influencing leading and following in human-robot...

11
Influencing Leading and Following in Human-Robot Teams Minae Kwon Stanford University Mengxi Li Stanford University Alexandre Bucquet Stanford University Dorsa Sadigh Stanford University Abstract—Recent efforts in human-robot interaction have been focused on modeling and interacting with single human agents. However, when modeling teams of humans, current models are not able to capture underlying emergent dynamics that define group behavior, such as leading and following. We introduce a mathematical framework that enables robots to influence human teams by modeling emergent leading and following behaviors. We tackle the task in two steps. First, we develop a scalable represen- tation of latent leading-following structures by combining model- based methods and data-driven techniques. Second, we optimize for a robot policy that leverages this representation to influence a human team toward a desired outcome. We demonstrate our approach on three tasks where a robot optimizes for changing a leader-follower relationship, distracting a team, and leading a team towards an optimal goal. Our evaluations show that our representation is scalable with different human team sizes, generalizable across different tasks, and can be used to design meaningful robot policies. I. I NTRODUCTION Humans are capable of seamlessly interacting and collabo- rating with each other. They can easily form teams and decide if they should follow or lead to efficiently complete a task as a group [39, 41, 12]. This is apparent in sports teams, human driving behavior, or simply having two people move a table together. Similarly, humans and robots are expected to seamlessly interact with each other to achieve collaborative tasks. Examples include collaborative manufacturing, search and rescue missions, and in an implicit way, collaborating on roads shared by self-driving and human-driven cars. These collaborative settings bring two important challenges for robots. First, robots need to understand the emergent leading and following dynamics that underlie seamless human coordination. Second, robots must be able to influence human teams to achieve a desired goal. For instance, imagine a mixed human-drone search-and-rescue team headed toward the island on the left as shown in Fig. 1. When a drone senses that the survivors are actually on the island on the right, how should it direct the rest of the human team toward the desired goal? Using natural language or pre-defined, task-specific com- munication signals (e.g., blinking lights) are viable solutions [42, 5]. However, in this work, we focus on finding more general solutions that do not rely on the assumption that there will be direct channels of communication. One common solution is to assign leading and following roles to the team a priori before starting the team’s mis- Authors have contributed equally. Autonomous drone senses extra information and influences human team t t t t + 1 t + 1 Fig. 1: We learn leading and following relationships in human teams (denoted by the arrows), and use this to create influential robot policies. The grey arrows represent intended human leading and following behaviors whereas the black arrows represent updated leading and following behaviors after the influencing robot action. sion. Many current human-robot interactions determine leader- follower roles beforehand [16, 22, 26, 40, 43, 17, 34], which include learning-from-demonstration tasks where the human is the leader and the robot is the follower [10, 3, 1, 44, 13, 8, 28], assistive tasks where the robot leads by teaching and assisting human users [33, 20], or other interactions based on game theoretic formulations where the robot leads by influencing its human partner [37, 38]. However, assigning leadership roles a priori is not always feasible in dynamically changing environments or long-term interactions. For instance, the drone that collects new information should be able to dynamically become a leader. If leading and following roles are not assigned a priori, the robot must estimate human state and be equipped with policies that can influence teammates. In addition to the learning from demonstration work mentioned above, prior work has shown that robots have been able to infer human preferences through interactions using partially observable Markov decision processes (POMDPs) which allow reasoning over uncertainty on the human’s internal state or intent [9, 14, 25, 27, 19, 36]. Human intent inference has also been achieved through human-robot cross-training [28] as well as various other approximations to POMDP solutions such as augmented MDPs, belief space planning, approximating reachable belief space, and decentralization [2, 23, 24, 31, 32, 35]. These approaches do not easily generalize to larger teams of humans due to computational intractability. Prior work has also studied

Upload: others

Post on 13-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

Influencing Leading and Followingin Human-Robot Teams

Minae KwonlowastStanford University

Mengxi LilowastStanford University

Alexandre BucquetStanford University

Dorsa SadighStanford University

AbstractmdashRecent efforts in human-robot interaction have beenfocused on modeling and interacting with single human agentsHowever when modeling teams of humans current models arenot able to capture underlying emergent dynamics that definegroup behavior such as leading and following We introduce amathematical framework that enables robots to influence humanteams by modeling emergent leading and following behaviors Wetackle the task in two steps First we develop a scalable represen-tation of latent leading-following structures by combining model-based methods and data-driven techniques Second we optimizefor a robot policy that leverages this representation to influencea human team toward a desired outcome We demonstrate ourapproach on three tasks where a robot optimizes for changinga leader-follower relationship distracting a team and leadinga team towards an optimal goal Our evaluations show thatour representation is scalable with different human team sizesgeneralizable across different tasks and can be used to designmeaningful robot policies

I INTRODUCTION

Humans are capable of seamlessly interacting and collabo-rating with each other They can easily form teams and decideif they should follow or lead to efficiently complete a taskas a group [39 41 12] This is apparent in sports teamshuman driving behavior or simply having two people movea table together Similarly humans and robots are expectedto seamlessly interact with each other to achieve collaborativetasks Examples include collaborative manufacturing searchand rescue missions and in an implicit way collaborating onroads shared by self-driving and human-driven cars

These collaborative settings bring two important challengesfor robots First robots need to understand the emergentleading and following dynamics that underlie seamless humancoordination Second robots must be able to influence humanteams to achieve a desired goal For instance imagine a mixedhuman-drone search-and-rescue team headed toward the islandon the left as shown in Fig 1 When a drone senses that thesurvivors are actually on the island on the right how shouldit direct the rest of the human team toward the desired goal

Using natural language or pre-defined task-specific com-munication signals (eg blinking lights) are viable solutions[42 5] However in this work we focus on finding moregeneral solutions that do not rely on the assumption that therewill be direct channels of communication

One common solution is to assign leading and followingroles to the team a priori before starting the teamrsquos mis-

lowastAuthors have contributed equally

Autonomous drone senses extra information and influences human team

t

t

t

t + 1

t + 1

Fig 1 We learn leading and following relationships in human teams(denoted by the arrows) and use this to create influential robotpolicies The grey arrows represent intended human leading andfollowing behaviors whereas the black arrows represent updatedleading and following behaviors after the influencing robot action

sion Many current human-robot interactions determine leader-follower roles beforehand [16 22 26 40 43 17 34] whichinclude learning-from-demonstration tasks where the human isthe leader and the robot is the follower [10 3 1 44 13 8 28]assistive tasks where the robot leads by teaching and assistinghuman users [33 20] or other interactions based on gametheoretic formulations where the robot leads by influencingits human partner [37 38] However assigning leadershiproles a priori is not always feasible in dynamically changingenvironments or long-term interactions For instance the dronethat collects new information should be able to dynamicallybecome a leader

If leading and following roles are not assigned a priorithe robot must estimate human state and be equipped withpolicies that can influence teammates In addition to thelearning from demonstration work mentioned above priorwork has shown that robots have been able to infer humanpreferences through interactions using partially observableMarkov decision processes (POMDPs) which allow reasoningover uncertainty on the humanrsquos internal state or intent [9 1425 27 19 36] Human intent inference has also been achievedthrough human-robot cross-training [28] as well as variousother approximations to POMDP solutions such as augmentedMDPs belief space planning approximating reachable beliefspace and decentralization [2 23 24 31 32 35] Theseapproaches do not easily generalize to larger teams of humansdue to computational intractability Prior work has also studied

how we can construct intelligent robot policies that inducedesired behaviors from people [38 18 29 30 7] Howeverall of these works optimize for robot policies that influenceonly a single human Human state estimation and optimizationfor actions based on the estimation will be computationallyinfeasible with larger groups of humans

Instead of keeping track of each individualrsquos state in ateam we propose a more scalable method that estimates thecollective teamrsquos state Our key insight is that similar toindividuals teams exhibit behavioral patterns that robots canuse to create intelligent influencing policies One particularbehavioral pattern we focus on are leading and followingrelationships

In this paper we introduce a framework for extractingunderlying leading and following structures in human teamsthat is scalable with the number of human agents We callthis structure a leader-follower graph This graph provides aconcise and informative representation of the current state ofthe team and can be used when planning We then use theleader-follower graph to optimize for robot policies that caninfluence a human team to achieve a goal

Our contributions in this paper are as followsbull Formalizing and learning a scalable structure leader-

follower graph that captures complex leading and fol-lowing relationships between members in human teams

bull Developing optimization-based robot strategies that lever-age the leader-follower graph to influence the team to-wards a more efficient objective

bull Providing simulation experiments in a pursuit-evasiongame demonstrating the robotrsquos influencing strategies toredirect a leader-follower relationship distract a teamand lead a team towards an optimal goal based on itslearned leader-follower graph

II FORMALISM FOR MODELINGLEADING AND FOLLOWING IN HUMAN TEAMS

Running Example Pursuit-Evasion Game We define amulti-player pursuit-evasion game on a 2D plane as our mainrunning example In this game each pursuer is an agent inthe set of agents I that can take actions in the 2D space tonavigate There are a number of stationary evaders which werefer to as goals G The objective of the pursuers is to collab-oratively capture the goals Fig 2 shows an example of a gamewith three pursuers shown in orange and three goals shownin green The action space of each agent is identical Ai =move up move down move left move right stay still theaction spaces of all agents collectively define the joint actionspace A All pursuers must jointly and implicitly agree ona goal to target without directly communicating with oneanother A goal will be captured when all pursuers collidewith it as shown in Fig 2bLeaders and Followers In order to collectively capture agoal each agent i isin I must decide to follow a goal or anotheragent We refer to the goal or agent being followed as a leaderFormally we let li isin G cup I where li is either an agent or afixed goal g who is the leader of agent i (agent i follows

$

1

23

(a) Pursuit evasion game with three goals (green circles) and three

pursuers (orange triangles) The pursuers must jointly agree on a

goal to target

$1

2

3

(b) Three pursuers collide with to capture it

Fig 2 Pursuit-evasion game

li) This is shown in Fig 3a where agent 2 follows goal g1(l2 = g1) and agent 3 follows agent 2 (l3 = 2)

The set of goals G abstracts the idea of the agents reaching aset of states in order to fully optimize the joint reward functionFor instance in a pursuit-evasion game all the agents need toreach a state corresponding to the goals being captured Anindividual goal g intuitively signifies a way for the agentsto coordinate strategies with each other For instance in apursuit-evasion game the agents should collaboratively planon actions that capture each goalLeader-Follower Graph The set of leaders and followersform a directed leader-follower graph as shown in Fig 3aEach node represents an agent i isin I or goal g isin GThe directed edges represent leading-following relationshipswhere there is an outgoing edge from a follower to its leaderThe weights on the edges represent a leadership score whichis the probability that the tail node is the head nodersquos leaderFor instance in Fig 3a w32 represents the probability that2 is 3rsquos leader The leader-follower graph is dynamic in thatagents can decide to change their leaders at any time Weassume that there could be an implicit transitivity in a leader-follower graph ie if an agent i follows an agent j implicitlyit could be following the agent jrsquos believed ultimate goal

Some patterns are not desirable in a leader-follower graphFor instance an agent would never follow itself or we donot expect to observe cycling leading-following behaviors(Fig 3b) Other patterns that are likely include chain patterns(Fig 3c) or patterns with multiple teams where multiple agentsdirectly follow goals (Fig 3d) We describe how to constructa leader-follower graph that is scalable with the number ofagents and avoids the undesirable patterns in Sec IIIPartial Observability The leader of each agent li is alatent variable We assume that agents cannot directly observethe one-on-one leading and following dynamics of otheragents Thus constructing leader-follower graphs can helprobot teammates predict who will follow whom allowing themto strategically influence teammates to adapt roles We assumeagents are homogeneous and have full information on theobservations of themselves and all other agents (eg positionsand velocities of agents)

III CONSTRUCTION OF A LEADER-FOLLOWER GRAPH

In this section we focus on constructing the leader-followergraph that emerges in collaborative teams using a combinationof data-driven and graph-theoretic approaches Our aim is to

(c) Chain behavior in the leader-follower graph

(b) Cyclic leader-follower graph We design policies that avoid such

cyclic behaviors

(d) Multiple teams (a) Leader-follower graph Green circles are the goals that need to be captured Orange triangles are the

pursuers

$amp(

amp(

$

amp

1 3

2

$amp

$

amp$

amp

1 3

2

$amp$

amp(

$

amp

1 3

2

$amp)

amp(

$

amp

1

2

3

4

Fig 3 Variations of the leader-follower graph

leverage this leader-follower graph to enable robot teammatesto produce helpful leading behaviors We will first focuson learning pairwise relationships between agents using asupervised learning approach We then generalize our ideas tomulti-player settings using algorithms from graph-theory Ourcombination of data-driven and graph-theoretic approachesallows the leader-follower graph to efficiently scale with thenumber of agents

A Pairwise Leadership Scores

We first will focus on learning the probability of any agenti following any goal or agent j isin G cup I The pairwiseprobabilities help us estimate the leadership score wij iethe weight of the edge (i j) in the leader-follower graph

We propose a general framework of estimating the leader-ship scores using a supervised learning approach Consider atwo-player setting where I = i j we collect labeled datawhere agent i is asked to follow j and agent j is asked tooptimize for the joint reward function assuming it is leadingi ie following a fixed goal g in the pursuit-evasion game(li = j and lj = g) We then train a Long Short-Term Memory(LSTM) network with a softmax layer to predict each agentrsquosmost likely leaderTraining with a Scalable Network Architecture Our net-work architecture consists of two LSTM submodules oneto predict player-player leader-follower relationships (P-PLSTM) and one to predict player-evader relationships (P-ELSTM) We use a softmax output layer with a cross-entropyloss function to get a probability distribution over j and allgoals g isin G of being irsquos leader We take the leader (an agentor a goal) with the highest probability and assign this as theleadership score The P-P and P-E submodules allow us toscale training to a game of any number of players and evadersas we can add or remove P-P and P-E submodules dependingon the number of players and evaders in a game An exampleof our scalable network architecture is illustrated in Fig 4Data Collection We collect labeled human data by askingparticipants to play a pursuit evasion game with pre-assignedleaders We recruited pairs of humans and randomly assignedleaders li to them (ie another agent or a goal) Participantsplayed the game in a web browser using their arrow keys andwere asked to move toward their assigned leader li In order tocreate a balanced dataset we collected data from all possible

1

2

1

2

2

2

2

LSTM pursuer to pursuer

LSTM pursuer to evader

LSTM pursuer to evader

2

LSTM pursuer to evader

1

Softmax Layer

$= amp(

$= amp(

$= amp

$)= amp

Training data

Fig 4 Scalable neural network architecture This example is pre-dicting the probability of another agent j being agent 2rsquos leaderw2j There are three LSTM submodules used because there are twopossible evaders and one possible agent that could be agent 2rsquos leader

configurations of leader and followers in a two-player settingWe collected a total of 186 games

In addition since human data is often difficult to collectin large amounts we augmented our dataset with syntheticdata The synthetic data is generated by simulating humansplaying the pursuit evasion game We simulated humans basedon the well-known potential field path planner model [6] Wefind that the simple nature of the task given to humans (iemove directly toward your assigned leader li) is qualitativelyeasily replicated using a potential field path planner Agentsat location q plan their path under the influence of an artificialpotential field U(q) which is constructed to reflect the en-vironment Agents move toward their leaders li by followingan attractive potential field Other agents and goals that arenot their leaders are treated as obstacles and emit a repulsivepotential field

We denote the set of attractions i isin A and the setof repulsive obstacles j isin R The overall potential fieldis weighted sum of potential field from all attractions andrepulsive obstacles θi is the weight for the attractive potentialfield from attraction i isin A and θj is the weight for repulsivepotential field from obstacle j isin R

U(q) =983131

iisinA

θiUiatt(q) +

983131

jisinR

θjUjrep(q) (1)

The optimal action a an agent would take lies in the direction

of the potential field gradient

a = minusnablaU(q) = minus983131

iisinA

θinablaU iatt(q)minus

983131

jisinR

θjnablaU jrep(q)

The attractive potential field increases as the distance togoal becomes larger The repulsive potential field has a fixedeffective range within which the potential field increases asthe distance to the obstacle decreases The attractive andrepulsive potential fields are constructed in the same way forall attractive and repulsive obstacles We elaborate on detailsof the potential field function in the Appendix Section VIIImplementation Details We simulated 15000 two-playergames equally representing the number of possible leaderand follower configurations Each game stored the position ofeach agent and goal at all timesteps Before training we pre-processed the data by shifting it to have a zero-centered meannormalizing it and down-sampling it Each game was then fedinto our network as a sequence Based on our experimentshyperparameters that worked well included a batch size of 250learning rate of 00001 and hidden dimension size of 64 Inaddition we used gradient clipping and layer normalization [4]to stabilize gradient updates Using these hyperparameters wetrained our network for 100000 stepsEvaluating Pairwise Scores Our network trained on two-player simulated data performed with a validation accuracyof 83 We also experimented with training with three-player simulated data two-player human data as well as acombination of two-player simulated and human data (two-player mixed data) Validation results for the first 60000 stepsare shown in Fig 5 For our mixed two-player model we firsttook a pre-trained model trained on two-player simulated dataand then trained on two-player human data The two-playermixed curve shows the modelrsquos accuracy as it is fined-tuned onhuman data Our three-player model had a validation accuracyof 74 while our two player human model and our mixedtwo-player data model had a validation accuracy of 65

0 20e3 40e3 60e3steps

02

04

06

08

accu

racy

2P simulated3P simulated2P human2P mixed

Fig 5 Validation accuracy when calculating pairwise leadershipscores described in Sec III-A

B Maximum Likelihood Leader-Follower Graph in Teams

With a model that can predict pairwise relationships wefocus on building a model that can generalize beyond twoplayers We compute pairwise weights or leadership scoreswij of leader-follower relationships between all possiblepairs of leaders i and followers j The pairwise weightscan be computed based on the supervised learning approachdescribed above indicating the probability of one agent orgoal being another agentrsquos leader After computing wij for all

combinations of leaders and followers we create a directedgraph G = (VE) where V = I cup G and E = (i j)|i isinI j isin I cup G i ∕= j and the weights on each edge (i j)correspond to wij To create a more concise representationwe extract the maximum likelihood leader-follower graph Glowast

by pruning the edges of our constructed graph G We prunethe graph by greedily selecting the outgoing edge with highestweight for each agent node In other words we select theedge associated with the agent or goal that has the highestprobability of being agent irsquos leader where the probabilitiescorrespond to edge weights When pruning we make surethat no cycles are formed If we find a cycle we will choosethe next probable edge Our pruning approach is inspired byEdmondsrsquo algorithm [15 11] which finds a maximum weightarborescence [21] in a graph An arborescence is an acyclicdirected tree structure where there is exactly one outgoingedge from a node to another We use a modified version ofEdmondsrsquo algorithm to find the maximum likelihood leader-follower graph Compared to our approach a maximum weightarborescence is more restrictive since it requires the resultinggraph to be a treeEvaluating the Leader-Follower Graph We evaluate theaccuracy of our leader-follower graph with three or moreagents when trained on simulated two-player and three-playerdata as well as a combination of simulated and human two-player data (shown in Table I) In each of these multi-playergames we extracted a leader-follower graph at each timestepand compared our leader-follower graphrsquos predictions againstthe ground-truth labels Our leader-follower graph performsbetter than a random policy which selects a leader li isin I cupGfor agent i at random where li ∕= i The chance of beingright is thus 1

|G|+|I|minus1 We take the average of all successprobabilities for all leader-follower graph configurations tocompute the overall accuracy In all experiments shown inTable I our trained model outperforms the random policyMost notably the models scale naturally to settings with largenumbers of players as well as human data We find that trainingwith larger numbers of players helps with generalization Thetradeoff between training with larger numbers of players andcollecting more data will depend on the problem domain Forour experiments (Section V) we use the model trained onthree-player simulated dataTABLE I Generalization accuracy of leader-follower graph (LFG)

Training Data Testing Data LFGAccuracy

RandomAccuracy

3 players simulated 067 0294 players simulated 045 023

2 players simulated 5 players simulated 041 0192 players human 068 0443 players human 047 029

3 players simulated 044 0294 players simulated 038 023

2 players mixed 5 players simulated 028 0192 players human 069 0443 players human 044 029

4 players simulated 053 0233 players simulated 5 players simulated 050 019

3 players human 063 029

1 23

4

3

21

1 23

12

3

4

(a) Graph The directed edges represent pairwise likelihoods that the tail node

is the head nodes leader

(b) Maximum-likelihood leader-follower graph shown in bold

(c) Most influential leader is agent 2

(d) Most influential leader is agent 1 since other human agents are targeting the preferred goal

2

31

2

3

4

12

3

1preferred goal

1

2

34

robotrobot

preferred goal

Fig 6 On left creating a maximum-likelihood leader-follower graph Glowast On right examples of influential leaders in Glowast The green circlesare goals orange triangles are human agents and black triangles are robot agents

IV PLANNING BASED ON INFERENCE ONLEADER-FOLLOWER GRAPHS

We so far have computed a graphical representation forlatent leadership structures in human teams Glowast In this sectionwe will use Glowast to positively influence human teams ie movethe team towards a more desirable outcome We first describehow a robot can use the leader-follower graph to infer usefulteam structures We then describe how a robot can leveragethese inferences to plan for a desired outcome

A Inference based on leader-follower graph

Leader-follower graphs enable a robot to infer useful in-formation about a team such as agentsrsquo goals or who themost influential leader is These pieces of information helpthe robot to identify key goals or agents that are useful inachieving a desired outcome (eg identifying shared goalsin a collaborative task) Especially in multi-agent settingsfollowing key goals or influencing key agents is a strategicmove that allows the robot to plan for desired outcomes Webegin by describing different inferences a robot can performon the leader-follower graphGoal inference in multiagent settings One way a robot canuse structure in the leader-follower graph is to perform goalinference An agentrsquos goal can be inferred by the outgoingedges from agents to goals In the case where there is anoutgoing edge from an agent to another agent (ie agent ifollows agent j) we assume transitivity where agent i can beimplicitly following agent jrsquos believed ultimate goalInfluencing the most influential leader In order to leada team toward a desired goal the robot can also leveragethe leader-follower graph to predict who the most influentialleader is We define the most influential leader to be theagent ilowast isin I with the most number of followers Identifyingthe most influential leader ilowast allows the robot to strategicallyinfluence a single teammate that also indirectly influences theother teammates that are following ilowast For example in Fig6c and Fig 6d we show two examples of identifying themost influential leader from Glowast In the case where some ofagents are already going for the preferred goal the one thathas the most followers among the remaining players becomesthe most influential leader as shown in Fig 6d

B Optimization based on leader-follower graph

We leverage the leader-follower graph to design robotpolicies that optimize for an objective (eg leading the teamtowards a preferred goal) We introduce one such way wecan use the leader-follower graph by directly incorporatinginferences we make from it into a simple optimization

At each timestep we generate graphs Gat

t+k that simulatewhat the leader-follower graph would look like at timestept+k if the robot takes an action at at current timestep t Overthe next k steps we assume human agents will continue alongthe current trajectory with constant velocity

From each graph Gat

t+k we can obtain the weights wt+kij

corresponding to an objective r that the robot is optimizingfor (eg the robot becoming agent irsquos leader) We then iterateover the robotrsquos action assuming it is finite and discrete andchoose the action alowastt that maximizes the objective r We notethat r must be expressible in terms of wt+k

ij rsquos and wt+kig rsquos for

i j isin I and g isin G

alowastt = argmaxatisinA

r983043wt+k

ij (at)ijisinI wt+kig (at)iisinIgisinG

983044

(2)We describe three specific tasks that we plan for using the

optimization described in Eqn (2)Redirecting a leader-follower relationship A robot candirectly influence team dynamics by changing leader-followerrelationships For a given directed edge between agents i andj the robot can use the optimization outlined in Eqn (2) foractions that reverse an edge or direct the edge to a differentagent For instance to reverse the direction of the edge fromagent i to agent j the robot will select actions that maximizethe probability of agent j following agent i

alowastt = argmaxatisinA

wt+kji (at) i j isin I (3)

The robot can also take actions to eliminate an edge betweenagents i and j by minimizing wij One might want to modifyedges in the leader-follower graph when trying to change theleadership structure in a team For instance in a setting whereagents must collectively decide on a goal a robot can helpunify a team with sub-groups (an example is shown in Fig 3d)by re-directing the edges of one sub-group to follow another

Distracting a team In adversarial settings a robot might wantto prevent a team of humans from reaching a collective goal gIn order to stall the team a robot can use the leader-followergraph to identify who the current most influential leader ilowast isThe robot can then select actions that maximize the probabilityof the robot becoming the most influential leaderrsquos leaderand minimize the probability of the most influential leaderfollowing the collective goal g

alowastt = argmaxatisinA

wt+kilowastr (at) ilowast isin I (4)

Distracting a team from reaching a collective goal can beuseful in cases where the team is an adversary For instance ateam of adversarial robots may want to prevent their opponentsfrom collaboratively reaching a joint goalLeading a team towards the optimal goal In collaborativesettings where the team needs to agree on a goal g isin G arobot that knows where the optimal goal glowast isin G is shouldmaximize joint utility by leading all of its teammates to reachglowast To influence the team the robot can use the leader-followergraph to infer who the current most influential leader ilowast is Therobot can then select actions that maximize the probability ofthe most influential leader following the optimal goal glowast

alowastt = argmaxatisinA

wt+kilowastr (at) + wt+k

rlowastglowast(at) ilowast isin I (5)

Being able to lead a team of humans to a goal is useful in manyreal-life scenarios For instance in search-and-rescue missionsrobots with more information about the location of survivorsshould be able to lead the team in the optimal direction

V EXPERIMENTS

With an optimization framework that plans for robot actionsusing the leader-follower graph we evaluate our approachon the three tasks described in Sec IV For each task wecompare task performance with robot policies that use theleader-follower graph against robot policies that do not Acrossall tasks we find that robot policies that use the leader-followergraph perform well compared to other policies showing thatour graph can be easily generalized to different settingsTask setup Our tasks take place in the pursuit-evasiondomain Within each task we conduct experiments withsimulated human behavior Simulated humans move along apotential field as shown in Eqn 1 where there are two sourcesof attraction the agentrsquos goal (ag) and the crowd center (ac)and simulated human agents would make trade offs betweenfollowing the crowd and moving toward a target goal Theweights θg and θc encode how determined human players areabout reaching the goal or moving towards the crowd Higherθc means that players care more about being close to the wholeteam while higher θg indicates more determined players whocare about reaching their goals This can make it harder for arobot to influence the humans For validating our frameworkwe use weights θg = 06 and θc = 04 where going toward thegoal is slightly more advantageous than moving with the wholeteam setting moderate challenges for the robot to completetasks At the same time we prevent agents from colliding with

one another by making non-target agents and goals sources ofrepulsion with weight θj = 1

An important limitation is that our simulation does notcapture the broad range of behaviors humans would show inresponse to a robot that is trying to influence them We onlyoffer one simple interpretation of how humans would reactbased on the potential field

In each iteration of the task the initial position of agentsand goals are randomized For all of our experiments the sizeof the game canvas is 500times 500 pixels At every time step asimulated human can move 1 unit in one of the four directionsup down left right or stay at its position The robot agentrsquosaction space is the same but can move 5 units at a time Themaximum game time limit for all our experiments is 1000timesteps For convenience we will refer to simulated humanagents as human agents for the rest of the section

A Changing edges of the leader-follower graph

We evaluate a robotrsquos ability to change an edge of a leader-follower graph Unlike the other tasks we described the goalof this task is not to affect some larger outcome in the game(eg influence humans toward a particular goal) Instead thistask serves as a preliminary to others where we evaluate howwell a robot is able to manipulate the leader-follower graphMethods Given a human agent i who is predisposed tofollowing a goal with weights θg = 06 θa = 04 we createda robot policy that encouraged the agent to follow the robotr instead The robot optimized for the probability wir that itwould become agent irsquos leader as outlined in Eqn (3) Wecompared our approach against a random baseline where therobot chooses its next action at randomMetrics For both our approach and our baseline we evaluatedthe performance of the robot based on the leadership scoresie probabilities wir given by the leader-follower graphResults We show that the robot can influence a humanagent to follow it Fig 7 contains averaged probabilitiesover ten tasks The probability of the robot being the humanagentrsquos leader wir increases over time and averages to 73represented by the orange dashed line Our approach performswell compared to the random baseline which has an averageperformance of 26 represented by the grey dashed line

0 100 200 300 400 500 600task steps

02

04

06

08

pro

bab

ilit

y

oursrandom

Fig 7 Smoothed probabilities of a human agent following the robotover 10 tasks The robot is quickly able to become the human agentrsquosleader with an average of leadership score of 73

B Adversarial taskWe consider a different task where the robot is an adversary

that is trying to distract a team of humans from reaching a goalThere are m goals and n agents in the pursuit-evasion gameAmong the n agents we have 1 robot and n minus 1 humansThe robotrsquos goal is to distract a team of humans so that theycannot converge to the same goal In order to capture a goalwe enforce the constraint that nminus2 agents must collide with itat the same time allowing 1 agent to be absent Since we allowone agent to be absent this prevents the robot from adoptingthe simple strategy of blocking an agent throughout the gameThe game ends when all goals are captured or when the gametime exceeds the limit Thus another way to interpret therobotrsquos objective is to extend game time as much as possibleMethods We compared our approach against 3 baselineheuristic strategies (random to one player to farthest goal)that do not use the leader-follower graph

In the Random strategy the robot picks an action at eachtimestep with uniform probability In the To one player strat-egy the robot randomly selects a human agent and then movestowards it to try to block its way Finally a robot enactingthe To farthest goal strategy selects the goal that the averagedistance to human players are largest and then go to that goalin the hope that human agents would get influenced or mayfurther change goal by observing that some players are headingfor another goal

We also experimented with two variations of our approachLFG closest pursuer involves the robot agent selecting theclosest pursuer and choosing an action to maximize theprobability of the pursuer following it (as predicted by theLFG) Similarly LFG influential pursuer strategy involvesthe robot targeting the most influential human agent predictedby the LFG and then conducting the same optimization ofmaximizing the following probability as shown in Eqn (4)

Finally we explored different game settings by varying thenumber of players n = 3 6 and the number of goalsm = 1 4Metrics We evaluated the performance of the robot with gametime as the metric Longer game time indicates that the robotdoes well in distracting human playersResults We sampled 50 outcomes for every game setting androbot policy Tables II and III show the mean and standarddeviation of game time for every game setting and robotpolicy across 50 samples Across game settings our modelsbased on the leader-follower graph consistently outperformedmethods without knowledge of leader-follower graph Theseexperimental results are also visualized in Fig 8

As can be seen from Fig 8 the average game time goes upas the number of players increases This is because it is morechallenging for more players to agree upon on which goal tocapture The consistent advantageous performance across allgame settings suggests the effectiveness of the leader-followergraph for inference and optimization in this scenario

We demonstrate an example of robot behavior in theadversarial game shown in Fig 9 (dashed lines representtrajectories) Player 1 and player 2 start very close to goal

Fig 8 Average game time over 50 games with a different numberof players across all baseline methods and our model

Fig 9 Adversarial task snapshots The orange triangles are the humanagents and the black triangle is the robot agent

g1 making g1 the probable target The robot then approachesagent 2 blocks its way and leads it to another goal g2 In thisway the robot successfully extends the game time

C Cooperative task

Finally we evaluate the robot in a cooperative setting wherethe robot tries to be helpful for human teams The robot aimsto lead its teammates in a way that everyone can reach thetarget goal that gives the team the greatest joint utility glowast isin GHowever glowast is not immediately observable to all teammatesWe assume a setting where only the robot knows where glowast is

For consistency the experiment setting is the same as theAdversarial task where n minus 2 human agents need to collidewith a goal to capture it In this scenario the task is consideredsuccessful if the goal with greatest joint utility glowast is capturedThe task is unsuccessful if any other suboptimal goal iscaptured or if the game time exceeds the limitMethods Similar to the Adversarial task we explore twovariations of our approach where the robot chooses to in-fluence its closest human agent or the most influential agentpredicted by the leader-follower graph The variation wherethe robot targets the most influential leader is outlined in Eqn(5) Different from the Adversarial task here the robot isoptimizing the probability of both becoming the target agentrsquosleader and going toward the target goal We also experimentedwith three baseline strategies The Random strategy involvesthe robot taking actions at random The To target goal strat-egy has the robot move directly to the optimal goal glowastand stay there trying to attract other human agents TheTo goal farthest player strategy involves the robot going tothe player that is farthest away from glowast in an attempt toindirectly influence the player to move back to glowast Finallywe experimented with game settings by varying the numberof goals m = 2 6Metrics We evaluated the performance of the robot strategyusing the game success rate over 100 games We calculate thesuccess rate by dividing the number of successful games overnumber of total games

TABLE II Average Game time over 50 adversarial games with varying number of players

number of goals (m=2)

Model n=3 n=4 n=5 n=6

LFG closest pursuer (ours) 23304plusmn5182 30508plusmn4948 46118plusmn5573 55088plusmn5167LFG influential pursuer (ours) 20194plusmn4515 28644plusmn4854 41478plusmn5098 51592plusmn4880

random 1292plusmn3266 20940plusmn3986 38892plusmn5324 43716plusmn4317to one pursuer 21504plusmn5000 23142plusmn4469 45516plusmn5835 47236plusmn4975to farthest goal 13284plusmn3422 1985plusmn3614 38208plusmn5259 44564plusmn4677

TABLE III Average Game time over 50 adversarial games with varying number of goals

number of players (n=4)

Model m=1 m=2 m=3 m=4

LFG closest pursuer (ours) 21094plusmn3323 30508plusmn4948 28922plusmn5299 34300plusmn5590LFG influential pursuer (ours) 23904plusmn3973 28644plusmn4854 21956plusmn4100 30180plusmn5200

random 15594plusmn2142 20940plusmn3986 20574plusmn4305 29462plusmn5401to one pursuer 12358plusmn956 23142plusmn4469 22552plusmn4147 31792plusmn5475to farthest goal 21336plusmn3483 1985plusmn3614 21868plusmn4367 25830plusmn5064

Results Our results are summarized in Table IV We expectedthe To target goal baseline to be very strong since movingtowards the target goal directly influences other simulatedagents to also move toward the target goal We find that thisbaseline strategy is especially effective when the game is notcomplex ie the number of goals is small However ourmodel based on the leader-follower graph still demonstratescompetitive performance compared to it Especially when thenumber of goals increase the advantage of leader-followergraph becomes clearer This indicates that in complex scenar-ios brute force methods that do not have knowledge of humanteam hidden structure will not suffice Another thing to noteis that the difference between all of the strategies becomessmaller as the number of goals increases This is because thedifficulty of the game increases for all strategies and thuswhether a game would succeed depends more on the gameinitial conditions We have shown snapshots of one game

TABLE IV Success rate over 100 collaborative games with varyingnumber of goals m

number of players (n=4)

Model m=2 m=3 m=4 m=5 m=6

LFG closest pursuer 059 038 029 027 022LFG influential pursuer 057 036 032 024 019

random 055 035 024 021 020to target goal 060 042 028 024 021

to goal farthest player 047 029 017 019 021

as in Fig 10 In this game the robot approaches other agentsand the desired goal in the collaborative pattern trying to helpcatch the goal g1

Fig 10 Collaborative task snapshots The orange triangles are thehuman agents and the black triangle is the robot agent

VI DISCUSSION

Summary We propose an approach for modeling leadingand following behavior in multi-agent human teams We usea combination of data-driven and graph-theoretic techniquesto learn a leader-follower graph This graph representationencoding human team hidden structure is scalable with theteam size (number of agents) since we first learn localpairwise relationships and combine them to create a globalmodel We demonstrate the effectiveness of the leader-followergraph by experimenting with optimization based robot policiesthat leverage the graph to influence human teams in differentscenarios Our policies are general and perform well across alltasks compared to other high-performing task-specific policiesLimitations and Future Work We view our work as a firststep into modeling latent dynamic human team structures Per-haps our greatest limitation is the reliance on simulated humanbehavior to validate our framework Further experiments withreal human data are needed to support our frameworkrsquos effec-tiveness for noisier human behavior understanding Moreoverthe robot policies that use the leader-follower graphs are fairlysimple Although this may be a limitation it is also promisingthat simple policies were able to perform well using the leader-follower graph

For future work we plan to evaluate our model on largescale human-robot experiments in both simulation and nav-igation settings such as robot swarms to further improveour modelrsquos generalization capacity We also plan on experi-menting with combining the leader-follower graph with moreadvanced policy learning methods such as deep reinforcementlearning We think the leader-follower graph could contributeto multi-agent reinforcement learning in various ways such asreward design and state representationAcknowledgements Toyota Research Institute (ldquoTRIrdquo) pro-vided funds to assist the authors with their research but thisarticle solely reflects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity The authorswould also like to acknowledge General Electric (GE) andStanford School of Engineering Fellowship

REFERENCES

[1] Pieter Abbeel and Andrew Y Ng Apprenticeship learn-ing via inverse reinforcement learning In Proceedingsof the twenty-first international conference on Machinelearning page 1 ACM 2004

[2] Ali-Akbar Agha-Mohammadi Suman Chakravorty andNancy M Amato Firm Sampling-based feedbackmotion-planning under motion uncertainty and imperfectmeasurements The International Journal of RoboticsResearch 33(2)268ndash304 2014

[3] Baris Akgun Maya Cakmak Karl Jiang and Andrea LThomaz Keyframe-based learning from demonstrationInternational Journal of Social Robotics 4(4)343ndash3552012

[4] Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey EHinton Layer normalization arXiv preprintarXiv160706450 2016

[5] Kim Baraka Ana Paiva and Manuela Veloso Expressivelights for revealing mobile service robot state In Robot2015 Second Iberian Robotics Conference pages 107ndash119 Springer 2016

[6] Jerome Barraquand Bruno Langlois and J-C LatombeNumerical potential field techniques for robot path plan-ning IEEE transactions on systems man and cybernet-ics 22(2)224ndash241 1992

[7] Aaron Bestick Ruzena Bajcsy and Anca D Dragan Im-plicitly assisting humans to choose good grasps in robotto human handovers In International Symposium onExperimental Robotics pages 341ndash354 Springer 2016

[8] Erdem Biyik and Dorsa Sadigh Batch active preference-based learning of reward functions In Conference onRobot Learning (CoRL) October 2018

[9] Frank Broz Illah Nourbakhsh and Reid Simmons De-signing pomdp models of socially situated tasks In RO-MAN 2011 IEEE pages 39ndash46 IEEE 2011

[10] Sonia Chernova and Andrea L Thomaz Robot learningfrom human teachers Synthesis Lectures on ArtificialIntelligence and Machine Learning 8(3)1ndash121 2014

[11] Yoeng-Jin Chu On the shortest arborescence of adirected graph Science Sinica 141396ndash1400 1965

[12] D Scott DeRue Adaptive leadership theory Leadingand following as a complex adaptive process Researchin organizational behavior 31125ndash150 2011

[13] Anca D Dragan Dorsa Sadigh Shankar Sastry and San-jit A Seshia Active preference-based learning of rewardfunctions In Robotics Science and Systems (RSS) 2017

[14] Finale Doshi and Nicholas Roy Efficient model learningfor dialog management In Proceedings of the ACMIEEEinternational conference on Human-robot interactionpages 65ndash72 ACM 2007

[15] Jack Edmonds Optimum branchings Mathematics andthe Decision Sciences Part 1(335-345)25 1968

[16] Katie Genter Noa Agmon and Peter Stone Ad hocteamwork for leading a flock In Proceedings of the2013 international conference on Autonomous agents and

multi-agent systems pages 531ndash538 International Foun-dation for Autonomous Agents and Multiagent Systems2013

[17] Matthew C Gombolay Cindy Huang and Julie A ShahCoordination of human-robot teaming with human taskpreferences In AAAI Fall Symposium Series on AI-HRIvolume 11 page 2015 2015

[18] Dylan Hadfield-Menell Stuart J Russell Pieter Abbeeland Anca Dragan Cooperative inverse reinforcementlearning In Advances in neural information processingsystems pages 3909ndash3917 2016

[19] Shervin Javdani Henny Admoni Stefania PellegrinelliSiddhartha S Srinivasa and J Andrew Bagnell Sharedautonomy via hindsight optimization for teleoperationand teaming The International Journal of RoboticsResearch page 0278364918776060 2018

[20] Takayuki Kanda Takayuki Hirano Daniel Eaton andHiroshi Ishiguro Interactive robots as social partners andpeer tutors for children A field trial Human-computerinteraction 19(1)61ndash84 2004

[21] Richard M Karp A simple derivation of edmondsrsquoalgorithm for optimum branchings Networks 1(3)265ndash272 1971

[22] Piyush Khandelwal Samuel Barrett and Peter StoneLeading the way An efficient multi-robot guidance sys-tem In Proceedings of the 2015 international conferenceon autonomous agents and multiagent systems pages1625ndash1633 International Foundation for AutonomousAgents and Multiagent Systems 2015

[23] Mykel J Kochenderfer Decision making under uncer-tainty theory and application MIT press 2015

[24] Hanna Kurniawati David Hsu and Wee Sun Lee SarsopEfficient point-based pomdp planning by approximatingoptimally reachable belief spaces In Robotics Scienceand systems volume 2008 Zurich Switzerland 2008

[25] Oliver Lemon Conversational interfaces In Data-DrivenMethods for Adaptive Spoken Dialogue Systems pages1ndash4 Springer 2012

[26] Michael L Littman and Peter Stone Leading best-response strategies in repeated games In In SeventeenthAnnual International Joint Conference on Artificial In-telligence Workshop on Economic Agents Models andMechanisms Citeseer 2001

[27] Owen Macindoe Leslie Pack Kaelbling and TomasLozano-Perez Pomcop Belief space planning for side-kicks in cooperative games In AIIDE 2012

[28] Stefanos Nikolaidis and Julie Shah Human-robot cross-training computational formulation modeling and eval-uation of a human team training strategy In Pro-ceedings of the 8th ACMIEEE international conferenceon Human-robot interaction pages 33ndash40 IEEE Press2013

[29] Stefanos Nikolaidis Anton Kuznetsov David Hsu andSiddharta Srinivasa Formalizing human-robot mutualadaptation A bounded memory model In The EleventhACMIEEE International Conference on Human Robot

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 2: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

how we can construct intelligent robot policies that inducedesired behaviors from people [38 18 29 30 7] Howeverall of these works optimize for robot policies that influenceonly a single human Human state estimation and optimizationfor actions based on the estimation will be computationallyinfeasible with larger groups of humans

Instead of keeping track of each individualrsquos state in ateam we propose a more scalable method that estimates thecollective teamrsquos state Our key insight is that similar toindividuals teams exhibit behavioral patterns that robots canuse to create intelligent influencing policies One particularbehavioral pattern we focus on are leading and followingrelationships

In this paper we introduce a framework for extractingunderlying leading and following structures in human teamsthat is scalable with the number of human agents We callthis structure a leader-follower graph This graph provides aconcise and informative representation of the current state ofthe team and can be used when planning We then use theleader-follower graph to optimize for robot policies that caninfluence a human team to achieve a goal

Our contributions in this paper are as followsbull Formalizing and learning a scalable structure leader-

follower graph that captures complex leading and fol-lowing relationships between members in human teams

bull Developing optimization-based robot strategies that lever-age the leader-follower graph to influence the team to-wards a more efficient objective

bull Providing simulation experiments in a pursuit-evasiongame demonstrating the robotrsquos influencing strategies toredirect a leader-follower relationship distract a teamand lead a team towards an optimal goal based on itslearned leader-follower graph

II FORMALISM FOR MODELINGLEADING AND FOLLOWING IN HUMAN TEAMS

Running Example Pursuit-Evasion Game We define amulti-player pursuit-evasion game on a 2D plane as our mainrunning example In this game each pursuer is an agent inthe set of agents I that can take actions in the 2D space tonavigate There are a number of stationary evaders which werefer to as goals G The objective of the pursuers is to collab-oratively capture the goals Fig 2 shows an example of a gamewith three pursuers shown in orange and three goals shownin green The action space of each agent is identical Ai =move up move down move left move right stay still theaction spaces of all agents collectively define the joint actionspace A All pursuers must jointly and implicitly agree ona goal to target without directly communicating with oneanother A goal will be captured when all pursuers collidewith it as shown in Fig 2bLeaders and Followers In order to collectively capture agoal each agent i isin I must decide to follow a goal or anotheragent We refer to the goal or agent being followed as a leaderFormally we let li isin G cup I where li is either an agent or afixed goal g who is the leader of agent i (agent i follows

$

1

23

(a) Pursuit evasion game with three goals (green circles) and three

pursuers (orange triangles) The pursuers must jointly agree on a

goal to target

$1

2

3

(b) Three pursuers collide with to capture it

Fig 2 Pursuit-evasion game

li) This is shown in Fig 3a where agent 2 follows goal g1(l2 = g1) and agent 3 follows agent 2 (l3 = 2)

The set of goals G abstracts the idea of the agents reaching aset of states in order to fully optimize the joint reward functionFor instance in a pursuit-evasion game all the agents need toreach a state corresponding to the goals being captured Anindividual goal g intuitively signifies a way for the agentsto coordinate strategies with each other For instance in apursuit-evasion game the agents should collaboratively planon actions that capture each goalLeader-Follower Graph The set of leaders and followersform a directed leader-follower graph as shown in Fig 3aEach node represents an agent i isin I or goal g isin GThe directed edges represent leading-following relationshipswhere there is an outgoing edge from a follower to its leaderThe weights on the edges represent a leadership score whichis the probability that the tail node is the head nodersquos leaderFor instance in Fig 3a w32 represents the probability that2 is 3rsquos leader The leader-follower graph is dynamic in thatagents can decide to change their leaders at any time Weassume that there could be an implicit transitivity in a leader-follower graph ie if an agent i follows an agent j implicitlyit could be following the agent jrsquos believed ultimate goal

Some patterns are not desirable in a leader-follower graphFor instance an agent would never follow itself or we donot expect to observe cycling leading-following behaviors(Fig 3b) Other patterns that are likely include chain patterns(Fig 3c) or patterns with multiple teams where multiple agentsdirectly follow goals (Fig 3d) We describe how to constructa leader-follower graph that is scalable with the number ofagents and avoids the undesirable patterns in Sec IIIPartial Observability The leader of each agent li is alatent variable We assume that agents cannot directly observethe one-on-one leading and following dynamics of otheragents Thus constructing leader-follower graphs can helprobot teammates predict who will follow whom allowing themto strategically influence teammates to adapt roles We assumeagents are homogeneous and have full information on theobservations of themselves and all other agents (eg positionsand velocities of agents)

III CONSTRUCTION OF A LEADER-FOLLOWER GRAPH

In this section we focus on constructing the leader-followergraph that emerges in collaborative teams using a combinationof data-driven and graph-theoretic approaches Our aim is to

(c) Chain behavior in the leader-follower graph

(b) Cyclic leader-follower graph We design policies that avoid such

cyclic behaviors

(d) Multiple teams (a) Leader-follower graph Green circles are the goals that need to be captured Orange triangles are the

pursuers

$amp(

amp(

$

amp

1 3

2

$amp

$

amp$

amp

1 3

2

$amp$

amp(

$

amp

1 3

2

$amp)

amp(

$

amp

1

2

3

4

Fig 3 Variations of the leader-follower graph

leverage this leader-follower graph to enable robot teammatesto produce helpful leading behaviors We will first focuson learning pairwise relationships between agents using asupervised learning approach We then generalize our ideas tomulti-player settings using algorithms from graph-theory Ourcombination of data-driven and graph-theoretic approachesallows the leader-follower graph to efficiently scale with thenumber of agents

A Pairwise Leadership Scores

We first will focus on learning the probability of any agenti following any goal or agent j isin G cup I The pairwiseprobabilities help us estimate the leadership score wij iethe weight of the edge (i j) in the leader-follower graph

We propose a general framework of estimating the leader-ship scores using a supervised learning approach Consider atwo-player setting where I = i j we collect labeled datawhere agent i is asked to follow j and agent j is asked tooptimize for the joint reward function assuming it is leadingi ie following a fixed goal g in the pursuit-evasion game(li = j and lj = g) We then train a Long Short-Term Memory(LSTM) network with a softmax layer to predict each agentrsquosmost likely leaderTraining with a Scalable Network Architecture Our net-work architecture consists of two LSTM submodules oneto predict player-player leader-follower relationships (P-PLSTM) and one to predict player-evader relationships (P-ELSTM) We use a softmax output layer with a cross-entropyloss function to get a probability distribution over j and allgoals g isin G of being irsquos leader We take the leader (an agentor a goal) with the highest probability and assign this as theleadership score The P-P and P-E submodules allow us toscale training to a game of any number of players and evadersas we can add or remove P-P and P-E submodules dependingon the number of players and evaders in a game An exampleof our scalable network architecture is illustrated in Fig 4Data Collection We collect labeled human data by askingparticipants to play a pursuit evasion game with pre-assignedleaders We recruited pairs of humans and randomly assignedleaders li to them (ie another agent or a goal) Participantsplayed the game in a web browser using their arrow keys andwere asked to move toward their assigned leader li In order tocreate a balanced dataset we collected data from all possible

1

2

1

2

2

2

2

LSTM pursuer to pursuer

LSTM pursuer to evader

LSTM pursuer to evader

2

LSTM pursuer to evader

1

Softmax Layer

$= amp(

$= amp(

$= amp

$)= amp

Training data

Fig 4 Scalable neural network architecture This example is pre-dicting the probability of another agent j being agent 2rsquos leaderw2j There are three LSTM submodules used because there are twopossible evaders and one possible agent that could be agent 2rsquos leader

configurations of leader and followers in a two-player settingWe collected a total of 186 games

In addition since human data is often difficult to collectin large amounts we augmented our dataset with syntheticdata The synthetic data is generated by simulating humansplaying the pursuit evasion game We simulated humans basedon the well-known potential field path planner model [6] Wefind that the simple nature of the task given to humans (iemove directly toward your assigned leader li) is qualitativelyeasily replicated using a potential field path planner Agentsat location q plan their path under the influence of an artificialpotential field U(q) which is constructed to reflect the en-vironment Agents move toward their leaders li by followingan attractive potential field Other agents and goals that arenot their leaders are treated as obstacles and emit a repulsivepotential field

We denote the set of attractions i isin A and the setof repulsive obstacles j isin R The overall potential fieldis weighted sum of potential field from all attractions andrepulsive obstacles θi is the weight for the attractive potentialfield from attraction i isin A and θj is the weight for repulsivepotential field from obstacle j isin R

U(q) =983131

iisinA

θiUiatt(q) +

983131

jisinR

θjUjrep(q) (1)

The optimal action a an agent would take lies in the direction

of the potential field gradient

a = minusnablaU(q) = minus983131

iisinA

θinablaU iatt(q)minus

983131

jisinR

θjnablaU jrep(q)

The attractive potential field increases as the distance togoal becomes larger The repulsive potential field has a fixedeffective range within which the potential field increases asthe distance to the obstacle decreases The attractive andrepulsive potential fields are constructed in the same way forall attractive and repulsive obstacles We elaborate on detailsof the potential field function in the Appendix Section VIIImplementation Details We simulated 15000 two-playergames equally representing the number of possible leaderand follower configurations Each game stored the position ofeach agent and goal at all timesteps Before training we pre-processed the data by shifting it to have a zero-centered meannormalizing it and down-sampling it Each game was then fedinto our network as a sequence Based on our experimentshyperparameters that worked well included a batch size of 250learning rate of 00001 and hidden dimension size of 64 Inaddition we used gradient clipping and layer normalization [4]to stabilize gradient updates Using these hyperparameters wetrained our network for 100000 stepsEvaluating Pairwise Scores Our network trained on two-player simulated data performed with a validation accuracyof 83 We also experimented with training with three-player simulated data two-player human data as well as acombination of two-player simulated and human data (two-player mixed data) Validation results for the first 60000 stepsare shown in Fig 5 For our mixed two-player model we firsttook a pre-trained model trained on two-player simulated dataand then trained on two-player human data The two-playermixed curve shows the modelrsquos accuracy as it is fined-tuned onhuman data Our three-player model had a validation accuracyof 74 while our two player human model and our mixedtwo-player data model had a validation accuracy of 65

0 20e3 40e3 60e3steps

02

04

06

08

accu

racy

2P simulated3P simulated2P human2P mixed

Fig 5 Validation accuracy when calculating pairwise leadershipscores described in Sec III-A

B Maximum Likelihood Leader-Follower Graph in Teams

With a model that can predict pairwise relationships wefocus on building a model that can generalize beyond twoplayers We compute pairwise weights or leadership scoreswij of leader-follower relationships between all possiblepairs of leaders i and followers j The pairwise weightscan be computed based on the supervised learning approachdescribed above indicating the probability of one agent orgoal being another agentrsquos leader After computing wij for all

combinations of leaders and followers we create a directedgraph G = (VE) where V = I cup G and E = (i j)|i isinI j isin I cup G i ∕= j and the weights on each edge (i j)correspond to wij To create a more concise representationwe extract the maximum likelihood leader-follower graph Glowast

by pruning the edges of our constructed graph G We prunethe graph by greedily selecting the outgoing edge with highestweight for each agent node In other words we select theedge associated with the agent or goal that has the highestprobability of being agent irsquos leader where the probabilitiescorrespond to edge weights When pruning we make surethat no cycles are formed If we find a cycle we will choosethe next probable edge Our pruning approach is inspired byEdmondsrsquo algorithm [15 11] which finds a maximum weightarborescence [21] in a graph An arborescence is an acyclicdirected tree structure where there is exactly one outgoingedge from a node to another We use a modified version ofEdmondsrsquo algorithm to find the maximum likelihood leader-follower graph Compared to our approach a maximum weightarborescence is more restrictive since it requires the resultinggraph to be a treeEvaluating the Leader-Follower Graph We evaluate theaccuracy of our leader-follower graph with three or moreagents when trained on simulated two-player and three-playerdata as well as a combination of simulated and human two-player data (shown in Table I) In each of these multi-playergames we extracted a leader-follower graph at each timestepand compared our leader-follower graphrsquos predictions againstthe ground-truth labels Our leader-follower graph performsbetter than a random policy which selects a leader li isin I cupGfor agent i at random where li ∕= i The chance of beingright is thus 1

|G|+|I|minus1 We take the average of all successprobabilities for all leader-follower graph configurations tocompute the overall accuracy In all experiments shown inTable I our trained model outperforms the random policyMost notably the models scale naturally to settings with largenumbers of players as well as human data We find that trainingwith larger numbers of players helps with generalization Thetradeoff between training with larger numbers of players andcollecting more data will depend on the problem domain Forour experiments (Section V) we use the model trained onthree-player simulated dataTABLE I Generalization accuracy of leader-follower graph (LFG)

Training Data Testing Data LFGAccuracy

RandomAccuracy

3 players simulated 067 0294 players simulated 045 023

2 players simulated 5 players simulated 041 0192 players human 068 0443 players human 047 029

3 players simulated 044 0294 players simulated 038 023

2 players mixed 5 players simulated 028 0192 players human 069 0443 players human 044 029

4 players simulated 053 0233 players simulated 5 players simulated 050 019

3 players human 063 029

1 23

4

3

21

1 23

12

3

4

(a) Graph The directed edges represent pairwise likelihoods that the tail node

is the head nodes leader

(b) Maximum-likelihood leader-follower graph shown in bold

(c) Most influential leader is agent 2

(d) Most influential leader is agent 1 since other human agents are targeting the preferred goal

2

31

2

3

4

12

3

1preferred goal

1

2

34

robotrobot

preferred goal

Fig 6 On left creating a maximum-likelihood leader-follower graph Glowast On right examples of influential leaders in Glowast The green circlesare goals orange triangles are human agents and black triangles are robot agents

IV PLANNING BASED ON INFERENCE ONLEADER-FOLLOWER GRAPHS

We so far have computed a graphical representation forlatent leadership structures in human teams Glowast In this sectionwe will use Glowast to positively influence human teams ie movethe team towards a more desirable outcome We first describehow a robot can use the leader-follower graph to infer usefulteam structures We then describe how a robot can leveragethese inferences to plan for a desired outcome

A Inference based on leader-follower graph

Leader-follower graphs enable a robot to infer useful in-formation about a team such as agentsrsquo goals or who themost influential leader is These pieces of information helpthe robot to identify key goals or agents that are useful inachieving a desired outcome (eg identifying shared goalsin a collaborative task) Especially in multi-agent settingsfollowing key goals or influencing key agents is a strategicmove that allows the robot to plan for desired outcomes Webegin by describing different inferences a robot can performon the leader-follower graphGoal inference in multiagent settings One way a robot canuse structure in the leader-follower graph is to perform goalinference An agentrsquos goal can be inferred by the outgoingedges from agents to goals In the case where there is anoutgoing edge from an agent to another agent (ie agent ifollows agent j) we assume transitivity where agent i can beimplicitly following agent jrsquos believed ultimate goalInfluencing the most influential leader In order to leada team toward a desired goal the robot can also leveragethe leader-follower graph to predict who the most influentialleader is We define the most influential leader to be theagent ilowast isin I with the most number of followers Identifyingthe most influential leader ilowast allows the robot to strategicallyinfluence a single teammate that also indirectly influences theother teammates that are following ilowast For example in Fig6c and Fig 6d we show two examples of identifying themost influential leader from Glowast In the case where some ofagents are already going for the preferred goal the one thathas the most followers among the remaining players becomesthe most influential leader as shown in Fig 6d

B Optimization based on leader-follower graph

We leverage the leader-follower graph to design robotpolicies that optimize for an objective (eg leading the teamtowards a preferred goal) We introduce one such way wecan use the leader-follower graph by directly incorporatinginferences we make from it into a simple optimization

At each timestep we generate graphs Gat

t+k that simulatewhat the leader-follower graph would look like at timestept+k if the robot takes an action at at current timestep t Overthe next k steps we assume human agents will continue alongthe current trajectory with constant velocity

From each graph Gat

t+k we can obtain the weights wt+kij

corresponding to an objective r that the robot is optimizingfor (eg the robot becoming agent irsquos leader) We then iterateover the robotrsquos action assuming it is finite and discrete andchoose the action alowastt that maximizes the objective r We notethat r must be expressible in terms of wt+k

ij rsquos and wt+kig rsquos for

i j isin I and g isin G

alowastt = argmaxatisinA

r983043wt+k

ij (at)ijisinI wt+kig (at)iisinIgisinG

983044

(2)We describe three specific tasks that we plan for using the

optimization described in Eqn (2)Redirecting a leader-follower relationship A robot candirectly influence team dynamics by changing leader-followerrelationships For a given directed edge between agents i andj the robot can use the optimization outlined in Eqn (2) foractions that reverse an edge or direct the edge to a differentagent For instance to reverse the direction of the edge fromagent i to agent j the robot will select actions that maximizethe probability of agent j following agent i

alowastt = argmaxatisinA

wt+kji (at) i j isin I (3)

The robot can also take actions to eliminate an edge betweenagents i and j by minimizing wij One might want to modifyedges in the leader-follower graph when trying to change theleadership structure in a team For instance in a setting whereagents must collectively decide on a goal a robot can helpunify a team with sub-groups (an example is shown in Fig 3d)by re-directing the edges of one sub-group to follow another

Distracting a team In adversarial settings a robot might wantto prevent a team of humans from reaching a collective goal gIn order to stall the team a robot can use the leader-followergraph to identify who the current most influential leader ilowast isThe robot can then select actions that maximize the probabilityof the robot becoming the most influential leaderrsquos leaderand minimize the probability of the most influential leaderfollowing the collective goal g

alowastt = argmaxatisinA

wt+kilowastr (at) ilowast isin I (4)

Distracting a team from reaching a collective goal can beuseful in cases where the team is an adversary For instance ateam of adversarial robots may want to prevent their opponentsfrom collaboratively reaching a joint goalLeading a team towards the optimal goal In collaborativesettings where the team needs to agree on a goal g isin G arobot that knows where the optimal goal glowast isin G is shouldmaximize joint utility by leading all of its teammates to reachglowast To influence the team the robot can use the leader-followergraph to infer who the current most influential leader ilowast is Therobot can then select actions that maximize the probability ofthe most influential leader following the optimal goal glowast

alowastt = argmaxatisinA

wt+kilowastr (at) + wt+k

rlowastglowast(at) ilowast isin I (5)

Being able to lead a team of humans to a goal is useful in manyreal-life scenarios For instance in search-and-rescue missionsrobots with more information about the location of survivorsshould be able to lead the team in the optimal direction

V EXPERIMENTS

With an optimization framework that plans for robot actionsusing the leader-follower graph we evaluate our approachon the three tasks described in Sec IV For each task wecompare task performance with robot policies that use theleader-follower graph against robot policies that do not Acrossall tasks we find that robot policies that use the leader-followergraph perform well compared to other policies showing thatour graph can be easily generalized to different settingsTask setup Our tasks take place in the pursuit-evasiondomain Within each task we conduct experiments withsimulated human behavior Simulated humans move along apotential field as shown in Eqn 1 where there are two sourcesof attraction the agentrsquos goal (ag) and the crowd center (ac)and simulated human agents would make trade offs betweenfollowing the crowd and moving toward a target goal Theweights θg and θc encode how determined human players areabout reaching the goal or moving towards the crowd Higherθc means that players care more about being close to the wholeteam while higher θg indicates more determined players whocare about reaching their goals This can make it harder for arobot to influence the humans For validating our frameworkwe use weights θg = 06 and θc = 04 where going toward thegoal is slightly more advantageous than moving with the wholeteam setting moderate challenges for the robot to completetasks At the same time we prevent agents from colliding with

one another by making non-target agents and goals sources ofrepulsion with weight θj = 1

An important limitation is that our simulation does notcapture the broad range of behaviors humans would show inresponse to a robot that is trying to influence them We onlyoffer one simple interpretation of how humans would reactbased on the potential field

In each iteration of the task the initial position of agentsand goals are randomized For all of our experiments the sizeof the game canvas is 500times 500 pixels At every time step asimulated human can move 1 unit in one of the four directionsup down left right or stay at its position The robot agentrsquosaction space is the same but can move 5 units at a time Themaximum game time limit for all our experiments is 1000timesteps For convenience we will refer to simulated humanagents as human agents for the rest of the section

A Changing edges of the leader-follower graph

We evaluate a robotrsquos ability to change an edge of a leader-follower graph Unlike the other tasks we described the goalof this task is not to affect some larger outcome in the game(eg influence humans toward a particular goal) Instead thistask serves as a preliminary to others where we evaluate howwell a robot is able to manipulate the leader-follower graphMethods Given a human agent i who is predisposed tofollowing a goal with weights θg = 06 θa = 04 we createda robot policy that encouraged the agent to follow the robotr instead The robot optimized for the probability wir that itwould become agent irsquos leader as outlined in Eqn (3) Wecompared our approach against a random baseline where therobot chooses its next action at randomMetrics For both our approach and our baseline we evaluatedthe performance of the robot based on the leadership scoresie probabilities wir given by the leader-follower graphResults We show that the robot can influence a humanagent to follow it Fig 7 contains averaged probabilitiesover ten tasks The probability of the robot being the humanagentrsquos leader wir increases over time and averages to 73represented by the orange dashed line Our approach performswell compared to the random baseline which has an averageperformance of 26 represented by the grey dashed line

0 100 200 300 400 500 600task steps

02

04

06

08

pro

bab

ilit

y

oursrandom

Fig 7 Smoothed probabilities of a human agent following the robotover 10 tasks The robot is quickly able to become the human agentrsquosleader with an average of leadership score of 73

B Adversarial taskWe consider a different task where the robot is an adversary

that is trying to distract a team of humans from reaching a goalThere are m goals and n agents in the pursuit-evasion gameAmong the n agents we have 1 robot and n minus 1 humansThe robotrsquos goal is to distract a team of humans so that theycannot converge to the same goal In order to capture a goalwe enforce the constraint that nminus2 agents must collide with itat the same time allowing 1 agent to be absent Since we allowone agent to be absent this prevents the robot from adoptingthe simple strategy of blocking an agent throughout the gameThe game ends when all goals are captured or when the gametime exceeds the limit Thus another way to interpret therobotrsquos objective is to extend game time as much as possibleMethods We compared our approach against 3 baselineheuristic strategies (random to one player to farthest goal)that do not use the leader-follower graph

In the Random strategy the robot picks an action at eachtimestep with uniform probability In the To one player strat-egy the robot randomly selects a human agent and then movestowards it to try to block its way Finally a robot enactingthe To farthest goal strategy selects the goal that the averagedistance to human players are largest and then go to that goalin the hope that human agents would get influenced or mayfurther change goal by observing that some players are headingfor another goal

We also experimented with two variations of our approachLFG closest pursuer involves the robot agent selecting theclosest pursuer and choosing an action to maximize theprobability of the pursuer following it (as predicted by theLFG) Similarly LFG influential pursuer strategy involvesthe robot targeting the most influential human agent predictedby the LFG and then conducting the same optimization ofmaximizing the following probability as shown in Eqn (4)

Finally we explored different game settings by varying thenumber of players n = 3 6 and the number of goalsm = 1 4Metrics We evaluated the performance of the robot with gametime as the metric Longer game time indicates that the robotdoes well in distracting human playersResults We sampled 50 outcomes for every game setting androbot policy Tables II and III show the mean and standarddeviation of game time for every game setting and robotpolicy across 50 samples Across game settings our modelsbased on the leader-follower graph consistently outperformedmethods without knowledge of leader-follower graph Theseexperimental results are also visualized in Fig 8

As can be seen from Fig 8 the average game time goes upas the number of players increases This is because it is morechallenging for more players to agree upon on which goal tocapture The consistent advantageous performance across allgame settings suggests the effectiveness of the leader-followergraph for inference and optimization in this scenario

We demonstrate an example of robot behavior in theadversarial game shown in Fig 9 (dashed lines representtrajectories) Player 1 and player 2 start very close to goal

Fig 8 Average game time over 50 games with a different numberof players across all baseline methods and our model

Fig 9 Adversarial task snapshots The orange triangles are the humanagents and the black triangle is the robot agent

g1 making g1 the probable target The robot then approachesagent 2 blocks its way and leads it to another goal g2 In thisway the robot successfully extends the game time

C Cooperative task

Finally we evaluate the robot in a cooperative setting wherethe robot tries to be helpful for human teams The robot aimsto lead its teammates in a way that everyone can reach thetarget goal that gives the team the greatest joint utility glowast isin GHowever glowast is not immediately observable to all teammatesWe assume a setting where only the robot knows where glowast is

For consistency the experiment setting is the same as theAdversarial task where n minus 2 human agents need to collidewith a goal to capture it In this scenario the task is consideredsuccessful if the goal with greatest joint utility glowast is capturedThe task is unsuccessful if any other suboptimal goal iscaptured or if the game time exceeds the limitMethods Similar to the Adversarial task we explore twovariations of our approach where the robot chooses to in-fluence its closest human agent or the most influential agentpredicted by the leader-follower graph The variation wherethe robot targets the most influential leader is outlined in Eqn(5) Different from the Adversarial task here the robot isoptimizing the probability of both becoming the target agentrsquosleader and going toward the target goal We also experimentedwith three baseline strategies The Random strategy involvesthe robot taking actions at random The To target goal strat-egy has the robot move directly to the optimal goal glowastand stay there trying to attract other human agents TheTo goal farthest player strategy involves the robot going tothe player that is farthest away from glowast in an attempt toindirectly influence the player to move back to glowast Finallywe experimented with game settings by varying the numberof goals m = 2 6Metrics We evaluated the performance of the robot strategyusing the game success rate over 100 games We calculate thesuccess rate by dividing the number of successful games overnumber of total games

TABLE II Average Game time over 50 adversarial games with varying number of players

number of goals (m=2)

Model n=3 n=4 n=5 n=6

LFG closest pursuer (ours) 23304plusmn5182 30508plusmn4948 46118plusmn5573 55088plusmn5167LFG influential pursuer (ours) 20194plusmn4515 28644plusmn4854 41478plusmn5098 51592plusmn4880

random 1292plusmn3266 20940plusmn3986 38892plusmn5324 43716plusmn4317to one pursuer 21504plusmn5000 23142plusmn4469 45516plusmn5835 47236plusmn4975to farthest goal 13284plusmn3422 1985plusmn3614 38208plusmn5259 44564plusmn4677

TABLE III Average Game time over 50 adversarial games with varying number of goals

number of players (n=4)

Model m=1 m=2 m=3 m=4

LFG closest pursuer (ours) 21094plusmn3323 30508plusmn4948 28922plusmn5299 34300plusmn5590LFG influential pursuer (ours) 23904plusmn3973 28644plusmn4854 21956plusmn4100 30180plusmn5200

random 15594plusmn2142 20940plusmn3986 20574plusmn4305 29462plusmn5401to one pursuer 12358plusmn956 23142plusmn4469 22552plusmn4147 31792plusmn5475to farthest goal 21336plusmn3483 1985plusmn3614 21868plusmn4367 25830plusmn5064

Results Our results are summarized in Table IV We expectedthe To target goal baseline to be very strong since movingtowards the target goal directly influences other simulatedagents to also move toward the target goal We find that thisbaseline strategy is especially effective when the game is notcomplex ie the number of goals is small However ourmodel based on the leader-follower graph still demonstratescompetitive performance compared to it Especially when thenumber of goals increase the advantage of leader-followergraph becomes clearer This indicates that in complex scenar-ios brute force methods that do not have knowledge of humanteam hidden structure will not suffice Another thing to noteis that the difference between all of the strategies becomessmaller as the number of goals increases This is because thedifficulty of the game increases for all strategies and thuswhether a game would succeed depends more on the gameinitial conditions We have shown snapshots of one game

TABLE IV Success rate over 100 collaborative games with varyingnumber of goals m

number of players (n=4)

Model m=2 m=3 m=4 m=5 m=6

LFG closest pursuer 059 038 029 027 022LFG influential pursuer 057 036 032 024 019

random 055 035 024 021 020to target goal 060 042 028 024 021

to goal farthest player 047 029 017 019 021

as in Fig 10 In this game the robot approaches other agentsand the desired goal in the collaborative pattern trying to helpcatch the goal g1

Fig 10 Collaborative task snapshots The orange triangles are thehuman agents and the black triangle is the robot agent

VI DISCUSSION

Summary We propose an approach for modeling leadingand following behavior in multi-agent human teams We usea combination of data-driven and graph-theoretic techniquesto learn a leader-follower graph This graph representationencoding human team hidden structure is scalable with theteam size (number of agents) since we first learn localpairwise relationships and combine them to create a globalmodel We demonstrate the effectiveness of the leader-followergraph by experimenting with optimization based robot policiesthat leverage the graph to influence human teams in differentscenarios Our policies are general and perform well across alltasks compared to other high-performing task-specific policiesLimitations and Future Work We view our work as a firststep into modeling latent dynamic human team structures Per-haps our greatest limitation is the reliance on simulated humanbehavior to validate our framework Further experiments withreal human data are needed to support our frameworkrsquos effec-tiveness for noisier human behavior understanding Moreoverthe robot policies that use the leader-follower graphs are fairlysimple Although this may be a limitation it is also promisingthat simple policies were able to perform well using the leader-follower graph

For future work we plan to evaluate our model on largescale human-robot experiments in both simulation and nav-igation settings such as robot swarms to further improveour modelrsquos generalization capacity We also plan on experi-menting with combining the leader-follower graph with moreadvanced policy learning methods such as deep reinforcementlearning We think the leader-follower graph could contributeto multi-agent reinforcement learning in various ways such asreward design and state representationAcknowledgements Toyota Research Institute (ldquoTRIrdquo) pro-vided funds to assist the authors with their research but thisarticle solely reflects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity The authorswould also like to acknowledge General Electric (GE) andStanford School of Engineering Fellowship

REFERENCES

[1] Pieter Abbeel and Andrew Y Ng Apprenticeship learn-ing via inverse reinforcement learning In Proceedingsof the twenty-first international conference on Machinelearning page 1 ACM 2004

[2] Ali-Akbar Agha-Mohammadi Suman Chakravorty andNancy M Amato Firm Sampling-based feedbackmotion-planning under motion uncertainty and imperfectmeasurements The International Journal of RoboticsResearch 33(2)268ndash304 2014

[3] Baris Akgun Maya Cakmak Karl Jiang and Andrea LThomaz Keyframe-based learning from demonstrationInternational Journal of Social Robotics 4(4)343ndash3552012

[4] Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey EHinton Layer normalization arXiv preprintarXiv160706450 2016

[5] Kim Baraka Ana Paiva and Manuela Veloso Expressivelights for revealing mobile service robot state In Robot2015 Second Iberian Robotics Conference pages 107ndash119 Springer 2016

[6] Jerome Barraquand Bruno Langlois and J-C LatombeNumerical potential field techniques for robot path plan-ning IEEE transactions on systems man and cybernet-ics 22(2)224ndash241 1992

[7] Aaron Bestick Ruzena Bajcsy and Anca D Dragan Im-plicitly assisting humans to choose good grasps in robotto human handovers In International Symposium onExperimental Robotics pages 341ndash354 Springer 2016

[8] Erdem Biyik and Dorsa Sadigh Batch active preference-based learning of reward functions In Conference onRobot Learning (CoRL) October 2018

[9] Frank Broz Illah Nourbakhsh and Reid Simmons De-signing pomdp models of socially situated tasks In RO-MAN 2011 IEEE pages 39ndash46 IEEE 2011

[10] Sonia Chernova and Andrea L Thomaz Robot learningfrom human teachers Synthesis Lectures on ArtificialIntelligence and Machine Learning 8(3)1ndash121 2014

[11] Yoeng-Jin Chu On the shortest arborescence of adirected graph Science Sinica 141396ndash1400 1965

[12] D Scott DeRue Adaptive leadership theory Leadingand following as a complex adaptive process Researchin organizational behavior 31125ndash150 2011

[13] Anca D Dragan Dorsa Sadigh Shankar Sastry and San-jit A Seshia Active preference-based learning of rewardfunctions In Robotics Science and Systems (RSS) 2017

[14] Finale Doshi and Nicholas Roy Efficient model learningfor dialog management In Proceedings of the ACMIEEEinternational conference on Human-robot interactionpages 65ndash72 ACM 2007

[15] Jack Edmonds Optimum branchings Mathematics andthe Decision Sciences Part 1(335-345)25 1968

[16] Katie Genter Noa Agmon and Peter Stone Ad hocteamwork for leading a flock In Proceedings of the2013 international conference on Autonomous agents and

multi-agent systems pages 531ndash538 International Foun-dation for Autonomous Agents and Multiagent Systems2013

[17] Matthew C Gombolay Cindy Huang and Julie A ShahCoordination of human-robot teaming with human taskpreferences In AAAI Fall Symposium Series on AI-HRIvolume 11 page 2015 2015

[18] Dylan Hadfield-Menell Stuart J Russell Pieter Abbeeland Anca Dragan Cooperative inverse reinforcementlearning In Advances in neural information processingsystems pages 3909ndash3917 2016

[19] Shervin Javdani Henny Admoni Stefania PellegrinelliSiddhartha S Srinivasa and J Andrew Bagnell Sharedautonomy via hindsight optimization for teleoperationand teaming The International Journal of RoboticsResearch page 0278364918776060 2018

[20] Takayuki Kanda Takayuki Hirano Daniel Eaton andHiroshi Ishiguro Interactive robots as social partners andpeer tutors for children A field trial Human-computerinteraction 19(1)61ndash84 2004

[21] Richard M Karp A simple derivation of edmondsrsquoalgorithm for optimum branchings Networks 1(3)265ndash272 1971

[22] Piyush Khandelwal Samuel Barrett and Peter StoneLeading the way An efficient multi-robot guidance sys-tem In Proceedings of the 2015 international conferenceon autonomous agents and multiagent systems pages1625ndash1633 International Foundation for AutonomousAgents and Multiagent Systems 2015

[23] Mykel J Kochenderfer Decision making under uncer-tainty theory and application MIT press 2015

[24] Hanna Kurniawati David Hsu and Wee Sun Lee SarsopEfficient point-based pomdp planning by approximatingoptimally reachable belief spaces In Robotics Scienceand systems volume 2008 Zurich Switzerland 2008

[25] Oliver Lemon Conversational interfaces In Data-DrivenMethods for Adaptive Spoken Dialogue Systems pages1ndash4 Springer 2012

[26] Michael L Littman and Peter Stone Leading best-response strategies in repeated games In In SeventeenthAnnual International Joint Conference on Artificial In-telligence Workshop on Economic Agents Models andMechanisms Citeseer 2001

[27] Owen Macindoe Leslie Pack Kaelbling and TomasLozano-Perez Pomcop Belief space planning for side-kicks in cooperative games In AIIDE 2012

[28] Stefanos Nikolaidis and Julie Shah Human-robot cross-training computational formulation modeling and eval-uation of a human team training strategy In Pro-ceedings of the 8th ACMIEEE international conferenceon Human-robot interaction pages 33ndash40 IEEE Press2013

[29] Stefanos Nikolaidis Anton Kuznetsov David Hsu andSiddharta Srinivasa Formalizing human-robot mutualadaptation A bounded memory model In The EleventhACMIEEE International Conference on Human Robot

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 3: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

(c) Chain behavior in the leader-follower graph

(b) Cyclic leader-follower graph We design policies that avoid such

cyclic behaviors

(d) Multiple teams (a) Leader-follower graph Green circles are the goals that need to be captured Orange triangles are the

pursuers

$amp(

amp(

$

amp

1 3

2

$amp

$

amp$

amp

1 3

2

$amp$

amp(

$

amp

1 3

2

$amp)

amp(

$

amp

1

2

3

4

Fig 3 Variations of the leader-follower graph

leverage this leader-follower graph to enable robot teammatesto produce helpful leading behaviors We will first focuson learning pairwise relationships between agents using asupervised learning approach We then generalize our ideas tomulti-player settings using algorithms from graph-theory Ourcombination of data-driven and graph-theoretic approachesallows the leader-follower graph to efficiently scale with thenumber of agents

A Pairwise Leadership Scores

We first will focus on learning the probability of any agenti following any goal or agent j isin G cup I The pairwiseprobabilities help us estimate the leadership score wij iethe weight of the edge (i j) in the leader-follower graph

We propose a general framework of estimating the leader-ship scores using a supervised learning approach Consider atwo-player setting where I = i j we collect labeled datawhere agent i is asked to follow j and agent j is asked tooptimize for the joint reward function assuming it is leadingi ie following a fixed goal g in the pursuit-evasion game(li = j and lj = g) We then train a Long Short-Term Memory(LSTM) network with a softmax layer to predict each agentrsquosmost likely leaderTraining with a Scalable Network Architecture Our net-work architecture consists of two LSTM submodules oneto predict player-player leader-follower relationships (P-PLSTM) and one to predict player-evader relationships (P-ELSTM) We use a softmax output layer with a cross-entropyloss function to get a probability distribution over j and allgoals g isin G of being irsquos leader We take the leader (an agentor a goal) with the highest probability and assign this as theleadership score The P-P and P-E submodules allow us toscale training to a game of any number of players and evadersas we can add or remove P-P and P-E submodules dependingon the number of players and evaders in a game An exampleof our scalable network architecture is illustrated in Fig 4Data Collection We collect labeled human data by askingparticipants to play a pursuit evasion game with pre-assignedleaders We recruited pairs of humans and randomly assignedleaders li to them (ie another agent or a goal) Participantsplayed the game in a web browser using their arrow keys andwere asked to move toward their assigned leader li In order tocreate a balanced dataset we collected data from all possible

1

2

1

2

2

2

2

LSTM pursuer to pursuer

LSTM pursuer to evader

LSTM pursuer to evader

2

LSTM pursuer to evader

1

Softmax Layer

$= amp(

$= amp(

$= amp

$)= amp

Training data

Fig 4 Scalable neural network architecture This example is pre-dicting the probability of another agent j being agent 2rsquos leaderw2j There are three LSTM submodules used because there are twopossible evaders and one possible agent that could be agent 2rsquos leader

configurations of leader and followers in a two-player settingWe collected a total of 186 games

In addition since human data is often difficult to collectin large amounts we augmented our dataset with syntheticdata The synthetic data is generated by simulating humansplaying the pursuit evasion game We simulated humans basedon the well-known potential field path planner model [6] Wefind that the simple nature of the task given to humans (iemove directly toward your assigned leader li) is qualitativelyeasily replicated using a potential field path planner Agentsat location q plan their path under the influence of an artificialpotential field U(q) which is constructed to reflect the en-vironment Agents move toward their leaders li by followingan attractive potential field Other agents and goals that arenot their leaders are treated as obstacles and emit a repulsivepotential field

We denote the set of attractions i isin A and the setof repulsive obstacles j isin R The overall potential fieldis weighted sum of potential field from all attractions andrepulsive obstacles θi is the weight for the attractive potentialfield from attraction i isin A and θj is the weight for repulsivepotential field from obstacle j isin R

U(q) =983131

iisinA

θiUiatt(q) +

983131

jisinR

θjUjrep(q) (1)

The optimal action a an agent would take lies in the direction

of the potential field gradient

a = minusnablaU(q) = minus983131

iisinA

θinablaU iatt(q)minus

983131

jisinR

θjnablaU jrep(q)

The attractive potential field increases as the distance togoal becomes larger The repulsive potential field has a fixedeffective range within which the potential field increases asthe distance to the obstacle decreases The attractive andrepulsive potential fields are constructed in the same way forall attractive and repulsive obstacles We elaborate on detailsof the potential field function in the Appendix Section VIIImplementation Details We simulated 15000 two-playergames equally representing the number of possible leaderand follower configurations Each game stored the position ofeach agent and goal at all timesteps Before training we pre-processed the data by shifting it to have a zero-centered meannormalizing it and down-sampling it Each game was then fedinto our network as a sequence Based on our experimentshyperparameters that worked well included a batch size of 250learning rate of 00001 and hidden dimension size of 64 Inaddition we used gradient clipping and layer normalization [4]to stabilize gradient updates Using these hyperparameters wetrained our network for 100000 stepsEvaluating Pairwise Scores Our network trained on two-player simulated data performed with a validation accuracyof 83 We also experimented with training with three-player simulated data two-player human data as well as acombination of two-player simulated and human data (two-player mixed data) Validation results for the first 60000 stepsare shown in Fig 5 For our mixed two-player model we firsttook a pre-trained model trained on two-player simulated dataand then trained on two-player human data The two-playermixed curve shows the modelrsquos accuracy as it is fined-tuned onhuman data Our three-player model had a validation accuracyof 74 while our two player human model and our mixedtwo-player data model had a validation accuracy of 65

0 20e3 40e3 60e3steps

02

04

06

08

accu

racy

2P simulated3P simulated2P human2P mixed

Fig 5 Validation accuracy when calculating pairwise leadershipscores described in Sec III-A

B Maximum Likelihood Leader-Follower Graph in Teams

With a model that can predict pairwise relationships wefocus on building a model that can generalize beyond twoplayers We compute pairwise weights or leadership scoreswij of leader-follower relationships between all possiblepairs of leaders i and followers j The pairwise weightscan be computed based on the supervised learning approachdescribed above indicating the probability of one agent orgoal being another agentrsquos leader After computing wij for all

combinations of leaders and followers we create a directedgraph G = (VE) where V = I cup G and E = (i j)|i isinI j isin I cup G i ∕= j and the weights on each edge (i j)correspond to wij To create a more concise representationwe extract the maximum likelihood leader-follower graph Glowast

by pruning the edges of our constructed graph G We prunethe graph by greedily selecting the outgoing edge with highestweight for each agent node In other words we select theedge associated with the agent or goal that has the highestprobability of being agent irsquos leader where the probabilitiescorrespond to edge weights When pruning we make surethat no cycles are formed If we find a cycle we will choosethe next probable edge Our pruning approach is inspired byEdmondsrsquo algorithm [15 11] which finds a maximum weightarborescence [21] in a graph An arborescence is an acyclicdirected tree structure where there is exactly one outgoingedge from a node to another We use a modified version ofEdmondsrsquo algorithm to find the maximum likelihood leader-follower graph Compared to our approach a maximum weightarborescence is more restrictive since it requires the resultinggraph to be a treeEvaluating the Leader-Follower Graph We evaluate theaccuracy of our leader-follower graph with three or moreagents when trained on simulated two-player and three-playerdata as well as a combination of simulated and human two-player data (shown in Table I) In each of these multi-playergames we extracted a leader-follower graph at each timestepand compared our leader-follower graphrsquos predictions againstthe ground-truth labels Our leader-follower graph performsbetter than a random policy which selects a leader li isin I cupGfor agent i at random where li ∕= i The chance of beingright is thus 1

|G|+|I|minus1 We take the average of all successprobabilities for all leader-follower graph configurations tocompute the overall accuracy In all experiments shown inTable I our trained model outperforms the random policyMost notably the models scale naturally to settings with largenumbers of players as well as human data We find that trainingwith larger numbers of players helps with generalization Thetradeoff between training with larger numbers of players andcollecting more data will depend on the problem domain Forour experiments (Section V) we use the model trained onthree-player simulated dataTABLE I Generalization accuracy of leader-follower graph (LFG)

Training Data Testing Data LFGAccuracy

RandomAccuracy

3 players simulated 067 0294 players simulated 045 023

2 players simulated 5 players simulated 041 0192 players human 068 0443 players human 047 029

3 players simulated 044 0294 players simulated 038 023

2 players mixed 5 players simulated 028 0192 players human 069 0443 players human 044 029

4 players simulated 053 0233 players simulated 5 players simulated 050 019

3 players human 063 029

1 23

4

3

21

1 23

12

3

4

(a) Graph The directed edges represent pairwise likelihoods that the tail node

is the head nodes leader

(b) Maximum-likelihood leader-follower graph shown in bold

(c) Most influential leader is agent 2

(d) Most influential leader is agent 1 since other human agents are targeting the preferred goal

2

31

2

3

4

12

3

1preferred goal

1

2

34

robotrobot

preferred goal

Fig 6 On left creating a maximum-likelihood leader-follower graph Glowast On right examples of influential leaders in Glowast The green circlesare goals orange triangles are human agents and black triangles are robot agents

IV PLANNING BASED ON INFERENCE ONLEADER-FOLLOWER GRAPHS

We so far have computed a graphical representation forlatent leadership structures in human teams Glowast In this sectionwe will use Glowast to positively influence human teams ie movethe team towards a more desirable outcome We first describehow a robot can use the leader-follower graph to infer usefulteam structures We then describe how a robot can leveragethese inferences to plan for a desired outcome

A Inference based on leader-follower graph

Leader-follower graphs enable a robot to infer useful in-formation about a team such as agentsrsquo goals or who themost influential leader is These pieces of information helpthe robot to identify key goals or agents that are useful inachieving a desired outcome (eg identifying shared goalsin a collaborative task) Especially in multi-agent settingsfollowing key goals or influencing key agents is a strategicmove that allows the robot to plan for desired outcomes Webegin by describing different inferences a robot can performon the leader-follower graphGoal inference in multiagent settings One way a robot canuse structure in the leader-follower graph is to perform goalinference An agentrsquos goal can be inferred by the outgoingedges from agents to goals In the case where there is anoutgoing edge from an agent to another agent (ie agent ifollows agent j) we assume transitivity where agent i can beimplicitly following agent jrsquos believed ultimate goalInfluencing the most influential leader In order to leada team toward a desired goal the robot can also leveragethe leader-follower graph to predict who the most influentialleader is We define the most influential leader to be theagent ilowast isin I with the most number of followers Identifyingthe most influential leader ilowast allows the robot to strategicallyinfluence a single teammate that also indirectly influences theother teammates that are following ilowast For example in Fig6c and Fig 6d we show two examples of identifying themost influential leader from Glowast In the case where some ofagents are already going for the preferred goal the one thathas the most followers among the remaining players becomesthe most influential leader as shown in Fig 6d

B Optimization based on leader-follower graph

We leverage the leader-follower graph to design robotpolicies that optimize for an objective (eg leading the teamtowards a preferred goal) We introduce one such way wecan use the leader-follower graph by directly incorporatinginferences we make from it into a simple optimization

At each timestep we generate graphs Gat

t+k that simulatewhat the leader-follower graph would look like at timestept+k if the robot takes an action at at current timestep t Overthe next k steps we assume human agents will continue alongthe current trajectory with constant velocity

From each graph Gat

t+k we can obtain the weights wt+kij

corresponding to an objective r that the robot is optimizingfor (eg the robot becoming agent irsquos leader) We then iterateover the robotrsquos action assuming it is finite and discrete andchoose the action alowastt that maximizes the objective r We notethat r must be expressible in terms of wt+k

ij rsquos and wt+kig rsquos for

i j isin I and g isin G

alowastt = argmaxatisinA

r983043wt+k

ij (at)ijisinI wt+kig (at)iisinIgisinG

983044

(2)We describe three specific tasks that we plan for using the

optimization described in Eqn (2)Redirecting a leader-follower relationship A robot candirectly influence team dynamics by changing leader-followerrelationships For a given directed edge between agents i andj the robot can use the optimization outlined in Eqn (2) foractions that reverse an edge or direct the edge to a differentagent For instance to reverse the direction of the edge fromagent i to agent j the robot will select actions that maximizethe probability of agent j following agent i

alowastt = argmaxatisinA

wt+kji (at) i j isin I (3)

The robot can also take actions to eliminate an edge betweenagents i and j by minimizing wij One might want to modifyedges in the leader-follower graph when trying to change theleadership structure in a team For instance in a setting whereagents must collectively decide on a goal a robot can helpunify a team with sub-groups (an example is shown in Fig 3d)by re-directing the edges of one sub-group to follow another

Distracting a team In adversarial settings a robot might wantto prevent a team of humans from reaching a collective goal gIn order to stall the team a robot can use the leader-followergraph to identify who the current most influential leader ilowast isThe robot can then select actions that maximize the probabilityof the robot becoming the most influential leaderrsquos leaderand minimize the probability of the most influential leaderfollowing the collective goal g

alowastt = argmaxatisinA

wt+kilowastr (at) ilowast isin I (4)

Distracting a team from reaching a collective goal can beuseful in cases where the team is an adversary For instance ateam of adversarial robots may want to prevent their opponentsfrom collaboratively reaching a joint goalLeading a team towards the optimal goal In collaborativesettings where the team needs to agree on a goal g isin G arobot that knows where the optimal goal glowast isin G is shouldmaximize joint utility by leading all of its teammates to reachglowast To influence the team the robot can use the leader-followergraph to infer who the current most influential leader ilowast is Therobot can then select actions that maximize the probability ofthe most influential leader following the optimal goal glowast

alowastt = argmaxatisinA

wt+kilowastr (at) + wt+k

rlowastglowast(at) ilowast isin I (5)

Being able to lead a team of humans to a goal is useful in manyreal-life scenarios For instance in search-and-rescue missionsrobots with more information about the location of survivorsshould be able to lead the team in the optimal direction

V EXPERIMENTS

With an optimization framework that plans for robot actionsusing the leader-follower graph we evaluate our approachon the three tasks described in Sec IV For each task wecompare task performance with robot policies that use theleader-follower graph against robot policies that do not Acrossall tasks we find that robot policies that use the leader-followergraph perform well compared to other policies showing thatour graph can be easily generalized to different settingsTask setup Our tasks take place in the pursuit-evasiondomain Within each task we conduct experiments withsimulated human behavior Simulated humans move along apotential field as shown in Eqn 1 where there are two sourcesof attraction the agentrsquos goal (ag) and the crowd center (ac)and simulated human agents would make trade offs betweenfollowing the crowd and moving toward a target goal Theweights θg and θc encode how determined human players areabout reaching the goal or moving towards the crowd Higherθc means that players care more about being close to the wholeteam while higher θg indicates more determined players whocare about reaching their goals This can make it harder for arobot to influence the humans For validating our frameworkwe use weights θg = 06 and θc = 04 where going toward thegoal is slightly more advantageous than moving with the wholeteam setting moderate challenges for the robot to completetasks At the same time we prevent agents from colliding with

one another by making non-target agents and goals sources ofrepulsion with weight θj = 1

An important limitation is that our simulation does notcapture the broad range of behaviors humans would show inresponse to a robot that is trying to influence them We onlyoffer one simple interpretation of how humans would reactbased on the potential field

In each iteration of the task the initial position of agentsand goals are randomized For all of our experiments the sizeof the game canvas is 500times 500 pixels At every time step asimulated human can move 1 unit in one of the four directionsup down left right or stay at its position The robot agentrsquosaction space is the same but can move 5 units at a time Themaximum game time limit for all our experiments is 1000timesteps For convenience we will refer to simulated humanagents as human agents for the rest of the section

A Changing edges of the leader-follower graph

We evaluate a robotrsquos ability to change an edge of a leader-follower graph Unlike the other tasks we described the goalof this task is not to affect some larger outcome in the game(eg influence humans toward a particular goal) Instead thistask serves as a preliminary to others where we evaluate howwell a robot is able to manipulate the leader-follower graphMethods Given a human agent i who is predisposed tofollowing a goal with weights θg = 06 θa = 04 we createda robot policy that encouraged the agent to follow the robotr instead The robot optimized for the probability wir that itwould become agent irsquos leader as outlined in Eqn (3) Wecompared our approach against a random baseline where therobot chooses its next action at randomMetrics For both our approach and our baseline we evaluatedthe performance of the robot based on the leadership scoresie probabilities wir given by the leader-follower graphResults We show that the robot can influence a humanagent to follow it Fig 7 contains averaged probabilitiesover ten tasks The probability of the robot being the humanagentrsquos leader wir increases over time and averages to 73represented by the orange dashed line Our approach performswell compared to the random baseline which has an averageperformance of 26 represented by the grey dashed line

0 100 200 300 400 500 600task steps

02

04

06

08

pro

bab

ilit

y

oursrandom

Fig 7 Smoothed probabilities of a human agent following the robotover 10 tasks The robot is quickly able to become the human agentrsquosleader with an average of leadership score of 73

B Adversarial taskWe consider a different task where the robot is an adversary

that is trying to distract a team of humans from reaching a goalThere are m goals and n agents in the pursuit-evasion gameAmong the n agents we have 1 robot and n minus 1 humansThe robotrsquos goal is to distract a team of humans so that theycannot converge to the same goal In order to capture a goalwe enforce the constraint that nminus2 agents must collide with itat the same time allowing 1 agent to be absent Since we allowone agent to be absent this prevents the robot from adoptingthe simple strategy of blocking an agent throughout the gameThe game ends when all goals are captured or when the gametime exceeds the limit Thus another way to interpret therobotrsquos objective is to extend game time as much as possibleMethods We compared our approach against 3 baselineheuristic strategies (random to one player to farthest goal)that do not use the leader-follower graph

In the Random strategy the robot picks an action at eachtimestep with uniform probability In the To one player strat-egy the robot randomly selects a human agent and then movestowards it to try to block its way Finally a robot enactingthe To farthest goal strategy selects the goal that the averagedistance to human players are largest and then go to that goalin the hope that human agents would get influenced or mayfurther change goal by observing that some players are headingfor another goal

We also experimented with two variations of our approachLFG closest pursuer involves the robot agent selecting theclosest pursuer and choosing an action to maximize theprobability of the pursuer following it (as predicted by theLFG) Similarly LFG influential pursuer strategy involvesthe robot targeting the most influential human agent predictedby the LFG and then conducting the same optimization ofmaximizing the following probability as shown in Eqn (4)

Finally we explored different game settings by varying thenumber of players n = 3 6 and the number of goalsm = 1 4Metrics We evaluated the performance of the robot with gametime as the metric Longer game time indicates that the robotdoes well in distracting human playersResults We sampled 50 outcomes for every game setting androbot policy Tables II and III show the mean and standarddeviation of game time for every game setting and robotpolicy across 50 samples Across game settings our modelsbased on the leader-follower graph consistently outperformedmethods without knowledge of leader-follower graph Theseexperimental results are also visualized in Fig 8

As can be seen from Fig 8 the average game time goes upas the number of players increases This is because it is morechallenging for more players to agree upon on which goal tocapture The consistent advantageous performance across allgame settings suggests the effectiveness of the leader-followergraph for inference and optimization in this scenario

We demonstrate an example of robot behavior in theadversarial game shown in Fig 9 (dashed lines representtrajectories) Player 1 and player 2 start very close to goal

Fig 8 Average game time over 50 games with a different numberof players across all baseline methods and our model

Fig 9 Adversarial task snapshots The orange triangles are the humanagents and the black triangle is the robot agent

g1 making g1 the probable target The robot then approachesagent 2 blocks its way and leads it to another goal g2 In thisway the robot successfully extends the game time

C Cooperative task

Finally we evaluate the robot in a cooperative setting wherethe robot tries to be helpful for human teams The robot aimsto lead its teammates in a way that everyone can reach thetarget goal that gives the team the greatest joint utility glowast isin GHowever glowast is not immediately observable to all teammatesWe assume a setting where only the robot knows where glowast is

For consistency the experiment setting is the same as theAdversarial task where n minus 2 human agents need to collidewith a goal to capture it In this scenario the task is consideredsuccessful if the goal with greatest joint utility glowast is capturedThe task is unsuccessful if any other suboptimal goal iscaptured or if the game time exceeds the limitMethods Similar to the Adversarial task we explore twovariations of our approach where the robot chooses to in-fluence its closest human agent or the most influential agentpredicted by the leader-follower graph The variation wherethe robot targets the most influential leader is outlined in Eqn(5) Different from the Adversarial task here the robot isoptimizing the probability of both becoming the target agentrsquosleader and going toward the target goal We also experimentedwith three baseline strategies The Random strategy involvesthe robot taking actions at random The To target goal strat-egy has the robot move directly to the optimal goal glowastand stay there trying to attract other human agents TheTo goal farthest player strategy involves the robot going tothe player that is farthest away from glowast in an attempt toindirectly influence the player to move back to glowast Finallywe experimented with game settings by varying the numberof goals m = 2 6Metrics We evaluated the performance of the robot strategyusing the game success rate over 100 games We calculate thesuccess rate by dividing the number of successful games overnumber of total games

TABLE II Average Game time over 50 adversarial games with varying number of players

number of goals (m=2)

Model n=3 n=4 n=5 n=6

LFG closest pursuer (ours) 23304plusmn5182 30508plusmn4948 46118plusmn5573 55088plusmn5167LFG influential pursuer (ours) 20194plusmn4515 28644plusmn4854 41478plusmn5098 51592plusmn4880

random 1292plusmn3266 20940plusmn3986 38892plusmn5324 43716plusmn4317to one pursuer 21504plusmn5000 23142plusmn4469 45516plusmn5835 47236plusmn4975to farthest goal 13284plusmn3422 1985plusmn3614 38208plusmn5259 44564plusmn4677

TABLE III Average Game time over 50 adversarial games with varying number of goals

number of players (n=4)

Model m=1 m=2 m=3 m=4

LFG closest pursuer (ours) 21094plusmn3323 30508plusmn4948 28922plusmn5299 34300plusmn5590LFG influential pursuer (ours) 23904plusmn3973 28644plusmn4854 21956plusmn4100 30180plusmn5200

random 15594plusmn2142 20940plusmn3986 20574plusmn4305 29462plusmn5401to one pursuer 12358plusmn956 23142plusmn4469 22552plusmn4147 31792plusmn5475to farthest goal 21336plusmn3483 1985plusmn3614 21868plusmn4367 25830plusmn5064

Results Our results are summarized in Table IV We expectedthe To target goal baseline to be very strong since movingtowards the target goal directly influences other simulatedagents to also move toward the target goal We find that thisbaseline strategy is especially effective when the game is notcomplex ie the number of goals is small However ourmodel based on the leader-follower graph still demonstratescompetitive performance compared to it Especially when thenumber of goals increase the advantage of leader-followergraph becomes clearer This indicates that in complex scenar-ios brute force methods that do not have knowledge of humanteam hidden structure will not suffice Another thing to noteis that the difference between all of the strategies becomessmaller as the number of goals increases This is because thedifficulty of the game increases for all strategies and thuswhether a game would succeed depends more on the gameinitial conditions We have shown snapshots of one game

TABLE IV Success rate over 100 collaborative games with varyingnumber of goals m

number of players (n=4)

Model m=2 m=3 m=4 m=5 m=6

LFG closest pursuer 059 038 029 027 022LFG influential pursuer 057 036 032 024 019

random 055 035 024 021 020to target goal 060 042 028 024 021

to goal farthest player 047 029 017 019 021

as in Fig 10 In this game the robot approaches other agentsand the desired goal in the collaborative pattern trying to helpcatch the goal g1

Fig 10 Collaborative task snapshots The orange triangles are thehuman agents and the black triangle is the robot agent

VI DISCUSSION

Summary We propose an approach for modeling leadingand following behavior in multi-agent human teams We usea combination of data-driven and graph-theoretic techniquesto learn a leader-follower graph This graph representationencoding human team hidden structure is scalable with theteam size (number of agents) since we first learn localpairwise relationships and combine them to create a globalmodel We demonstrate the effectiveness of the leader-followergraph by experimenting with optimization based robot policiesthat leverage the graph to influence human teams in differentscenarios Our policies are general and perform well across alltasks compared to other high-performing task-specific policiesLimitations and Future Work We view our work as a firststep into modeling latent dynamic human team structures Per-haps our greatest limitation is the reliance on simulated humanbehavior to validate our framework Further experiments withreal human data are needed to support our frameworkrsquos effec-tiveness for noisier human behavior understanding Moreoverthe robot policies that use the leader-follower graphs are fairlysimple Although this may be a limitation it is also promisingthat simple policies were able to perform well using the leader-follower graph

For future work we plan to evaluate our model on largescale human-robot experiments in both simulation and nav-igation settings such as robot swarms to further improveour modelrsquos generalization capacity We also plan on experi-menting with combining the leader-follower graph with moreadvanced policy learning methods such as deep reinforcementlearning We think the leader-follower graph could contributeto multi-agent reinforcement learning in various ways such asreward design and state representationAcknowledgements Toyota Research Institute (ldquoTRIrdquo) pro-vided funds to assist the authors with their research but thisarticle solely reflects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity The authorswould also like to acknowledge General Electric (GE) andStanford School of Engineering Fellowship

REFERENCES

[1] Pieter Abbeel and Andrew Y Ng Apprenticeship learn-ing via inverse reinforcement learning In Proceedingsof the twenty-first international conference on Machinelearning page 1 ACM 2004

[2] Ali-Akbar Agha-Mohammadi Suman Chakravorty andNancy M Amato Firm Sampling-based feedbackmotion-planning under motion uncertainty and imperfectmeasurements The International Journal of RoboticsResearch 33(2)268ndash304 2014

[3] Baris Akgun Maya Cakmak Karl Jiang and Andrea LThomaz Keyframe-based learning from demonstrationInternational Journal of Social Robotics 4(4)343ndash3552012

[4] Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey EHinton Layer normalization arXiv preprintarXiv160706450 2016

[5] Kim Baraka Ana Paiva and Manuela Veloso Expressivelights for revealing mobile service robot state In Robot2015 Second Iberian Robotics Conference pages 107ndash119 Springer 2016

[6] Jerome Barraquand Bruno Langlois and J-C LatombeNumerical potential field techniques for robot path plan-ning IEEE transactions on systems man and cybernet-ics 22(2)224ndash241 1992

[7] Aaron Bestick Ruzena Bajcsy and Anca D Dragan Im-plicitly assisting humans to choose good grasps in robotto human handovers In International Symposium onExperimental Robotics pages 341ndash354 Springer 2016

[8] Erdem Biyik and Dorsa Sadigh Batch active preference-based learning of reward functions In Conference onRobot Learning (CoRL) October 2018

[9] Frank Broz Illah Nourbakhsh and Reid Simmons De-signing pomdp models of socially situated tasks In RO-MAN 2011 IEEE pages 39ndash46 IEEE 2011

[10] Sonia Chernova and Andrea L Thomaz Robot learningfrom human teachers Synthesis Lectures on ArtificialIntelligence and Machine Learning 8(3)1ndash121 2014

[11] Yoeng-Jin Chu On the shortest arborescence of adirected graph Science Sinica 141396ndash1400 1965

[12] D Scott DeRue Adaptive leadership theory Leadingand following as a complex adaptive process Researchin organizational behavior 31125ndash150 2011

[13] Anca D Dragan Dorsa Sadigh Shankar Sastry and San-jit A Seshia Active preference-based learning of rewardfunctions In Robotics Science and Systems (RSS) 2017

[14] Finale Doshi and Nicholas Roy Efficient model learningfor dialog management In Proceedings of the ACMIEEEinternational conference on Human-robot interactionpages 65ndash72 ACM 2007

[15] Jack Edmonds Optimum branchings Mathematics andthe Decision Sciences Part 1(335-345)25 1968

[16] Katie Genter Noa Agmon and Peter Stone Ad hocteamwork for leading a flock In Proceedings of the2013 international conference on Autonomous agents and

multi-agent systems pages 531ndash538 International Foun-dation for Autonomous Agents and Multiagent Systems2013

[17] Matthew C Gombolay Cindy Huang and Julie A ShahCoordination of human-robot teaming with human taskpreferences In AAAI Fall Symposium Series on AI-HRIvolume 11 page 2015 2015

[18] Dylan Hadfield-Menell Stuart J Russell Pieter Abbeeland Anca Dragan Cooperative inverse reinforcementlearning In Advances in neural information processingsystems pages 3909ndash3917 2016

[19] Shervin Javdani Henny Admoni Stefania PellegrinelliSiddhartha S Srinivasa and J Andrew Bagnell Sharedautonomy via hindsight optimization for teleoperationand teaming The International Journal of RoboticsResearch page 0278364918776060 2018

[20] Takayuki Kanda Takayuki Hirano Daniel Eaton andHiroshi Ishiguro Interactive robots as social partners andpeer tutors for children A field trial Human-computerinteraction 19(1)61ndash84 2004

[21] Richard M Karp A simple derivation of edmondsrsquoalgorithm for optimum branchings Networks 1(3)265ndash272 1971

[22] Piyush Khandelwal Samuel Barrett and Peter StoneLeading the way An efficient multi-robot guidance sys-tem In Proceedings of the 2015 international conferenceon autonomous agents and multiagent systems pages1625ndash1633 International Foundation for AutonomousAgents and Multiagent Systems 2015

[23] Mykel J Kochenderfer Decision making under uncer-tainty theory and application MIT press 2015

[24] Hanna Kurniawati David Hsu and Wee Sun Lee SarsopEfficient point-based pomdp planning by approximatingoptimally reachable belief spaces In Robotics Scienceand systems volume 2008 Zurich Switzerland 2008

[25] Oliver Lemon Conversational interfaces In Data-DrivenMethods for Adaptive Spoken Dialogue Systems pages1ndash4 Springer 2012

[26] Michael L Littman and Peter Stone Leading best-response strategies in repeated games In In SeventeenthAnnual International Joint Conference on Artificial In-telligence Workshop on Economic Agents Models andMechanisms Citeseer 2001

[27] Owen Macindoe Leslie Pack Kaelbling and TomasLozano-Perez Pomcop Belief space planning for side-kicks in cooperative games In AIIDE 2012

[28] Stefanos Nikolaidis and Julie Shah Human-robot cross-training computational formulation modeling and eval-uation of a human team training strategy In Pro-ceedings of the 8th ACMIEEE international conferenceon Human-robot interaction pages 33ndash40 IEEE Press2013

[29] Stefanos Nikolaidis Anton Kuznetsov David Hsu andSiddharta Srinivasa Formalizing human-robot mutualadaptation A bounded memory model In The EleventhACMIEEE International Conference on Human Robot

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 4: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

of the potential field gradient

a = minusnablaU(q) = minus983131

iisinA

θinablaU iatt(q)minus

983131

jisinR

θjnablaU jrep(q)

The attractive potential field increases as the distance togoal becomes larger The repulsive potential field has a fixedeffective range within which the potential field increases asthe distance to the obstacle decreases The attractive andrepulsive potential fields are constructed in the same way forall attractive and repulsive obstacles We elaborate on detailsof the potential field function in the Appendix Section VIIImplementation Details We simulated 15000 two-playergames equally representing the number of possible leaderand follower configurations Each game stored the position ofeach agent and goal at all timesteps Before training we pre-processed the data by shifting it to have a zero-centered meannormalizing it and down-sampling it Each game was then fedinto our network as a sequence Based on our experimentshyperparameters that worked well included a batch size of 250learning rate of 00001 and hidden dimension size of 64 Inaddition we used gradient clipping and layer normalization [4]to stabilize gradient updates Using these hyperparameters wetrained our network for 100000 stepsEvaluating Pairwise Scores Our network trained on two-player simulated data performed with a validation accuracyof 83 We also experimented with training with three-player simulated data two-player human data as well as acombination of two-player simulated and human data (two-player mixed data) Validation results for the first 60000 stepsare shown in Fig 5 For our mixed two-player model we firsttook a pre-trained model trained on two-player simulated dataand then trained on two-player human data The two-playermixed curve shows the modelrsquos accuracy as it is fined-tuned onhuman data Our three-player model had a validation accuracyof 74 while our two player human model and our mixedtwo-player data model had a validation accuracy of 65

0 20e3 40e3 60e3steps

02

04

06

08

accu

racy

2P simulated3P simulated2P human2P mixed

Fig 5 Validation accuracy when calculating pairwise leadershipscores described in Sec III-A

B Maximum Likelihood Leader-Follower Graph in Teams

With a model that can predict pairwise relationships wefocus on building a model that can generalize beyond twoplayers We compute pairwise weights or leadership scoreswij of leader-follower relationships between all possiblepairs of leaders i and followers j The pairwise weightscan be computed based on the supervised learning approachdescribed above indicating the probability of one agent orgoal being another agentrsquos leader After computing wij for all

combinations of leaders and followers we create a directedgraph G = (VE) where V = I cup G and E = (i j)|i isinI j isin I cup G i ∕= j and the weights on each edge (i j)correspond to wij To create a more concise representationwe extract the maximum likelihood leader-follower graph Glowast

by pruning the edges of our constructed graph G We prunethe graph by greedily selecting the outgoing edge with highestweight for each agent node In other words we select theedge associated with the agent or goal that has the highestprobability of being agent irsquos leader where the probabilitiescorrespond to edge weights When pruning we make surethat no cycles are formed If we find a cycle we will choosethe next probable edge Our pruning approach is inspired byEdmondsrsquo algorithm [15 11] which finds a maximum weightarborescence [21] in a graph An arborescence is an acyclicdirected tree structure where there is exactly one outgoingedge from a node to another We use a modified version ofEdmondsrsquo algorithm to find the maximum likelihood leader-follower graph Compared to our approach a maximum weightarborescence is more restrictive since it requires the resultinggraph to be a treeEvaluating the Leader-Follower Graph We evaluate theaccuracy of our leader-follower graph with three or moreagents when trained on simulated two-player and three-playerdata as well as a combination of simulated and human two-player data (shown in Table I) In each of these multi-playergames we extracted a leader-follower graph at each timestepand compared our leader-follower graphrsquos predictions againstthe ground-truth labels Our leader-follower graph performsbetter than a random policy which selects a leader li isin I cupGfor agent i at random where li ∕= i The chance of beingright is thus 1

|G|+|I|minus1 We take the average of all successprobabilities for all leader-follower graph configurations tocompute the overall accuracy In all experiments shown inTable I our trained model outperforms the random policyMost notably the models scale naturally to settings with largenumbers of players as well as human data We find that trainingwith larger numbers of players helps with generalization Thetradeoff between training with larger numbers of players andcollecting more data will depend on the problem domain Forour experiments (Section V) we use the model trained onthree-player simulated dataTABLE I Generalization accuracy of leader-follower graph (LFG)

Training Data Testing Data LFGAccuracy

RandomAccuracy

3 players simulated 067 0294 players simulated 045 023

2 players simulated 5 players simulated 041 0192 players human 068 0443 players human 047 029

3 players simulated 044 0294 players simulated 038 023

2 players mixed 5 players simulated 028 0192 players human 069 0443 players human 044 029

4 players simulated 053 0233 players simulated 5 players simulated 050 019

3 players human 063 029

1 23

4

3

21

1 23

12

3

4

(a) Graph The directed edges represent pairwise likelihoods that the tail node

is the head nodes leader

(b) Maximum-likelihood leader-follower graph shown in bold

(c) Most influential leader is agent 2

(d) Most influential leader is agent 1 since other human agents are targeting the preferred goal

2

31

2

3

4

12

3

1preferred goal

1

2

34

robotrobot

preferred goal

Fig 6 On left creating a maximum-likelihood leader-follower graph Glowast On right examples of influential leaders in Glowast The green circlesare goals orange triangles are human agents and black triangles are robot agents

IV PLANNING BASED ON INFERENCE ONLEADER-FOLLOWER GRAPHS

We so far have computed a graphical representation forlatent leadership structures in human teams Glowast In this sectionwe will use Glowast to positively influence human teams ie movethe team towards a more desirable outcome We first describehow a robot can use the leader-follower graph to infer usefulteam structures We then describe how a robot can leveragethese inferences to plan for a desired outcome

A Inference based on leader-follower graph

Leader-follower graphs enable a robot to infer useful in-formation about a team such as agentsrsquo goals or who themost influential leader is These pieces of information helpthe robot to identify key goals or agents that are useful inachieving a desired outcome (eg identifying shared goalsin a collaborative task) Especially in multi-agent settingsfollowing key goals or influencing key agents is a strategicmove that allows the robot to plan for desired outcomes Webegin by describing different inferences a robot can performon the leader-follower graphGoal inference in multiagent settings One way a robot canuse structure in the leader-follower graph is to perform goalinference An agentrsquos goal can be inferred by the outgoingedges from agents to goals In the case where there is anoutgoing edge from an agent to another agent (ie agent ifollows agent j) we assume transitivity where agent i can beimplicitly following agent jrsquos believed ultimate goalInfluencing the most influential leader In order to leada team toward a desired goal the robot can also leveragethe leader-follower graph to predict who the most influentialleader is We define the most influential leader to be theagent ilowast isin I with the most number of followers Identifyingthe most influential leader ilowast allows the robot to strategicallyinfluence a single teammate that also indirectly influences theother teammates that are following ilowast For example in Fig6c and Fig 6d we show two examples of identifying themost influential leader from Glowast In the case where some ofagents are already going for the preferred goal the one thathas the most followers among the remaining players becomesthe most influential leader as shown in Fig 6d

B Optimization based on leader-follower graph

We leverage the leader-follower graph to design robotpolicies that optimize for an objective (eg leading the teamtowards a preferred goal) We introduce one such way wecan use the leader-follower graph by directly incorporatinginferences we make from it into a simple optimization

At each timestep we generate graphs Gat

t+k that simulatewhat the leader-follower graph would look like at timestept+k if the robot takes an action at at current timestep t Overthe next k steps we assume human agents will continue alongthe current trajectory with constant velocity

From each graph Gat

t+k we can obtain the weights wt+kij

corresponding to an objective r that the robot is optimizingfor (eg the robot becoming agent irsquos leader) We then iterateover the robotrsquos action assuming it is finite and discrete andchoose the action alowastt that maximizes the objective r We notethat r must be expressible in terms of wt+k

ij rsquos and wt+kig rsquos for

i j isin I and g isin G

alowastt = argmaxatisinA

r983043wt+k

ij (at)ijisinI wt+kig (at)iisinIgisinG

983044

(2)We describe three specific tasks that we plan for using the

optimization described in Eqn (2)Redirecting a leader-follower relationship A robot candirectly influence team dynamics by changing leader-followerrelationships For a given directed edge between agents i andj the robot can use the optimization outlined in Eqn (2) foractions that reverse an edge or direct the edge to a differentagent For instance to reverse the direction of the edge fromagent i to agent j the robot will select actions that maximizethe probability of agent j following agent i

alowastt = argmaxatisinA

wt+kji (at) i j isin I (3)

The robot can also take actions to eliminate an edge betweenagents i and j by minimizing wij One might want to modifyedges in the leader-follower graph when trying to change theleadership structure in a team For instance in a setting whereagents must collectively decide on a goal a robot can helpunify a team with sub-groups (an example is shown in Fig 3d)by re-directing the edges of one sub-group to follow another

Distracting a team In adversarial settings a robot might wantto prevent a team of humans from reaching a collective goal gIn order to stall the team a robot can use the leader-followergraph to identify who the current most influential leader ilowast isThe robot can then select actions that maximize the probabilityof the robot becoming the most influential leaderrsquos leaderand minimize the probability of the most influential leaderfollowing the collective goal g

alowastt = argmaxatisinA

wt+kilowastr (at) ilowast isin I (4)

Distracting a team from reaching a collective goal can beuseful in cases where the team is an adversary For instance ateam of adversarial robots may want to prevent their opponentsfrom collaboratively reaching a joint goalLeading a team towards the optimal goal In collaborativesettings where the team needs to agree on a goal g isin G arobot that knows where the optimal goal glowast isin G is shouldmaximize joint utility by leading all of its teammates to reachglowast To influence the team the robot can use the leader-followergraph to infer who the current most influential leader ilowast is Therobot can then select actions that maximize the probability ofthe most influential leader following the optimal goal glowast

alowastt = argmaxatisinA

wt+kilowastr (at) + wt+k

rlowastglowast(at) ilowast isin I (5)

Being able to lead a team of humans to a goal is useful in manyreal-life scenarios For instance in search-and-rescue missionsrobots with more information about the location of survivorsshould be able to lead the team in the optimal direction

V EXPERIMENTS

With an optimization framework that plans for robot actionsusing the leader-follower graph we evaluate our approachon the three tasks described in Sec IV For each task wecompare task performance with robot policies that use theleader-follower graph against robot policies that do not Acrossall tasks we find that robot policies that use the leader-followergraph perform well compared to other policies showing thatour graph can be easily generalized to different settingsTask setup Our tasks take place in the pursuit-evasiondomain Within each task we conduct experiments withsimulated human behavior Simulated humans move along apotential field as shown in Eqn 1 where there are two sourcesof attraction the agentrsquos goal (ag) and the crowd center (ac)and simulated human agents would make trade offs betweenfollowing the crowd and moving toward a target goal Theweights θg and θc encode how determined human players areabout reaching the goal or moving towards the crowd Higherθc means that players care more about being close to the wholeteam while higher θg indicates more determined players whocare about reaching their goals This can make it harder for arobot to influence the humans For validating our frameworkwe use weights θg = 06 and θc = 04 where going toward thegoal is slightly more advantageous than moving with the wholeteam setting moderate challenges for the robot to completetasks At the same time we prevent agents from colliding with

one another by making non-target agents and goals sources ofrepulsion with weight θj = 1

An important limitation is that our simulation does notcapture the broad range of behaviors humans would show inresponse to a robot that is trying to influence them We onlyoffer one simple interpretation of how humans would reactbased on the potential field

In each iteration of the task the initial position of agentsand goals are randomized For all of our experiments the sizeof the game canvas is 500times 500 pixels At every time step asimulated human can move 1 unit in one of the four directionsup down left right or stay at its position The robot agentrsquosaction space is the same but can move 5 units at a time Themaximum game time limit for all our experiments is 1000timesteps For convenience we will refer to simulated humanagents as human agents for the rest of the section

A Changing edges of the leader-follower graph

We evaluate a robotrsquos ability to change an edge of a leader-follower graph Unlike the other tasks we described the goalof this task is not to affect some larger outcome in the game(eg influence humans toward a particular goal) Instead thistask serves as a preliminary to others where we evaluate howwell a robot is able to manipulate the leader-follower graphMethods Given a human agent i who is predisposed tofollowing a goal with weights θg = 06 θa = 04 we createda robot policy that encouraged the agent to follow the robotr instead The robot optimized for the probability wir that itwould become agent irsquos leader as outlined in Eqn (3) Wecompared our approach against a random baseline where therobot chooses its next action at randomMetrics For both our approach and our baseline we evaluatedthe performance of the robot based on the leadership scoresie probabilities wir given by the leader-follower graphResults We show that the robot can influence a humanagent to follow it Fig 7 contains averaged probabilitiesover ten tasks The probability of the robot being the humanagentrsquos leader wir increases over time and averages to 73represented by the orange dashed line Our approach performswell compared to the random baseline which has an averageperformance of 26 represented by the grey dashed line

0 100 200 300 400 500 600task steps

02

04

06

08

pro

bab

ilit

y

oursrandom

Fig 7 Smoothed probabilities of a human agent following the robotover 10 tasks The robot is quickly able to become the human agentrsquosleader with an average of leadership score of 73

B Adversarial taskWe consider a different task where the robot is an adversary

that is trying to distract a team of humans from reaching a goalThere are m goals and n agents in the pursuit-evasion gameAmong the n agents we have 1 robot and n minus 1 humansThe robotrsquos goal is to distract a team of humans so that theycannot converge to the same goal In order to capture a goalwe enforce the constraint that nminus2 agents must collide with itat the same time allowing 1 agent to be absent Since we allowone agent to be absent this prevents the robot from adoptingthe simple strategy of blocking an agent throughout the gameThe game ends when all goals are captured or when the gametime exceeds the limit Thus another way to interpret therobotrsquos objective is to extend game time as much as possibleMethods We compared our approach against 3 baselineheuristic strategies (random to one player to farthest goal)that do not use the leader-follower graph

In the Random strategy the robot picks an action at eachtimestep with uniform probability In the To one player strat-egy the robot randomly selects a human agent and then movestowards it to try to block its way Finally a robot enactingthe To farthest goal strategy selects the goal that the averagedistance to human players are largest and then go to that goalin the hope that human agents would get influenced or mayfurther change goal by observing that some players are headingfor another goal

We also experimented with two variations of our approachLFG closest pursuer involves the robot agent selecting theclosest pursuer and choosing an action to maximize theprobability of the pursuer following it (as predicted by theLFG) Similarly LFG influential pursuer strategy involvesthe robot targeting the most influential human agent predictedby the LFG and then conducting the same optimization ofmaximizing the following probability as shown in Eqn (4)

Finally we explored different game settings by varying thenumber of players n = 3 6 and the number of goalsm = 1 4Metrics We evaluated the performance of the robot with gametime as the metric Longer game time indicates that the robotdoes well in distracting human playersResults We sampled 50 outcomes for every game setting androbot policy Tables II and III show the mean and standarddeviation of game time for every game setting and robotpolicy across 50 samples Across game settings our modelsbased on the leader-follower graph consistently outperformedmethods without knowledge of leader-follower graph Theseexperimental results are also visualized in Fig 8

As can be seen from Fig 8 the average game time goes upas the number of players increases This is because it is morechallenging for more players to agree upon on which goal tocapture The consistent advantageous performance across allgame settings suggests the effectiveness of the leader-followergraph for inference and optimization in this scenario

We demonstrate an example of robot behavior in theadversarial game shown in Fig 9 (dashed lines representtrajectories) Player 1 and player 2 start very close to goal

Fig 8 Average game time over 50 games with a different numberof players across all baseline methods and our model

Fig 9 Adversarial task snapshots The orange triangles are the humanagents and the black triangle is the robot agent

g1 making g1 the probable target The robot then approachesagent 2 blocks its way and leads it to another goal g2 In thisway the robot successfully extends the game time

C Cooperative task

Finally we evaluate the robot in a cooperative setting wherethe robot tries to be helpful for human teams The robot aimsto lead its teammates in a way that everyone can reach thetarget goal that gives the team the greatest joint utility glowast isin GHowever glowast is not immediately observable to all teammatesWe assume a setting where only the robot knows where glowast is

For consistency the experiment setting is the same as theAdversarial task where n minus 2 human agents need to collidewith a goal to capture it In this scenario the task is consideredsuccessful if the goal with greatest joint utility glowast is capturedThe task is unsuccessful if any other suboptimal goal iscaptured or if the game time exceeds the limitMethods Similar to the Adversarial task we explore twovariations of our approach where the robot chooses to in-fluence its closest human agent or the most influential agentpredicted by the leader-follower graph The variation wherethe robot targets the most influential leader is outlined in Eqn(5) Different from the Adversarial task here the robot isoptimizing the probability of both becoming the target agentrsquosleader and going toward the target goal We also experimentedwith three baseline strategies The Random strategy involvesthe robot taking actions at random The To target goal strat-egy has the robot move directly to the optimal goal glowastand stay there trying to attract other human agents TheTo goal farthest player strategy involves the robot going tothe player that is farthest away from glowast in an attempt toindirectly influence the player to move back to glowast Finallywe experimented with game settings by varying the numberof goals m = 2 6Metrics We evaluated the performance of the robot strategyusing the game success rate over 100 games We calculate thesuccess rate by dividing the number of successful games overnumber of total games

TABLE II Average Game time over 50 adversarial games with varying number of players

number of goals (m=2)

Model n=3 n=4 n=5 n=6

LFG closest pursuer (ours) 23304plusmn5182 30508plusmn4948 46118plusmn5573 55088plusmn5167LFG influential pursuer (ours) 20194plusmn4515 28644plusmn4854 41478plusmn5098 51592plusmn4880

random 1292plusmn3266 20940plusmn3986 38892plusmn5324 43716plusmn4317to one pursuer 21504plusmn5000 23142plusmn4469 45516plusmn5835 47236plusmn4975to farthest goal 13284plusmn3422 1985plusmn3614 38208plusmn5259 44564plusmn4677

TABLE III Average Game time over 50 adversarial games with varying number of goals

number of players (n=4)

Model m=1 m=2 m=3 m=4

LFG closest pursuer (ours) 21094plusmn3323 30508plusmn4948 28922plusmn5299 34300plusmn5590LFG influential pursuer (ours) 23904plusmn3973 28644plusmn4854 21956plusmn4100 30180plusmn5200

random 15594plusmn2142 20940plusmn3986 20574plusmn4305 29462plusmn5401to one pursuer 12358plusmn956 23142plusmn4469 22552plusmn4147 31792plusmn5475to farthest goal 21336plusmn3483 1985plusmn3614 21868plusmn4367 25830plusmn5064

Results Our results are summarized in Table IV We expectedthe To target goal baseline to be very strong since movingtowards the target goal directly influences other simulatedagents to also move toward the target goal We find that thisbaseline strategy is especially effective when the game is notcomplex ie the number of goals is small However ourmodel based on the leader-follower graph still demonstratescompetitive performance compared to it Especially when thenumber of goals increase the advantage of leader-followergraph becomes clearer This indicates that in complex scenar-ios brute force methods that do not have knowledge of humanteam hidden structure will not suffice Another thing to noteis that the difference between all of the strategies becomessmaller as the number of goals increases This is because thedifficulty of the game increases for all strategies and thuswhether a game would succeed depends more on the gameinitial conditions We have shown snapshots of one game

TABLE IV Success rate over 100 collaborative games with varyingnumber of goals m

number of players (n=4)

Model m=2 m=3 m=4 m=5 m=6

LFG closest pursuer 059 038 029 027 022LFG influential pursuer 057 036 032 024 019

random 055 035 024 021 020to target goal 060 042 028 024 021

to goal farthest player 047 029 017 019 021

as in Fig 10 In this game the robot approaches other agentsand the desired goal in the collaborative pattern trying to helpcatch the goal g1

Fig 10 Collaborative task snapshots The orange triangles are thehuman agents and the black triangle is the robot agent

VI DISCUSSION

Summary We propose an approach for modeling leadingand following behavior in multi-agent human teams We usea combination of data-driven and graph-theoretic techniquesto learn a leader-follower graph This graph representationencoding human team hidden structure is scalable with theteam size (number of agents) since we first learn localpairwise relationships and combine them to create a globalmodel We demonstrate the effectiveness of the leader-followergraph by experimenting with optimization based robot policiesthat leverage the graph to influence human teams in differentscenarios Our policies are general and perform well across alltasks compared to other high-performing task-specific policiesLimitations and Future Work We view our work as a firststep into modeling latent dynamic human team structures Per-haps our greatest limitation is the reliance on simulated humanbehavior to validate our framework Further experiments withreal human data are needed to support our frameworkrsquos effec-tiveness for noisier human behavior understanding Moreoverthe robot policies that use the leader-follower graphs are fairlysimple Although this may be a limitation it is also promisingthat simple policies were able to perform well using the leader-follower graph

For future work we plan to evaluate our model on largescale human-robot experiments in both simulation and nav-igation settings such as robot swarms to further improveour modelrsquos generalization capacity We also plan on experi-menting with combining the leader-follower graph with moreadvanced policy learning methods such as deep reinforcementlearning We think the leader-follower graph could contributeto multi-agent reinforcement learning in various ways such asreward design and state representationAcknowledgements Toyota Research Institute (ldquoTRIrdquo) pro-vided funds to assist the authors with their research but thisarticle solely reflects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity The authorswould also like to acknowledge General Electric (GE) andStanford School of Engineering Fellowship

REFERENCES

[1] Pieter Abbeel and Andrew Y Ng Apprenticeship learn-ing via inverse reinforcement learning In Proceedingsof the twenty-first international conference on Machinelearning page 1 ACM 2004

[2] Ali-Akbar Agha-Mohammadi Suman Chakravorty andNancy M Amato Firm Sampling-based feedbackmotion-planning under motion uncertainty and imperfectmeasurements The International Journal of RoboticsResearch 33(2)268ndash304 2014

[3] Baris Akgun Maya Cakmak Karl Jiang and Andrea LThomaz Keyframe-based learning from demonstrationInternational Journal of Social Robotics 4(4)343ndash3552012

[4] Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey EHinton Layer normalization arXiv preprintarXiv160706450 2016

[5] Kim Baraka Ana Paiva and Manuela Veloso Expressivelights for revealing mobile service robot state In Robot2015 Second Iberian Robotics Conference pages 107ndash119 Springer 2016

[6] Jerome Barraquand Bruno Langlois and J-C LatombeNumerical potential field techniques for robot path plan-ning IEEE transactions on systems man and cybernet-ics 22(2)224ndash241 1992

[7] Aaron Bestick Ruzena Bajcsy and Anca D Dragan Im-plicitly assisting humans to choose good grasps in robotto human handovers In International Symposium onExperimental Robotics pages 341ndash354 Springer 2016

[8] Erdem Biyik and Dorsa Sadigh Batch active preference-based learning of reward functions In Conference onRobot Learning (CoRL) October 2018

[9] Frank Broz Illah Nourbakhsh and Reid Simmons De-signing pomdp models of socially situated tasks In RO-MAN 2011 IEEE pages 39ndash46 IEEE 2011

[10] Sonia Chernova and Andrea L Thomaz Robot learningfrom human teachers Synthesis Lectures on ArtificialIntelligence and Machine Learning 8(3)1ndash121 2014

[11] Yoeng-Jin Chu On the shortest arborescence of adirected graph Science Sinica 141396ndash1400 1965

[12] D Scott DeRue Adaptive leadership theory Leadingand following as a complex adaptive process Researchin organizational behavior 31125ndash150 2011

[13] Anca D Dragan Dorsa Sadigh Shankar Sastry and San-jit A Seshia Active preference-based learning of rewardfunctions In Robotics Science and Systems (RSS) 2017

[14] Finale Doshi and Nicholas Roy Efficient model learningfor dialog management In Proceedings of the ACMIEEEinternational conference on Human-robot interactionpages 65ndash72 ACM 2007

[15] Jack Edmonds Optimum branchings Mathematics andthe Decision Sciences Part 1(335-345)25 1968

[16] Katie Genter Noa Agmon and Peter Stone Ad hocteamwork for leading a flock In Proceedings of the2013 international conference on Autonomous agents and

multi-agent systems pages 531ndash538 International Foun-dation for Autonomous Agents and Multiagent Systems2013

[17] Matthew C Gombolay Cindy Huang and Julie A ShahCoordination of human-robot teaming with human taskpreferences In AAAI Fall Symposium Series on AI-HRIvolume 11 page 2015 2015

[18] Dylan Hadfield-Menell Stuart J Russell Pieter Abbeeland Anca Dragan Cooperative inverse reinforcementlearning In Advances in neural information processingsystems pages 3909ndash3917 2016

[19] Shervin Javdani Henny Admoni Stefania PellegrinelliSiddhartha S Srinivasa and J Andrew Bagnell Sharedautonomy via hindsight optimization for teleoperationand teaming The International Journal of RoboticsResearch page 0278364918776060 2018

[20] Takayuki Kanda Takayuki Hirano Daniel Eaton andHiroshi Ishiguro Interactive robots as social partners andpeer tutors for children A field trial Human-computerinteraction 19(1)61ndash84 2004

[21] Richard M Karp A simple derivation of edmondsrsquoalgorithm for optimum branchings Networks 1(3)265ndash272 1971

[22] Piyush Khandelwal Samuel Barrett and Peter StoneLeading the way An efficient multi-robot guidance sys-tem In Proceedings of the 2015 international conferenceon autonomous agents and multiagent systems pages1625ndash1633 International Foundation for AutonomousAgents and Multiagent Systems 2015

[23] Mykel J Kochenderfer Decision making under uncer-tainty theory and application MIT press 2015

[24] Hanna Kurniawati David Hsu and Wee Sun Lee SarsopEfficient point-based pomdp planning by approximatingoptimally reachable belief spaces In Robotics Scienceand systems volume 2008 Zurich Switzerland 2008

[25] Oliver Lemon Conversational interfaces In Data-DrivenMethods for Adaptive Spoken Dialogue Systems pages1ndash4 Springer 2012

[26] Michael L Littman and Peter Stone Leading best-response strategies in repeated games In In SeventeenthAnnual International Joint Conference on Artificial In-telligence Workshop on Economic Agents Models andMechanisms Citeseer 2001

[27] Owen Macindoe Leslie Pack Kaelbling and TomasLozano-Perez Pomcop Belief space planning for side-kicks in cooperative games In AIIDE 2012

[28] Stefanos Nikolaidis and Julie Shah Human-robot cross-training computational formulation modeling and eval-uation of a human team training strategy In Pro-ceedings of the 8th ACMIEEE international conferenceon Human-robot interaction pages 33ndash40 IEEE Press2013

[29] Stefanos Nikolaidis Anton Kuznetsov David Hsu andSiddharta Srinivasa Formalizing human-robot mutualadaptation A bounded memory model In The EleventhACMIEEE International Conference on Human Robot

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 5: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

1 23

4

3

21

1 23

12

3

4

(a) Graph The directed edges represent pairwise likelihoods that the tail node

is the head nodes leader

(b) Maximum-likelihood leader-follower graph shown in bold

(c) Most influential leader is agent 2

(d) Most influential leader is agent 1 since other human agents are targeting the preferred goal

2

31

2

3

4

12

3

1preferred goal

1

2

34

robotrobot

preferred goal

Fig 6 On left creating a maximum-likelihood leader-follower graph Glowast On right examples of influential leaders in Glowast The green circlesare goals orange triangles are human agents and black triangles are robot agents

IV PLANNING BASED ON INFERENCE ONLEADER-FOLLOWER GRAPHS

We so far have computed a graphical representation forlatent leadership structures in human teams Glowast In this sectionwe will use Glowast to positively influence human teams ie movethe team towards a more desirable outcome We first describehow a robot can use the leader-follower graph to infer usefulteam structures We then describe how a robot can leveragethese inferences to plan for a desired outcome

A Inference based on leader-follower graph

Leader-follower graphs enable a robot to infer useful in-formation about a team such as agentsrsquo goals or who themost influential leader is These pieces of information helpthe robot to identify key goals or agents that are useful inachieving a desired outcome (eg identifying shared goalsin a collaborative task) Especially in multi-agent settingsfollowing key goals or influencing key agents is a strategicmove that allows the robot to plan for desired outcomes Webegin by describing different inferences a robot can performon the leader-follower graphGoal inference in multiagent settings One way a robot canuse structure in the leader-follower graph is to perform goalinference An agentrsquos goal can be inferred by the outgoingedges from agents to goals In the case where there is anoutgoing edge from an agent to another agent (ie agent ifollows agent j) we assume transitivity where agent i can beimplicitly following agent jrsquos believed ultimate goalInfluencing the most influential leader In order to leada team toward a desired goal the robot can also leveragethe leader-follower graph to predict who the most influentialleader is We define the most influential leader to be theagent ilowast isin I with the most number of followers Identifyingthe most influential leader ilowast allows the robot to strategicallyinfluence a single teammate that also indirectly influences theother teammates that are following ilowast For example in Fig6c and Fig 6d we show two examples of identifying themost influential leader from Glowast In the case where some ofagents are already going for the preferred goal the one thathas the most followers among the remaining players becomesthe most influential leader as shown in Fig 6d

B Optimization based on leader-follower graph

We leverage the leader-follower graph to design robotpolicies that optimize for an objective (eg leading the teamtowards a preferred goal) We introduce one such way wecan use the leader-follower graph by directly incorporatinginferences we make from it into a simple optimization

At each timestep we generate graphs Gat

t+k that simulatewhat the leader-follower graph would look like at timestept+k if the robot takes an action at at current timestep t Overthe next k steps we assume human agents will continue alongthe current trajectory with constant velocity

From each graph Gat

t+k we can obtain the weights wt+kij

corresponding to an objective r that the robot is optimizingfor (eg the robot becoming agent irsquos leader) We then iterateover the robotrsquos action assuming it is finite and discrete andchoose the action alowastt that maximizes the objective r We notethat r must be expressible in terms of wt+k

ij rsquos and wt+kig rsquos for

i j isin I and g isin G

alowastt = argmaxatisinA

r983043wt+k

ij (at)ijisinI wt+kig (at)iisinIgisinG

983044

(2)We describe three specific tasks that we plan for using the

optimization described in Eqn (2)Redirecting a leader-follower relationship A robot candirectly influence team dynamics by changing leader-followerrelationships For a given directed edge between agents i andj the robot can use the optimization outlined in Eqn (2) foractions that reverse an edge or direct the edge to a differentagent For instance to reverse the direction of the edge fromagent i to agent j the robot will select actions that maximizethe probability of agent j following agent i

alowastt = argmaxatisinA

wt+kji (at) i j isin I (3)

The robot can also take actions to eliminate an edge betweenagents i and j by minimizing wij One might want to modifyedges in the leader-follower graph when trying to change theleadership structure in a team For instance in a setting whereagents must collectively decide on a goal a robot can helpunify a team with sub-groups (an example is shown in Fig 3d)by re-directing the edges of one sub-group to follow another

Distracting a team In adversarial settings a robot might wantto prevent a team of humans from reaching a collective goal gIn order to stall the team a robot can use the leader-followergraph to identify who the current most influential leader ilowast isThe robot can then select actions that maximize the probabilityof the robot becoming the most influential leaderrsquos leaderand minimize the probability of the most influential leaderfollowing the collective goal g

alowastt = argmaxatisinA

wt+kilowastr (at) ilowast isin I (4)

Distracting a team from reaching a collective goal can beuseful in cases where the team is an adversary For instance ateam of adversarial robots may want to prevent their opponentsfrom collaboratively reaching a joint goalLeading a team towards the optimal goal In collaborativesettings where the team needs to agree on a goal g isin G arobot that knows where the optimal goal glowast isin G is shouldmaximize joint utility by leading all of its teammates to reachglowast To influence the team the robot can use the leader-followergraph to infer who the current most influential leader ilowast is Therobot can then select actions that maximize the probability ofthe most influential leader following the optimal goal glowast

alowastt = argmaxatisinA

wt+kilowastr (at) + wt+k

rlowastglowast(at) ilowast isin I (5)

Being able to lead a team of humans to a goal is useful in manyreal-life scenarios For instance in search-and-rescue missionsrobots with more information about the location of survivorsshould be able to lead the team in the optimal direction

V EXPERIMENTS

With an optimization framework that plans for robot actionsusing the leader-follower graph we evaluate our approachon the three tasks described in Sec IV For each task wecompare task performance with robot policies that use theleader-follower graph against robot policies that do not Acrossall tasks we find that robot policies that use the leader-followergraph perform well compared to other policies showing thatour graph can be easily generalized to different settingsTask setup Our tasks take place in the pursuit-evasiondomain Within each task we conduct experiments withsimulated human behavior Simulated humans move along apotential field as shown in Eqn 1 where there are two sourcesof attraction the agentrsquos goal (ag) and the crowd center (ac)and simulated human agents would make trade offs betweenfollowing the crowd and moving toward a target goal Theweights θg and θc encode how determined human players areabout reaching the goal or moving towards the crowd Higherθc means that players care more about being close to the wholeteam while higher θg indicates more determined players whocare about reaching their goals This can make it harder for arobot to influence the humans For validating our frameworkwe use weights θg = 06 and θc = 04 where going toward thegoal is slightly more advantageous than moving with the wholeteam setting moderate challenges for the robot to completetasks At the same time we prevent agents from colliding with

one another by making non-target agents and goals sources ofrepulsion with weight θj = 1

An important limitation is that our simulation does notcapture the broad range of behaviors humans would show inresponse to a robot that is trying to influence them We onlyoffer one simple interpretation of how humans would reactbased on the potential field

In each iteration of the task the initial position of agentsand goals are randomized For all of our experiments the sizeof the game canvas is 500times 500 pixels At every time step asimulated human can move 1 unit in one of the four directionsup down left right or stay at its position The robot agentrsquosaction space is the same but can move 5 units at a time Themaximum game time limit for all our experiments is 1000timesteps For convenience we will refer to simulated humanagents as human agents for the rest of the section

A Changing edges of the leader-follower graph

We evaluate a robotrsquos ability to change an edge of a leader-follower graph Unlike the other tasks we described the goalof this task is not to affect some larger outcome in the game(eg influence humans toward a particular goal) Instead thistask serves as a preliminary to others where we evaluate howwell a robot is able to manipulate the leader-follower graphMethods Given a human agent i who is predisposed tofollowing a goal with weights θg = 06 θa = 04 we createda robot policy that encouraged the agent to follow the robotr instead The robot optimized for the probability wir that itwould become agent irsquos leader as outlined in Eqn (3) Wecompared our approach against a random baseline where therobot chooses its next action at randomMetrics For both our approach and our baseline we evaluatedthe performance of the robot based on the leadership scoresie probabilities wir given by the leader-follower graphResults We show that the robot can influence a humanagent to follow it Fig 7 contains averaged probabilitiesover ten tasks The probability of the robot being the humanagentrsquos leader wir increases over time and averages to 73represented by the orange dashed line Our approach performswell compared to the random baseline which has an averageperformance of 26 represented by the grey dashed line

0 100 200 300 400 500 600task steps

02

04

06

08

pro

bab

ilit

y

oursrandom

Fig 7 Smoothed probabilities of a human agent following the robotover 10 tasks The robot is quickly able to become the human agentrsquosleader with an average of leadership score of 73

B Adversarial taskWe consider a different task where the robot is an adversary

that is trying to distract a team of humans from reaching a goalThere are m goals and n agents in the pursuit-evasion gameAmong the n agents we have 1 robot and n minus 1 humansThe robotrsquos goal is to distract a team of humans so that theycannot converge to the same goal In order to capture a goalwe enforce the constraint that nminus2 agents must collide with itat the same time allowing 1 agent to be absent Since we allowone agent to be absent this prevents the robot from adoptingthe simple strategy of blocking an agent throughout the gameThe game ends when all goals are captured or when the gametime exceeds the limit Thus another way to interpret therobotrsquos objective is to extend game time as much as possibleMethods We compared our approach against 3 baselineheuristic strategies (random to one player to farthest goal)that do not use the leader-follower graph

In the Random strategy the robot picks an action at eachtimestep with uniform probability In the To one player strat-egy the robot randomly selects a human agent and then movestowards it to try to block its way Finally a robot enactingthe To farthest goal strategy selects the goal that the averagedistance to human players are largest and then go to that goalin the hope that human agents would get influenced or mayfurther change goal by observing that some players are headingfor another goal

We also experimented with two variations of our approachLFG closest pursuer involves the robot agent selecting theclosest pursuer and choosing an action to maximize theprobability of the pursuer following it (as predicted by theLFG) Similarly LFG influential pursuer strategy involvesthe robot targeting the most influential human agent predictedby the LFG and then conducting the same optimization ofmaximizing the following probability as shown in Eqn (4)

Finally we explored different game settings by varying thenumber of players n = 3 6 and the number of goalsm = 1 4Metrics We evaluated the performance of the robot with gametime as the metric Longer game time indicates that the robotdoes well in distracting human playersResults We sampled 50 outcomes for every game setting androbot policy Tables II and III show the mean and standarddeviation of game time for every game setting and robotpolicy across 50 samples Across game settings our modelsbased on the leader-follower graph consistently outperformedmethods without knowledge of leader-follower graph Theseexperimental results are also visualized in Fig 8

As can be seen from Fig 8 the average game time goes upas the number of players increases This is because it is morechallenging for more players to agree upon on which goal tocapture The consistent advantageous performance across allgame settings suggests the effectiveness of the leader-followergraph for inference and optimization in this scenario

We demonstrate an example of robot behavior in theadversarial game shown in Fig 9 (dashed lines representtrajectories) Player 1 and player 2 start very close to goal

Fig 8 Average game time over 50 games with a different numberof players across all baseline methods and our model

Fig 9 Adversarial task snapshots The orange triangles are the humanagents and the black triangle is the robot agent

g1 making g1 the probable target The robot then approachesagent 2 blocks its way and leads it to another goal g2 In thisway the robot successfully extends the game time

C Cooperative task

Finally we evaluate the robot in a cooperative setting wherethe robot tries to be helpful for human teams The robot aimsto lead its teammates in a way that everyone can reach thetarget goal that gives the team the greatest joint utility glowast isin GHowever glowast is not immediately observable to all teammatesWe assume a setting where only the robot knows where glowast is

For consistency the experiment setting is the same as theAdversarial task where n minus 2 human agents need to collidewith a goal to capture it In this scenario the task is consideredsuccessful if the goal with greatest joint utility glowast is capturedThe task is unsuccessful if any other suboptimal goal iscaptured or if the game time exceeds the limitMethods Similar to the Adversarial task we explore twovariations of our approach where the robot chooses to in-fluence its closest human agent or the most influential agentpredicted by the leader-follower graph The variation wherethe robot targets the most influential leader is outlined in Eqn(5) Different from the Adversarial task here the robot isoptimizing the probability of both becoming the target agentrsquosleader and going toward the target goal We also experimentedwith three baseline strategies The Random strategy involvesthe robot taking actions at random The To target goal strat-egy has the robot move directly to the optimal goal glowastand stay there trying to attract other human agents TheTo goal farthest player strategy involves the robot going tothe player that is farthest away from glowast in an attempt toindirectly influence the player to move back to glowast Finallywe experimented with game settings by varying the numberof goals m = 2 6Metrics We evaluated the performance of the robot strategyusing the game success rate over 100 games We calculate thesuccess rate by dividing the number of successful games overnumber of total games

TABLE II Average Game time over 50 adversarial games with varying number of players

number of goals (m=2)

Model n=3 n=4 n=5 n=6

LFG closest pursuer (ours) 23304plusmn5182 30508plusmn4948 46118plusmn5573 55088plusmn5167LFG influential pursuer (ours) 20194plusmn4515 28644plusmn4854 41478plusmn5098 51592plusmn4880

random 1292plusmn3266 20940plusmn3986 38892plusmn5324 43716plusmn4317to one pursuer 21504plusmn5000 23142plusmn4469 45516plusmn5835 47236plusmn4975to farthest goal 13284plusmn3422 1985plusmn3614 38208plusmn5259 44564plusmn4677

TABLE III Average Game time over 50 adversarial games with varying number of goals

number of players (n=4)

Model m=1 m=2 m=3 m=4

LFG closest pursuer (ours) 21094plusmn3323 30508plusmn4948 28922plusmn5299 34300plusmn5590LFG influential pursuer (ours) 23904plusmn3973 28644plusmn4854 21956plusmn4100 30180plusmn5200

random 15594plusmn2142 20940plusmn3986 20574plusmn4305 29462plusmn5401to one pursuer 12358plusmn956 23142plusmn4469 22552plusmn4147 31792plusmn5475to farthest goal 21336plusmn3483 1985plusmn3614 21868plusmn4367 25830plusmn5064

Results Our results are summarized in Table IV We expectedthe To target goal baseline to be very strong since movingtowards the target goal directly influences other simulatedagents to also move toward the target goal We find that thisbaseline strategy is especially effective when the game is notcomplex ie the number of goals is small However ourmodel based on the leader-follower graph still demonstratescompetitive performance compared to it Especially when thenumber of goals increase the advantage of leader-followergraph becomes clearer This indicates that in complex scenar-ios brute force methods that do not have knowledge of humanteam hidden structure will not suffice Another thing to noteis that the difference between all of the strategies becomessmaller as the number of goals increases This is because thedifficulty of the game increases for all strategies and thuswhether a game would succeed depends more on the gameinitial conditions We have shown snapshots of one game

TABLE IV Success rate over 100 collaborative games with varyingnumber of goals m

number of players (n=4)

Model m=2 m=3 m=4 m=5 m=6

LFG closest pursuer 059 038 029 027 022LFG influential pursuer 057 036 032 024 019

random 055 035 024 021 020to target goal 060 042 028 024 021

to goal farthest player 047 029 017 019 021

as in Fig 10 In this game the robot approaches other agentsand the desired goal in the collaborative pattern trying to helpcatch the goal g1

Fig 10 Collaborative task snapshots The orange triangles are thehuman agents and the black triangle is the robot agent

VI DISCUSSION

Summary We propose an approach for modeling leadingand following behavior in multi-agent human teams We usea combination of data-driven and graph-theoretic techniquesto learn a leader-follower graph This graph representationencoding human team hidden structure is scalable with theteam size (number of agents) since we first learn localpairwise relationships and combine them to create a globalmodel We demonstrate the effectiveness of the leader-followergraph by experimenting with optimization based robot policiesthat leverage the graph to influence human teams in differentscenarios Our policies are general and perform well across alltasks compared to other high-performing task-specific policiesLimitations and Future Work We view our work as a firststep into modeling latent dynamic human team structures Per-haps our greatest limitation is the reliance on simulated humanbehavior to validate our framework Further experiments withreal human data are needed to support our frameworkrsquos effec-tiveness for noisier human behavior understanding Moreoverthe robot policies that use the leader-follower graphs are fairlysimple Although this may be a limitation it is also promisingthat simple policies were able to perform well using the leader-follower graph

For future work we plan to evaluate our model on largescale human-robot experiments in both simulation and nav-igation settings such as robot swarms to further improveour modelrsquos generalization capacity We also plan on experi-menting with combining the leader-follower graph with moreadvanced policy learning methods such as deep reinforcementlearning We think the leader-follower graph could contributeto multi-agent reinforcement learning in various ways such asreward design and state representationAcknowledgements Toyota Research Institute (ldquoTRIrdquo) pro-vided funds to assist the authors with their research but thisarticle solely reflects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity The authorswould also like to acknowledge General Electric (GE) andStanford School of Engineering Fellowship

REFERENCES

[1] Pieter Abbeel and Andrew Y Ng Apprenticeship learn-ing via inverse reinforcement learning In Proceedingsof the twenty-first international conference on Machinelearning page 1 ACM 2004

[2] Ali-Akbar Agha-Mohammadi Suman Chakravorty andNancy M Amato Firm Sampling-based feedbackmotion-planning under motion uncertainty and imperfectmeasurements The International Journal of RoboticsResearch 33(2)268ndash304 2014

[3] Baris Akgun Maya Cakmak Karl Jiang and Andrea LThomaz Keyframe-based learning from demonstrationInternational Journal of Social Robotics 4(4)343ndash3552012

[4] Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey EHinton Layer normalization arXiv preprintarXiv160706450 2016

[5] Kim Baraka Ana Paiva and Manuela Veloso Expressivelights for revealing mobile service robot state In Robot2015 Second Iberian Robotics Conference pages 107ndash119 Springer 2016

[6] Jerome Barraquand Bruno Langlois and J-C LatombeNumerical potential field techniques for robot path plan-ning IEEE transactions on systems man and cybernet-ics 22(2)224ndash241 1992

[7] Aaron Bestick Ruzena Bajcsy and Anca D Dragan Im-plicitly assisting humans to choose good grasps in robotto human handovers In International Symposium onExperimental Robotics pages 341ndash354 Springer 2016

[8] Erdem Biyik and Dorsa Sadigh Batch active preference-based learning of reward functions In Conference onRobot Learning (CoRL) October 2018

[9] Frank Broz Illah Nourbakhsh and Reid Simmons De-signing pomdp models of socially situated tasks In RO-MAN 2011 IEEE pages 39ndash46 IEEE 2011

[10] Sonia Chernova and Andrea L Thomaz Robot learningfrom human teachers Synthesis Lectures on ArtificialIntelligence and Machine Learning 8(3)1ndash121 2014

[11] Yoeng-Jin Chu On the shortest arborescence of adirected graph Science Sinica 141396ndash1400 1965

[12] D Scott DeRue Adaptive leadership theory Leadingand following as a complex adaptive process Researchin organizational behavior 31125ndash150 2011

[13] Anca D Dragan Dorsa Sadigh Shankar Sastry and San-jit A Seshia Active preference-based learning of rewardfunctions In Robotics Science and Systems (RSS) 2017

[14] Finale Doshi and Nicholas Roy Efficient model learningfor dialog management In Proceedings of the ACMIEEEinternational conference on Human-robot interactionpages 65ndash72 ACM 2007

[15] Jack Edmonds Optimum branchings Mathematics andthe Decision Sciences Part 1(335-345)25 1968

[16] Katie Genter Noa Agmon and Peter Stone Ad hocteamwork for leading a flock In Proceedings of the2013 international conference on Autonomous agents and

multi-agent systems pages 531ndash538 International Foun-dation for Autonomous Agents and Multiagent Systems2013

[17] Matthew C Gombolay Cindy Huang and Julie A ShahCoordination of human-robot teaming with human taskpreferences In AAAI Fall Symposium Series on AI-HRIvolume 11 page 2015 2015

[18] Dylan Hadfield-Menell Stuart J Russell Pieter Abbeeland Anca Dragan Cooperative inverse reinforcementlearning In Advances in neural information processingsystems pages 3909ndash3917 2016

[19] Shervin Javdani Henny Admoni Stefania PellegrinelliSiddhartha S Srinivasa and J Andrew Bagnell Sharedautonomy via hindsight optimization for teleoperationand teaming The International Journal of RoboticsResearch page 0278364918776060 2018

[20] Takayuki Kanda Takayuki Hirano Daniel Eaton andHiroshi Ishiguro Interactive robots as social partners andpeer tutors for children A field trial Human-computerinteraction 19(1)61ndash84 2004

[21] Richard M Karp A simple derivation of edmondsrsquoalgorithm for optimum branchings Networks 1(3)265ndash272 1971

[22] Piyush Khandelwal Samuel Barrett and Peter StoneLeading the way An efficient multi-robot guidance sys-tem In Proceedings of the 2015 international conferenceon autonomous agents and multiagent systems pages1625ndash1633 International Foundation for AutonomousAgents and Multiagent Systems 2015

[23] Mykel J Kochenderfer Decision making under uncer-tainty theory and application MIT press 2015

[24] Hanna Kurniawati David Hsu and Wee Sun Lee SarsopEfficient point-based pomdp planning by approximatingoptimally reachable belief spaces In Robotics Scienceand systems volume 2008 Zurich Switzerland 2008

[25] Oliver Lemon Conversational interfaces In Data-DrivenMethods for Adaptive Spoken Dialogue Systems pages1ndash4 Springer 2012

[26] Michael L Littman and Peter Stone Leading best-response strategies in repeated games In In SeventeenthAnnual International Joint Conference on Artificial In-telligence Workshop on Economic Agents Models andMechanisms Citeseer 2001

[27] Owen Macindoe Leslie Pack Kaelbling and TomasLozano-Perez Pomcop Belief space planning for side-kicks in cooperative games In AIIDE 2012

[28] Stefanos Nikolaidis and Julie Shah Human-robot cross-training computational formulation modeling and eval-uation of a human team training strategy In Pro-ceedings of the 8th ACMIEEE international conferenceon Human-robot interaction pages 33ndash40 IEEE Press2013

[29] Stefanos Nikolaidis Anton Kuznetsov David Hsu andSiddharta Srinivasa Formalizing human-robot mutualadaptation A bounded memory model In The EleventhACMIEEE International Conference on Human Robot

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 6: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

Distracting a team In adversarial settings a robot might wantto prevent a team of humans from reaching a collective goal gIn order to stall the team a robot can use the leader-followergraph to identify who the current most influential leader ilowast isThe robot can then select actions that maximize the probabilityof the robot becoming the most influential leaderrsquos leaderand minimize the probability of the most influential leaderfollowing the collective goal g

alowastt = argmaxatisinA

wt+kilowastr (at) ilowast isin I (4)

Distracting a team from reaching a collective goal can beuseful in cases where the team is an adversary For instance ateam of adversarial robots may want to prevent their opponentsfrom collaboratively reaching a joint goalLeading a team towards the optimal goal In collaborativesettings where the team needs to agree on a goal g isin G arobot that knows where the optimal goal glowast isin G is shouldmaximize joint utility by leading all of its teammates to reachglowast To influence the team the robot can use the leader-followergraph to infer who the current most influential leader ilowast is Therobot can then select actions that maximize the probability ofthe most influential leader following the optimal goal glowast

alowastt = argmaxatisinA

wt+kilowastr (at) + wt+k

rlowastglowast(at) ilowast isin I (5)

Being able to lead a team of humans to a goal is useful in manyreal-life scenarios For instance in search-and-rescue missionsrobots with more information about the location of survivorsshould be able to lead the team in the optimal direction

V EXPERIMENTS

With an optimization framework that plans for robot actionsusing the leader-follower graph we evaluate our approachon the three tasks described in Sec IV For each task wecompare task performance with robot policies that use theleader-follower graph against robot policies that do not Acrossall tasks we find that robot policies that use the leader-followergraph perform well compared to other policies showing thatour graph can be easily generalized to different settingsTask setup Our tasks take place in the pursuit-evasiondomain Within each task we conduct experiments withsimulated human behavior Simulated humans move along apotential field as shown in Eqn 1 where there are two sourcesof attraction the agentrsquos goal (ag) and the crowd center (ac)and simulated human agents would make trade offs betweenfollowing the crowd and moving toward a target goal Theweights θg and θc encode how determined human players areabout reaching the goal or moving towards the crowd Higherθc means that players care more about being close to the wholeteam while higher θg indicates more determined players whocare about reaching their goals This can make it harder for arobot to influence the humans For validating our frameworkwe use weights θg = 06 and θc = 04 where going toward thegoal is slightly more advantageous than moving with the wholeteam setting moderate challenges for the robot to completetasks At the same time we prevent agents from colliding with

one another by making non-target agents and goals sources ofrepulsion with weight θj = 1

An important limitation is that our simulation does notcapture the broad range of behaviors humans would show inresponse to a robot that is trying to influence them We onlyoffer one simple interpretation of how humans would reactbased on the potential field

In each iteration of the task the initial position of agentsand goals are randomized For all of our experiments the sizeof the game canvas is 500times 500 pixels At every time step asimulated human can move 1 unit in one of the four directionsup down left right or stay at its position The robot agentrsquosaction space is the same but can move 5 units at a time Themaximum game time limit for all our experiments is 1000timesteps For convenience we will refer to simulated humanagents as human agents for the rest of the section

A Changing edges of the leader-follower graph

We evaluate a robotrsquos ability to change an edge of a leader-follower graph Unlike the other tasks we described the goalof this task is not to affect some larger outcome in the game(eg influence humans toward a particular goal) Instead thistask serves as a preliminary to others where we evaluate howwell a robot is able to manipulate the leader-follower graphMethods Given a human agent i who is predisposed tofollowing a goal with weights θg = 06 θa = 04 we createda robot policy that encouraged the agent to follow the robotr instead The robot optimized for the probability wir that itwould become agent irsquos leader as outlined in Eqn (3) Wecompared our approach against a random baseline where therobot chooses its next action at randomMetrics For both our approach and our baseline we evaluatedthe performance of the robot based on the leadership scoresie probabilities wir given by the leader-follower graphResults We show that the robot can influence a humanagent to follow it Fig 7 contains averaged probabilitiesover ten tasks The probability of the robot being the humanagentrsquos leader wir increases over time and averages to 73represented by the orange dashed line Our approach performswell compared to the random baseline which has an averageperformance of 26 represented by the grey dashed line

0 100 200 300 400 500 600task steps

02

04

06

08

pro

bab

ilit

y

oursrandom

Fig 7 Smoothed probabilities of a human agent following the robotover 10 tasks The robot is quickly able to become the human agentrsquosleader with an average of leadership score of 73

B Adversarial taskWe consider a different task where the robot is an adversary

that is trying to distract a team of humans from reaching a goalThere are m goals and n agents in the pursuit-evasion gameAmong the n agents we have 1 robot and n minus 1 humansThe robotrsquos goal is to distract a team of humans so that theycannot converge to the same goal In order to capture a goalwe enforce the constraint that nminus2 agents must collide with itat the same time allowing 1 agent to be absent Since we allowone agent to be absent this prevents the robot from adoptingthe simple strategy of blocking an agent throughout the gameThe game ends when all goals are captured or when the gametime exceeds the limit Thus another way to interpret therobotrsquos objective is to extend game time as much as possibleMethods We compared our approach against 3 baselineheuristic strategies (random to one player to farthest goal)that do not use the leader-follower graph

In the Random strategy the robot picks an action at eachtimestep with uniform probability In the To one player strat-egy the robot randomly selects a human agent and then movestowards it to try to block its way Finally a robot enactingthe To farthest goal strategy selects the goal that the averagedistance to human players are largest and then go to that goalin the hope that human agents would get influenced or mayfurther change goal by observing that some players are headingfor another goal

We also experimented with two variations of our approachLFG closest pursuer involves the robot agent selecting theclosest pursuer and choosing an action to maximize theprobability of the pursuer following it (as predicted by theLFG) Similarly LFG influential pursuer strategy involvesthe robot targeting the most influential human agent predictedby the LFG and then conducting the same optimization ofmaximizing the following probability as shown in Eqn (4)

Finally we explored different game settings by varying thenumber of players n = 3 6 and the number of goalsm = 1 4Metrics We evaluated the performance of the robot with gametime as the metric Longer game time indicates that the robotdoes well in distracting human playersResults We sampled 50 outcomes for every game setting androbot policy Tables II and III show the mean and standarddeviation of game time for every game setting and robotpolicy across 50 samples Across game settings our modelsbased on the leader-follower graph consistently outperformedmethods without knowledge of leader-follower graph Theseexperimental results are also visualized in Fig 8

As can be seen from Fig 8 the average game time goes upas the number of players increases This is because it is morechallenging for more players to agree upon on which goal tocapture The consistent advantageous performance across allgame settings suggests the effectiveness of the leader-followergraph for inference and optimization in this scenario

We demonstrate an example of robot behavior in theadversarial game shown in Fig 9 (dashed lines representtrajectories) Player 1 and player 2 start very close to goal

Fig 8 Average game time over 50 games with a different numberof players across all baseline methods and our model

Fig 9 Adversarial task snapshots The orange triangles are the humanagents and the black triangle is the robot agent

g1 making g1 the probable target The robot then approachesagent 2 blocks its way and leads it to another goal g2 In thisway the robot successfully extends the game time

C Cooperative task

Finally we evaluate the robot in a cooperative setting wherethe robot tries to be helpful for human teams The robot aimsto lead its teammates in a way that everyone can reach thetarget goal that gives the team the greatest joint utility glowast isin GHowever glowast is not immediately observable to all teammatesWe assume a setting where only the robot knows where glowast is

For consistency the experiment setting is the same as theAdversarial task where n minus 2 human agents need to collidewith a goal to capture it In this scenario the task is consideredsuccessful if the goal with greatest joint utility glowast is capturedThe task is unsuccessful if any other suboptimal goal iscaptured or if the game time exceeds the limitMethods Similar to the Adversarial task we explore twovariations of our approach where the robot chooses to in-fluence its closest human agent or the most influential agentpredicted by the leader-follower graph The variation wherethe robot targets the most influential leader is outlined in Eqn(5) Different from the Adversarial task here the robot isoptimizing the probability of both becoming the target agentrsquosleader and going toward the target goal We also experimentedwith three baseline strategies The Random strategy involvesthe robot taking actions at random The To target goal strat-egy has the robot move directly to the optimal goal glowastand stay there trying to attract other human agents TheTo goal farthest player strategy involves the robot going tothe player that is farthest away from glowast in an attempt toindirectly influence the player to move back to glowast Finallywe experimented with game settings by varying the numberof goals m = 2 6Metrics We evaluated the performance of the robot strategyusing the game success rate over 100 games We calculate thesuccess rate by dividing the number of successful games overnumber of total games

TABLE II Average Game time over 50 adversarial games with varying number of players

number of goals (m=2)

Model n=3 n=4 n=5 n=6

LFG closest pursuer (ours) 23304plusmn5182 30508plusmn4948 46118plusmn5573 55088plusmn5167LFG influential pursuer (ours) 20194plusmn4515 28644plusmn4854 41478plusmn5098 51592plusmn4880

random 1292plusmn3266 20940plusmn3986 38892plusmn5324 43716plusmn4317to one pursuer 21504plusmn5000 23142plusmn4469 45516plusmn5835 47236plusmn4975to farthest goal 13284plusmn3422 1985plusmn3614 38208plusmn5259 44564plusmn4677

TABLE III Average Game time over 50 adversarial games with varying number of goals

number of players (n=4)

Model m=1 m=2 m=3 m=4

LFG closest pursuer (ours) 21094plusmn3323 30508plusmn4948 28922plusmn5299 34300plusmn5590LFG influential pursuer (ours) 23904plusmn3973 28644plusmn4854 21956plusmn4100 30180plusmn5200

random 15594plusmn2142 20940plusmn3986 20574plusmn4305 29462plusmn5401to one pursuer 12358plusmn956 23142plusmn4469 22552plusmn4147 31792plusmn5475to farthest goal 21336plusmn3483 1985plusmn3614 21868plusmn4367 25830plusmn5064

Results Our results are summarized in Table IV We expectedthe To target goal baseline to be very strong since movingtowards the target goal directly influences other simulatedagents to also move toward the target goal We find that thisbaseline strategy is especially effective when the game is notcomplex ie the number of goals is small However ourmodel based on the leader-follower graph still demonstratescompetitive performance compared to it Especially when thenumber of goals increase the advantage of leader-followergraph becomes clearer This indicates that in complex scenar-ios brute force methods that do not have knowledge of humanteam hidden structure will not suffice Another thing to noteis that the difference between all of the strategies becomessmaller as the number of goals increases This is because thedifficulty of the game increases for all strategies and thuswhether a game would succeed depends more on the gameinitial conditions We have shown snapshots of one game

TABLE IV Success rate over 100 collaborative games with varyingnumber of goals m

number of players (n=4)

Model m=2 m=3 m=4 m=5 m=6

LFG closest pursuer 059 038 029 027 022LFG influential pursuer 057 036 032 024 019

random 055 035 024 021 020to target goal 060 042 028 024 021

to goal farthest player 047 029 017 019 021

as in Fig 10 In this game the robot approaches other agentsand the desired goal in the collaborative pattern trying to helpcatch the goal g1

Fig 10 Collaborative task snapshots The orange triangles are thehuman agents and the black triangle is the robot agent

VI DISCUSSION

Summary We propose an approach for modeling leadingand following behavior in multi-agent human teams We usea combination of data-driven and graph-theoretic techniquesto learn a leader-follower graph This graph representationencoding human team hidden structure is scalable with theteam size (number of agents) since we first learn localpairwise relationships and combine them to create a globalmodel We demonstrate the effectiveness of the leader-followergraph by experimenting with optimization based robot policiesthat leverage the graph to influence human teams in differentscenarios Our policies are general and perform well across alltasks compared to other high-performing task-specific policiesLimitations and Future Work We view our work as a firststep into modeling latent dynamic human team structures Per-haps our greatest limitation is the reliance on simulated humanbehavior to validate our framework Further experiments withreal human data are needed to support our frameworkrsquos effec-tiveness for noisier human behavior understanding Moreoverthe robot policies that use the leader-follower graphs are fairlysimple Although this may be a limitation it is also promisingthat simple policies were able to perform well using the leader-follower graph

For future work we plan to evaluate our model on largescale human-robot experiments in both simulation and nav-igation settings such as robot swarms to further improveour modelrsquos generalization capacity We also plan on experi-menting with combining the leader-follower graph with moreadvanced policy learning methods such as deep reinforcementlearning We think the leader-follower graph could contributeto multi-agent reinforcement learning in various ways such asreward design and state representationAcknowledgements Toyota Research Institute (ldquoTRIrdquo) pro-vided funds to assist the authors with their research but thisarticle solely reflects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity The authorswould also like to acknowledge General Electric (GE) andStanford School of Engineering Fellowship

REFERENCES

[1] Pieter Abbeel and Andrew Y Ng Apprenticeship learn-ing via inverse reinforcement learning In Proceedingsof the twenty-first international conference on Machinelearning page 1 ACM 2004

[2] Ali-Akbar Agha-Mohammadi Suman Chakravorty andNancy M Amato Firm Sampling-based feedbackmotion-planning under motion uncertainty and imperfectmeasurements The International Journal of RoboticsResearch 33(2)268ndash304 2014

[3] Baris Akgun Maya Cakmak Karl Jiang and Andrea LThomaz Keyframe-based learning from demonstrationInternational Journal of Social Robotics 4(4)343ndash3552012

[4] Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey EHinton Layer normalization arXiv preprintarXiv160706450 2016

[5] Kim Baraka Ana Paiva and Manuela Veloso Expressivelights for revealing mobile service robot state In Robot2015 Second Iberian Robotics Conference pages 107ndash119 Springer 2016

[6] Jerome Barraquand Bruno Langlois and J-C LatombeNumerical potential field techniques for robot path plan-ning IEEE transactions on systems man and cybernet-ics 22(2)224ndash241 1992

[7] Aaron Bestick Ruzena Bajcsy and Anca D Dragan Im-plicitly assisting humans to choose good grasps in robotto human handovers In International Symposium onExperimental Robotics pages 341ndash354 Springer 2016

[8] Erdem Biyik and Dorsa Sadigh Batch active preference-based learning of reward functions In Conference onRobot Learning (CoRL) October 2018

[9] Frank Broz Illah Nourbakhsh and Reid Simmons De-signing pomdp models of socially situated tasks In RO-MAN 2011 IEEE pages 39ndash46 IEEE 2011

[10] Sonia Chernova and Andrea L Thomaz Robot learningfrom human teachers Synthesis Lectures on ArtificialIntelligence and Machine Learning 8(3)1ndash121 2014

[11] Yoeng-Jin Chu On the shortest arborescence of adirected graph Science Sinica 141396ndash1400 1965

[12] D Scott DeRue Adaptive leadership theory Leadingand following as a complex adaptive process Researchin organizational behavior 31125ndash150 2011

[13] Anca D Dragan Dorsa Sadigh Shankar Sastry and San-jit A Seshia Active preference-based learning of rewardfunctions In Robotics Science and Systems (RSS) 2017

[14] Finale Doshi and Nicholas Roy Efficient model learningfor dialog management In Proceedings of the ACMIEEEinternational conference on Human-robot interactionpages 65ndash72 ACM 2007

[15] Jack Edmonds Optimum branchings Mathematics andthe Decision Sciences Part 1(335-345)25 1968

[16] Katie Genter Noa Agmon and Peter Stone Ad hocteamwork for leading a flock In Proceedings of the2013 international conference on Autonomous agents and

multi-agent systems pages 531ndash538 International Foun-dation for Autonomous Agents and Multiagent Systems2013

[17] Matthew C Gombolay Cindy Huang and Julie A ShahCoordination of human-robot teaming with human taskpreferences In AAAI Fall Symposium Series on AI-HRIvolume 11 page 2015 2015

[18] Dylan Hadfield-Menell Stuart J Russell Pieter Abbeeland Anca Dragan Cooperative inverse reinforcementlearning In Advances in neural information processingsystems pages 3909ndash3917 2016

[19] Shervin Javdani Henny Admoni Stefania PellegrinelliSiddhartha S Srinivasa and J Andrew Bagnell Sharedautonomy via hindsight optimization for teleoperationand teaming The International Journal of RoboticsResearch page 0278364918776060 2018

[20] Takayuki Kanda Takayuki Hirano Daniel Eaton andHiroshi Ishiguro Interactive robots as social partners andpeer tutors for children A field trial Human-computerinteraction 19(1)61ndash84 2004

[21] Richard M Karp A simple derivation of edmondsrsquoalgorithm for optimum branchings Networks 1(3)265ndash272 1971

[22] Piyush Khandelwal Samuel Barrett and Peter StoneLeading the way An efficient multi-robot guidance sys-tem In Proceedings of the 2015 international conferenceon autonomous agents and multiagent systems pages1625ndash1633 International Foundation for AutonomousAgents and Multiagent Systems 2015

[23] Mykel J Kochenderfer Decision making under uncer-tainty theory and application MIT press 2015

[24] Hanna Kurniawati David Hsu and Wee Sun Lee SarsopEfficient point-based pomdp planning by approximatingoptimally reachable belief spaces In Robotics Scienceand systems volume 2008 Zurich Switzerland 2008

[25] Oliver Lemon Conversational interfaces In Data-DrivenMethods for Adaptive Spoken Dialogue Systems pages1ndash4 Springer 2012

[26] Michael L Littman and Peter Stone Leading best-response strategies in repeated games In In SeventeenthAnnual International Joint Conference on Artificial In-telligence Workshop on Economic Agents Models andMechanisms Citeseer 2001

[27] Owen Macindoe Leslie Pack Kaelbling and TomasLozano-Perez Pomcop Belief space planning for side-kicks in cooperative games In AIIDE 2012

[28] Stefanos Nikolaidis and Julie Shah Human-robot cross-training computational formulation modeling and eval-uation of a human team training strategy In Pro-ceedings of the 8th ACMIEEE international conferenceon Human-robot interaction pages 33ndash40 IEEE Press2013

[29] Stefanos Nikolaidis Anton Kuznetsov David Hsu andSiddharta Srinivasa Formalizing human-robot mutualadaptation A bounded memory model In The EleventhACMIEEE International Conference on Human Robot

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 7: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

B Adversarial taskWe consider a different task where the robot is an adversary

that is trying to distract a team of humans from reaching a goalThere are m goals and n agents in the pursuit-evasion gameAmong the n agents we have 1 robot and n minus 1 humansThe robotrsquos goal is to distract a team of humans so that theycannot converge to the same goal In order to capture a goalwe enforce the constraint that nminus2 agents must collide with itat the same time allowing 1 agent to be absent Since we allowone agent to be absent this prevents the robot from adoptingthe simple strategy of blocking an agent throughout the gameThe game ends when all goals are captured or when the gametime exceeds the limit Thus another way to interpret therobotrsquos objective is to extend game time as much as possibleMethods We compared our approach against 3 baselineheuristic strategies (random to one player to farthest goal)that do not use the leader-follower graph

In the Random strategy the robot picks an action at eachtimestep with uniform probability In the To one player strat-egy the robot randomly selects a human agent and then movestowards it to try to block its way Finally a robot enactingthe To farthest goal strategy selects the goal that the averagedistance to human players are largest and then go to that goalin the hope that human agents would get influenced or mayfurther change goal by observing that some players are headingfor another goal

We also experimented with two variations of our approachLFG closest pursuer involves the robot agent selecting theclosest pursuer and choosing an action to maximize theprobability of the pursuer following it (as predicted by theLFG) Similarly LFG influential pursuer strategy involvesthe robot targeting the most influential human agent predictedby the LFG and then conducting the same optimization ofmaximizing the following probability as shown in Eqn (4)

Finally we explored different game settings by varying thenumber of players n = 3 6 and the number of goalsm = 1 4Metrics We evaluated the performance of the robot with gametime as the metric Longer game time indicates that the robotdoes well in distracting human playersResults We sampled 50 outcomes for every game setting androbot policy Tables II and III show the mean and standarddeviation of game time for every game setting and robotpolicy across 50 samples Across game settings our modelsbased on the leader-follower graph consistently outperformedmethods without knowledge of leader-follower graph Theseexperimental results are also visualized in Fig 8

As can be seen from Fig 8 the average game time goes upas the number of players increases This is because it is morechallenging for more players to agree upon on which goal tocapture The consistent advantageous performance across allgame settings suggests the effectiveness of the leader-followergraph for inference and optimization in this scenario

We demonstrate an example of robot behavior in theadversarial game shown in Fig 9 (dashed lines representtrajectories) Player 1 and player 2 start very close to goal

Fig 8 Average game time over 50 games with a different numberof players across all baseline methods and our model

Fig 9 Adversarial task snapshots The orange triangles are the humanagents and the black triangle is the robot agent

g1 making g1 the probable target The robot then approachesagent 2 blocks its way and leads it to another goal g2 In thisway the robot successfully extends the game time

C Cooperative task

Finally we evaluate the robot in a cooperative setting wherethe robot tries to be helpful for human teams The robot aimsto lead its teammates in a way that everyone can reach thetarget goal that gives the team the greatest joint utility glowast isin GHowever glowast is not immediately observable to all teammatesWe assume a setting where only the robot knows where glowast is

For consistency the experiment setting is the same as theAdversarial task where n minus 2 human agents need to collidewith a goal to capture it In this scenario the task is consideredsuccessful if the goal with greatest joint utility glowast is capturedThe task is unsuccessful if any other suboptimal goal iscaptured or if the game time exceeds the limitMethods Similar to the Adversarial task we explore twovariations of our approach where the robot chooses to in-fluence its closest human agent or the most influential agentpredicted by the leader-follower graph The variation wherethe robot targets the most influential leader is outlined in Eqn(5) Different from the Adversarial task here the robot isoptimizing the probability of both becoming the target agentrsquosleader and going toward the target goal We also experimentedwith three baseline strategies The Random strategy involvesthe robot taking actions at random The To target goal strat-egy has the robot move directly to the optimal goal glowastand stay there trying to attract other human agents TheTo goal farthest player strategy involves the robot going tothe player that is farthest away from glowast in an attempt toindirectly influence the player to move back to glowast Finallywe experimented with game settings by varying the numberof goals m = 2 6Metrics We evaluated the performance of the robot strategyusing the game success rate over 100 games We calculate thesuccess rate by dividing the number of successful games overnumber of total games

TABLE II Average Game time over 50 adversarial games with varying number of players

number of goals (m=2)

Model n=3 n=4 n=5 n=6

LFG closest pursuer (ours) 23304plusmn5182 30508plusmn4948 46118plusmn5573 55088plusmn5167LFG influential pursuer (ours) 20194plusmn4515 28644plusmn4854 41478plusmn5098 51592plusmn4880

random 1292plusmn3266 20940plusmn3986 38892plusmn5324 43716plusmn4317to one pursuer 21504plusmn5000 23142plusmn4469 45516plusmn5835 47236plusmn4975to farthest goal 13284plusmn3422 1985plusmn3614 38208plusmn5259 44564plusmn4677

TABLE III Average Game time over 50 adversarial games with varying number of goals

number of players (n=4)

Model m=1 m=2 m=3 m=4

LFG closest pursuer (ours) 21094plusmn3323 30508plusmn4948 28922plusmn5299 34300plusmn5590LFG influential pursuer (ours) 23904plusmn3973 28644plusmn4854 21956plusmn4100 30180plusmn5200

random 15594plusmn2142 20940plusmn3986 20574plusmn4305 29462plusmn5401to one pursuer 12358plusmn956 23142plusmn4469 22552plusmn4147 31792plusmn5475to farthest goal 21336plusmn3483 1985plusmn3614 21868plusmn4367 25830plusmn5064

Results Our results are summarized in Table IV We expectedthe To target goal baseline to be very strong since movingtowards the target goal directly influences other simulatedagents to also move toward the target goal We find that thisbaseline strategy is especially effective when the game is notcomplex ie the number of goals is small However ourmodel based on the leader-follower graph still demonstratescompetitive performance compared to it Especially when thenumber of goals increase the advantage of leader-followergraph becomes clearer This indicates that in complex scenar-ios brute force methods that do not have knowledge of humanteam hidden structure will not suffice Another thing to noteis that the difference between all of the strategies becomessmaller as the number of goals increases This is because thedifficulty of the game increases for all strategies and thuswhether a game would succeed depends more on the gameinitial conditions We have shown snapshots of one game

TABLE IV Success rate over 100 collaborative games with varyingnumber of goals m

number of players (n=4)

Model m=2 m=3 m=4 m=5 m=6

LFG closest pursuer 059 038 029 027 022LFG influential pursuer 057 036 032 024 019

random 055 035 024 021 020to target goal 060 042 028 024 021

to goal farthest player 047 029 017 019 021

as in Fig 10 In this game the robot approaches other agentsand the desired goal in the collaborative pattern trying to helpcatch the goal g1

Fig 10 Collaborative task snapshots The orange triangles are thehuman agents and the black triangle is the robot agent

VI DISCUSSION

Summary We propose an approach for modeling leadingand following behavior in multi-agent human teams We usea combination of data-driven and graph-theoretic techniquesto learn a leader-follower graph This graph representationencoding human team hidden structure is scalable with theteam size (number of agents) since we first learn localpairwise relationships and combine them to create a globalmodel We demonstrate the effectiveness of the leader-followergraph by experimenting with optimization based robot policiesthat leverage the graph to influence human teams in differentscenarios Our policies are general and perform well across alltasks compared to other high-performing task-specific policiesLimitations and Future Work We view our work as a firststep into modeling latent dynamic human team structures Per-haps our greatest limitation is the reliance on simulated humanbehavior to validate our framework Further experiments withreal human data are needed to support our frameworkrsquos effec-tiveness for noisier human behavior understanding Moreoverthe robot policies that use the leader-follower graphs are fairlysimple Although this may be a limitation it is also promisingthat simple policies were able to perform well using the leader-follower graph

For future work we plan to evaluate our model on largescale human-robot experiments in both simulation and nav-igation settings such as robot swarms to further improveour modelrsquos generalization capacity We also plan on experi-menting with combining the leader-follower graph with moreadvanced policy learning methods such as deep reinforcementlearning We think the leader-follower graph could contributeto multi-agent reinforcement learning in various ways such asreward design and state representationAcknowledgements Toyota Research Institute (ldquoTRIrdquo) pro-vided funds to assist the authors with their research but thisarticle solely reflects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity The authorswould also like to acknowledge General Electric (GE) andStanford School of Engineering Fellowship

REFERENCES

[1] Pieter Abbeel and Andrew Y Ng Apprenticeship learn-ing via inverse reinforcement learning In Proceedingsof the twenty-first international conference on Machinelearning page 1 ACM 2004

[2] Ali-Akbar Agha-Mohammadi Suman Chakravorty andNancy M Amato Firm Sampling-based feedbackmotion-planning under motion uncertainty and imperfectmeasurements The International Journal of RoboticsResearch 33(2)268ndash304 2014

[3] Baris Akgun Maya Cakmak Karl Jiang and Andrea LThomaz Keyframe-based learning from demonstrationInternational Journal of Social Robotics 4(4)343ndash3552012

[4] Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey EHinton Layer normalization arXiv preprintarXiv160706450 2016

[5] Kim Baraka Ana Paiva and Manuela Veloso Expressivelights for revealing mobile service robot state In Robot2015 Second Iberian Robotics Conference pages 107ndash119 Springer 2016

[6] Jerome Barraquand Bruno Langlois and J-C LatombeNumerical potential field techniques for robot path plan-ning IEEE transactions on systems man and cybernet-ics 22(2)224ndash241 1992

[7] Aaron Bestick Ruzena Bajcsy and Anca D Dragan Im-plicitly assisting humans to choose good grasps in robotto human handovers In International Symposium onExperimental Robotics pages 341ndash354 Springer 2016

[8] Erdem Biyik and Dorsa Sadigh Batch active preference-based learning of reward functions In Conference onRobot Learning (CoRL) October 2018

[9] Frank Broz Illah Nourbakhsh and Reid Simmons De-signing pomdp models of socially situated tasks In RO-MAN 2011 IEEE pages 39ndash46 IEEE 2011

[10] Sonia Chernova and Andrea L Thomaz Robot learningfrom human teachers Synthesis Lectures on ArtificialIntelligence and Machine Learning 8(3)1ndash121 2014

[11] Yoeng-Jin Chu On the shortest arborescence of adirected graph Science Sinica 141396ndash1400 1965

[12] D Scott DeRue Adaptive leadership theory Leadingand following as a complex adaptive process Researchin organizational behavior 31125ndash150 2011

[13] Anca D Dragan Dorsa Sadigh Shankar Sastry and San-jit A Seshia Active preference-based learning of rewardfunctions In Robotics Science and Systems (RSS) 2017

[14] Finale Doshi and Nicholas Roy Efficient model learningfor dialog management In Proceedings of the ACMIEEEinternational conference on Human-robot interactionpages 65ndash72 ACM 2007

[15] Jack Edmonds Optimum branchings Mathematics andthe Decision Sciences Part 1(335-345)25 1968

[16] Katie Genter Noa Agmon and Peter Stone Ad hocteamwork for leading a flock In Proceedings of the2013 international conference on Autonomous agents and

multi-agent systems pages 531ndash538 International Foun-dation for Autonomous Agents and Multiagent Systems2013

[17] Matthew C Gombolay Cindy Huang and Julie A ShahCoordination of human-robot teaming with human taskpreferences In AAAI Fall Symposium Series on AI-HRIvolume 11 page 2015 2015

[18] Dylan Hadfield-Menell Stuart J Russell Pieter Abbeeland Anca Dragan Cooperative inverse reinforcementlearning In Advances in neural information processingsystems pages 3909ndash3917 2016

[19] Shervin Javdani Henny Admoni Stefania PellegrinelliSiddhartha S Srinivasa and J Andrew Bagnell Sharedautonomy via hindsight optimization for teleoperationand teaming The International Journal of RoboticsResearch page 0278364918776060 2018

[20] Takayuki Kanda Takayuki Hirano Daniel Eaton andHiroshi Ishiguro Interactive robots as social partners andpeer tutors for children A field trial Human-computerinteraction 19(1)61ndash84 2004

[21] Richard M Karp A simple derivation of edmondsrsquoalgorithm for optimum branchings Networks 1(3)265ndash272 1971

[22] Piyush Khandelwal Samuel Barrett and Peter StoneLeading the way An efficient multi-robot guidance sys-tem In Proceedings of the 2015 international conferenceon autonomous agents and multiagent systems pages1625ndash1633 International Foundation for AutonomousAgents and Multiagent Systems 2015

[23] Mykel J Kochenderfer Decision making under uncer-tainty theory and application MIT press 2015

[24] Hanna Kurniawati David Hsu and Wee Sun Lee SarsopEfficient point-based pomdp planning by approximatingoptimally reachable belief spaces In Robotics Scienceand systems volume 2008 Zurich Switzerland 2008

[25] Oliver Lemon Conversational interfaces In Data-DrivenMethods for Adaptive Spoken Dialogue Systems pages1ndash4 Springer 2012

[26] Michael L Littman and Peter Stone Leading best-response strategies in repeated games In In SeventeenthAnnual International Joint Conference on Artificial In-telligence Workshop on Economic Agents Models andMechanisms Citeseer 2001

[27] Owen Macindoe Leslie Pack Kaelbling and TomasLozano-Perez Pomcop Belief space planning for side-kicks in cooperative games In AIIDE 2012

[28] Stefanos Nikolaidis and Julie Shah Human-robot cross-training computational formulation modeling and eval-uation of a human team training strategy In Pro-ceedings of the 8th ACMIEEE international conferenceon Human-robot interaction pages 33ndash40 IEEE Press2013

[29] Stefanos Nikolaidis Anton Kuznetsov David Hsu andSiddharta Srinivasa Formalizing human-robot mutualadaptation A bounded memory model In The EleventhACMIEEE International Conference on Human Robot

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 8: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

TABLE II Average Game time over 50 adversarial games with varying number of players

number of goals (m=2)

Model n=3 n=4 n=5 n=6

LFG closest pursuer (ours) 23304plusmn5182 30508plusmn4948 46118plusmn5573 55088plusmn5167LFG influential pursuer (ours) 20194plusmn4515 28644plusmn4854 41478plusmn5098 51592plusmn4880

random 1292plusmn3266 20940plusmn3986 38892plusmn5324 43716plusmn4317to one pursuer 21504plusmn5000 23142plusmn4469 45516plusmn5835 47236plusmn4975to farthest goal 13284plusmn3422 1985plusmn3614 38208plusmn5259 44564plusmn4677

TABLE III Average Game time over 50 adversarial games with varying number of goals

number of players (n=4)

Model m=1 m=2 m=3 m=4

LFG closest pursuer (ours) 21094plusmn3323 30508plusmn4948 28922plusmn5299 34300plusmn5590LFG influential pursuer (ours) 23904plusmn3973 28644plusmn4854 21956plusmn4100 30180plusmn5200

random 15594plusmn2142 20940plusmn3986 20574plusmn4305 29462plusmn5401to one pursuer 12358plusmn956 23142plusmn4469 22552plusmn4147 31792plusmn5475to farthest goal 21336plusmn3483 1985plusmn3614 21868plusmn4367 25830plusmn5064

Results Our results are summarized in Table IV We expectedthe To target goal baseline to be very strong since movingtowards the target goal directly influences other simulatedagents to also move toward the target goal We find that thisbaseline strategy is especially effective when the game is notcomplex ie the number of goals is small However ourmodel based on the leader-follower graph still demonstratescompetitive performance compared to it Especially when thenumber of goals increase the advantage of leader-followergraph becomes clearer This indicates that in complex scenar-ios brute force methods that do not have knowledge of humanteam hidden structure will not suffice Another thing to noteis that the difference between all of the strategies becomessmaller as the number of goals increases This is because thedifficulty of the game increases for all strategies and thuswhether a game would succeed depends more on the gameinitial conditions We have shown snapshots of one game

TABLE IV Success rate over 100 collaborative games with varyingnumber of goals m

number of players (n=4)

Model m=2 m=3 m=4 m=5 m=6

LFG closest pursuer 059 038 029 027 022LFG influential pursuer 057 036 032 024 019

random 055 035 024 021 020to target goal 060 042 028 024 021

to goal farthest player 047 029 017 019 021

as in Fig 10 In this game the robot approaches other agentsand the desired goal in the collaborative pattern trying to helpcatch the goal g1

Fig 10 Collaborative task snapshots The orange triangles are thehuman agents and the black triangle is the robot agent

VI DISCUSSION

Summary We propose an approach for modeling leadingand following behavior in multi-agent human teams We usea combination of data-driven and graph-theoretic techniquesto learn a leader-follower graph This graph representationencoding human team hidden structure is scalable with theteam size (number of agents) since we first learn localpairwise relationships and combine them to create a globalmodel We demonstrate the effectiveness of the leader-followergraph by experimenting with optimization based robot policiesthat leverage the graph to influence human teams in differentscenarios Our policies are general and perform well across alltasks compared to other high-performing task-specific policiesLimitations and Future Work We view our work as a firststep into modeling latent dynamic human team structures Per-haps our greatest limitation is the reliance on simulated humanbehavior to validate our framework Further experiments withreal human data are needed to support our frameworkrsquos effec-tiveness for noisier human behavior understanding Moreoverthe robot policies that use the leader-follower graphs are fairlysimple Although this may be a limitation it is also promisingthat simple policies were able to perform well using the leader-follower graph

For future work we plan to evaluate our model on largescale human-robot experiments in both simulation and nav-igation settings such as robot swarms to further improveour modelrsquos generalization capacity We also plan on experi-menting with combining the leader-follower graph with moreadvanced policy learning methods such as deep reinforcementlearning We think the leader-follower graph could contributeto multi-agent reinforcement learning in various ways such asreward design and state representationAcknowledgements Toyota Research Institute (ldquoTRIrdquo) pro-vided funds to assist the authors with their research but thisarticle solely reflects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity The authorswould also like to acknowledge General Electric (GE) andStanford School of Engineering Fellowship

REFERENCES

[1] Pieter Abbeel and Andrew Y Ng Apprenticeship learn-ing via inverse reinforcement learning In Proceedingsof the twenty-first international conference on Machinelearning page 1 ACM 2004

[2] Ali-Akbar Agha-Mohammadi Suman Chakravorty andNancy M Amato Firm Sampling-based feedbackmotion-planning under motion uncertainty and imperfectmeasurements The International Journal of RoboticsResearch 33(2)268ndash304 2014

[3] Baris Akgun Maya Cakmak Karl Jiang and Andrea LThomaz Keyframe-based learning from demonstrationInternational Journal of Social Robotics 4(4)343ndash3552012

[4] Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey EHinton Layer normalization arXiv preprintarXiv160706450 2016

[5] Kim Baraka Ana Paiva and Manuela Veloso Expressivelights for revealing mobile service robot state In Robot2015 Second Iberian Robotics Conference pages 107ndash119 Springer 2016

[6] Jerome Barraquand Bruno Langlois and J-C LatombeNumerical potential field techniques for robot path plan-ning IEEE transactions on systems man and cybernet-ics 22(2)224ndash241 1992

[7] Aaron Bestick Ruzena Bajcsy and Anca D Dragan Im-plicitly assisting humans to choose good grasps in robotto human handovers In International Symposium onExperimental Robotics pages 341ndash354 Springer 2016

[8] Erdem Biyik and Dorsa Sadigh Batch active preference-based learning of reward functions In Conference onRobot Learning (CoRL) October 2018

[9] Frank Broz Illah Nourbakhsh and Reid Simmons De-signing pomdp models of socially situated tasks In RO-MAN 2011 IEEE pages 39ndash46 IEEE 2011

[10] Sonia Chernova and Andrea L Thomaz Robot learningfrom human teachers Synthesis Lectures on ArtificialIntelligence and Machine Learning 8(3)1ndash121 2014

[11] Yoeng-Jin Chu On the shortest arborescence of adirected graph Science Sinica 141396ndash1400 1965

[12] D Scott DeRue Adaptive leadership theory Leadingand following as a complex adaptive process Researchin organizational behavior 31125ndash150 2011

[13] Anca D Dragan Dorsa Sadigh Shankar Sastry and San-jit A Seshia Active preference-based learning of rewardfunctions In Robotics Science and Systems (RSS) 2017

[14] Finale Doshi and Nicholas Roy Efficient model learningfor dialog management In Proceedings of the ACMIEEEinternational conference on Human-robot interactionpages 65ndash72 ACM 2007

[15] Jack Edmonds Optimum branchings Mathematics andthe Decision Sciences Part 1(335-345)25 1968

[16] Katie Genter Noa Agmon and Peter Stone Ad hocteamwork for leading a flock In Proceedings of the2013 international conference on Autonomous agents and

multi-agent systems pages 531ndash538 International Foun-dation for Autonomous Agents and Multiagent Systems2013

[17] Matthew C Gombolay Cindy Huang and Julie A ShahCoordination of human-robot teaming with human taskpreferences In AAAI Fall Symposium Series on AI-HRIvolume 11 page 2015 2015

[18] Dylan Hadfield-Menell Stuart J Russell Pieter Abbeeland Anca Dragan Cooperative inverse reinforcementlearning In Advances in neural information processingsystems pages 3909ndash3917 2016

[19] Shervin Javdani Henny Admoni Stefania PellegrinelliSiddhartha S Srinivasa and J Andrew Bagnell Sharedautonomy via hindsight optimization for teleoperationand teaming The International Journal of RoboticsResearch page 0278364918776060 2018

[20] Takayuki Kanda Takayuki Hirano Daniel Eaton andHiroshi Ishiguro Interactive robots as social partners andpeer tutors for children A field trial Human-computerinteraction 19(1)61ndash84 2004

[21] Richard M Karp A simple derivation of edmondsrsquoalgorithm for optimum branchings Networks 1(3)265ndash272 1971

[22] Piyush Khandelwal Samuel Barrett and Peter StoneLeading the way An efficient multi-robot guidance sys-tem In Proceedings of the 2015 international conferenceon autonomous agents and multiagent systems pages1625ndash1633 International Foundation for AutonomousAgents and Multiagent Systems 2015

[23] Mykel J Kochenderfer Decision making under uncer-tainty theory and application MIT press 2015

[24] Hanna Kurniawati David Hsu and Wee Sun Lee SarsopEfficient point-based pomdp planning by approximatingoptimally reachable belief spaces In Robotics Scienceand systems volume 2008 Zurich Switzerland 2008

[25] Oliver Lemon Conversational interfaces In Data-DrivenMethods for Adaptive Spoken Dialogue Systems pages1ndash4 Springer 2012

[26] Michael L Littman and Peter Stone Leading best-response strategies in repeated games In In SeventeenthAnnual International Joint Conference on Artificial In-telligence Workshop on Economic Agents Models andMechanisms Citeseer 2001

[27] Owen Macindoe Leslie Pack Kaelbling and TomasLozano-Perez Pomcop Belief space planning for side-kicks in cooperative games In AIIDE 2012

[28] Stefanos Nikolaidis and Julie Shah Human-robot cross-training computational formulation modeling and eval-uation of a human team training strategy In Pro-ceedings of the 8th ACMIEEE international conferenceon Human-robot interaction pages 33ndash40 IEEE Press2013

[29] Stefanos Nikolaidis Anton Kuznetsov David Hsu andSiddharta Srinivasa Formalizing human-robot mutualadaptation A bounded memory model In The EleventhACMIEEE International Conference on Human Robot

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 9: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

REFERENCES

[1] Pieter Abbeel and Andrew Y Ng Apprenticeship learn-ing via inverse reinforcement learning In Proceedingsof the twenty-first international conference on Machinelearning page 1 ACM 2004

[2] Ali-Akbar Agha-Mohammadi Suman Chakravorty andNancy M Amato Firm Sampling-based feedbackmotion-planning under motion uncertainty and imperfectmeasurements The International Journal of RoboticsResearch 33(2)268ndash304 2014

[3] Baris Akgun Maya Cakmak Karl Jiang and Andrea LThomaz Keyframe-based learning from demonstrationInternational Journal of Social Robotics 4(4)343ndash3552012

[4] Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey EHinton Layer normalization arXiv preprintarXiv160706450 2016

[5] Kim Baraka Ana Paiva and Manuela Veloso Expressivelights for revealing mobile service robot state In Robot2015 Second Iberian Robotics Conference pages 107ndash119 Springer 2016

[6] Jerome Barraquand Bruno Langlois and J-C LatombeNumerical potential field techniques for robot path plan-ning IEEE transactions on systems man and cybernet-ics 22(2)224ndash241 1992

[7] Aaron Bestick Ruzena Bajcsy and Anca D Dragan Im-plicitly assisting humans to choose good grasps in robotto human handovers In International Symposium onExperimental Robotics pages 341ndash354 Springer 2016

[8] Erdem Biyik and Dorsa Sadigh Batch active preference-based learning of reward functions In Conference onRobot Learning (CoRL) October 2018

[9] Frank Broz Illah Nourbakhsh and Reid Simmons De-signing pomdp models of socially situated tasks In RO-MAN 2011 IEEE pages 39ndash46 IEEE 2011

[10] Sonia Chernova and Andrea L Thomaz Robot learningfrom human teachers Synthesis Lectures on ArtificialIntelligence and Machine Learning 8(3)1ndash121 2014

[11] Yoeng-Jin Chu On the shortest arborescence of adirected graph Science Sinica 141396ndash1400 1965

[12] D Scott DeRue Adaptive leadership theory Leadingand following as a complex adaptive process Researchin organizational behavior 31125ndash150 2011

[13] Anca D Dragan Dorsa Sadigh Shankar Sastry and San-jit A Seshia Active preference-based learning of rewardfunctions In Robotics Science and Systems (RSS) 2017

[14] Finale Doshi and Nicholas Roy Efficient model learningfor dialog management In Proceedings of the ACMIEEEinternational conference on Human-robot interactionpages 65ndash72 ACM 2007

[15] Jack Edmonds Optimum branchings Mathematics andthe Decision Sciences Part 1(335-345)25 1968

[16] Katie Genter Noa Agmon and Peter Stone Ad hocteamwork for leading a flock In Proceedings of the2013 international conference on Autonomous agents and

multi-agent systems pages 531ndash538 International Foun-dation for Autonomous Agents and Multiagent Systems2013

[17] Matthew C Gombolay Cindy Huang and Julie A ShahCoordination of human-robot teaming with human taskpreferences In AAAI Fall Symposium Series on AI-HRIvolume 11 page 2015 2015

[18] Dylan Hadfield-Menell Stuart J Russell Pieter Abbeeland Anca Dragan Cooperative inverse reinforcementlearning In Advances in neural information processingsystems pages 3909ndash3917 2016

[19] Shervin Javdani Henny Admoni Stefania PellegrinelliSiddhartha S Srinivasa and J Andrew Bagnell Sharedautonomy via hindsight optimization for teleoperationand teaming The International Journal of RoboticsResearch page 0278364918776060 2018

[20] Takayuki Kanda Takayuki Hirano Daniel Eaton andHiroshi Ishiguro Interactive robots as social partners andpeer tutors for children A field trial Human-computerinteraction 19(1)61ndash84 2004

[21] Richard M Karp A simple derivation of edmondsrsquoalgorithm for optimum branchings Networks 1(3)265ndash272 1971

[22] Piyush Khandelwal Samuel Barrett and Peter StoneLeading the way An efficient multi-robot guidance sys-tem In Proceedings of the 2015 international conferenceon autonomous agents and multiagent systems pages1625ndash1633 International Foundation for AutonomousAgents and Multiagent Systems 2015

[23] Mykel J Kochenderfer Decision making under uncer-tainty theory and application MIT press 2015

[24] Hanna Kurniawati David Hsu and Wee Sun Lee SarsopEfficient point-based pomdp planning by approximatingoptimally reachable belief spaces In Robotics Scienceand systems volume 2008 Zurich Switzerland 2008

[25] Oliver Lemon Conversational interfaces In Data-DrivenMethods for Adaptive Spoken Dialogue Systems pages1ndash4 Springer 2012

[26] Michael L Littman and Peter Stone Leading best-response strategies in repeated games In In SeventeenthAnnual International Joint Conference on Artificial In-telligence Workshop on Economic Agents Models andMechanisms Citeseer 2001

[27] Owen Macindoe Leslie Pack Kaelbling and TomasLozano-Perez Pomcop Belief space planning for side-kicks in cooperative games In AIIDE 2012

[28] Stefanos Nikolaidis and Julie Shah Human-robot cross-training computational formulation modeling and eval-uation of a human team training strategy In Pro-ceedings of the 8th ACMIEEE international conferenceon Human-robot interaction pages 33ndash40 IEEE Press2013

[29] Stefanos Nikolaidis Anton Kuznetsov David Hsu andSiddharta Srinivasa Formalizing human-robot mutualadaptation A bounded memory model In The EleventhACMIEEE International Conference on Human Robot

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 10: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

Interaction pages 75ndash82 IEEE Press 2016[30] Stefanos Nikolaidis Swaprava Nath Ariel D Procaccia

and Siddhartha Srinivasa Game-theoretic modeling ofhuman adaptation in human-robot collaboration InProceedings of the 2017 ACMIEEE International Con-ference on Human-Robot Interaction pages 323ndash331ACM 2017

[31] Shayegan Omidshafiei Ali-Akbar Agha-MohammadiChristopher Amato and Jonathan P How Decentralizedcontrol of partially observable markov decision processesusing belief space macro-actions In Robotics and Au-tomation (ICRA) 2015 IEEE International Conferenceon pages 5962ndash5969 IEEE 2015

[32] Robert Platt Jr Russ Tedrake Leslie Kaelbling andTomas Lozano-Perez Belief space planning assumingmaximum likelihood observations 2010

[33] Ben Robins Kerstin Dautenhahn Rene Te Boekhorstand Aude Billard Effects of repeated exposure to ahumanoid robot on children with autism Designing amore inclusive world pages 225ndash236 2004

[34] Michael Rubenstein Adrian Cabrera Justin Werfel Gol-naz Habibi James McLurkin and Radhika Nagpal Col-lective transport of complex objects by simple robotstheory and experiments In Proceedings of the 2013 in-ternational conference on Autonomous agents and multi-agent systems pages 47ndash54 International Foundation forAutonomous Agents and Multiagent Systems 2013

[35] Dorsa Sadigh Safe and Interactive Autonomy ControlLearning and Verification PhD thesis University ofCalifornia Berkeley 2017

[36] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca Dragan Information gathering actions over hu-man internal state In Proceedings of the IEEE RSJInternational Conference on Intelligent Robots and Sys-tems (IROS) pages 66ndash73 IEEE October 2016 doi101109IROS20167759036

[37] Dorsa Sadigh S Shankar Sastry Sanjit A Seshia andAnca D Dragan Planning for autonomous cars thatleverage effects on human actions In Proceedings ofRobotics Science and Systems (RSS) June 2016 doi1015607RSS2016XII029

[38] Dorsa Sadigh Nick Landolfi Shankar S Sastry Sanjit ASeshia and Anca D Dragan Planning for cars thatcoordinate with people leveraging effects on humanactions for planning and active information gatheringover human internal state Autonomous Robots 42(7)1405ndash1426 2018

[39] Michaela C Schippers Deanne N Den Hartog Paul LKoopman and Daan van Knippenberg The role oftransformational leadership in enhancing team reflexivityHuman Relations 61(11)1593ndash1616 2008

[40] Peter Stone Gal A Kaminka Sarit Kraus Jeffrey SRosenschein and Noa Agmon Teaching and lead-ing an ad hoc teammate Collaboration without pre-coordination Artificial Intelligence 20335ndash65 2013

[41] Fay Sudweeks and Simeon J Simoff Leading conversa-

tions Communication behaviours of emergent leaders invirtual teams In proceedings of the 38th annual Hawaiiinternational conference on system sciences pages 108andash108a IEEE 2005

[42] Daniel Szafir Bilge Mutlu and Terrence Fong Com-municating directionality in flying robots In 2015 10thACMIEEE International Conference on Human-RobotInteraction (HRI) pages 19ndash26 IEEE 2015

[43] Zijian Wang and Mac Schwager Force-amplifying n-robot transport system (force-ants) for cooperative planarmanipulation without communication The InternationalJournal of Robotics Research 35(13)1564ndash1586 2016

[44] Brian D Ziebart Andrew L Maas J Andrew Bagnell andAnind K Dey Maximum entropy inverse reinforcementlearning In AAAI volume 8 pages 1433ndash1438 ChicagoIL USA 2008

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0

Page 11: Influencing Leading and Following in Human-Robot Teamsiliad.stanford.edu/pdfs/publications/kwon2019influencing.pdf · Running Example: Pursuit-Evasion Game. We define a multi-player

VII APPENDIX

A Potential Field for Simulated Human Planning

In our implementation the attractive potential field of at-traction i denoted as U i

att(q) is constructed as the square ofthe Euclidean distance ρi(q) between agent at location q andattraction i at location qi In this way the attraction increasesas the distance to goal becomes larger 983171 is the hyper-parameterfor controlling how strong the attraction is and has consistentvalue for all attractions

ρi(q) = 983042q minus qi983042U iatt(q) =

12983171ρi(q)

2

minusnablaU iatt(q) = minus983171ρi(q)(nablaρi(q))

The repulsive potential field U jrep(q) is used for obstacle

avoidance It usually has a limited effective radius since wedonrsquot want the obstacle to affect agentsrsquo planning if theyare far way from each other Our choice for U j

rep(q) has alimited range γ0 where the value is zero outside the rangeWithin distance γ0 the repulsive potential field increases as theagent approaches the obstacle Here we denote the minimumdistance from the agent to the obstacle j as γj(q) Coefficientη and range γ0 are the hyper-parameters for controlling howconservative we want our collision avoidance to be and isconsistent for all obstacles Larger values of η and γ0 meanthat we are more conservative with collision avoidance andwant the agent to keep a larger distance to obstacles

γj(q) = minqprimeisinobsj 983042q minus qprime983042

U jrep(q) =

983069 12η(

1γj(q)

minus 1γ0) γj(q) lt γ0

0 γj(q) gt γ0

nablaU jrep(q) =

983069η( 1

γj(q)minus 1

γ0)( 1

γj(q)2)nablaγ(q) γ(q)j lt γ0

0 γj(q) gt γ0