arxiv:2012.07195v1 [cs.ai] 14 dec 2020

15
Efficient Querying for Cooperative Probabilistic Commitments Qi Zhang 1 , Edmund H. Durfee 2 , Satinder Singh 2 1 Artificial Intelligence Institute, University of South Carolina 2 Computer Science and Engineering, University of Michigan [email protected], [email protected], [email protected] Abstract Multiagent systems can use commitments as the core of a general coordination infrastructure, supporting both coop- erative and non-cooperative interactions. Agents whose ob- jectives are aligned, and where one agent can help another achieve greater reward by sacrificing some of its own reward, should choose a cooperative commitment to maximize their joint reward. We present a solution to the problem of how co- operative agents can efficiently find an (approximately) opti- mal commitment by querying about carefully-selected com- mitment choices. We prove structural properties of the agents’ values as functions of the parameters of the commitment specification, and develop a greedy method for composing a query with provable approximation bounds, which we empir- ically show can find nearly optimal commitments in a fraction of the time methods that lack our insights require. 1 Introduction Commitments are a proven approach to multiagent coor- dination (Singh 2012; Cohen and Levesque 1990; Castel- franchi 1995; Mallya and Huhns 2003; Chesani et al. 2013; Al-Saqqar et al. 2014). Through commitments, agents know more about what to expect from others, and thus can plan actions with higher confidence of success. That said, com- mitments are generally uncertain: an agent might abandon a commitment if it discovers that it cannot achieve what it promised, or that it prefers to achieve something else, or that others will not uphold their side of the commitment (Jen- nings 1993; Xing and Singh 2001; Winikoff 2006). One way to deal with commitment uncertainty is to insti- tute protocols so participating agents are aware of the sta- tus of commitments through their lifecycles (Venkatraman and Singh 1999; Xing and Singh 2001; Yolum and Singh 2002; Fornara and Colombetti 2008; Baldoni et al. 2015; unay, Liu, and Zhang 2016; Pereira, Oren, and Meneguzzi 2017; Dastani, van der Torre, and Yorke-Smith 2017). An- other has been to qualify commitments with conditional statements about what must (not) be true in the environment for the commitment to be fulfilled (Singh 2012; Agotnes, Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Goranko, and Jamroga 2007; Vokr´ ınek, Komenda, and Pe- choucek 2009). When such conditions might not be fully observable to all agents, agents might summarize the likeli- hood of the conditions being satisfied in the form of a prob- abilistic commitment (Kushmerick, Hanks, and Weld 1994; Xuan and Lesser 1999; Witwicki and Durfee 2007). Our focus is the process by which agents choose a proba- bilistic commitment, which serves as a probabilistic promise from one agent (the provider) to another (the recipient) about establishing a precondition for the recipient’s pre- ferred actions/objectives. We formalize how the space of probabilistic commitments for the precondition captures dif- ferent tradeoffs between timing and likelihood, where in general the recipient gets higher reward from earlier tim- ing and/or higher likelihood, while the provider prefers later timing and/or lower likelihood because these leave it less constrained when optimizing its own policy. Thus, when agents agree to work together (e.g., (Han, Pereira, and Lenaerts 2017)), forming a commitment generally involves a negotiation (Kraus 1997; Aknine, Pinson, and Shakun 2004; Rahwan 2004). Sometimes, however, a pair of agents might have ob- jectives/payoffs that are aligned/shared. For example, they might be a chef and waiter working in a restaurant (see Sec- tion 6.2). In a commitment-based coordination framework, such agents should find a cooperative probabilistic commit- ment, whose timing and likelihood maximizes their joint (summed) reward, in expectation. Decomposing the joint re- ward into local, individual rewards is common elsewhere as well, like in the Dec-POMDP literature (Oliehoek, Amato et al. 2016) and multi-agent reinforcement learning (Zhang et al. 2018). This optimization problem is complicated by two main factors: i) the information relevant to optimization is distributed, and thus the agents need to exchange knowl- edge, preferably with low communication cost; and ii) the space of possible timing/probability combinations is large and evaluating a combination (requiring each agent to com- pute an optimal policy) is expensive, and thus identifying a desirable probabilistic commitment is computationally chal- lenging even with perfect centralized information. The main contribution of this paper is an approach that ad- dresses both challenges for cooperative agents to efficiently arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

Upload: others

Post on 27-Apr-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

Efficient Querying for Cooperative Probabilistic Commitments

Qi Zhang1, Edmund H. Durfee2, Satinder Singh2

1 Artificial Intelligence Institute, University of South Carolina2 Computer Science and Engineering, University of Michigan

[email protected], [email protected], [email protected]

Abstract

Multiagent systems can use commitments as the core of ageneral coordination infrastructure, supporting both coop-erative and non-cooperative interactions. Agents whose ob-jectives are aligned, and where one agent can help anotherachieve greater reward by sacrificing some of its own reward,should choose a cooperative commitment to maximize theirjoint reward. We present a solution to the problem of how co-operative agents can efficiently find an (approximately) opti-mal commitment by querying about carefully-selected com-mitment choices. We prove structural properties of the agents’values as functions of the parameters of the commitmentspecification, and develop a greedy method for composing aquery with provable approximation bounds, which we empir-ically show can find nearly optimal commitments in a fractionof the time methods that lack our insights require.

1 IntroductionCommitments are a proven approach to multiagent coor-dination (Singh 2012; Cohen and Levesque 1990; Castel-franchi 1995; Mallya and Huhns 2003; Chesani et al. 2013;Al-Saqqar et al. 2014). Through commitments, agents knowmore about what to expect from others, and thus can planactions with higher confidence of success. That said, com-mitments are generally uncertain: an agent might abandona commitment if it discovers that it cannot achieve what itpromised, or that it prefers to achieve something else, or thatothers will not uphold their side of the commitment (Jen-nings 1993; Xing and Singh 2001; Winikoff 2006).

One way to deal with commitment uncertainty is to insti-tute protocols so participating agents are aware of the sta-tus of commitments through their lifecycles (Venkatramanand Singh 1999; Xing and Singh 2001; Yolum and Singh2002; Fornara and Colombetti 2008; Baldoni et al. 2015;Gunay, Liu, and Zhang 2016; Pereira, Oren, and Meneguzzi2017; Dastani, van der Torre, and Yorke-Smith 2017). An-other has been to qualify commitments with conditionalstatements about what must (not) be true in the environmentfor the commitment to be fulfilled (Singh 2012; Agotnes,

Copyright © 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Goranko, and Jamroga 2007; Vokrınek, Komenda, and Pe-choucek 2009). When such conditions might not be fullyobservable to all agents, agents might summarize the likeli-hood of the conditions being satisfied in the form of a prob-abilistic commitment (Kushmerick, Hanks, and Weld 1994;Xuan and Lesser 1999; Witwicki and Durfee 2007).

Our focus is the process by which agents choose a proba-bilistic commitment, which serves as a probabilistic promisefrom one agent (the provider) to another (the recipient)about establishing a precondition for the recipient’s pre-ferred actions/objectives. We formalize how the space ofprobabilistic commitments for the precondition captures dif-ferent tradeoffs between timing and likelihood, where ingeneral the recipient gets higher reward from earlier tim-ing and/or higher likelihood, while the provider preferslater timing and/or lower likelihood because these leaveit less constrained when optimizing its own policy. Thus,when agents agree to work together (e.g., (Han, Pereira, andLenaerts 2017)), forming a commitment generally involves anegotiation (Kraus 1997; Aknine, Pinson, and Shakun 2004;Rahwan 2004).

Sometimes, however, a pair of agents might have ob-jectives/payoffs that are aligned/shared. For example, theymight be a chef and waiter working in a restaurant (see Sec-tion 6.2). In a commitment-based coordination framework,such agents should find a cooperative probabilistic commit-ment, whose timing and likelihood maximizes their joint(summed) reward, in expectation. Decomposing the joint re-ward into local, individual rewards is common elsewhere aswell, like in the Dec-POMDP literature (Oliehoek, Amatoet al. 2016) and multi-agent reinforcement learning (Zhanget al. 2018). This optimization problem is complicated bytwo main factors: i) the information relevant to optimizationis distributed, and thus the agents need to exchange knowl-edge, preferably with low communication cost; and ii) thespace of possible timing/probability combinations is largeand evaluating a combination (requiring each agent to com-pute an optimal policy) is expensive, and thus identifying adesirable probabilistic commitment is computationally chal-lenging even with perfect centralized information.

The main contribution of this paper is an approach that ad-dresses both challenges for cooperative agents to efficiently

arX

iv:2

012.

0719

5v1

[cs

.AI]

14

Dec

202

0

Page 2: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

converge on an approximately-optimal probabilistic com-mitment. To address i), our approach adopts a decentralized,query-based protocol for the agents to exchange knowledgeeffectively with low communication cost. To get the effi-ciency for ii), we prove the existence of structural propertiesin the agents’ value functions, and show that these can beprovably exploited by the query-based protocol.

2 Related WorkCommitments are a widely-adopted framework for multi-agent coordination (Kushmerick, Hanks, and Weld 1994;Xuan and Lesser 1999; Singh 2012). We build on priorresearch on probabilistic commitments (Xuan and Lesser1999; Bannazadeh and Leon-Garcia 2010), where the tim-ing and likelihood of achieving a desired outcome are ex-plicitly specified. Choosing a probabilistic commitment thuscorresponds to searching over the combinatorial space ofpossible commitment times and probabilities. Prior workon such search largely relies on heuristics. Witwicki et al.(2007) propose the first probabilistic commitment search al-gorithm that initializes a set of commitments and then per-forms local adjustments on time and probability. Later work(Witwicki and Durfee 2009; Oliehoek, Witwicki, and Kael-bling 2012) further incorporates the commitment’s feasibil-ity and the best response strategy (Nair et al. 2003) to guidethe search. In contrast to these heuristic approaches, in thispaper we analytically reveal the structure of the commitmentspace, which enables efficient search that provably finds theoptimal commitment.

Because the search process is decentralized, it will in-volve message passing. The message passing between ourdecision-theoretic agents serves the purpose of preferenceelicitation, which is typically framed in terms of an agentquerying another about which from among a set of choicesit most prefers (Chajewska, Koller, and Parr 2000; Boutilier2002; Viappiani and Boutilier 2010). We adopt such a query-ing protocol as a means for information exchange betweenthe agents. In particular, we draw on recent work that usesvalue-of-information concepts to formulate multiple-choicequeries (Viappiani and Boutilier 2010; Cohn, Singh, andDurfee 2014; Zhang, Durfee, and Singh 2017), but as wewill explain we augment prior approaches by annotating of-fered choices with the preferences of the agent posing thequery. Moreover, we prove several characteristic propertiesof agents’ commitment value functions, which enables effi-cient formulation of near-optimal queries.

3 Decision-Theoretic CommitmentsThe provider’s and recipient’s environments are modeled astwo separate Markov Decision Processes (MDPs). An MDPis defined as M = (S,A, P,R,H, s0) where S is the finitestate space, A the finite action space, P : S × A → ∆(S)the transition function (∆(S) denotes the set of all proba-bility distributions over S), R : S × A → R the rewardfunction, H the finite horizon, and s0 the initial state. Thestate space is partitioned into disjoint sets by the time step,S =

⋃Hh=0 Sh, where states in Sh only transition to states

in Sh+1. The MDP starts in s0 and ends in SH . Given a

policy π : S → ∆(A), a random sequence of transitions{(sh, ah, rh, sh+1)}H−1

h=0 is generated by ah ∼ π(sh), rh =R(sh, ah), sh+1 ∼ P (sh, ah). The value function of π isV πM (s) = E[

∑H−1h′=h rh′ |π, sh = s] where h is such that

s ∈ Sh. The optimal policy π∗M maximizes V πM for all s ∈ S,with value function V π

∗M

M abbreviated as V ∗M .Superscripts p and r denote the provider and recipient,

respectively. Thus, the provider’s MDP is Mp, and the re-cipient’s MDP is M r, sharing the horizon H = Hp = Hr.We assume that the two MDPs are weakly-coupled in onedirection in the sense that the provider’s action might af-fect certain aspects of the recipient’s state but not the otherway around. As one way to model such an interaction,we adopt the Transition-Decoupled POMDP (TD-POMDP)framework (Witwicki and Durfee 2010). Formally, both theprovider’s state sp and the recipient’s state sr can be factoredinto state features. The provider can fully control its statefeatures. The recipient’s state can be factored as sr = (lr, u),where lr is the set of all the recipient’s state features locallycontrolled by the recipient, and u is the set of state featuresuncontrollable by the recipient but shared with the provider,i.e. u = sp ∩ sr. Formally, the dynamics of the recipient’sstate is factored as P r = (P r

l , Pru):

P r(srh+1|sr

h, arh

)=P r

((lrh+1, uh+1)|(lrh, uh), ar

h

)=P r

u(uh+1|uh)P rl

(lrh+1|(lrh, uh), ar

h

),

where the dynamics of u, P ru, is controlled only by the

provider’s policy (i.e., it is not a function of arh). Prior work

refers to P ru as the influence (Witwicki and Durfee 2010;

Oliehoek, Witwicki, and Kaelbling 2012) that the providerexerts on the recipient’s environment. In this paper, we fo-cus on the setting where u contains a single binary state fea-ture, u ∈ {u−, u+}, with u initially taking the value of u−.Intuitively, u+(u−) stands for an enabled (disabled) precon-dition needed by the recipient, and the provider commits toenabling the precondition. Further, we focus on a scenariowhere the flipping is permanent (Hindriks and van Riems-dijk 2007; Witwicki and Durfee 2009; Zhang et al. 2016).That is, once feature u flips to u+, the precondition is per-manently established and will not revert back to u−.

The provider’s commitment semantics. Borrowingfrom the literature (Witwicki and Durfee 2007; Zhang et al.2016), we define a probabilistic commitment w.r.t. theshared feature u via a tuple c = (T, p), where T is thecommitment time and p is the commitment probability. Theprovider’s commitment semantics is to follow a policy πp

that, starting from initial state sp0 (in which u is u−), sets u

to u+ by time step T with at least probability p:

Pr(u+ ∈ sp

T |sp0 , π

p)≥ p. (1)

For a commitment c, let Πp(c) be the set of all possi-ble provider policies respecting the commitment semantics(Eq. (1)). We call commitment c feasible if and only if Πp(c)is non-empty. For a given commitment time T , there is amaximum feasible probability p(T ) ≤ 1 such that commit-ment (T, p) is feasible if and only if p ≤ p(T ), and p(T )can be computed by solving the provider’s MDP with thereward function modified to +1 reward for states where the

Page 3: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

commitment is realized at T , and 0 otherwise. This is be-cause maximizing this reward is equivalent to maximizingthe probability of realizing the commitment at time step T ,and thus the optimal initial state value is the maximum fea-sible probability p(T ).

Given a feasible c, the provider’s optimal policy maxi-mizes the value with its original reward function of its initialstate while respecting the commitment semantics:

vp(c) = maxπp∈Πp(c) Vπp

Mp(sp0). (2)

We call vp(c) the provider’s commitment value function,and πp(c) denotes the provider’s policy maximizing Eq. (2).

The recipient’s commitment modeling. Abstracting theprovider’s influence using a single time/probability pair re-duces the complexity and communication between the twoagents, and prior work has also shown that such abstraction,by leaving other time steps unconstrained, helps the providerhandle uncertainty in its environment (Zhang et al. 2016;Zhang, Durfee, and Singh 2020b). Specifying just a singletime/probability pair, however, increases the uncertainty ofthe recipient. Given commitment c, the recipient creates anapproximation P r

u(c) of influence P ru, where P r

u(c) hypoth-esizes the flipping probabilities at other timesteps. Formally,given P r

u(c), let M r(c) be the recipient’s approximate modelthat differs fromM r only in terms of the dynamics of u. Therecipient’s value of commitment c is defined to be the opti-mal value of the initial state in M r(c):

vr(c) = maxπr∈Πr V πr

Mr(c)(sr

0). (3)

We call vr(c) the recipient’s commitment value function,and πr(c) the recipient’s policy maximizing Eq. (3) .

Previous work (Witwicki and Durfee 2010; Zhang, Dur-fee, and Singh 2020a) has chosen an intuitive and straight-forward strategy for the recipient to create P r

u(c), whichmodels the flipping with a single branch at the commit-ment time with the commitment probability. In this paper,we adopt this commitment modeling strategy in Eq. (3) forthe recipient, where the strategy determines the transitionfunction of M r(c) through P r

u(c).The optimal commitment. Let T = {1, 2, ...,H} be the

space of possible commitment times, [0, 1] be the continuouscommitment probability space, and vp+r = vp + vr be thejoint commitment value function. The optimal commitmentis a feasible commitment that maximizes the joint value, i.e.

c∗ = arg maxfeasible c∈T ×[0,1] vp+r(c). (4)

Since commitment feasibility is a constraint for all our op-timization problems, for notational simplicity we omit it forthe rest of this paper. A naıve strategy for solving the prob-lem in Eq. (4) is to discretize the commitment probabilityspace, and evaluate every feasible commitment in the dis-cretized space. The finer the discretization is, the better thesolution will be. At the same time, the finer the discretiza-tion, the larger the computational cost of evaluating all thepossible commitments. Next, we prove structural propertiesof the provider’s and the recipient’s commitment value func-tions that enable us to develop algorithms that efficientlysearch for the exact optimal commitment.

4 Commitment Space Structure4.1 Properties of the Commitment ValuesWe show that, as functions of the commitment probability,both commitment value functions are monotonic and piece-wise linear; the provider’s commitment value function isconcave, and the recipient’s is convex. Proofs of all the the-orems and the lemmas are included in the appendix.

Theorem 1. Let vp(c) = vp(T, p) be the provider’s com-mitment value as defined in Eq. (2). For any fixed commit-ment time T , vp(T, p) is monotonically non-increasing, con-cave, and piecewise linear in p.

We introduce Assumption 1 that formalizes the notion thatu+, as opposed to u−, is the value of u that is desirable forthe recipient, and then state the properties of the recipient’scommitment value function in Theorem 2.

Assumption 1. Let M r+(M r−) be defined as the recipi-ent’s MDP identical to M r except that u is always set tou+(u−). For any M r and any locally-controlled feature lr,letting sr+ = (lr, u+) and sr− = (lr, u−), we assumeV ∗Mr−(sr−) ≤ V ∗Mr+(sr+).

Theorem 2. Let vr(c) = vr(T, p) be the recipient’s commit-ment value as defined in Eq. (3). For any fixed commitmenttime T , under Assumption 1, vr(T, p) is monotonically non-decreasing, convex, and piecewise linear in p.

4.2 Efficient Optimal Commitment SearchAs an immediate consequence of Theorems 1 and 2, the jointcommitment value is piecewise linear in the probability, andany local maximum for a fixed commitment time T can beattained by a probability at the extremes of zero and p(T ), orwhere the slope of the provider’s commitment value functionchanges. We refer to these probabilities as the provider’s lin-earity breakpoints, or breakpoints for short. Therefore, onecan solve the problem in Eq. (4) to find an optimal commit-ment by searching only over these breakpoints, as formallystated in Theorem 3.

Theorem 3. Let P(T ) be the provider’s breakpoints for afixed commitment time T . Let C = {(T, p) : T ∈ T , p ∈P(T )} be the set of commitments in which the probabilityis a provider’s breakpoint. We have

maxc∈T ×[0,1] vp+r(c) = maxc∈C v

p+r(c).

Further, the property of convexity/concavity assures that,for any commitment time, the commitment value function islinear in a probability interval [pl, pu] if and only if the valueof an intermediate commitment probability pm ∈ (pl, pu)is the linear interpolation of the two extremes. This en-ables us to adopt the binary search procedure in Algorithm1 to efficiently identify the provider’s breakpoints. For anyfixed commitment time T , the strategy first computes themaximum feasible probability p(T ). Beginning with the en-tire interval of [pl, pu] = [0, p(T )], it recursively checksthe linearity of an interval by checking the middle point,pm = (pl + pu)/2. The recursion continues with the twohalves, [pl, pm] and [pm, pu], only if the commitment value

Page 4: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

Algorithm 1: Binary search for breakpointsInput: The provider’s Mp, commitment time T .Output: P(T ): the provider’s breakpoints for T .

1 p(T )← the maximum feasible probability for T2 q← A FIFO queue of probability intervals3 q.push([0, p(T )])4 Compute and save the provider’s commitment value

for p = 0, p(T ), i.e. vp(T, 0) and vp(T, p(T ))5 Initialize P(T )← {}6 while q not empty do7 [pl, pu]← q.pop(); P(T )←P(T ) ∪ {pl, pu}8 pm ← (pl + pu)/2; compute and save vp(T, pm)9 if vp(T, pm) is not the linear interpolation of

vp(T, pl) and vp(T, pu) then10 q.push([pl, pm]); q.push([pm, pu])11 end12 end

function is verified to be nonlinear in interval [pl, pu]. Step-ping through T ∈ [ht] and doing the above binary search foreach will find all probability breakpoint commitments C.

This allows for an efficient centralized procedure tosearch for the optimal commitment: construct C as just de-scribed, compute the value of each c ∈ C for both theprovider and recipient, and return the c with the highestsummed value. We will use it to benchmark the decentral-ized algorithms we develop in Section 5.

5 Commitment QueriesWe now develop a querying approach for eliciting thejointly-preferred (cooperative) commitment in a decentral-ized setting where neither agent has full knowledge aboutthe other’s environment. In our querying approach, oneagent poses a commitment query consisting of informationabout a set of feasible commitments, and the other respondsby selecting the commitment from the set that best satisfiestheir joint preferences. To limit communication cost and re-sponse time, the set of commitments in the query is oftensmall. A query poser thus should optimize its choices ofcommitments to include, and the responder’s choice shouldreflect joint value. In general, either the provider or recipientcould be responsible for posing the query, and the other forresponding, and in future work we will consider how theseroles could be dynamically assigned. In this paper, though,we always assign the provider to be the query poser and therecipient to be the responder. We do this because the agentsmust assuredly be able to adopt the responder’s selectedchoice, which means it must be feasible, and per Section 3,only the provider knows which commitments are feasible.

Specifically, we consider a setting where the providerfully knows its MDP, and where its uncertainty about the re-cipient’s MDP is modeled as a distribution µ over a finite setof N candidate MDPs containing the recipient’s true MDP.Given uncertainty µ, the Expected Utility (EU) of a feasiblecommitment c is defined as :

EU(c;µ) = Eµ[vp+r(c)

], (5)

where the expectation is w.r.t. the uncertainty about the re-cipient’s MDP. If the provider had to singlehandedly selecta commitment based on its uncertainty µ, the best commit-ment is the one that maximizes the expected utility:

c∗(µ) = arg maxcEU(c;µ). (6)But through querying, the provider is given a chance to re-fine its knowledge about the recipient’s actual MDP. For-mally, the provider’s commitment query Q consists of a fi-nite number k = |Q| of feasible commitments. The provideroffers these choices to the recipient, where the provider alsoannotates each choice with the expected local value of itsoptimal policy respecting the commitment (Eq. (2)). The re-cipient computes (using Eq. (3)) its own expected value foreach commitment offered in the query, and adds that to theannotated value from the provider. It responds with the com-mitment that maximizes the summed value (with ties brokenby selecting the smallest indexed) to be the commitment thetwo agents agree on. Therefore, our motivation for a smallquery size k is two-fold: it avoids large communication cost;and it induces short response time of the recipient evaluatingeach commitment in the query.

More formally, let Q c denote the recipient’s responsethat selects c ∈ Q. With the provider’s prior uncertainty µ,the posterior distribution given the response is denoted asµ | Q c, which can be computed by Bayes’ rule. Whenthe query size k = |Q| is limited, the response usually can-not fully resolve the provider’s uncertainty. In that case, thevalue of a query Q is the EU with respect to the poste-rior distribution averaged over all the commitments in thequery being a possible response, and, consistent with priorwork (Viappiani and Boutilier 2010), we refer to it as thequery’s Expected Utility of Selection (EUS):

EUS(Q;µ) = EQ c;µ [EU(c;µ | Q c)] .

Here, the expectation is with respect to the recipient’s re-sponse under µ. The provider’s querying problem thus is toformulate a query Q ⊆ T × [0, 1] consisting of |Q| = kfeasible commitments that maximizes EUS:

maxQ⊆T ×[0,1],|Q|=k EUS(Q;µ). (7)

Importantly, we can show that EUS(Q;µ) is a submod-ular function of Q, as formally stated in Theorem 4. Sub-modularity serves as the basis for a greedy optimization al-gorithm (Nemhauser, Wolsey, and Fisher 1978), which wewill describe after Theorem 5.Theorem 4. For any uncertainty µ, EUS(Q;µ) is a sub-modular function of Q.

Submodularity means that adding a commitment to thequery can increase the EUS, but the increase is diminishingwith the size of the query. An upper bound on the EUS ofany query of any size k can be obtained when k ≥ N suchthat the query can include the optimal commitment of eachcandidate recipient’s MDP, i.e.

EUS = Eµ[maxc∈T ×[0,1] v

p+r(c)]. (8)

As the objective of Eq. (7) increases with size k, in practicethe agents could choose size k large enough to meet somepredefined EUS. We will empirically investigate the effectof the choice of k in Section 6.

Page 5: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

Structure of the Commitment Query Space. Due to theproperties of individual commitment value functions provedin Section 4, the expected utilityEU(c;µ) defined in Eq. (5),as calculated by the provider alone, becomes a summationof the non-increasing provider’s commitment value functionand the (provider-computed) weighted average of the non-decreasing recipient’s commitment value functions. Withthe same reasoning as for Theorem 3, the optimality of thebreakpoint commitments can be generalized to any uncer-tainty, as formalized in Lemma 1.Lemma 1. Let C be defined as in Theorem 3. We havemaxc∈T ×[0,1]EU(c;µ) = maxc∈C EU(c;µ).

As a consequence of Lemma 1, for EUS maximization,there is no loss in only considering the provider’s break-points, as formally stated in Theorem 5.Theorem 5. For any query size k and uncertainty µ, we have

maxQ⊆T ×[0,1],|Q|=k

EUS(Q;µ) = maxQ⊆C,|Q|=k

EUS(Q;µ).

Theorem 5 enables an efficient procedure for solving thequery formulation problem (Eq. (7)). The provider first iden-tifies its breakpoint commitments C and evaluates them forits MDP and each of the N recipient’s possible MDPs.Due to the concavity and convexity properties, C can beidentified and evaluated efficiently with the binary searchstrategy we described in Section 4.2. Finally, a size kquery is formulated from commitments C that solves theEUS maximization problem either exactly with exhaustivesearch, or approximately with greedy search (Viappiani andBoutilier 2010; Cohn, Singh, and Durfee 2014). The greedysearch begins with Q0 as an empty set and iteratively per-forms Qi ← Qi−1 ∪ {ci} for i = 1, ..., k, where ci =arg maxc∈C,c/∈Qi−1

EUS(Qi−1 ∪ {c};µ). Since EUS is asubmodular function of the query (Theorem 4), the greedily-formed size k query Qk is within a factor of 1 − (k−1

k )k ofthe optimal EUS (Nemhauser, Wolsey, and Fisher 1978).

6 Empirical EvaluationOur empirical evaluations focus on these questions:• For EUS maximization, how effective and efficient is the

breakpoints discretization compared with alternatives?• For EUS maximization, how effective and efficient is

greedy query search compared with exhaustive search?To answer these questions, in Section 6.1, we conduct em-pirical evaluations in synthetic MDPs with minimal assump-tions on the structure of transition and reward functions, andwe use an environment in Section 6.2 inspired by the videogame of Overcooked to evaluate the breakpoints discretiza-tion and the greedy query search in this more grounded andstructured domain.

6.1 Synthetic MDPsThe provider’s environment is a randomly-generated MDP.It has 10 states the provider can be in at any time step, oneof which is an absorbing state denoted as s+, and where theinitial state is chosen from the non-absorbing states. Fea-ture u takes the value of u+ only in the absorbing state, i.e.

u+ ∈ sp if and only if sp = s+. There are 3 actions. Foreach state-action pair (sp, ap) where sp 6= s+, the transitionfunction P p(·|sp, ap) is determined independently by fill-ing the 10 entries with values uniformly drawn from [0, 1],and normalizing P p(·|sp, ap). The reward Rp(sp, ap) for anon-absorbing state sp 6= s+ is sampled uniformly and inde-pendently from [0, 1], and for the absorbing state sp = s+ iszero. Thus, the random MDPs are intentionally generated tointroduce a tension for the provider between helping the re-cipient (but getting no further local reward) versus accumu-lating more local reward. (Our algorithms also work fine incases without this tension, but the commitment search is lessinteresting without it because no compromise is needed.)

The recipient’s environment is a one-dimensional spacewith 10 locations represented as integers {0, 1, ..., 9}. In lo-cations 1− 8, the recipient can move right, left, or stay still.Once the recipient reaches either end (location 0 or 9), itstays there. There is a gate between locations 0 and 1 forwhich u = u+ denotes the state of open and u = u− closed.Initially, the gate is closed and the recipient starts at an ini-tial location L0. A negative reward of −10 is incurred bybumping into the closed gate. For each time step the recipi-ent is at neither end, it gets a reward of −1. If it reaches theleft end (i.e. location 0), it gets a one-time reward of r0 > 0.The recipient gets a reward of 0 if it reaches the right end. Ina specific instantiation, L0 and r0 are fixed. L0 is randomlychosen from locations 1 − 8 and r0 from interval (0, 10) tocreate various MDPs for the recipient.

To generate a random coordination problem, we samplean MDP for the provider, and N candidate MDPs for therecipient, setting the provider’s prior uncertainty µ over therecipient’s MDP to be the uniform distribution over the Ncandidates. The horizon for both agents is set to be 20.Since the left end has higher rewards than the right end, ifthe recipient’s start position is close enough to the left endand the provider commits to opening the gate early enoughwith high enough probability, the recipient should utilize thecommitment by checking if the gate is open by the commit-ment time, and pass through it if so; otherwise, the recipientshould simply ignore the commitment and move to the rightend. The distribution for generating the recipient’s MDPs isdesigned to include diverse preferences regarding the com-mitments, such that the provider’s query should be carefullyformulated to elicit the recipient’s preference.

Evaluating the Breakpoints Discretization. The princi-pal result from Section 4 was that the commitment probabil-ities to consider can be restricted to breakpoints without lossof optimality. Further, the hypothesis was that the space ofbreakpoints would be relatively small, allowing the search tobe faster. We now empirically confirm the optimality result,and test the hypothesis of greater efficiency, by comparingthe breakpoint commitments discretization to the followingalternative discretizations:

Even discretization. Prior work (Witwicki and Durfee2007) discretizes the probability space up to a certaingranularity. Here, the probability space [0, 1] is evenly dis-cretized as {p0, ..., pn} where pi = i

n .

Page 6: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

0.0

0.2

0.4

0.6

0.8

1.0EU

S (n

orm

alize

d)

k=2 k=50

200

400

600

800

1000

1200

1400

runt

ime

(sec

)

Even, n = 10Even, n = 20Even, n = 50DP, n = 10DP, n = 20DP, n = 50Breakpoints

Figure 1: Means and standard errors of the EUS (left) andruntime (right) of the discretizations in Synthetic MDPs.

Table 1: Averaged discretization size per commitment time(mean and standard error) in Synthetic MDPs.

n = 10 n = 20 n = 50Even 7.8± 0.1 15.1± 0.1 37.1± 0.1DP 6.1± 0.2 12.0± 0.3 26.5± 0.7

Breakpoints 10.0± 0.1

Deterministic Policy (DP) discretization. This discretiza-tion finds all of the probabilities of toggling feature u atthe commitment time that can be attained by the providerfollowing a deterministic policy (Witwicki and Durfee2007, 2010).

For the even discretization, we consider the resolutions n ∈{10, 20, 50}. For DP, we found that the number of togglingprobabilities of all the provider’s deterministic policies islarge, and the corresponding computational cost of identi-fying and evaluating them is high. To reduce the computa-tional cost and for fair comparison, we group the probabili-ties in the DP discretization that are within i

n of each otherfor n ∈ {10, 20, 50}. Since the problem instances have dif-ferent reward scales, to facilitate analyses we normalize foreach instance the EUS with the upper bound EUS definedin Eq. (8) and the EUS of the optimal and greedy query ofthe even discretization for k = 1, n = 10.

Figure 1 gives the EUS for the seven discretizations over50 randomly-generated problem instances, for N = 10 can-didate MDPs for the recipient and k = 2 and 5. Figure 1shows that, coupled with the greedy query algorithm, ourbreakpoint commitments discretization yields the highestEUS with the lowest computational cost. In Figure 1(left),we see that, for the even and the DP discretizations, the EUSincreases with the probability resolution n, and only once wereach n = 50 is the EUS comparable to our breakpoints dis-cretization. Figure 1(right) compares the runtimes of form-ing the discretization and evaluating the commitments in thediscretization for the downstream query formulation proce-dure, confirming the hypothesis that using breakpoints isfaster. Table 1 compares the sizes of these discretizations,and confirms our intuition that the breakpoints discretizationis most efficient because it identifies fewer commitmentsthat are sufficient for the EUS maximization.

Evaluating the Greedy Query. Next, we empirically con-firm that the greedy query search is effective for EUS max-

1 2 3 5 10 20log query size k

3

2

1

0

1

EUS

(nor

mal

ized)

#breakpoints203.6 ± 2.9

EUSEU * ( )greedyrandomoptimal

1 2 3 5 10 20log query size k

0.1

1.0

10.0

log

runt

ime

(sec

.)

#breakpoints203.6 ± 2.9

Figure 2: Means and standard errors of the EUS (left) andruntime (right) of the optimal, the greedy, and the randomqueries formulated from the breakpoints in Synthetic MDPs.

imization. Given the results confirming the effectivenessand efficiency of the breakpoint discretization, the querysearches here are over the breakpoint commitments. Fig-ure 2(left) compares the EUS of the greedily-formulatedquery with the optimal (exhaustive search) query, and with aquery comprised of randomly-chosen breakpoints. The EUSis normalized with EUS and the optimal EU prior to query-ing given uncertainty µ as defined in Eq. (6). We vary thequery size k, and report means and standard errors over thesame 50 coordination problems. We see that the EUS of thegreedy query tracks that of the optimal query closely, whilegreedy’s runtime scales much better.

6.2 OvercookedIn this section, we further test our approach in a moregrounded domain. The domain, Overcooked, was inspiredby the video game of the same name and introducedby (Wang et al. 2020) to study theory of mind in theabsence of communication but with global observabil-ity. We reuse one of their Overcooked settings with twohigh-level modifications: 1) instead of having global ob-servability, each agent observes only its local environ-ment, and 2) we introduce probabilistic transitions. Thesemodifications induce for the domain a rich space ofmeaningful commitments, over which the agents shouldcarefully negotiate for the optimal cooperative behavior.

Figure 3: Overcooked.

Figure 3 illustrates thisOvercooked environ-ment. Two agents, thechef and the waiter,together occupy a gridwith counters being theboundaries. The chefis supposed to pick upthe tomato, chop it, andplace it on the plate.Afterwards, the waiteris supposed to pick upthe chopped tomato anddeliver to the counterlabelled by the star.Meanwhile, the chef needs to take care of the pot thatcan probabilistically begin boiling, and the waiter needsto take care of a dine-in customer (labelled by the platewith fork and knife). This introduces interesting tensions

Page 7: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

k=2 k=5 0

5000

10000

15000

20000

25000

30000

runt

ime

(sec

)

Even, n = 10Even, n = 20Even, n = 50DP, n = 10DP, n = 20DP, n = 50Breakpoints

Figure 4: Means and standard errors of the EUS (left) andruntime (right) in Overcooked.

Table 2: Averaged discretization size per commitment time(mean and standard error) in Overcooked.

n = 10 n = 20 n = 50Even 6.4± 0.1 11.8± 0.3 28.0± 0.7DP 5.2± 0.2 8.5± 0.4 16.4± 0.9

Breakpoints 4.9± 0.2

between delivering the food and taking care of the pot andthe customer. Please refer to Appendix C for a detaileddescription of the environment.

For coordination, the chef makes a probabilistic commit-ment that it will place the chopped tomato on the plate. Thus,the chef is the provider and the waiter is the recipient. Cru-cially, the commitment decouples the agents’ planning prob-lems, allowing the agents to only model the MDP in theirhalf of the grid. We repeat the experiments in Section 6.1 thatevaluate the breakpoints discretization and the greedy queryover 50 problem instances. We conjecture that, since theprovider’s transition function in Overcooked is more struc-tured than in Section 6.1, the breakpoints discretization isrelatively smaller, leading to greater efficiency. The results,presented in Figure 4 and Table 2 as the counterparts of Fig-ure 1 and Table 1, confirm our conjecture. Comparing Table2 with Table 1, we see that the breakpoints discretization inOvercooked is relatively smaller. Therefore, it is unsurpris-ing to see that the runtime in the Overcooked environment,as shown in Figure 4(right), is relatively smaller than that inFigure 1(right). The results that are the counterpart of Fig-ure 2 are presented in Appendix E.1, which confirm that thegreedy query is again efficient and effective.

Quality of Coordination from Querying. The resultsthus far confirm that the agents are able to agree on a com-mitment from our querying process to achieve high ex-pected joint commitment value (i.e. EUS). By agreeing onc, the agents will execute joint policies (πp(c), πr(c)) de-rived from c, respectively. Note that, because πr(c)) cor-responds to the recipient’s approximation of the provider’strue influence (Eq. 3), the joint commitment value is notperfectly aligned with the value of the joint policies. A rea-sonable remaining question is: how effective is our commit-ment query approach in terms of maximizing joint policies’value, compared with the coordination approach originallyexamined in Overcooked, and with other approaches that

Table 3: Values of joint policies (mean and standard error in%) in Overcooked, with MMDPs normalized to 100 and nullcommitment to 0.

Centralized Decentralized

Global Obs. 100(MMDPs)

near 100(Wang et al.)

Local Obs. 99.9± 0.1(Optimal c)

99.6± 0.1, 99.8± 0.1(Query k = 2, 5)

Null c: 0; Random c: 14.5± 1.4

make different tradeoffs with respect to observability, com-munication, and centralization? To answer this question, wemeasure the following joint values: 1) a centralized plannerview of the provider and the recipient as a single agent; thisalso corresponds to the multi-agent MDPs (MMDPs) model(Boutilier 1996) where agents have global observability ofthe state and the reward when selecting actions. 2) decentral-ized MMDPs, which allows global observability but no cen-tralization, so that the agents need to infer each other’s inten-tion individually; in this case, Wang el al. (2020) achievedvalues close to centralization in Overcooked; 3) centralizedlocal observability, where a centralized planner yields jointpolicies that select actions based on local observability (i.e.half of the grid and private rewards); in particular, we con-sider policies derived from the optimal commitment (Sec-tion 4.2) found by the centralized planner; 4) decentralizedlocal observability, which corresponds to our commitmentquery approach; we also consider the null commitment pol-icy where the chef and the waiter only optimize the rewardfor the pot and the dine-in customer without delivering food.Note that this null commitment policy is a reasonable base-line if there is no communication allowed.

The results, presented in Table 3, show that our com-mitment query approach uses modest communication toachieve joint values comparable to cases with stronger in-formation infrastructures that assume centralization and/orglobal observability during planning and/or execution. Theresults also confirm that careful selection of commitmentsfor querying is crucial to induce effective coordination, asrandom commitments yield significantly lower joint values.

7 DiscussionBuilt on provable foundations and evaluated in two separatedomains, our approach proves highly appropriate for settingswhere cooperative agents coordinate their plans throughcommitments in a decentralized manner, and could providea good performance/cost tradeoff even compared to coordi-nation that is not restricted to being commitment-based.

For future directions, if the agents can afford the time andbandwidth, querying need not be limited to a single round,which then raises questions about how agents should con-sider future rounds when deciding on what to ask in the cur-rent round. The querying can also be extended to the settingwhere the query poser is uncertain about both the respon-der’s and its own environments. As dependencies betweenagents get richer (with chains and even cycles of commit-

Page 8: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

ments), continuing to identify and exploit structure in in-tertwined value functions will be critical to scaling up forefficient multi-round querying of connected commitments.Acknowledgments We thank the anonymous reviewers fortheir valuable feedback. This work was supported in partby the Air Force Office of Scientific Research under grantFA9550-15-1-0039. Opinions, findings, conclusions, or rec-ommendations expressed here are those of the authors anddo not necessarily reflect the views of the sponsor.

ReferencesAgotnes, T.; Goranko, V.; and Jamroga, W. 2007. Strategiccommitment and release in logics for multi-agent systems.Technical Report IfI-08-01, Clausthal University.Aknine, S.; Pinson, S.; and Shakun, M. F. 2004. An extendedmulti-agent negotiation protocol. Autonomous Agents andMulti-Agent Systems 8(1): 5–45.Al-Saqqar, F.; Bentahar, J.; Sultan, K.; and El-Menshawy,M. 2014. On the interaction between knowledge and socialcommitments in multi-agent systems. Applied Intelligence41(1): 235–259.Altman, E. 1999. Constrained Markov decision processes,volume 7. CRC Press.Baldoni, M.; Baroglio, C.; Chopra, A. K.; and Singh, M. P.2015. Composing and verifying commitment-based multia-gent protocols. In Proceedings of the Twenty-Fourth Inter-national Joint Conference on Artificial Intelligence, 10–17.Bannazadeh, H.; and Leon-Garcia, A. 2010. A distributedprobabilistic commitment control algorithm for service-oriented systems. IEEE Transactions on Network and Ser-vice Management 7(4): 204–217.Boutilier, C. 1996. Planning, learning and coordination inmultiagent decision processes. In Proceedings of the 6thconference on Theoretical aspects of rationality and knowl-edge, 195–210.Boutilier, C. 2002. A POMDP formulation of preferenceelicitation problems. In Proceedings of the Eighteenth Na-tional Conference on Artificial Intelligence, 239–246.Castelfranchi, C. 1995. Commitments: From Individual in-tentions to groups and organizations. In Proceedings of theInternational Conference on Multiagent Systems, 41–48.Chajewska, U.; Koller, D.; and Parr, R. 2000. Making ratio-nal decisions using adaptive utility elicitation. In Proceed-ings of the Seventeenth National Conference on Artificial In-telligence, 363–369.Chesani, F.; Mello, P.; Montali, M.; and Torroni, P. 2013.Representing and monitoring social commitments using theevent calculus. Autonomous Agents and Multi-Agent Sys-tems 27(1): 85–130.Cohen, P. R.; and Levesque, H. J. 1990. Intention is choicewith commitment. Artificial Intelligence 42(2-3): 213–261.Cohn, R.; Singh, S.; and Durfee, E. 2014. CharacterizingEVOI-sufficient k-response query sets in decision problems.In International Conference on Artificial Intelligence andStatistics, 131–139.

Dastani, M.; van der Torre, L. W. N.; and Yorke-Smith, N.2017. Commitments and interaction norms in organisations.Auton. Agents Multi Agent Syst. 31(2): 207–249.Fern, A.; Yoon, S. W.; and Givan, R. 2004. LearningDomain-Specific Control Knowledge from Random Walks.In Proceedings of the Fourteenth International Conferenceon Automated Planning and Scheduling, 191–199.Fornara, N.; and Colombetti, M. 2008. Specifying and en-forcing norms in artificial institutions. In Proceedings of the7th International Joint Conference on Autonomous Agentsand Multiagent Systems, 1481–1484.Gunay, A.; Liu, Y.; and Zhang, J. 2016. Promoca: Proba-bilistic modeling and analysis of agents in commitment pro-tocols. Journal of Artificial Intelligence Research 57: 465–508.Han, T. A.; Pereira, L. M.; and Lenaerts, T. 2017. Evolutionof commitment and level of participation in public goodsgames. Auton. Agents Multi Agent Syst. 31(3): 561–583.Hindriks, K. V.; and van Riemsdijk, M. B. 2007. Satisfyingmaintenance goals. In 5th Int. Workshop Declarative AgentLanguages and Technologies (DALT), 86–103.Jennings, N. R. 1993. Commitments and conventions: Thefoundation of coordination in multi-agent systems. TheKnowledge Engineering Review 8(3): 223–250.Kraus, S. 1997. Negotiation and cooperation in multi-agentenvironments. Artificial intelligence 94(1-2): 79–97.Kushmerick, N.; Hanks, S.; and Weld, D. 1994. An algo-rithm for probabilistic least-commitment planning. In Pro-ceedings of the Twelfth National Conference on Artificial In-telligence, 1073–1078.Mallya, A. U.; and Huhns, M. N. 2003. Commitmentsamong agents. IEEE Internet Computing 7(4): 90–93.Nair, R.; Tambe, M.; Yokoo, M.; Pynadath, D.; andMarsella, S. 2003. Taming decentralized POMDPs: Towardsefficient policy computation for multiagent settings. In Pro-ceedings of the Eighteenth International Joint Conferenceon Artificial Intelligence, volume 3, 705–711.Nakhost, H.; and Muller, M. 2009. Monte-Carlo explorationfor deterministic planning. In Twenty-First InternationalJoint Conference on Artificial Intelligence, 1766–1771.Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978.An analysis of approximations for maximizing submodularset functions. Mathematical Programming 14(1): 265–294.Oliehoek, F. A.; Amato, C.; et al. 2016. A concise introduc-tion to decentralized POMDPs. Springer Briefs in IntelligentSystems .Oliehoek, F. A.; Witwicki, S. J.; and Kaelbling, L. P. 2012.Influence-based abstraction for multiagent systems. In Pro-ceedings of the Twenty-Sixth AAAI Conference on ArtificialIntelligence, 1422–1428.Pereira, R. F.; Oren, N.; and Meneguzzi, F. 2017. Detectingcommitment abandonment by monitoring sub-optimal stepsduring plan execution. In Proceedings of the 16th Confer-ence on Autonomous Agents and Multiagent Systems, 1685–1687.

Page 9: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

Puterman, M. L. 2014. Markov Decision Processes.: Dis-crete Stochastic Dynamic Programming. John Wiley &Sons.Rahwan, I. 2004. Interest-based negotiation in multi-agentsystems. Ph.D. thesis, University of Melbourne, Departmentof Information Systems Melbourne.Singh, M. P. 2012. Commitments in multiagent systems:Some history, some confusions, some controversies, someprospects. In The Goals of Cognition. Essays in Honor ofCristiano Castelfranchi, 601–626. London.Venkatraman, M.; and Singh, M. P. 1999. Verifying compli-ance with commitment protocols. Autonomous Agents andMulti-agent Systems 2(3): 217–236.Viappiani, P.; and Boutilier, C. 2010. Optimal Bayesianrecommendation sets and myopically optimal choice querysets. In Advances in Neural Information Processing Systems,2352–2360.Vokrınek, J.; Komenda, A.; and Pechoucek, M. 2009. De-committing in multi-agent execution in non-deterministicenvironment: experimental approach. In 8th InternationalJoint Conference on Autonomous Agents and MultiagentSystems, 977–984.Wang, R. E.; Wu, S. A.; Evans, J. A.; Tenenbaum, J. B.;Parkes, D. C.; and Kleiman-Weiner, M. 2020. Too manycooks: Coordinating multi-agent collaboration through in-verse planning. arXiv preprint arXiv:2003.11778 .Winikoff, M. 2006. Implementing flexible and robust agentinteractions using distributed commitment machines. Multi-agent and Grid Systems 2(4): 365–381.Witwicki, S. J.; and Durfee, E. H. 2007. Commitment-drivendistributed joint policy search. In Proceedings of the 6thInternational Joint Conference on Autonomous Agents andMultiagent Systems, 480–487.Witwicki, S. J.; and Durfee, E. H. 2009. Commitment-basedservice coordination. Int.J. Agent-Oriented Software Engi-neering 3: 59–87.Witwicki, S. J.; and Durfee, E. H. 2010. Influence-based pol-icy abstraction for weakly-coupled Dec-POMDPs. In Pro-ceedings of the Twentieth International Conference on Au-tomated Planning and Scheduling, 185–192.Xing, J.; and Singh, M. P. 2001. Formalization ofcommitment-based agent interaction. In Proceedings of the2001 ACM Symposium on Applied Computing, 115–120.ACM.Xuan, P.; and Lesser, V. R. 1999. Incorporating uncertaintyin agent commitments. In International Workshop on AgentTheories, Architectures, and Languages, 57–70. Springer.Yolum, P.; and Singh, M. P. 2002. Flexible protocol spec-ification and execution: Applying event calculus planningusing commitments. In Proceedings of the First Interna-tional Joint Conference on Autonomous Agents and Multia-gent Systems, 527–534.Zhang, K.; Yang, Z.; Liu, H.; Zhang, T.; and Basar, T. 2018.Fully decentralized multi-agent reinforcement learning withnetworked agents. arXiv preprint arXiv:1802.08757 .

Zhang, Q.; Durfee, E.; and Singh, S. 2020a. Modeling prob-abilistic commitments for maintenance is inherently harderthan for achievement. In Proceedings of the AAAI Confer-ence on Artificial Intelligence, 10326–10333.Zhang, Q.; Durfee, E. H.; and Singh, S. 2020b. Semanticsand algorithms for trustworthy commitment achievementunder model uncertainty. Autonomous Agents and Multi-Agent Systems 34(1): 19.Zhang, Q.; Durfee, E. H.; Singh, S.; Chen, A.; and Witwicki,S. J. 2016. Commitment semantics for sequential decisionmaking under reward uncertainty. In Proceedings of theTwenty-Fifth International Joint Conference on Artificial In-telligence, 3315–3323.Zhang, S.; Durfee, E.; and Singh, S. 2017. Approximately-optimal queries for planning in reward-uncertain Markovdecision processes. In Proceedings of the Twenty-SeventhInternational Conference on Automated Planning andScheduling, 339–347.

Page 10: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

Appendix

A ProofsA.1 Proof of Theorem 1Proof of monotonicity By the commitment semantics ofEq. (1), Πp(c) = Πp(T, p) is monotonically non-increasingin p for any fixed T , i.e. Πp(T, p′) ⊆ Πp(T, p) for any p′ >p. Therefore, vp(T, p) is monotonically non-increasing in p.

Proof of concavity Consider the linear program (LP), pat-terned on the literature(Altman 1999; Witwicki and Dur-fee 2007), that solves the provider’s planning problem inEq. (2):

maxx

∑sp,ap

x(sp, ap)Rp(sp, ap) (9a)

s.t. ∀sp, ap x(sp, ap) ≥ 0; (9b)

∀sp′∑ap′

x(sp′, ap′) (9c)

=∑sp,ap

x(sp, ap)P p(sp′|sp, ap) + δ(sp′, sp0);

∑sp∈s+T

∑ap

x(sp, ap) ≥ p (9d)

where δ(sp′, sp0) is the Kronecker delta that returns 1 when

sp′ = sp0 and 0 otherwise, and s+

T = {sp : sp ∈ SpT , u

+ ∈sp} is the set of the provider’s states at commitment time Tin which u = u+. If x satisfies constraints (9b) and (9c),then it is the occupancy measure of policy πp,

πp(ap|sp) =x(sp, ap)∑ap′ x(sp, ap′)

,

where x(sp, ap) is the expected number of times action ap

is taken in state sp by following policy πp. Constraint (9d)expresses the commitment semantics of Eq. (1). The ex-pected cumulative reward is expressed in the objective func-tion (9a). Therefore, vp(c) is the optimal value of this linearprogram.

For a fixed commitment time T and any two commitmentprobabilities p and p′, let x∗p, x

∗p′ be the optimal solutions to

the LP, respectively. For any η ∈ [0, 1], let pη = ηp′ + (1−η)p. Consider xη that is the η-interpolation of x∗p, x

∗p′ ,

xη(sp, ap) = ηx∗p′(sp, ap) + (1− η)x∗p(s

p, ap).

Note that xη satisfies constraints (9b) and (9c), and so it isthe occupancy measure of policy πp

η defined as

πpη(ap|sp) =

xη(sp, ap)∑ap xη(s, ap)

.

Since the occupancy measure of πpη is the η-interpolation of

x∗p and x∗p′ , it is easy to verify that πpη is feasible for commit-

ment probability pη . Therefore, the concavity holds because

vp(T, pη)

≥V πpη

Mp(sp0) =

∑sp,ap

xη(sp, ap)Rp(sp, ap)

=∑sp,ap

(ηx∗p′(s

p, ap) + (1− η)x∗p0(sp, ap))Rp(sp, ap)

=ηvp(T, p′) + (1− η)vp(T, p0).

Proof of piecewise linearity We first convert the originallinear program into its standard form:

maxx

rT x

s.t. Ax = b;

x ≥ 0;

To convert constraint (9d) into an equality constraint, we in-troduce a slack variable ξ ≥ 0:∑

sp∈Sp+T

∑ap

x(sp, ap)− ξ = p.

The slack variable is a decision variable in the standard form,x = [x | ξ] ∈ R|Sp||Ap|+1. The standard form eliminatesredundant constraints so that A ∈ Rm×(|S||A|+1) is full rowrank (rank(A) = m). Note that the elimination producesb ∈ Rm whose elements are linear in p.

Pick a set of indices B corresponding to m columns ofthe matrix A. We can think of A as the concatenation of twomatrices AB and AN where AB is the m × m matrix ofthese m linearly independent columns, and AN contains theother columns. Correspondingly, x is decomposed into xBand xN . Then, x = [xB | xN ] is basic feasible if xN = 0,AB is invertible, and xB = A−1

B b ≥ 0.It is known that the optimal solution can be found in the

basic feasible solutions,

vp(T, p) = maxB:x is basic feasible

rT x

= maxB:x is basic feasible

rTBxB

= maxB:x is basic feasible

rTBA−1B b.

Since b is in linear in p, vp(T, p) is the maximum of a set oflinear functions in p, and therefore it is piecewise linear.

A.2 Proof of Theorem 2Proof of monotonicity. We fix the commitment time T .For any recipient policy πr, let vπ

r

T,1 be the initial state valueof πr when u is enabled from u− to u+ with probability 1 atT , and let vπ

r

T,0 be the initial state value of πr when u neverflips to u+. It is useful to notice that

V πr

Mr(c)(sr0) = pvπ

r

T,1 + (1− p)vπr

T,0 (11)

In words, the initial state value can be expressed as theweighted sum of the two scenarios, with the weight deter-mined by the commitment probability. Consider the optimal

Page 11: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

policy π∗Mr(c)

for Mr(c). It is guaranteed that vπ∗Mr(c)

T,1 ≥

vπ∗Mr(c)

T,0 because, intuitively, u+ is more desirable than u−

to the recipient. We will formally prove this later. Now con-sider p′ > p and let c′ = (t, p′):

vr(T, p) = pvπ∗Mr(c)

T,1 + (1− p)vπ∗Mr(c)

T,0

≤ p′vπ∗Mr(c)

T,1 + (1− p′)vπ∗Mr(c)

T,0 ≤ vr(T, p′).

Now, we finish the proof by formally showing vπ∗Mr(c)

T,1 ≥

vπ∗Mr(c)

T,0 . To this end, it is useful to first give Lemma 2 thatdirectly follows from Assumption 1, stating that the valuewhen u is always set to u− is no more than the value of anyarbitrary Mr.

Lemma 2. Under Assumption 1, for any Mr with arbitraryP ru , we have V ∗Mr−(sr0) ≤ V ∗Mr (sr0).

Proof. Let’s first consider the case in which P ru flips u onlyat a single time step T . We show V ∗Mr−(sr0) ≤ V ∗Mr (sr0) byconstructing a policy in Mr for which the value is at leastV ∗Mr−(sr0) by mimicking π∗Mr− . We can construct a policyπMr that chooses the same actions as π∗Mr− up until timestep T . If u is not toggled at T , then we keep choosingthe same actions as π∗Mr− throughout the episode; other-wise, after T we chooses actions that are optimal for u+.By Assumption 1, this policy yields a value that is at leastV ∗Mr−(sr0).

For the case in which P ru flips u with positive probabilityat K > 1 time steps, we can decompose the value functionfor P ru as the weighted average of K value functions, eachof which corresponds to the scenario where u only flips ata single time step, and the weights of the average are theflipping probabilities of P ru at these K time steps.

Now we can show vπ∗Mr(c)

T,1 ≥ vπ∗Mr(c)

T,0 because, otherwise,we have

vr(T, p) = pvπ∗Mr(c)

T,1 + (1− p)vπ∗Mr(c)

T,0

< pvπ∗Mr(c)

T,0 + (1− p)vπ∗Mr(c)

T,0

= vπ∗Mr(c)

T,0 ≤ V ∗M−(sr−0 )

where vr(T, p) < V ∗M−(sr−0 ) contradicts Lemma 2.

Proof of convexity and piecewise linearity. Let ΠrD be

the set of all the recipient’s deterministic policies. It is wellknown (Puterman 2014) that the optimal value can be at-tained by a deterministic policy,

vr(T, p) = maxπr∈ΠrD

V πr

Mr(c)(sr0) = max

πr∈ΠrD

pvπr

T,1 + (1− p)vπr

T,0

which indicates that vr(T, p) is the maximum of a finitenumber of value functions that are linear in p. Therefore,vr(T, p) is convex and piecewise linear in p.

A.3 Proof of Theorem 3As an immediate consequence of Theorems 1 and 2, the jointcommitment value is piecewise linear in the probability, andany local maximum for a fixed commitment time T can beattained by a breakpoint probability. Therefore, restricting tothose breakpoint commitments incurs no loss of optimality.

A.4 Proof of Theorem 4Since the recipient always chooses the one that maximizesthe joint value over all commitments in the query, this re-duces to the scenario referred to as the noiseless responsemodel in prior work on EUS maximization (Viappiani andBoutilier 2010). (Viappiani and Boutilier 2010) proves thesubmodularity under the noiseless response model, whichalso proves Theorem 4.

A.5 Proof of Theorem 5We first give Lemma 3 that says any discretization that con-tains the linearity breakpoints is no worse than any otherdiscretization.Lemma 3. Let C be defined in the same manner as in The-orem 5. Consider any finite set of commitments C that con-tains C, i.e. C ⊇ C. For any query size k and any uncertaintyµ,

maxQ⊆C,|Q|=k

EUS(Q;µ) = maxQ⊆C,|Q|=k

EUS(Q;µ). (12)

Proof. Because C ⊇ C, it is obvious that “≥” holds forEq. (12). We next show “≤”.

Given a commitment query Q = {c1, ..., ck}, defineT (Q) as a commitment query where each commitment isthe optimal commitment with respect to the posterior givena response for Q, i.e.

T (Q) = {c∗(µ | Q c1), ..., c∗(µ | Q ck)}.Previous work (Viappiani and Boutilier 2010) shows thatEUS(T (Q);µ) ≥ EUS(Q;µ). Due to Lemma 1, we nowhave c∗(µ) ∈ C for any uncertainty µ. Thus, given an EUSmaximizer Q∗ for C, T (Q∗) is a subset of C with a EUSthat is no smaller, which shows “≤” holds for Eq. (12). Thisconcludes the proof.

We are ready to prove Theorem 5. Consider the even dis-cretization of [0, 1], Pn = {p0, p1, ..., pn} where pi = i

n .Because vp+r is bounded and piecewise linear in the com-mitment probability, for any ε > 0, there exists a largeenough discretization resolution n, such that for any size kquery Q ⊆ T × [0, 1], there is a size k query Q ∈ T × Pnthat |EUS(Q;µ)− EUS(Q;µ)| ≤ ε. Therefore, we have

EUS(Q;µ)− ε ≤ maxQ⊆T ×Pn,|Q|=k

EUS(Q;µ)

≤ maxQ⊆(C∪T ×Pn),|Q|=k

EUS(Q;µ)

= maxQ⊆C,|Q|=k

EUS(Q;µ)

for any query Q ⊆ T × [0, 1] with |Q| = k, where theequality is a direct result from Lemma 3. This concludes theproof.

Page 12: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

B Domain Description: Synthetic MDPsThe provider’s environment is a randomly-generatedMDP, from a distribution designed such that, in expecta-tion, the provider’s reward when enabling the preconditionis smaller than when not enabling it. This introduces tensionin the provider between enabling the precondition to helpthe recipient, versus increasing its own reward.

We now describe the provider’s MDP-generating distri-bution. The MDP has 10 states the provider can be in atany time step, one out of which is an absorbing state de-noted as s+, and where the initial state is chosen from thenon-absorbing states. Feature u takes the value of u+ onlyin the absorbing state, i.e. u+ ∈ sp if and only if sp = s+.There are 3 actions. For each state-action pair (sp, ap) wheresp 6= s+, the transition function P p(·|sp, ap) is determinedindependently by filling the 10 entries with values uniformlydrawn from [0, 1], and normalizing P p(·|sp, ap). The rewardRp(sp, ap) for a non-absorbing state sp 6= s+ is sampleduniformly and independently from [0, 1], and for the absorb-ing state sp = s+ is zero, meaning the provider prefers toavoid the absorbing state, but that state is the only one thatsatisfies the commitment.

The recipient’s environment, inspired by the randomwalk domains used in the planning literature (Fern, Yoon,and Givan 2004; Nakhost and Muller 2009), is a one-dimensional space with 10 locations represented as integers{0, 1, ..., 9}, as illustrated in Figure 5. In locations 1−8, therecipient can move right, left, or stay still. Once the recipi-ent reaches either end (location 0 or 9), it stays there. Thereis a gate between locations 0 and 1 for which u = u+ de-notes the state of open and u = u− closed. Initially, the gateis closed and the recipient starts at an initial location L0.A negative reward of −10 is incurred by bumping into theclosed gate. For each time step the recipient is at neither end,it gets a reward of −1. If it reaches the left end (i.e. location0), it gets a one-time reward of r0 ≥ 0. The recipient gets areward of 0 if it reaches the right end. In a specific instanti-ation of the recipient’s MDP, L0 and r0 are fixed, and theyare randomly chosen to create various MDPs for the recipi-ent. L0 is randomly chosen from locations 1−8 and r0 frominterval [0, 10].

0 1 L0 9... …...

r0

Figure 5: 1D Walk as the recipient’s environment in Section6.1.

To generate a random coordination problem, we sam-ple an MDP for the provider, and N candidate MDPs forthe recipient, setting the provider’s prior uncertainty µ overthe recipient’s MDP to be the uniform distribution overthe N candidates. The horizon for both agents is set to beH = Hp = Hr = 20. Since the left end has higher re-wards than the right end, if the recipient’s start position isclose enough to the left end and the provider commits toopening the gate early enough with high enough probability,the recipient should utilize the commitment by checking if

the gate is open by the commitment time, and pass throughit if so; otherwise, the recipient should simply ignore thecommitment and move to the right end. The distribution forgenerating the recipient’s MDPs is designed to include di-verse preferences regarding the commitments, such that theprovider’s query should be carefully formulated to elicit therecipient’s preference.

C Domain Description: OvercookedThe domain, Overcooked, was inspired by the video gameof the same name and introduced by (Wang et al. 2020). Itis a gridworld domain that requires the agents to cooperatein an environment that mimics a restaurant. We use an envi-ronment of Overcooked as is in (Wang et al. 2020) with twohigh-level modifications: 1) instead of having global observ-ability, each agent observes its local environment, and 2) weintroduce probabilistic effects into the transition function.We make these modifications to induce for the domain a richspace of meaningful commitments, over which the agentsshould carefully negotiate for the optimal cooperative be-havior.

Figure 3 illustrates this Overcooked environment, and wehere describe it in detail. Two agents, the chef and the waiter,together occupy a 7x7 grid with counters being the bound-aries. Counters divide the grid into halves with the chef onthe left and the waiter on the right. The chef is supposed topick up the tomato, chop it, and place it on the plate. After-wards, the waiter is supposed to pick up the chopped tomatoand deliver to the counter labelled by the star. Meanwhile,the chef needs to take care of the pot, and the waiter needsto take care of a dine-in customer (labelled by the plate withfork and knife). Specifically, each agent has nine actions:{N,E,S,W}-move, {N,E,S,W}-interact, and do-nothing. The{N,E,S,W}-move actions change the agents’ location in car-dinal directions. The {N,E,S,W}-interact actions change thestatus of the object in the corresponding cardinal directions:the chef picks up the (unchopped) tomato by interacting withit; after picking up the tomato, the chef chops it (and keepscarrying it) by interacting with the knife; after chopping thetomato, the chef places it on the plate by interacting with theplate that is initially empty; the waiter picks up the tomatoon the plate by interacting with the plate; after picking up thetomato, the waiter delivers it by interacting with the counterlabelled by the star; being initially unboiled, at each timestep the pot can turn boiling with probability pboiling, andwhen it is boiling the chef can turn the heat off by inter-acting with it; except for the aforementioned cases, the in-teract actions has no effect (equivalent to do-nothing). Thechef get a reward -1 for every time step the pot is boiling.The waiter gets a positive reward rdelivery upon the delivery.At every time step, the waiter also gets a negative rewarddrdistance, where d is the Manhattan distance between thewaiter and the dine-in customer, and rdistance is a negativenumber, which encourages the waiter to stay close to thedine-in customer.

To facilitate coordination between the two agents, we con-sider commitments concerning the tomato, where the chefmakes a commitment that it will place the chopped tomato

Page 13: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

on the plate by some time step with at least a certain prob-ability. Thus, the chef is the provider and the waiter is therecipient. Crucially, the commitment decouples the agents’planning problems, allowing the chef to only model in itsMDP the left half of the grid and the waiter to only modelthe right half. Similarly to Section 6, we generate a randomcoordination problem by sampling an MDP for the chef, and10 candidate MDPs for the waiter. For the chef’s MDP, theinitial locations of the chef, the tomato, the pot, and the knifeare random and different, while the plate is always on thecounter shown in Figure 3. For the waiter’s MDP, the initiallocations of the waiter, the delivery counter, and the dine-incustomer are random and different. The probability pboiling

is uniformly sampled from [0, 0.1], rdistance from [0, 0.1],and rdelivery from [5, 15]. Knowing pboiling but not rdistance

and rdelivery, the chef should carefully formulate the com-mitment query to elicit the commitment that balances thetradeoff between delivering the food and taking care of thepot and the dine-in customer.

D Supplementary Results in Synthetic MDPsAll experiments were run on Intel Xeon E5-2630 v4(2.20GHz) CPUs, with mip and pulp Python packages as(MI)LP solvers.

D.1 Diverse Priors and Multi-round QueryingFigure 2 has demonstrated the effectiveness of the greedyquery for a particular type of the provider’s prior µ, which isthe uniform distribution over the recipient’s N = 10 candi-date MDP. Here, we further show that the greedy query’seffectiveness is robust to diverse prior types. Besides theuniform prior, we consider two other prior types. For therandom prior, the probability for each candidate recipient’sMDP is proportional to a number that is randomly sampledfrom interval [0, 1]. For the Gaussian prior, the probabilityfor each candidate recipient’s MDP is proportional to thestandard Gaussian distribution’s probability density functionevaluated at a number randomly sampled from the three-sigma interval [−3, 3]. Figure 6 shows the EUS, normalizedin the same manner as Figure 2, of the greedy query for thethree prior types, with the number of candidate recipient’sMDPs N = 10, and 50. For comparison, Figure 6 shows,for query size k = 1, 2, 3, the EUS of the optimal query andthe greedy query’s theoretical lower bound (1 − (k−1

k )k ofthe EUS of the optimal query of size k).

Besides priors that are synthetically generated, we herealso explore priors that naturally emerge in a two-roundquerying process. Specifically, the provider’s initial prior µ0

is a random prior over N candidate recipient’s MDPs gener-ated as described above. The provider forms the first greedyquery of size k0, updates its prior to µ1 based on the recip-ient’s response, and then forms the second greedy query ofsize k for prior µ1. We are interested in the quality of the sec-ond greedy query for the updated prior µ1, which emergesfrom the first round of querying. Figure 7 shows the resultsforN = 50, k0 = 2 and 5, comparing the greedy query withits theoretical lower bound and the optimal query. Consistentwith the results in Figure 6, the results in Figure 7 show that

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

theoretical lower boundEUSEU * ( )greedyoptimal

(a) Uniform Prior, N = 10

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

(b) Uniform Prior, N = 50

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

(c) Random Prior, N = 10

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

(d) Random Prior, N = 50

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

(e) Gaussian Prior, N = 10

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

(f) Gaussian Prior, N = 50

Figure 6: EUS of the greedy query for the uniform (top),random (middle), and Gaussian (bottom) priors. The queriesare formed from the breakpoints discretization. The resultsare means and standard errors of the EUS over 50 probleminstances, each consisting of one provider MDP and N re-cipient MDPs randomly generated as described in 6.1. Theresults are for N = 10 (left) and N = 50 (right).

the greedy query is effective for the priors that emerge fromthe first round of querying.

D.2 Probabilistic Commitment EffectivenessThe results in Section 6.1 confirm the effectiveness of ourgreedy approach to forming a query from the breakpointcommitments. Similar to Section 6.2, we measure the jointvalue of the policies (πp(c∗), πr(c∗)) derived from the op-timal commitment found by the centralized algorithm, andthe joint value of the optimal joint value when the providerand the recipient plan as a single agent. Note that this single-agent’s optimal value is an upper bound for the joint value ofany distributed policies the agents could find. Table 4 showsthe results for the pairs of the provider’s MDP and the re-cipient’s MDP from the same 50 coordination problems. Foreach pair, the values are normalized by the optimal single-agent plan’s value. We also report the joint value achievedby a random policy and by policies (πp(c), πr(c)) derivedfrom a randomly-selected feasible commitment.

As expected, the restriction of coordinating plans onlythrough a commitment specification incurs loss in joint value

Page 14: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0EU

S (n

orm

alize

d)

theoretical lower boundEUSEU * ( )greedyoptimal

(a) k0 = 2

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

theoretical lower boundEUSEU * ( )greedyoptimal

(b) k0 = 5

Figure 7: EUS of the greedy query in the second round ofquerying. For the first round, the prior is the random priorover N = 50 candidate recipient’s MDPs, and the providerforms the first greedy query of size k0 and updates its priorbased on the recipient’s response. For the second round, theprovider constructs the second query of size k (X-axis) forthe updated prior, and the corresponding normalized EUS isshown along the Y-axis. The results are means and standarderrors of the EUS over 50 problem instances, each consistingof this two-round querying process, for N = 50 and k0 =2, 5. The provider’s MDP and N = 50 recipient MDPs arerandomly generated as described in 6.1.

compared to optimal joint planning, but the loss is mod-est, given that (as evidenced by the poor performance ofthe other approaches) the problems were not inherently easy.The results assure us that our commitment query approachinduces high-quality coordination.

Table 4: Joint value (mean and standard error) of the policiesbased on probabilistic commitments.

RandomPolicy

RandomCommitment

OptimalCommitment

OptimalSingle Agent

−4.62± 1.58 0.82± 0.03 0.90± 0.02 1.00

Page 15: arXiv:2012.07195v1 [cs.AI] 14 Dec 2020

E Supplementary Results in OvercookedE.1 Greedy Query from the BreakpointsFor Overcooked, we repeat the experiments in Section 6.1,with Figure 4, Table 2, and Figure 8 as the counterparts ofFigure 1, Table 1, Figure 2 respectively.

The results in Figure 4 and Table 2, presented in the mainbody, confirm our conjecture that the breakpoints discretiza-tion in Overcooked is relatively smaller, leading to greaterefficiency. Figure 4(left) confirms that the breakpoints dis-cretization still yields the highest EUS. Comparing Table2 with Table 1, we see that the breakpoints discretizationin Overcooked is even smaller that the even discretizationwith n = 10, while the breakpoints discretization is signif-icantly larger than the even discretization with n = 10 forthe synthetic MDPs in Section 6.1. Therefore, it is unsurpris-ing to see that the runtime in the Overcooked environment,as shown in Figure 4(right), is relatively smaller than thatin Figure 1(right). Figure 8 confirms that formulating thegreedy query is again both efficient and effective.

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

#breakpoints98.5 ± 3.3

1 2 3 5 10 20log query size k

0.1

1.0

10.0

log

runt

ime

(sec

.)

#breakpoints98.5 ± 3.3

Figure 8: Means (markers) and standard errors (bars) of theEUS (left) and runtime (right) of the optimal, the greedy,and the random queries formulated from the breakpoints inOvercooked.

E.2 Multiple Food ItemsWe also consider the scenario where there are more than onefood item. In such a scenario, if the time horizon is longenough, the commitment querying process could consist ofmultiple rounds, one commitment per round concerned witha single food item. We here consider evaluating our approachby restricting the chef to make a single commitment regard-ing one out of the multiple food items, leaving the full-fledged problem of multi-round querying to future work. Ifthere are m > 1 food items, besides the commitment timeTc and probability pc, the commitment c should also specifyas its commitment feature uc the id of the food it is con-cerned with, i.e. uc ∈ {1, ...,m}. It can be easily verifiedthat for each fixed uc ∈ {1, ...,m}, the structural proper-ties presented in Section 4 still hold. Therefore, we can usethe binary search procedure to identify the breakpoints foreach uc ∈ {1, ...,m} independently, and all the theoreticalguarantees presented in the main body still hold.

We repeat the experiments in Section 6.2 with m = 3.The results are presented in Figure 9, Table 5, Figure 10,and Table 6 as the counterparts of Figure 4, Table 2, Figure8, and Table 3, respectively. These results are qualitativelyconsistent with the results for m = 1.

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

k=2 k=5 0

5000

10000

15000

20000

25000

30000

35000

runt

ime

(sec

)

Even, n = 10Even, n = 20Even, n = 50DP, n = 10DP, n = 20DP, n = 50Breakpoints

Figure 9: Means and standard errors of the EUS (left) andruntime (right) in Overcooked with m = 3 food items.

Table 5: Averaged discretization size per commitment time(mean and standard error) in Overcooked with m = 3 fooditems.

n = 10 n = 20 n = 50Even 6.6± 0.1 11.1± 0.2 28.8± 0.5DP 5.3± 0.2 8.7± 0.4 17.0± 0.8

Breakpoints 5.0± 0.1

Table 6: Values of joint policies (mean and standard error in%) in Overcooked with m = 3 food items, with MMDPsnormalized to 100 and null commitment to 0.

Centralized Decentralized

Global Obs. 100(MMDPs)

-(Wang et al.)

Local Obs. 99.9± 0.1(Optimal c)

99.6± 0.1, 99.9± 0.1(Query k = 2, 5)

Null c: 0; Random c: 14.5± 1.4

1 2 3 5 10 20log query size k

0.0

0.2

0.4

0.6

0.8

1.0

EUS

(nor

mal

ized)

#breakpoints299.6 ± 8.3

1 2 3 5 10 20log query size k

0.1

1.0

10.0

100.0

log

runt

ime

(sec

.)

#breakpoints299.6 ± 8.3

Figure 10: Means (markers) and standard errors (bars) ofthe EUS (left) and runtime (right) of the optimal, the greedy,and the random queries formulated from the breakpoints inOvercooked with m = 3 food items.