computational rationalization: the inverse equilibrium problem · computational rationalization:...

30
Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh [email protected] Brian D. Ziebart [email protected] J. Andrew Bagnell [email protected] Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA 15213 Abstract Modeling the purposeful behavior of imper- fect agents from a small number of obser- vations is a challenging task. When re- stricted to the single-agent decision-theoretic setting, inverse optimal control techniques assume that observed behavior is an approxi- mately optimal solution to an unknown deci- sion problem. These techniques learn a util- ity function that explains the example behav- ior and can then be used to accurately predict or imitate future behavior in similar observed or unobserved situations. In this work, we consider similar tasks in competitive and cooperative multi-agent do- mains. Here, unlike single-agent settings, a player cannot myopically maximize its re- ward — it must speculate on how the other agents may act to influence the game’s out- come. Employing the game-theoretic notion of regret and the principle of maximum en- tropy, we introduce a technique for predicting and generalizing behavior, as well as recover- ing a reward function in these domains. 1. Introduction Predicting the actions of others in complex and strate- gic settings is an important facet of intelligence that guides our interactions—from walking in crowds to ne- gotiating multi-party deals. Recovering such behavior from merely a few observations is an important and challenging machine learning task. While mature computational frameworks for decision- making have been developed to prescribe the behav- ior that an agent should perform, such frameworks are Appearing in Proceedings of the 28 th International Con- ference on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s). often ill-suited for predicting the behavior that an agent will perform. Foremost, the standard assump- tion of decision-making frameworks that a criteria for preferring actions (e.g., costs, motivations and goals) is known a priori often does not hold. Moreover, real behavior is typically not consistently optimal or com- pletely rational; it may be influenced by factors that are difficult to model or subject to various types of er- ror when executed. Meanwhile, the standard tools of statistical machine learning (e.g., classification and re- gression) may be equally poorly matched to modeling purposeful behavior; an agent’s goals often succinctly, but implicitly, encode a strategy that would require tremendous amounts of data to learn. A natural approach to mitigate the complexity of re- covering a full strategy for an agent is to consider iden- tifying a compactly expressed utility function that ra- tionalizes observed behavior: that is, identify rewards for which the demonstrated behavior is optimal and then leverage these rewards for future prediction. Un- fortunately, the problem is fundamentally ill-posed: in general, many reward functions can make behavior seem rational, and in fact, the trivial, everywhere 0 re- ward function makes all behavior appear rational (Ng & Russell, 2000). Further, after removing such trivial reward functions, there may be no reward function for which the demonstrated behavior is optimal as agents may be imperfect and the real world they operate in may be only approximately represented. In the single-agent decision-theoretic setting, inverse optimal control methods have been used to bridge this gap between the prescriptive frameworks and predictive applications (Abbeel & Ng, 2004; Ratliff et al., 2006; Ziebart et al., 2008a; 2010). Success- ful applications include learning and prediction tasks in personalized vehicle route planning (Ziebart et al., 2008a), robotic crowd navigation (Henry et al., 2010), quadruped foot placement and grasp selection (Ratliff et al., 2009). A reward function is learned by these techniques that both explains demonstrated behavior

Upload: vanliem

Post on 07-Apr-2018

230 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

Kevin Waugh [email protected] D. Ziebart [email protected]. Andrew Bagnell [email protected]

Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA 15213

Abstract

Modeling the purposeful behavior of imper-fect agents from a small number of obser-vations is a challenging task. When re-stricted to the single-agent decision-theoreticsetting, inverse optimal control techniquesassume that observed behavior is an approxi-mately optimal solution to an unknown deci-sion problem. These techniques learn a util-ity function that explains the example behav-ior and can then be used to accurately predictor imitate future behavior in similar observedor unobserved situations.

In this work, we consider similar tasks incompetitive and cooperative multi-agent do-mains. Here, unlike single-agent settings, aplayer cannot myopically maximize its re-ward — it must speculate on how the otheragents may act to influence the game’s out-come. Employing the game-theoretic notionof regret and the principle of maximum en-tropy, we introduce a technique for predictingand generalizing behavior, as well as recover-ing a reward function in these domains.

1. Introduction

Predicting the actions of others in complex and strate-gic settings is an important facet of intelligence thatguides our interactions—from walking in crowds to ne-gotiating multi-party deals. Recovering such behaviorfrom merely a few observations is an important andchallenging machine learning task.

While mature computational frameworks for decision-making have been developed to prescribe the behav-ior that an agent should perform, such frameworks are

Appearing in Proceedings of the 28 th International Con-ference on Machine Learning, Bellevue, WA, USA, 2011.Copyright 2011 by the author(s)/owner(s).

often ill-suited for predicting the behavior that anagent will perform. Foremost, the standard assump-tion of decision-making frameworks that a criteria forpreferring actions (e.g., costs, motivations and goals)is known a priori often does not hold. Moreover, realbehavior is typically not consistently optimal or com-pletely rational; it may be influenced by factors thatare difficult to model or subject to various types of er-ror when executed. Meanwhile, the standard tools ofstatistical machine learning (e.g., classification and re-gression) may be equally poorly matched to modelingpurposeful behavior; an agent’s goals often succinctly,but implicitly, encode a strategy that would requiretremendous amounts of data to learn.

A natural approach to mitigate the complexity of re-covering a full strategy for an agent is to consider iden-tifying a compactly expressed utility function that ra-tionalizes observed behavior: that is, identify rewardsfor which the demonstrated behavior is optimal andthen leverage these rewards for future prediction. Un-fortunately, the problem is fundamentally ill-posed:in general, many reward functions can make behaviorseem rational, and in fact, the trivial, everywhere 0 re-ward function makes all behavior appear rational (Ng& Russell, 2000). Further, after removing such trivialreward functions, there may be no reward function forwhich the demonstrated behavior is optimal as agentsmay be imperfect and the real world they operate inmay be only approximately represented.

In the single-agent decision-theoretic setting, inverseoptimal control methods have been used to bridgethis gap between the prescriptive frameworks andpredictive applications (Abbeel & Ng, 2004; Ratliffet al., 2006; Ziebart et al., 2008a; 2010). Success-ful applications include learning and prediction tasksin personalized vehicle route planning (Ziebart et al.,2008a), robotic crowd navigation (Henry et al., 2010),quadruped foot placement and grasp selection (Ratliffet al., 2009). A reward function is learned by thesetechniques that both explains demonstrated behavior

Page 2: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

and approximates the optimality criteria of prescrip-tive decision-theoretic frameworks.

As these methods only capture a single reward func-tion and do not reason about competitive or cooper-ative motives, inverse optimal control proves inade-quate for modeling the strategic interactions of mul-tiple agents. In this paper, we consider the game-theoretic concept of regret as a necessary stand-in forthe optimality criteria of the single-agent work. Aswith the inverse optimal control problem, the result isfundamentally ill-posed. We address this by requiringthat for any utility function linear in known features,our learned model must have no more regret than thatof the observed behavior. We demonstrate that thisrequirement can be re-cast as a set of equivalent con-vex constraints that we denote the inverse correlatedequilibrium (ICE) polytope.

As we are interested in the effective prediction of be-havior, we will use a maximum entropy criteria toselect behavior from this polytope. We demonstratethat optimizing this criteria leads to mini-max opti-mal prediction of behavior subject to approximate ra-tionality. We consider the dual of this problem andnote that it generalizes the traditional log-linear max-imum entropy family of problems (Della Pietra et al.,2002). We provide a simple and computationally ef-ficient gradient-based optimization strategy for thisfamily and show that only a small number of obser-vations are required for accurate prediction and trans-fer of behavior. We conclude by considering a matrixrouting game and compare the ICE approach to a va-riety of natural alternatives.

Before we formalize imitation learning in matrixgames, motivate our assumptions and describe and an-alyze our approach, we will review the game-theoreticnotions of regret and the correlated equilibrium.

2. Game Theory Background

Matrix games are the canonical tool of game the-orists for representing strategic interactions rangingfrom illustrative toy problems, such as the “Prisoner’sDilemma” and the “Battle of the Sexes” games, to im-portant negotiations, collaborations, and auctions. Inthis work, we employ a class of games with payoffs orutilities that are linear functions of features definedover the outcome space.

Definition 1. A linearly parameterized normal-form game, or matrix game, Γ = (N,A, F ), iscomposed of: a finite set of players, N ; a set ofjoint-actions or outcomes, A = ×i∈NAi, consist-ing of a finite set of actions for each player, Ai; a set

of outcome features, F = {θia ∈ RK} for each out-

come that induce a parameterized utility function,ui(a|w) = θi

aTw – the reward for player i achieving

outcome a w.r.t. utility weights w.

For notational convenience, we let a−i denote the vec-tor a excluding component i and letA−i = ×j 6=i,j∈NAi

be the set of such vectors.

In contrast to standard normal-form games where theutility functions for game outcomes are known, in thiswork we assume that “true” utility weights, w∗, whichgovern observed behavior, are unknown. This allowsus to model real-world scenarios where a cardinal util-ity is not available or is subject to personal taste.

We model the players with a distribution σ ∈ ∆Aover the game’s joint-actions. Coordination betweenplayers can exist, thus, this distribution need not fac-tor into independent strategies for each player. Con-ceptually, a signaling mechanism, such as a trafficlight, can be thought to sample a joint-action fromσ and communicate to each player ai, its portionof the joint-action. Each player can then considerdeviating from ai using a modification function,fi : Ai 7→ Ai (Blum & Mansour, 2007).

The switch modification function, for instance,

switchx→yi (ai) =

{y if ai = xai otherwise (1)

substitutes action y for recommendation x.

Instantaneous regret measures how much a playerwould benefit from a particular modification functionwhen the coordination device draws joint-action a,

regreti(a|fi, w) = ui(fi(ai), a−i|w)− ui(a|w) (2)

=[θi

fi(ai),a−i− θi

ai,a−i

]T

w (3)

= rfiTi,a w. (4)

Players do not have knowledge of the complete joint-action; thus, each must reason about the expectedregret with respect to a modification function,

σTRfi

i w = Ea∼σ [regreti(a|fi, w)] (5)

=∑a∈A

σarfiTi,a w. (6)

It is helpful to consider regret with respect to a classof modification functions. Two classes are particularlyimportant for our discussion. First, internal regretcorresponds to the set of modification functions wherea single action is replaced by a new action, Φint

i =

Page 3: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

{switchx→yi (·) : ∀x, y ∈ Ai}. Second, swap regret

corresponds to the set of all modification functions,Φswap

i = {fi}. We denote Φ = ∪i∈N Φi.

The expected regret with respect to Φ and out-come distribution σ,

RΦ(σ,w) = maxfi∈Φ

Ea∼σ [regreti(a|fi, w)] , (7)

is important for understanding the incentive to deviatefrom, and hence the stability of, the specified behavior.The most general modification class, Φswap, leads tothe notion of ε-correlated equilibrium (Osborne &Rubinstein, 1994), in which σ satisfies RΦswap

(σ,w∗) ≤ε. Thus, regret can be thought of as a substitute forutility when assessing the optimality of behavior inmulti-agent settings.

3. Imitation Learning in Matrix Games

We are now equipped with the tools necessary to in-troduce our approach for imitation learning in multi-agent settings. As input, we observe a sequence ofoutcomes, {am}Mm=1, sampled from σ, the true be-havior. We denote the empirical distribution of thissequence, σ, the demonstrated behavior. We aimto learn a predictive behavior distribution, σ fromthese demonstrations. Moreover, we would like ourlearning procedure to extract the motives and intentfor the behavior so that we may imitate the players insimilarly structured, but unobserved games.

Imitation appears hard barring further assumptions.In particular, if the agents are unmotivated or theirintentions are not coerced by the observed game, thereis little hope of recovering principled behavior in a newgame. Thus, we require some form of rationality.

3.1. Rationality Assumptions

We say that agents are rational under their true pref-erences when they are indifferent between σ and theirtrue behavior if and only if RΦ(σ, w∗) ≤ RΦ(σ,w∗).

As agents’ true preferences w∗ are unknown to the ob-server, we must consider an encompassing assumptionthat requires any behavior that we estimate to satisfythis property for all possible utility weights, or

∀w ∈ RK , RΦ(σ, w) ≤ RΦ(σ,w). (8)

Any behavior achieving this restriction, strong ratio-nality, is also rational, and, by virtue of the contraposi-tive, we see that unless we have additional informationregarding the agents’ true preferences, we must assumethis strong assumption or we risk violating rationality.

Lemma 1. If strong rationality does not hold for alter-native behavior σ then there exist agent utilities suchthat they would prefer σ to σ.

By restricting our attention to behavior that satisfiesstrong rationality, at worst, agents acting according tounknown true preference w∗ will be indifferent betweenour predictive distribution and their true behavior.

3.2. Inverse Correlated Equilibria

Unfortunately, a direct translation of the strong ra-tionality requirement into constraints on the distribu-tion σ leads to a non-convex optimization problem asit involves products of varying utility vectors and thebehavior to be estimated. Fortunately, however, wecan provide an equivalent concise convex descriptionof the constraints on σ that ensures any feasible dis-tribution satisfies strong rationality. We denote thisset of equivalent constraints as the Inverse CorrelatedEquilibria (ICE) polytope:

Definition 2 (ICE Polytope).

σTRfi

i =∑fj∈Φ

ηfi

fjσTR

fj

j ,∀fi ∈ Φ (9)

ηfi ∈ ∆Φ,∀fi ∈ Φ; σ ∈ ∆A.

Theorem 1. A distribution, σ, satisfies the con-straints above for some η if and only if it satisfiesstrong rationality. That is, ∀w ∈ RK , RΦ(σ, w) ≤RΦ(σ, w) if and only if ∀fi ∈ Φ,∃ηfi ∈ ∆Φ such thatσTRfi

i =∑

fj∈Φ ηfi

fjσTR

fj

j .

The proof of Theorem 1 is provided in the Ap-pendix (Waugh et al., 2011).

We note that this polytope, perhaps unsurprisingly, issimilar to the polytope of correlated equilibrium itself,but here is defined in terms of the behavior we observeinstead of the (unknown) reward function. Given anyobserved behavior σ, the constraints are feasible asthe demonstrated behavior satisfies them; our goal isto choose from these behaviors without estimating afull joint-action distribution. While the ICE polytopeestablishes a basic requirement for estimating rationalbehavior, there are generally infinitely many distribu-tions consistent with its constraints.

3.3. Principle of Maximum Entropy

As we are interested in the problem of statistical pre-diction of strategic behavior, we must find a mecha-nism to resolve the ambiguity remaining after account-ing for the rationality constraints. The principle ofmaximum entropy provides a principled method for

Page 4: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

choosing such a distribution. This choice leads to notonly statistical guarantees on the resulting predictions,but to efficient optimization.

The Shannon entropy of a distribution σ is defined asH(σ) = −

∑x∈X σx log σx. The principle of max-

imum entropy advocates choosing the distributionwith maximum entropy subject to known (linear) con-straints (Jaynes, 1957):

σMaxEnt = argmaxσ∈∆X

H(σ), subject to: (10)

g(σ) = 0 and h(σ) ≤ 0.

The resulting log-linear family of distributions (e.g.,logistic regression, Markov random fields, conditionalrandom fields) are widely used within statistical ma-chine learning. For our problem, the constraints areprecisely that the distribution is in the ICE polytope,ensuring that whatever distribution is learned has nomore regret than the demonstrated behavior.

Importantly, the maximum entropy distribution sub-ject to our constraints enjoys the following guarantee:

Lemma 2. The maximum entropy ICE distributionminimizes over all strongly rational distributions theworst-case log-loss , −

∑a∈A σa log σa, when σ is cho-

sen adversarially and subject to strong rationality.

The proof of Lemma 2 follows immediately from theresult of Grunwald and Dawid (2003).

In the context of multi-agent behavior, the principle ofmaximum entropy has been employed to obtain corre-lated equilibria with predictive guarantees in normal-form games when the utilities are known a priori (Or-tiz et al., 2007). We will now leverage its power withour rationality assumption to select predictive distri-butions in games where the utilities are unknown.

3.4. Prediction of Behavior

Let us first consider prediction of the demonstratedbehavior using the principle of maximum entropy andour strong rationality condition. After, we will extendto behavior transfer and analyze the error introducedas a by-product of sampling σ from σ.

The mathematical program that maximizes the en-tropy of σ under strong rationality with respect to σ,

argmaxσ,η

H(σ), subject to: (11)

σTRfi

i =∑fj∈Φ

ηfi

fjσTR

fj

j ,∀fi ∈ Φ

ηfi ∈ ∆Φ,∀fi ∈ Φ; σ ∈ ∆A,

is convex with linear constraints, feasible, andbounded. That is, it is simple and can be efficientsolved in this form. Before presenting our preferreddual optimization procedure, however, let us describean approach for behavior transfer that further illus-trates the advantages of this approach over directlyestimating σ.

3.5. Transfer of Behavior

A principal justification of inverse optimal controltechniques that attempt to identify behavior in termsof utility functions is the ability to consider what be-havior might result if the underlying decision problemwere changed while the interpretation of features intoutilities remain the same (Ng & Russell, 2000; Ratliffet al., 2006). This enables prediction of agent behaviorin a no-regret or agnostic sense in problems such as arobot encountering novel terrain (Silver et al., 2010)as well as route recommendation for drivers travelingto unseen destinations (Ziebart et al., 2008b).

Econometricians are interested in similar situations,but for much different reasons. Typically, they aimto validate a model of market behavior from observa-tions of product sales. In these models, the firms as-sume a fixed pricing policy given known demand. Theeconometrician uses this fixed policy along with prod-uct features and sales data to estimate or bound boththe consumers’ utility functions as well as unknownproduction parameters, like markup and productioncost (Berry et al., 1995; Nevo, 2001; Yang, 2009). Inthis line of work, the observed behavior is consideredaccurate to start with; it is not suitable for settingswith limited observations.

Until now, we have considered the problem of identi-fying behavior in a single game. We note, however,that our approach enables behavior transfer to gamesequipped with the same features. We denote this un-observed game as Γ. As with prediction, to developa technique for behavior transfer we assume a linkbetween regret and the agents’ preferences across theknown space of possible preferences. Furthermore, weassume a relation between the regrets in both games.Property 1 (Transfer Rationality). For some con-stant κ > 0,

∀w, RΦ(σ, w) ≤ κRΦ(σ,w). (12)

Roughly, we assume that under preferences with lowregret in the original game, the behavior in the unob-served game should also have low regret. By enforcingthis property, if the agents are performing well withrespect to their true preferences, then the transferredbehavior will also be of high quality.

Page 5: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

As we are not privileged to know κ and this prop-erty is not guaranteed to hold, we introduce a slackvariable to allow for violations of the strong rational-ity constraints to guaranteeing feasibility. Intuitively,the transfer-ICE polytope we now optimize over re-quires that for any linear reward function and for ev-ery player, the predicted behavior in a new game musthave no more regret than demonstrated behavior doesin the observed game using the same parametric formof reward function. The corresponding mathematicalprogram is:

maxσ,η,ν

H(σ)− Cν, subject to: (13)

σTRfi

i −∑fj∈Φ

ηfi

fjσTR

fj

j ≤ ν,∀fi ∈ Φ

∑fj∈Φ

ηfi

fjσTR

fj

j − σTRfi

i ≤ ν,∀fi ∈ Φ

ηfi ∈ ∆Φ,∀fi ∈ Φ; σ ∈ ∆A; ν ≥ 0.

In the above formulation, C > 0 is a slack penaltyparameter, which allows us to choose the trade-off be-tween obeying the rationality constraints and maxi-mizing the entropy. Additionally, we have omitted κabove by considering it intrinsic to R.

We observe that this program is almost identical to thebehavior prediction program introduced above. Wehave simply made substitutions of the regret matricesand modification sets in the appropriate places. Thatis, if Γ = Γ, we recover prediction with a slack.

Given σ and ν, we can bound the violation of thestrong rationality constraint for any utility vector.

Lemma 3. If σ violates the strong rationality con-straints in the slack formulation by ν then for all w

RΦ(σ, w) ≤ RΦ(σ, w) + ν ||w||1 . (14)

One could choose to institute multiple slack variables,say one for each fi ∈ Φ, instead of a single slack acrossall modification functions. Our choice is motivated bythe interpretation of the dual multipliers presented inthe next section. There, we will also address selectionof an appropriate value for C.

4. Duality and Efficient Optimization

In this section, we will derive, interpret and describea procedure for optimizing the dual program for solv-ing the MaxEnt ICE problem. We will see that thedual multipliers can be interpreted as utility vectorsand that optimization in the dual has computationaladvantages. We begin by presenting the dual of the

Algorithm 1 Dual MaxEnt ICEInput: T, γ, C > 0, R, R,Φ and Φ∀fi ∈ Φ, αfi , βfi ← 1/(|Φ|K + 1)for t from 1 to T do

/* compute the gradient */∀a ∈ A, za ← exp

(−

∑fi∈Φ rfiT

i,a (αfi − βfi))

Z ←∑

a∈A za

for fi ∈ Φ dof∗j ← argmaxfj∈Φ σTR

fj

j (αfi − βfi)

gfi ← σTRf∗jj∗ −

∑a∈A zarfi

i,a/Zend for/* descend and project */γt ← γ/

√t

ρ← 1 +∑

fi,kαfi

k exp(−γtgfi

k ) + βfi

k exp(γtgfi

k )∀fi ∈ Φ, k ∈ K, αfi

k ← Cαfi

k exp(−γtgfi

k )/ρ

∀fi ∈ Φ, k ∈ K, βfi

k ← Cβfi

k exp(γtgfi

k )/ρend forreturn (α, β)

transfer program.

minα,β,ξ

∑fi∈Φ

maxfj∈Φ

[σTR

fj

j (αfi − βfi)]

+ log Z(α, β)

subject to: ξ +∑fi∈Φ

K∑k=1

αfi

k + βfi

k = C, α, β, ξ ≥ 0.

where Z(α, β) is the partition function,

Z(α, β) =∑a∈A

exp

− ∑fi∈Φ

rfiTi,a (αfi − βfi)

.

Removing the equality constraint is equivalent to dis-allowing any slack. We derive the dual in the ap-pendix (Waugh et al., 2011).

For C > 0, the dual’s feasible set has non-empty inte-rior and is bounded. Therefore, by Slater’s condition,strong duality holds – there is no duality gap. In par-ticular, we can use a dual solution to recover σ.Lemma 4. Given a dual solution, (α, β), we can re-cover the primal solution, σ. Specifically,

σa = exp

− ∑fi∈Φ

rfiTi,a (αfi − βfi)

/Z(α, β). (15)

Intuitively, the probability of predicting an outcome issmall if that outcome has high regret.

In general, the dual multipliers are utility vectors as-sociated with each modification function in Φ. Under

Page 6: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

the slack formulation, there is a natural interpretationof these variables as a single utility vector. Given adual solution, (α, β) with slack penalty C, we choose

λfi = αfi − βfi , (16)

πfi =1C

K∑k=1

αfi

k + βfi

k , and (17)

w =∑fi∈Φ

πfiλfi . (18)

That is, we can associate with each modification func-tion a probability, πfi , and a utility vector, λfi . Thus,a natural estimate for w is the expected utility vector.Note,

∑fi∈Φ πfi need not sum to one. The remaining

mass, ξ, is assigned to the zero utility vector.

The above observation implies that introducing a slackvariable coincides with bounding the L1 norm of theutility vectors under consideration by C. This insightsuggests that we choose C ≥ ||w∗||1, if possible, assmaller values of C will exclude w∗ from the feasi-ble set. If a bound on the L1 norm is not available,we may solve the prediction problem on the observedgame without slack and use ||w||1 as a proxy.

The dual formulation of our program has importantinherent computational advantages. First, it is a opti-mization over a simple set that is particularly well-suited for gradient-based optimization, a trait notshared by the primal program. Second, the numberof dual variables, 2|Φ|K, is typically much fewer thanthe number of primal variables, |A| + 2|Φ|2. Thoughthe work per iteration is still a function of |A| (to com-pute the partition function), these two advantages to-gether let us scale to larger problems than if we con-sider optimizing the primal objective. Computing theexpectations necessary to descend the dual gradientcan leverage recent advances in the structured, com-pact game representations: in particular, any graphi-cal game with low-treewidth or finite horizon Markovgame (Kakade et al., 2003) enables these computationsto be performed in time that scales only polynomiallyin the number of decision makers or time-steps.

Algorithm 1 employs exponentiated gradient de-scent (Kivinen & Warmuth, 1995) to find an optimaldual solution. The step size parameter, γ, is commonlytaken to be

√2 log |Φ|K/∆, with ∆ being the largest

value in any Rfi

i . With this step size, if the optimiza-tion is run for T ≥ 2∆2 log

(|Φ|K

)/ε2 iterations then

the dual solution will be within ε of optimal. Alter-natively, one can exactly measure the duality gap oneach iteration and halt when the desired accuracy isachieved. This is often preferred as the lower boundon the number of iterations is conservative in practice.

5. Sample Complexity

In practice, we do not have full access to the agents’true behavior – if we did, prediction would be straight-forward and not require our estimation technique. In-stead, we can only approximate it through finite obser-vation of play. In real applications there are costs as-sociated with gathering these observations and, thus,there are inherent limitations on the quality of thisapproximation. In this section, we will analyze thesensitivity of our approach to these types of errors.

First, although |A| is exponential in the number ofplayers, our technique only accesses σ through prod-ucts of the form σR

fj

j . That is, we need only approx-imate these products accurately, not the distributionσ. As a result, we can bound the approximation errorin terms of |Φ| and K.

Theorem 2. With probability at least 1 − δ, for anyw, by observing M ≥ 2

ε2 log 2|Φ|Kδ outcomes we have

RΦ(σ, w) ≤ RΦ(σ,w) + ε∆ ||w||1.

The proof is an application of Hoeffding’s inequalityand is provided in the Appendix (Waugh et al., 2011).As an immediate corollary, considering only the true,but unknown, reward function w∗:

Corollary 1. With probability at least 1 − δ, bysampling according to the above rule, RΦ(σ, w∗) ≤RΦ(σ,w∗) + (ε∆ + ν) ||w∗||1 for σ with slack ν.

That is, so long as we assume bounded utility, withhigh probability we need only logarithmic many sam-ples in terms of |Φ| and K to closely approximate σR

fj

j

and avoid a large violation of our rationality condition.

We note that choosing Φ = Φint is particularly appeal-ing, as |Φint| ≤ |N |A2, compared to |Φswap| ≤ |N |A!.As internal regret closely approximates swap regret,we do not lose much of the strategic complexity bychoosing the more limited set, but we require bothfewer observations and fewer computational resources.

6. Experimental Results

To evaluate our approach experimentally, we designeda simple routing game shown in Figure 1. Sevendrivers in this game choose how to travel home duringrush hour after a long day at the office. The differ-ent road segments have varying capacities, visualizedby the line thickness in the figure, that make some ofthem more or less susceptible to congestion or to trafficaccidents. Upon arrival home, each driver records thetotal time and distance they traveled, the gas that theyused, and the amount of time they spent stopped atintersections or in traffic jams – their utility features.

Page 7: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

Tuesday, February 1, 2011

Figure 1. A visualization of the routing game.

In this game, each of the drivers chooses from fourpossible routes (solid lines in Figure 1), yielding over16,000 possible outcomes. We obtained an ε-socialwelfare maximizing correlated equilibrium for thosedrivers where the drivers preferred mainly to minimizetheir travel time, but were also slightly concerned withgas usage. The demonstrated behavior σ was sampledfrom this true behavior distribution σ.

First, we evaluate the differences between the true be-havior distribution σ and the predicted behavior dis-tribution σ trained from observed behavior sampledfrom σ. In Figure 2 we compare the prediction accu-racy when varying the number of observations usinglog-loss, −

∑a∈A σa log σa. The baseline algorithms

we compare against are: a maximum likelihood esti-mate of the distribution over the joint-actions with auniform prior, an exponential family distribution pa-rameterized by the outcome’s utilities trained with lo-gistic regression, and a maximum entropy inverse op-timal control approach (Ziebart et al., 2008a) trainedindividually for each player.

In Figure 2, we see that MaxEnt ICE predicts behaviorwith higher accuracy than all other algorithms whenthe number of observations is limited. In particular,it achieves close to its best performance with as fewat 16 observations. The maximum likelihood estima-tor eventually overtakes it, as expected since it willultimately converge to σ, but only after 10,000 obser-vations, or about as many observations as there areoutcomes in the game. This experiment demonstratesthat learning underlying utility functions to estimateobserved behavior can be much more data-efficient forsmall sample sizes, and additionally, that the regret-based assumptions of MaxEnt ICE are both reasonable

100 101 102 103 104 1052

4

6

8

10

12

14

16

Number of observations

Logl

oss

MLEMaxEnt IOCLogistic ModelMaxEnt ICE

Figure 2. Prediction error (log-loss) as a function of num-ber of observations.

Table 1. Transfer error (log-loss) on unobserved games.

Problem Logistic Model MaxEnt Ice

Add Highway 4.177 3.093Add Driver 4.060 3.477Gas Shortage 3.498 3.137Congestion 3.345 2.965

and beneficial in our strategic routing game setting.

Next, we evaluate behavior transfer from this routinggame to four similar games, the results of which aredisplayed in Table 1. The first game, Add Highway,adds the dashed route to the game. That is, we modelthe city building a new highway. The second game,Add Driver, adds another driver to the game. Thethird game, Gas Shortage, keeps the structure of thegame the same, but changes the reward function tomake gas mileage more important to the drivers. Thefinal game, Congestion, adds construction to the majorroadway, delaying the drivers.

These transfer experiments even more directly demon-strate the benefits of learning utility weights ratherthan directly learning the joint-action distribution; di-rect strategy-learning approaches are incapable of be-ing applied to general transfer setting. Thus, we onlycompare against the Logistic Model. We see fromTable 1 that MaxEnt ICE outperforms the LogisticModel in all of our tests. For reference, in these newgames, the uniform strategy has a loss of approxi-mately 6.8 in all games, and the true behavior hasa loss of approximately 2.7.

Page 8: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

7. Conclusion

In this paper, we extended inverse optimal controlto multi-agent settings by combining the principle ofmaximum entropy with the game-theoretic notion ofregret. We observed that our formulation has a partic-ularly appealing dual program, which led to a simplegradient-based optimization procedure. Perhaps themost appealing quality of our technique is its theo-retical and practical sample complexity. In our ex-periments, MaxEnt ICE performed exceptionally wellafter only 0.1% of the game had been observed.

Acknowledgments

This work is supported by the ONR MURI grantN00014-09-1-1052 and by the National Sciences andEngineering Research Council of Canada (NSERC).

References

Abbeel, P. and Ng, A. Y. Apprenticeship learningvia inverse reinforcement learning. In Proceedings ofthe International Conference on Machine Learning,2004.

Berry, S., Levinsohn, J., and Pakes, A. Automobileprices in market equilibrium. Econometrica, 63(4):841–90, July 1995.

Blum, A. and Mansour, Y. Algorithmic Game Theory,chapter Learning, Regret Minimization and Equilib-ria, pp. 79–102. Cambridge University Press, 2007.

Della Pietra, S., Della Pietra, V., and Lafferty, J. In-ducing features of random fields. IEEE Transactionson Pattern Analysis and Machine Intelligence, 19(4):380–393, 2002. ISSN 0162-8828.

Grunwald, P. D. and Dawid, A. P. Game theory, max-imum entropy, minimum discrepancy, and robustbayesian decision theory. Annals of Statistics, 32:1367–1433, 2003.

Henry, P., Vollmer, C., Ferris, B., and Fox, D. Learn-ing to navigate through crowded environments. InProceedings of Robotics and Automation, 2010.

Jaynes, E. T. Information theory and statisticalmechanics. Physical Review, 106(4):620–630, May1957.

Kakade, S., Kearns, M., Langford, J., and Ortiz, L.Correlated equilibria in graphical games. In Pro-ceedings of Electronic Commerce, pp. 42–47, 2003.

Kivinen, J. and Warmuth, M. K. Exponentiated gra-dient versus gradient descent for linear predictors.Information and Computation, 132:1–63, 1995.

Nevo, A. Measuring market power in the ready-to-eat cereal industry. Econometrica, 69(2):307–342,March 2001.

Ng, A. and Russell, S. Algorithms for inverse reinforce-ment learning. In Proceedings of the InternationalConference on Machine Learning, 2000.

Ortiz, L. E., Shapire, R. E., and Kakade, S. M. Maxi-mum entropy correlated equilibrium. In Proceedingsof Artificial Intelligence and Statistics, pp. 347–354,2007.

Osborne, M.J. and Rubinstein, A. A course in gametheory. The MIT press, 1994. ISBN 0262650401.

Ratliff, N., Bagnell, J. A., and Zinkevich, M. Maxi-mum margin planning. In Proceedings of the Inter-national Conference on Machine Learning, 2006.

Ratliff, N. D., Silver, D., and Bagnell, J. A. Learningto search: Functional gradient techniques for imi-tation learning. Autonomous Robots, 27(1):25–53,2009.

Silver, D., Bagnell, J. A., and Stentz, A. Learningfrom demonstration for autonomous navigation incomplex unstructured terrain. International Journalof Robotics Research, 29(1):1565 – 1592, October2010.

Waugh, K., Ziebart, B., and Bagnell, J. A. Computa-tional rationalization: The inverse equilibrium prob-lem. arXiv, abs/1103.5254, 2011.

Yang, Z. Correlated equilibrium and the estimationof discrete games of complete information. Work-ing paper, http://www.econ.vt.edu/faculty/2008vitas research/joeyang research.htm,2009.

Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey,A. Maximum entropy inverse reinforcement learn-ing. In Proceeding of the AAAI Conference on Ar-tificial Intelligence, 2008a.

Ziebart, B. D., Maas, A., Dey, A. K., and Bagnell,J. A. Navigate like a cabbie: Probabilistic reasoningfrom observed context-aware behavior. In Proceed-ings of the International Conference on UbiquitousComputing, 2008b.

Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Model-ing interaction via the principle of maximum causalentropy. In Proceedings of the International Confer-ence on Machine Learning, 2010.

Page 9: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

Appendix

Rationality Properties and Primal Programs

The proof of Theorem 1 relies upon the following technical lemmas.

Lemma 5.bTw ≤ max

ai∈Aai

Tw ⇔ ∃λ ∈ ∆A s.t. bTw ≤ λTAw.

Proof of Lemma 5. Given bTw ≤ maxai∈A aiTw, choose

λi ={

1 if ai = argmaxai∈A aiTw

0 otherwise (19)

Thus, bTw ≤ maxai∈A aiTw = λTAw.

Given ∃λ ∈ ∆A s.t. bTw ≤ λTAw,

bTw ≤ λTAw (20)

≤∑

aj∈A

λaj maxai∈A

aiTw (21)

=[maxai∈A

aiTw

] ∑aj∈A

λaj (22)

= maxai∈A

aiTw (23)

Lemma 6.∀w ∈ RK , bTw ≤ max

i∈Nai

Tw ⇔ ∃λ ∈ ∆A s.t. b = λTA.

Proof of Lemma 6.

∀w ∈ RK , bTw ≤ maxai∈A

aiTw (24)

⇔ ∀w ∈ RK ,∃λ ∈ ∆A s.t. bTw ≤ λTAw (25)

⇔ ∀w ∈ RK ,∃λ ∈ ∆A s.t.[b− λTA

]Tw ≤ 0 (26)

⇔ the following linear program has optimal value 0

maxw,t

bTw − t (27)

subject to: t ≥ aiTw,∀ai ∈ A.

The following linear feasibility problem is the dual of the above program

minλ

0 (28)

subject to: b = λTA

λ ∈ ∆A.

By strong duality for linear programming, the primal has value 0 iff the dual is feasible, which is exactly when∃λ ∈ ∆A s.t. b = λTA.

Page 10: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

Proof of Theorem 1.

∀w ∈ RK , RΦ(σ, w) ≤ RΦ(σ, w) (29)

⇔ ∀w ∈ RK ,maxfi∈Φ

σTRfi

i w ≤ maxfi∈Φ

σTRfi

i w (30)

⇔ ∀fi ∈ Φ,∀w ∈ RK , σTRfi

i w ≤ maxfj∈Φ

σTRfj

j w (31)

⇔ ∀fi ∈ Φ,∃ηfi ∈ ∆Φ s.t. σTRfi

i =∑fj∈Φ

ηfi

fjσTR

fj

j (32)

The last step makes use of our second technical lemma.

Derivation of the Dual Program

The Lagrange dual is

minα,β,γ,δ,u,v,ξ

maxσ,η,ν

−∑a∈A

σa log σa − Cν −∑fi∈Φ

σRfi

i −∑fj∈Φ

ηfi

fjσTR

fj

j − ν

αfi (33)

−∑fi∈Φ

∑fj∈Φ

ηfi

fjσTR

fj

j − σRfi

i − ν

βfi (34)

+∑fi∈Φ

1−∑fj∈Φ

ηfi

fj

γfi +

1−∑a∈A

σa

δ (35)

+∑fi∈Φ

∑fj∈Φ

ηfi

fjufi

fj+

∑a∈A

σava + νξ (36)

subject to: α, β, u, v, ξ ≥ 0 (37)

To solve the unconstrained inner optimization, we take derivatives w.r.t. σ, η and ν and set equal to 0:

log σa = −1−∑fi∈Φ

rfi

i,a(αfi − βfi)− δ + va = 0, (38)

σTRfj

j (αfi − βfi)− γfi + ufi

fj= 0, ∀fi,∈ Φ, fj ∈ Φ, and (39)

ξ − C +∑fi∈Φ

αfi + βfi = 0. (40)

Substituting into the Lagrangian, we get

minα,β,γ,δ,u,v,ξ

∑fi∈Φ

γfi + δ + exp(−1− δ)∑a∈A

exp

− ∑fi∈Φ

rfi

i,a(αfi − βfi) + va

(41)

subject to: σTRfj

j (αfi − βfi)− γfi + ufi

fj= 0, ∀fi ∈ Φ, fj ∈ Φ (42)

ξ +∑fi∈Φ

αfi + βfi = C, (43)

α, β, u, v, ξ ≥ 0. (44)

Page 11: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

We note that u are slack variables, and that, by inspection, v = 0 at optimality. Thus, an equivalent program is

minα,β,γ,δ,ξ

∑fi∈Φ

γfi + δ + exp(−1− δ)∑a∈A

exp

− ∑fi∈Φ

rfi

i,a(αfi − βfi)

(45)

subject to: σTRfj

j λfi ≤ γfi , ∀fi ∈ Φ, fj ∈ Φ (46)

ξ +∑fi∈Φ

αfi + βfi ≤ C, (47)

α, β, ξ ≥ 0. (48)

We eliminate δ by setting its partial derivative to 0, solving for δ

δ = log

∑a∈A

exp

− ∑fi∈Φ

rfi

i,a(αfi − βfi)

− 1 (49)

and substituting back into the objective

minα,β,γ,ξ

∑fi∈Φ

γfi + log

∑a∈A

exp

− ∑fi∈Φ

rfi

i,a(αfi − βfi)

− 1 (50)

subject to: σTRfj

j (αfi − βfi) ≤ γfi , ∀fi ∈ Φ, fj ∈ Φ (51)

ξ +∑fi∈Φ

αfi + βfi ≤ C, (52)

α, β, ξ ≥ 0. (53)

By inspection, at optimality, γfi = maxfj∈Φ σTRfj

j (αfi − βfi). Thus an equivalent program is

minα,β,ξ

∑fi∈Φ

[maxfj∈Φ

σTRfj

j (αfi − βfi)]

+ log

∑a∈A

exp

− ∑fi∈Φ

rfi

i,a(αfi − βfi)

− 1 (54)

ξ +∑fi∈Φ

αfi + βfi = C, (55)

α, β, ξ ≥ 0. (56)

Proof of Lemma 4. In the derivation of the dual program, we observed that at optimality

log σa = −1−∑fi∈Φ

rfi

i,a(αfi − βfi)− δ + va = 0. (57)

Noting v = 0 and substituting for the optimal δ, we get

log σa = −∑fi∈Φ

rfi

i,a(αfi − βfi)− log

∑a′∈A

exp

− ∑fj∈Φ

rfj

j,a′(αfj − βfj )

. (58)

All that remains is to exponentiate both sides.

Page 12: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Computational Rationalization: The Inverse Equilibrium Problem

Sample Complexity

Proof of Theorem 2.

P

(max

fi∈Φ,k∈K|σRfi

i − σRfi

i |k ≥ ε∆M

)≤ P

⋃fi∈Φ,k∈K

|σRfi

i − σRfi

i |k ≥ ε∆M

(59)

≤∑

fi∈Φ,k∈K

P(|σRfi

i − σRfi

i |k ≥ ε∆M)

(60)

≤∑

fi∈Φ,k∈K

2 exp(−ε2M

2

)(61)

= 2|Φ|K exp(−ε2M

2

)(62)

≤ δ (63)

We use the union bound in step 2, and Hoeffding’s inequality in step 3. Solving for M , we get our result

M ≥ 2ε2

log2|Φ|K

δ. (64)

Proof of Corollary 1. We have ∀w,RΦ(σ, w) ≤ RΦ(σ, w) + ν ||w||1, where ν depends on the choice of the slack’spenalty. Thus, we have RΦ(σ, w∗) ≤ RΦ(σ, w∗)+ν ||w||1 ≤ RΦ(σ,w∗)+ (ε∆+ν) ||w∗||1 with probability at least1− δ, so long as M is as large as Theorem 2 deems. We can make ν as small as we like by increasing the slackpenalty.

Page 13: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Modeling Interaction via the Principle of Maximum Causal Entropy

Brian D. Ziebart [email protected]. Andrew Bagnell [email protected] K. Dey [email protected]

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Abstract

The principle of maximum entropy providesa powerful framework for statistical modelsof joint, conditional, and marginal distribu-tions. However, there are many importantdistributions with elements of interaction andfeedback where its applicability has not beenestablished. This work presents the principleof maximum causal entropy—an approachbased on causally conditioned probabilitiesthat can appropriately model the availabil-ity and influence of sequentially revealed sideinformation. Using this principle, we de-rive models for sequential data with revealedinformation, interaction, and feedback, anddemonstrate their applicability for statisti-cally framing inverse optimal control and de-cision prediction tasks.

1. Introduction

The principle of maximum entropy (Jaynes, 1957)serves a foundational role in the theory and practiceof constructing statistical models, with applicabilityto statistical mechanics, natural language processing,econometrics, and ecology (Dudık & Schapire, 2006).Conditional extensions of the principle that consider asequence of side information (i.e., additional variablesthat are not predicted, but are related to variablesthat are predicted), and specifically conditional ran-dom fields (Lafferty et al., 2001), have been appliedwith remarkable success in recognition, segmentation,and classification tasks, and are a preferred tool innatural language processing, and activity recognition.

This work extends the maximum entropy approachto conditional probability distributions in settingscharacterized by interaction with stochastic processes

Appearing in Proceedings of the 27 th International Confer-ence on Machine Learning, Haifa, Israel, 2010. Copyright2010 by the author(s)/owner(s).

where side information is dynamic, i.e., revealed overtime. For example, consider an agent interacting witha stochastic environment. The agent may have a modelfor the distribution of future states given its currentstate and possible actions, but, due to stochasticity,it does not know what value a future state will takeuntil after selecting the sequence of actions temporallypreceding it. Thus, future states have no causal influ-ence over earlier actions. Conditional maximum en-tropy approaches are ill-suited for this setting as theyassume all side information is available a priori.

Building on the recent advance of the Marko-Masseytheory of directed information (Massey, 1990), wepresent the principle of maximum causal entropy(MaxCausalEnt). It prescribes a probability distribu-tion by maximizing the entropy of a sequence of vari-ables conditioned on the side information available ateach time step. This contribution extends the max-imum entropy framework for statistical modeling toprocesses with information revelation, feedback, andinteraction. We motivate and apply this approach ondecision prediction tasks, where actions stochasticallyinfluence revealed side information (the state of theworld) with examples from inverse optimal control,multi-player dynamic games, and interaction with par-tially observable systems.

Though we focus on the connection to decision makingand control in this work, it is important to note thatthe principle of maximum causal entropy is not specificto those domains. It is a general approach that is ap-plicable to any sequential data where future side infor-mation’s assumed lack of causal influence over earliervariables is reasonable.

2. Maximum Causal Entropy

Motivated by the task of modeling decisions with ele-ments of sequential interaction, we introduce the prin-ciple of maximum causal entropy, describe its core the-oretical properties, and provide efficient algorithms forinference and learning.

Page 14: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Modeling Interaction via the Principle of Maximum Causal Entropy

2.1. Preliminaries

When faced with an ill-posed problem, the principleof maximum entropy (Jaynes, 1957) prescribes the useof “the least committed” probability distribution thatis consistent with known problem constraints. Thiscriterion is formally measured by Shannon’s informa-tion entropy, EY [− logP (Y )], and many of the funda-mental building blocks of statistics, including Gaussianand Markov random field distributions, maximize thisentropy subject to moment constraints.

In the presence of side information, X, that we do notdesire to model, the standard prescription is to max-imize the conditional entropy, EY,X[− logP (Y|X)],yielding, for example, the conditional random field(CRF) (Lafferty et al., 2001). Though our intentionis to similarly model conditional probability distribu-tions, CRFs assume a knowledge of future side in-formation, Xt+1:T , for each Yt that does not matchsettings with dynamically revealed information. De-spite not being originally formulated for such uses,marginalizing over the CRF’s joint distribution is pos-sible:

P (Yt|X1:t,Y1:t−1) ∝ (1)∑Xt+1:T ,Yt+1:T

eθ>F (X,Y)P (Xt+1:T |X1:t,Y1:t−1),

in what we refer to as a latent CRF model. However,we argue that entropy-based approaches like this thatdo not address the causal influence of side informationare inadequate for interactive settings.

In the context of this paper, we focus on modelingthe sequential actions of an agent interacting with astochastic environment. Thus, we replace the pre-dicted variables, Y, and side information, X, with se-quences of action variables, A, and state variables, S.

2.2. Directed Information and Causal Entropy

The causally conditioned probability (Kramer, 1998)from the Marko-Massey theory of directed informa-tion (Massey, 1990) is a natural extension of the con-ditional probability, P (A|S), to the situation whereeach At is conditioned on only a portion of the S vari-ables, S1:t, rather than the entirety, S1:T . Followingthe previously developed notation (Kramer, 1998), theprobability of A causally conditioned on S is

P (AT ||ST ) ,T∏t=1

P (At|S1:t,A1:t−1). (2)

The subtle, but significant difference from conditionalprobability, P (A|S) =

∏Tt=1 P (At|S1:T ,A1:t−1),

serves as the underlying basis for our approach.

Causal entropy (Kramer, 1998; Permuter et al., 2008),

H(AT ||ST ) , EA,S[− logP (AT ||ST )] (3)

=

T∑t=1

H(At|S1:t,A1:t−1),

measures the uncertainty present in the causally con-ditioned distribution. It upper bounds the condi-tional entropy; intuitively this reflects the fact thatadditionally conditioning on information from the fu-ture (i.e., acausally) only decreases uncertainty. Thecausal entropy has previously found applicability inthe analysis of communication channels with feedback(Kramer, 1998), decentralized control (Tatikonda &Mitter, 2004), sequential investment and online com-pression with side information (Permuter et al., 2008).

Using this notation, any joint distribution can beexpressed as P (A,S) = P (AT ||ST )P (ST ||AT−1).Our approach estimates a policy—the factors,P (At|S1:t,A1:t−1), of P (AT ||ST )—based on a pro-vided (explicitly or implicitly) distribution of side in-formation P (ST ||AT−1) =

∏t P (St|A1:t−1,S1:t−1).

2.3. Maximum Causal Entropy Optimization

With the causal entropy (Equation 3) as our objec-tive function, we now pose and solve the maximumcausal entropy optimization problem. We constrainour distribution to match expected feature functions,F(S,A) with empirical expectations of those samefunctions, ES,A[F(S,A)], yielding the following opti-mization problem:

argmax{P (At|S1:t,A1:t−1)}

H(AT ||ST ) (4)

such that: ES,A[F(S,A)] = ES,A[F(S,A)]

and ∀S1:t,A1:t−1

∑At

P (At|S1:t,A1:t−1) = 1,

and given: P (ST ||AT−1).

We assume for explanatory simplicity that fea-ture functions factor as: F(S,A) =

∑t F (St, At),

and that state dynamics are first-order Markovian,P (ST ||AT−1) =

∏t P (St|At−1, St−1).

Theorem 1. The distribution satisfying the maximumcausal entropy constrained optimization (Equation 4)has a form defined recursively as:

Pθ(At|St) =ZAt|St,θZSt,θ

(5)

logZAt|St,θ = θ>F (St, At) +∑St+1

P (St+1|St, At) logZSt+1,θ

logZSt,θ = log∑At

ZAt|St,θ = softmaxAt

logZAt|St,θ

Page 15: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Modeling Interaction via the Principle of Maximum Causal Entropy

where softmaxx f(x) , log∑x e

f(x).

Proof (sketch). The (negated) primal objective func-tion (Equation 4) is convex in the variables P (A||S)and subject to linear constraints on feature functionexpectation matching, valid probability distributions,and the non-causal influence of future side informa-tion. Differentiating the Lagrangian of the maximumcausal entropy optimization (Equation 4), and equat-ing to zero, we obtain the general form:

Pθ(At|St) ∝ exp{θ>ES,A[F(S,A)|St, At]

−∑τ>t

ES,A[logPθ(Aτ |Sτ )|St, At]}. (6)

Substituting the more operational recurrence of Equa-tion 5 into Equation 6 verifies the theorem.

We note that Theorem 1 relies on strong duality toidentify the form of this probability distribution; thesharp version of Slater’s condition (Boyd & Vanden-berghe, 2004) using the existence of a feasible point inthe relative interior ensures this but requires that (1)prescribed feature constraints are achievable, and (2)the distribution has full support. The first naturallyfollows if both model and empirical expectations aretaken with respect to the provided model of side infor-mation, P (ST ||AT−1). For technical simplicity in thiswork, we will further assume full support for the mod-eled distribution, although relatively simple modifica-tions (e.g., constraints hold within a small deviationε) ensure the correctness of this form in all cases.

Theorem 2. The gradient of the dual with respect toθ is

(ES,A[F(S,A)]−ES,A[F(S,A)]

), which is the dif-

ference between the empirical feature vector given thecomplete policy, {P (At|St)}, and the expected featurevector under the probabilistic model.

In many instances, the statistics of interest(ES,A[F(S,A)]) for the gradient (Theorem 2)

are only known approximately as they are obtainedfrom small sample sets. We note that this uncertaintycan be rigorously addressed by extending the dualityanalysis of Dudık & Schapire (2006), leading to pa-rameter regularization that may be naturally adoptedin the causal setting as well.

Theorem 3. The maximum causal entropy distribu-tion minimizes the worst case prediction log-loss, i.e.,

infP (A||S)

supP (AT ||ST )

−∑A,S

P (A,S) logP (AT ||ST ),

given P (A,S) = P (AT ||ST )P (ST ||AT−1) and featureexpectations EP (S,A)[F(S,A)] when S is sequentiallyrevealed from a known distribution and actions are se-quentially predicted using only previously revealed vari-ables.

Theorem 3 follows naturally from Grunwald & Dawid(2003) and extends their “robust Bayes” results to theinteractive setting as one justification for the maxi-mum causal entropy approach. The theorem can beunderstood by viewing maximum causal entropy as amaximin game where nature chooses a distribution tomaximize a predictor’s perplexity while the predictortries to minimize it. By duality, the minimax view ofthe theorem is equivalent. This strong result is notshared when maximizing alternate entropy measures(e.g., conditional or joint entropy) and marginalizingout future side information (as in Equation 1).

2.4. Inference and Learning Algorithms

The procedure for inferring decision probabilities inthe MaxCausalEnt model based on Theorem 1 is illus-trated by Algorithm 1.

Algorithm 1 MaxCausalEnt Inference Procedure

1: for t = T to 1 do2: if t = T then3: ∀At,St logZAt|St,θ ← θ>F (At, St)4: else5: ∀At,St logZAt|St,θ ← θ>F (At, St) +

ESt+1 [logZSt+1,θ|St, At]6: end if7: ∀St logZSt,θ ← softmaxAt logZAt|St

8: ∀At,StP (At|St)←ZAt|St,θZSt,θ

9: end for

Using the resulting action distributions and empiricalfeature functions, E(F), the gradient is obtained byemploying Algorithm 2.

Algorithm 2 MaxCausalEnt Gradient Calculation

1: for t = 1 to T do2: if t = 1 then3: ∀St,AtDSt,At ← P (St)P (At|St)4: else5: ∀St,AtDSt,At ←∑

St−1,At−1DSt−1,At−1 P (At|St−1, At−1)P (At|St)

6: end if7: E[F ]← E[F ] +

∑St,At

DSt,AtF (At, St)8: end for9: ∇θ logP (A||S)← E[F ]− E[F ]

As a consequence of convexity, standard gradient-based optimization techniques converge to the max-imum likelihood estimate of feature weights, θ.

2.5. Graphical Representation

We extend the influence diagram graphical framework(Howard & Matheson, 1984) as a convenient repre-sentation for the MaxCausalEnt variables and theirrelationships. The structural elements of the repre-

Page 16: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Modeling Interaction via the Principle of Maximum Causal Entropy

Table 1. Graphical representation structural elements.

Type Symbol Parent relationship

DecisionSpecifies observed vari-ables when A is selected

UncertaintySpecifies conditionalprobability, P (S|par(S))

UtilitySpecifies feature func-tions, θ>F (par(U))→ <

sentation are outlined in Table 1. We constrain thegraph to possess perfect recall1. Every graphical rep-resentation can then be reduced to the earlier causalentropy form (Equation 4) by marginalizing over eachdecision’s non-parent uncertainty nodes to obtain sideinformation transition dynamics and expected featurefunctions. Whereas in traditional influence diagrams,the parent-dependent values for decisions nodes thatprovide the highest expected utility are inferred, inthe MaxCausalEnt setting, utilities should be inter-preted as potentials for inferring a probability distri-bution over decisions.

3. Applications

We now present a series of applications with increasingcomplexity of interaction: (1) control with stochasticdynamics; (2) multiple agent interaction; and (3) in-teraction with a partially observable system.

3.1. Inverse Optimal Stochastic Control

Optimal control frameworks, such as Markov deci-sion processes (MDPs) and linear-quadratic regula-tors (LQRs), provide rich representations of interac-tions with stochastic systems. Inverse optimal control(IOC) is the problem of recovering a cost function thatmakes a particular controller or policy (nearly) opti-mal (Kalman, 1964; Boyd et al., 1994). Recent workhas demonstrated that IOC is a powerful techniquefor modeling the decision-making behavior of intelli-gent agents in problems as diverse as robotics (Ratliffet al., 2009), personal navigation (Ziebart et al., 2008),and cognitive science (Ullman et al., 2009).

Many IOC approaches (Abbeel & Ng, 2004; Ziebartet al., 2008) consider cost functions linear in a setof features and attempt to find behaviors that inducethe same feature counts as the policy to be mimicked(E[∑t fSt ] = E[

∑t fSt ]); by linearity such behaviors

achieve the same expected value. For settings withvectors of continuous states and actions, matching

1Variables observed during previous decisions are eitherobserved in future decisions or irrelevant (i.e., d-separatedfrom future value nodes by observed variables).

quadratic moments, e.g., E[∑t sts

>t ] = E[

∑t st, s

>t ],

guarantees equivalent performance under quadraticcost functions, e.g.,

∑t s>t Qs

t for unknown Q.

Unfortunately, matching feature counts is fundamen-tally ill-posed—usually no truly optimal policy willachieve those feature counts, but many stochastic poli-cies (and policy mixtures) will satisfy the constraint.Ziebart et al. (2008) resolve this ambiguity by us-ing the classical maximum entropy criteria to selecta single policy from all the distributions over deci-sions that match feature counts. However, for inverseoptimal stochastic control (IOSC)—characterized bystochastic state dynamics—their proposed approach isto marginalize over future state variables (Equation 1).For IOSC, the maximum causal entropy approach pro-vides prediction guarantees (Theorem 3) and a soft-ened interpretation of optimal decision theory, whilethe latent CRF approach provides neither.

3.1.1. MDP and LQR Formulations

In the IOSC problem, side information (states) anddecisions (actions) are inter-dependent with the dis-tribution of side information provided by the knowndynamics, P (ST ||AT−1) =

∏t P (St|St−1, At−1). We

employ the MaxCausalEnt framework to model thissetting, as shown in Figure 1.

Figure 1. The graphical representation for MaxCausalEntinverse optimal control. For MDPs: Ut(St, At) = θ>fSt ,and for LQRs: Ut(st,at) = s>t Qst + a>t Rat.

Using the action-based cost-to-go (Q) and state-basedvalue (V ) notation, the inference procedure for MDPMaxCausalEnt IOC reduces to:

Qsoftθ (At, St) = ESt+1 [V soft

θ (St+1)|St, At] (7)

V softθ (St) = softmax

AtQsoftθ (At, St) + θ>fSt ,

and for the continuous, quadratic-reward setting to

Qsoftθ (at, st) = Est+1

[V softθ (st+1)|st,at] + a>t Rat (8)

V softθ (st) = softmax

atQsoftθ (at, st) + s>t Qst.

Note that by replacing the softmax 2 function with themaximum, this algorithm becomes equivalent to the

2The continuous version of the softened maximum is

Page 17: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Modeling Interaction via the Principle of Maximum Causal Entropy

(stochastic) value iteration algorithm (Bellman, 1957)for finding the optimal control policy. The relativemagnitudes of the action values in the MaxCausalEntmodel control the amount of stochasticity in the re-

sulting action policy, πθ(a|s) ∝ eQsoftθ (a,s).

For the special case where dynamics are linearfunctions with Gaussian noise, the quadratic Max-CausalEnt model permits a closed-form solution and,given dynamics st+1 ∼ N(Ast + Bat,Σ), Equation 8reduces to:

Qsoftθ (at, st) =

[atst

]> [B>DB + R A>DB

B>DA A>DA

] [atst

]V softθ (st) = s>t (Cs,s +Q−C>a,sC

−1a,aCa,s)st + const,

where C and D are recursively computed as:Ca,a = B>DB;Cs,a = C>a,s = B>DA;Cs,s =

A>DA; and D = Cs,s + Q−C>C−1a,aCa,s.

3.1.2. Inverse Helicopter Control

We demonstrate the MaxCausalEnt approach to IOSCon the problem of building a controller for a heli-copter with linearized stochastic dynamics. Existingapproaches to IOSC (Ratliff et al., 2006; Abbeel & Ng,2004) have both practical and theoretical difficultiesin the presence of imperfect demonstrated behavior,leading to unstable controllers due to large changes incost weights (Abbeel et al., 2007) or poor predictiveaccuracy (Ratliff et al., 2006). To test the robust-ness of our approach, we generated five 100 time stepsub-optimal training trajectories (Figure 2) by nois-ily sampling actions from an optimal LQR controlleddesigned for hovering using the linearized stochastichelicopter simulator of Abbeel et al. (2007).

Figure 2. Left: An example sub-optimal helicopter trajec-tory attempting to hover around the origin point. Right:The average cost under the original cost function of: (1)demonstrated trajectories; (2) the optimal controller us-ing the inverse optimal control model; and (3) the optimalcontroller using the maximum causal entropy model.

We contrast between the policies obtained from themaximum margin planning (Ratliff et al., 2006) (la-

defined as: softmaxx f(x) , log∫xef(x) dx.

beled InvOpt in Figure 2) and MaxCausalEnt mod-els trained using demonstrated trajectories. Using thetrue cost function, we measure the cost of trajectoriessampled from the optimal policy under the cost func-tion learned by each model. The InvOpt model per-forms poorly because there is no optimal trajectory forany cost function that matches demonstrated features.In contrast, by design the MaxCausalEnt model isguaranteed to match the performance of demonstratedbehavior (Demo) in expectation even if that behavioris sub-optimal. Additionally, because of the quadraticcost function, the optimal controller using the Max-CausalEnt cost function is always at least as good asthe demonstrated behavior on the original, unknowncost function, and often better, as shown in Figure 2.In this sense, MaxCausalEnt provides a rigorous ap-proach to learning a cost function for such stochasticoptimal control problems: it is both predictive and canguarantee good performance of the learned controller.

3.2. Inverse Dynamic Games

Modeling the interactions of multiple agents is an im-portant task for uncovering the motives of negotiat-ing parties, planning a robot’s movement in a crowdedenvironment, and assessing the perceived roles of in-teracting agents (Ullman et al., 2009). While gameand decision theory can prescribe equilibria or optimalpolicies when the utilities of agents are known, oftenthe utilities are not known and only observed behavioris available for modeling tasks. We investigate recov-ering the agents’ utilities from those observations.

3.2.1. Markov Game Formulation

We consider a Markov game where agents act in se-quence, taking turns after observing the precedingagents’ actions and the resulting stochastic outcomesampled from known state dynamics, P (St+1|At, St).Agents are assumed to act based on a utility functionthat is linear in features associated with each state,w>i fS and to know the other agents’ utilities and poli-cies.

Learning a single agent’s MaxCausalEnt policy, πi,given the others’ policies, π−i, reduces to an IOSCproblem where the entropy of the agent’s actions givenstate is maximized while matching state features:

argmaxπi(A|S)

H(A(i)||S(i)) (9)

such that: E[∑t

fi(St)] = E[∑t

fi(St)]

and given: π−i(A|S) and P (S||A).

The distribution of side information (i.e., the agent’snext state, St+N given the agent’s previous state and

Page 18: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Modeling Interaction via the Principle of Maximum Causal Entropy

action, St and At, is obtained by marginalizing overthe other agents’ states and actions:

P (St+N |At, St) = ESt+1:t+N−1,At+1:t+N−1

[P (St+N |St+N−1, At+N−1)

∣∣∣St, At].Difficulties arise, however, because the policies ofother agents are not known in our setting. Instead,all of the agents’ utilities and policies are learnedfrom demonstrated trajectories. Maximizing the jointcausal entropy of all agents’ actions leads either to non-convexity (multiplicative functions of agent policies inthe feature matching constraints) or an assumptionthat agents share a common utility function.

We settle for potentially non-optimal solution policiesby employing a cyclic coordinate descent approach. Ititeratively maximizes the causal entropy of agent i’sactions subject to constraints:

πi ← argmaxπi

Fp(πi|π−i),

where Fp is the Lagrangian primal of Equation 9.However, instead of matching observed feature counts,which may be infeasible given the estimate of π−i, theexpectation of agent i’s features given its empiricalactions under the current estimate of agent policies,F(S,A) = EA,S[

∑S∈S fS |π], is matched.

3.2.2. Pursuit-Evasion Modeling

We consider a generalization of the pursuit-evasionmulti-agent setting (Parsons, 1976) with three agentsoperating in a four-by-four grid world (Figure 3). Eachagent has a mobility, mi ∈ [0.2, 1], which correspondsto the probability of success when attempting to movein one of the four cardinal directions, and a utility,wi,j ∈ [−1, 1], for being co-located with each of theother agents. Unlike traditional pursuer-evader, thereis no “capture” event in this game; agents continue toact after being co-located. The mobilities of each agentand a time sequence of their actions and locations areprovided and the task is to predict their future actions.

Figure 3. The pursuit-evasion grid with three agents andtheir co-location utilities. Agent X has a mobility of mX

and a utility of wX,Y when co-located with agent Y .

We generate data for this setting using the followingprocedure. First, mobilities and co-location utilitiesare sampled (uniformly from their domain). Next, op-timal policies3, πt∗i (A|S), for a range of time horizonst ∈ {T0, ..., T0 + ∆T}, are computed with completeknowledge of other agents’ utilities and future policies.A stochastic policy, πi(A|S), is obtained by first sam-pling a time horizon from P (t) = 1

∆T , and then sam-pling an action from the optimal policy for that timehorizon. Lastly, starting from random initial locations,trajectories of length 40 time steps (five for trainingand one for testing) are sampled from the policy andstate dynamics for six different parameters. Despiteits simplicity, this setting produces surprisingly richbehavior. For example, under certain optimal poli-cies, a first evader will help its pursuer corner a moredesirable second evader so that the first evader will bespared.

MaxC

ausa

lEnt

per

ple

xit

y

Latent CRF perplexity

Figure 4. The average per-action perplexities of the latentCRF model and the MaxCausalEnt model plotted againsteach other for three agents from six different pursuit-evasion settings. The MaxCausalEnt model outperformsthe latent CRF model in the region below the dotted line.

A comparison between the latent CRF model (Equa-tion 1) trained to maximize data likelihood and themaximum causal entropy model is shown in Figure 4using perplexity, 1

T

∑at,st

logP (at|st), as the evalua-tion metric of predictive performance. MaxCausalEntconsistently outperforms the latent CRF model. Theexplanation for this is based on the observation thatunder the latent CRF model, the action-value of Equa-tion 7 is instead: Q(at, st) = softmaxst+1

(V (st+1) +logP (st+1|st, at)). This has a disconcerting interpre-tation that the agent chooses the next state by “pay-ing” an extra logP (st+1|st, at) penalty to ignore thetrue stochasticity of the state transition dynamics, re-sulting in an unrealistic probability distribution.

3“Ties” in action value lead to uniform distributionsover the corresponding actions.

Page 19: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Modeling Interaction via the Principle of Maximum Causal Entropy

3.3. Inverse Diagnostics

Many important interaction tasks involve partial ob-servability. In medical diagnosis, for example, se-quences of symptoms, tests, and treatments are used toidentify (and mediate) unknown illnesses. Motivatedby the objective of learning diagnosis policies from ex-perts, we investigate the inverse diagnostics problem ofmodeling interaction with partially observed systems.

3.3.1. Bayes Net Diagnosis Formulation

We consider a set of variables (distributed accordingto a known dynamic Bayesian network) that evolveover time based in part on employed actions, A1:T ,as shown in Figure 5. Those actions are made withonly partial knowledge of the Bayes net variables, asrelayed through observation variables O1:T . Previousactions determine what information from the hiddenvariables is revealed in the next time step.

Figure 5. The MaxCausalEnt representation of the diag-nostic problem with an abstract dynamic Bayesian networkrepresented as BN nodes. Perfect recall edges from all pastobservations and actions to future actions are suppressed.

We assume that the utility for the state of the Bayesnet and actions invoked is an unknown linear func-tion of known feature vectors, θ>fBNt,At . We for-mulate the modeling problem as a maximum causalentropy optimization by maximizing the causallyconditioned entropy of actions given observations,H(AT ||OT ). We marginalize over the latent Bayesnet variables to obtain side information dynamics,P (Ot+1|Ot, St) = EBN1:t [P (Ot+1|par(Ot+1|O1:t,A1:t]and expected feature functions, E[ft|O1:t,A1:t] =EBN1:t [FUt(par(Ut))|O1:t,A1:t] to estimate the full-history policy, P (At|O1:t,A1:t) using Algorithm 1 andAlgorithm 2.

3.3.2. Vehicle Diagnosis Experiments

We apply our inverse diagnostics approach to the vehi-cle fault detection Bayesian network (Heckerman et al.,1994) shown in Figure 6 with fully specified condi-

Figure 6. The vehicle fault detection Bayesian networkwith replaceable variables denoted with an asterisk. Allvariables are binary (working or not) except Battery Age.

tional probability distributions. Apart from the re-lationship between Battery Age and Battery (expo-nentially increasing probability of failure with batteryage), the remaining conditional probability distribu-tions are deterministic-or’s (i.e., failure in any parentcauses a failure in the child).

A mechanic can either test a component of the vehicle(revealing its status) or repair a component (making itand potentially its descendants operational). Replace-ments and tests are both characterized by three actionfeatures: (1) a cost to the vehicle owner; (2) a profitfor the mechanic; and (3) a time requirement. Ideallythe sequence of mechanic’s actions would minimize theexpected cost to the vehicle owner, but an over-bookedmechanic might instead choose to minimize the totalrepair time, and a less ethical mechanic might seek tooptimize personal profit.

To generate a dataset of observations and replace-ments, a stochastic policy is obtained by adding Gaus-sian noise, εs,a, to each action’s future expected value,Q∗(s, a), under the optimal policy for a fixed set offeature weights and the highest noisy-valued action,Q∗(s, a) + εs,a, is selected at each time step. Differentvehicle failure samples are generated from the Bayesiannetwork conditioned on the vehicle’s engine failing tostart, and the stochastic policy is sampled until thevehicle is operational.

In Figure 7, we evaluate the prediction error rate andperplexity of our model and a Markov model that ig-nores the underlying mechanisms for decision makingand simply predicts behavior in proportion to the fre-quency it has previously been observed (with smallpseudo-count priors). The MaxCausalEnt approachconsistently outperforms the Markov model even withan order of magnitude less training data. The classifi-cation error rate quickly reaches the limit implied bythe stochasticity of the data generation process.

Page 20: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Modeling Interaction via the Principle of Maximum Causal Entropy

Figure 7. Error rate and perplexity of the MaxCausalEntmodel and Markov model for diagnosis action predictionas training set size (log-scale) increases.

4. Conclusion and Future Work

We extended the principle of maximum entropy to set-tings with sequentially revealed information in thiswork. We demonstrated the applicability of the re-sulting principle of maximum causal entropy for learn-ing policies in stochastic control, multi-agent interac-tion, and partially observable settings. In additionto further investigating modeling applications, our fu-ture work will investigate the applicability of Max-CausalEnt on non-modeling tasks in dynamics set-tings. For instance, we note that the proposed princi-ple provides a natural criteria for efficiently identifyinga correlated equilibrium in dynamic Markov games,generalizing the approach to normal-form games of Or-tiz et al. (2007).

Acknowledgments

The authors gratefully acknowledge the Richard KingMellon Foundation, the Quality of Life TechnologyCenter, and the Office of Naval Research MURIN00014-09-1-1052 for support of this research. We alsothank our reviewers for many useful suggestions.

References

Abbeel, P. and Ng, A. Y. Apprenticeship learning via in-verse reinforcement learning. In Proc. ICML, pp. 1–8,2004.

Abbeel, P., Coates, A., Quigley, M., and Ng, A. Y. Anapplication of reinforcement learning to aerobatic heli-copter flight. In NIPS, pp. 1–8, 2007.

Bellman, R. A Markovian decision process. Journal ofMathematics and Mechanics, 6:679–684, 1957.

Boyd, S. and Vandenberghe, L. Convex Optimization.Cambridge University Press, 2004.

Boyd, S., Ghaoui, L. El, Feron, E., and Balakrishnan, V.

Linear Matrix Inequalities in System and Control The-ory, volume 15 of Studies in Applied Mathematics. 1994.

Dudık, M. and Schapire, R. E. Maximum entropy distribu-tion estimation with generalized regularization. In Proc.COLT, pp. 123–138, 2006.

Grunwald, P. D. and Dawid, A. P. Game theory, maximumentropy, minimum discrepancy, and robust Bayesian de-cision theory. Annals of Statistics, 32:1367–1433, 2003.

Heckerman, D., Breese, J. S., and Rommelse, K. Trou-bleshooting under uncertainty. In Communications ofthe ACM, pp. 121–130, 1994.

Howard, R. A. and Matheson, J. E. Influence diagrams. InReadings on the Principles and Applications of DecisionAnalysis, pp. 721–762. Strategic Decisions Group, 1984.

Jaynes, E. T. Information theory and statistical mechanics.Physical Review, 106:620–630, 1957.

Kalman, R. When is a linear control system optimal?Trans. ASME, J. Basic Engrg., 86:51–60, 1964.

Kramer, G. Directed Information for Channels with Feed-back. PhD thesis, Swiss Federal Institute of Technology(ETH) Zurich, 1998.

Lafferty, J., McCallum, A., and Pereira, F. Conditionalrandom fields: Probabilistic models for segmenting andlabeling sequence data. In Proc. ICML, pp. 282–289,2001.

Massey, J. L. Causality, feedback and directed information.In Proc. IEEE International Symposium on InformationTheory and Its Applications, pp. 27–30, 1990.

Ortiz, L. E., Shapire, R. E., and Kakade, S. M. Maximumentropy correlated equilibria. In AISTATS, pp. 347–354,2007.

Parsons, T. D. Pursuit-evasion in a graph. In Theory andApplications of Graphs, pp. 426–441. Springer-Verlag,1976.

Permuter, H. H., Kim, Y.-H., and Weissman, T. On di-rected information and gambling. In Proc. IEEE Inter-national Symposium on Information Theory, pp. 1403–1407, 2008.

Ratliff, N., Bagnell, J. A., and Zinkevich, M. Maximummargin planning. In Proc. ICML, pp. 729–736, 2006.

Ratliff, N., Silver, D., and Bagnell, J. A. Learning tosearch: Functional gradient techniques for imitationlearning. Auton. Robots, 27(1):25–53, 2009.

Tatikonda, S. and Mitter, S. Control under communica-tion constraints. Automatic Control, IEEE Transactionson, 49(7):1056–1068, July 2004. ISSN 0018-9286. doi:10.1109/TAC.2004.831187.

Ullman, T., Baker, C., Macindoe, O., Evans, O., Good-man, N., and Tenenbaum, J. Help or hinder: Bayesianmodels of social goal inference. In Proc. NIPS, pp. 1874–1882, 2009.

Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K.Maximum entropy inverse reinforcement learning. InProc. AAAI, pp. 1433–1438, 2008.

Page 21: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Navigate Like a Cabbie: Probabilistic Reasoning fromObserved Context-Aware Behavior

Brian D. Ziebart, Andrew L. Maas, Anind K. Dey, and J. Andrew BagnellSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA [email protected], [email protected], [email protected], [email protected]

ABSTRACT

We present PROCAB, an efficient method for Probabilisti-cally Reasoning from Observed Context-Aware Behavior. Itmodels the context-dependent utilities and underlying rea-sons that people take different actions. The model gen-eralizes to unseen situations and scales to incorporate richcontextual information. We train our model using the routepreferences of 25 taxi drivers demonstrated in over 100,000miles of collected data, and demonstrate the performance ofour model by inferring: (1) decision at next intersection, (2)route to known destination, and (3) destination given par-tially traveled route.

Author Keywords

Decision modeling, vehicle navigation, route prediction

ACM Classification Keywords

I.5.1 Pattern Recognition: Statistical; I.2.6 Artificial Intelli-gence: Parameter Learning; G.3 Probability and Statistics:Markov Processes

INTRODUCTION

We envision future navigation devices that learn drivers’preferences and habits, and provide valuable services by rea-soning about driver behavior. These devices will incorporatereal-time contextual information, like accident reports, and adetailed knowledge of the road network to help mitigate themany unknowns drivers face every day. They will providecontext-sensitive route recommendations [19] that match adriver’s abilities and safety requirements, driving style, andfuel efficiency trade-offs. They will also alert drivers ofunanticipated hazards well in advance [25] – even when thedriver does not specify the intended destination, and betteroptimize vehicle energy consumption using short-term turnprediction [10]. Realizing such a vision requires new modelsof non-myopic behavior capable of incorporating rich con-textual information.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.UbiComp’08, September 21-24, 2008, Seoul, Korea.

Copyright 2008 ACM 978-1-60558-136-1/08/09...$5.00.

In this paper, we present PROCAB, a method forProbabilistically Reasoning from Observed Context-AwareBehavior. PROCAB enables reasoning about context-sensitive user actions, preferences, and goals – a requirementfor realizing ubiquitous computing systems that provide theright information and services to users at the appropriate mo-ment. Unlike other methods, which directly model actionsequences, PROCAB models the negative utility or cost ofeach action as a function of contextual variables associatedwith that action. This allows it to model the reasons for ac-tions rather than the actions themselves. These reasons gen-eralize to new situations and differing goals. In many set-tings, reason-based modeling provides a compact model forhuman behavior with fewer relationships between variables,which can be learned more precisely. PROCAB probabilisti-cally models a distribution over all behaviors (i.e., sequencesof actions) [27] using the principle of maximum entropy [9]within the framework of inverse reinforcement learning [18].

In vehicle navigation, roads in the road network differ bytype (e.g., interstate vs. alleyway), number of lanes, andspeed limit. In the PROCAB model, a driver’s utility for dif-ferent combinations of these road features is learned, ratherthan the driver’s affinity for particular roads. This allowsgeneralization to locations where a driver has never previ-ously visited.

Learning good context-dependent driving routes can be dif-ficult for a driver, and mistakes can cause undesirable traveldelays and frustration [7]. Our approach is to learn driv-ing routes from a group of experts – 25 taxi cab drivers –who have a good knowledge of alternative routes and con-textual influences on driving. We model context-dependenttaxi driving behavior from over 100,000 miles of collectedGPS data and then recommend routes that an efficient taxidriver would be likely to take.

Route recommendation quality it is difficult to assess. In-stead, we evaluate using three prediction tasks:

• Turn Prediction: What is the probability distributionover actions at the next intersection?

• Route Prediction: What is the most likely route to a spec-ified destination?

• Destination Prediction: What is the most probable desti-nation given a partially traveled route?

Page 22: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

We validate our model by comparing its prediction accuracyon our taxi driver dataset with that of existing approaches.We believe average drivers who will use our envisionedsmart navigation devices choose less efficient routes, butvisit fewer destinations, making route prediction somewhatharder and destination prediction significantly easier.

In the remainder of the paper, we first discuss existing prob-abilistic models of behavior to provide points of comparisonto PROCAB. We then present evidence that preference andcontext are important in route preference modeling. Next,we describe our PROCAB model for probabilistically mod-eling context-sensitive behavior compactly, which allows usto obtain a more accurate model with smaller amounts oftraining data. We then describe the data we collected, ourexperimental formulation, and evaluation results that illus-trate the the PROCAB model’s ability to reason about userbehavior. Finally, we conclude and discuss extensions of ourwork to other domains.

BACKGROUND

In this section, we describe existing approaches for model-ing human actions, behavior, and activities in context-richdomains. We then provide some illustration of the contex-tual sensitivity of behavior in our domain of focus, modelingvehicle route preferences.

Context-Aware Behavior Modeling

An important research theme in ubiquitous computing is therecognition of a user’s current activities and behaviors fromsensor data. This has been explored in a large number ofdomains including classifying a user’s posture [17], track-ing and classifying user exercises [5], classifying a user’sinterruptibility [6] and detecting a variety of daily activitieswithin the home [26, 24]. Very little work, in comparison,looks at how to predict what users want to do: their goals andintentions. Most context-aware systems attempt to modelcontext as a proxy for user intention. For example, a sys-tem might infer that a user wants more information about amuseum exhibit because it observes that she is standing nearit.

More sophisticated work in context-awareness has focusedon the problem of predicting where someone is going or go-ing to be. Bhattacharya and Das used Markov predictorsto develop a probability distribution of transitions from oneGSM cell to another [3]. Ashbrook and Starner take a simi-lar approach to develop a distribution of transitions betweenlearned significant locations from GPS data [2]. Patterson etal. used a Bayesian model of a mobile user conditioned onthe mode of transportation to predict the users future loca-tion [20]. Mayrhofer extends this existing work in locationprediction to that of context prediction, or predicting whatsituation a user will be in, by developing a supporting gen-eral architecture that uses a neural gas approach [15]. Inthe Neural Network House, a smart home observed the be-haviors and paths of an occupant and learned to anticipatehis needs with respect to temperature and lighting control.The occupant was tracked by motion detectors and a neuralnetwork was used to predict the next room the person was

going to enter along with the time at which he would enterand leave the home [16].

However, all of these approaches are limited in their abilityto learn or predict from only small variations in user behav-ior. In contrast, the PROCAB approach is specifically de-signed to learn and predict from small amounts of observeduser data, much of which is expected to contain context-dependent and preference-dependent variations. In the nextsection, we discuss the importance of these variations in un-derstanding route desirability.

Understanding Route Preference

Previous research on predicting the route preferences ofdrivers found that only 35% of the 2,517 routes taken by102 different drivers were the “fastest” route, as defined bya popular commercial route planning application [13]. Dis-agreements between route planning software and empiricalroutes were attributed to contextual factors, like the time ofday, and differences in personal preferences between drivers.

We conducted a survey of 21 college students who drive reg-ularly in our city to help understand the variability of routepreference as well as the personal and contextual factors thatinfluence route choice. We presented each participant with4 different maps labeled with a familiar start and destina-tion point. In each map we labeled a number of differentpotentially preferable routes. Participants selected their pre-ferred route from the set of provided routes under 6 differ-ent contextual situations for each endpoint pair. The contex-tual situations we considered were: early weekday morning,morning rush hour, noon on Saturday, evening rush hour,immediately after snow fall, and at night.

RouteContext A B C D E F G

Early morning 6 6 4 1 2 2 0Morning rush hour 8 4 5 0 1 2 1

Saturday noon 7 5 4 0 1 2 2Evening rush hour 8 4 5 0 0 3 1

After snow 7 4 4 3 2 1 0Midnight 6 4 4 2 1 4 0

Table 1. Context-dependent route preference survey results for oneparticular pair of endpoints

The routing problem most familiar to our participants hadthe most variance of preference. The number of participantswho most preferred each route under each of the contex-tual situations for this particular routing problem is shownin Table 1. Of the 7 available routes to choose from (A-G) under 6 different contexts, the route with highest agree-ment was only preferred by 8 people (38%). In additionto route choice being dependent on personal preferences,route choice was often context-dependent. Of the 21 partic-ipants, only 6 had route choices that were context-invariant.11 participants used two different routes depending on thecontext, and 4 participants employed three different context-dependent routes.

Page 23: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Our participants were not the only ones in disagreementover the best route for this particular endpoint pair. Wegenerated route recommendations from four major commer-cial mapping and direction services for the same endpointsto compare against. The resulting route recommendationsalso demonstrated significant variation, though all servicesare context- and preference-independent. Google Maps andMicrosoft’s MapPoint both chose route E, while MapQuestgenerated route D, and Yahoo! Maps provided route A.

Participants additionally provided their preference towardsdifferent driving situations on a five-point Likert scale (seeTable 2). The scale was defined as: (1) very strong dislike,avoid at all costs; (2) dislike, sometimes avoid; (3) don’tcare, doesn’t affect route choice; (4) like, sometimes prefer;(5) very strong like, always prefer.

PreferenceSituation 1 2 3 4 5

Interstate/highway 0 0 3 14 4Excess of speed limit 1 4 5 10 1

Exploring unfamiliar area 1 8 4 7 1Stuck behind slow driver 8 10 3 0 0

Longer routes with no stops 0 4 3 13 1

Table 2. Situational preference survey results

While some situations, like driving on the interstate are dis-liked by no one and being stuck behind a slow driver arepreferred by no one, other situations have a wide range ofpreference with only a small number of people expressingindifference. For example, the number of drivers who pre-fer routes that explore unfamiliar areas is roughly the sameas the number of drivers with the opposite preference, andwhile the majority prefer to drive in excess of the speed limitand take longer routes with no stops, there were a numberof others with differing preferences. We expect that a pop-ulation covering all ranges of age and driving ability willpossess an even larger preference variance than the more ho-mogeneous participants in our surveys.

The results of our formative research strongly suggest thatdrivers’ choices of routes vary greatly and are highly depen-dent on personal preferences and contextual factors.

PROCAB: CONTEXT-AWARE BEHAVIOR MODELING

The variability of route preference from person to personand from situation to situation makes perfectly predictingevery route choice for a group of drivers extremely difficult.We adopt the more modest goal of developing a probabilis-tic model that assigns as much probability as possible to theroutes the drivers prefer. Some of the variability from per-sonal preference and situation is explained by incorporatingcontextual variables within our probabilistic model. The re-maining variability in the probabilistic model stems from in-fluences on route choices that are unobserved by our model.

Many different assumptions and frameworks can be em-ployed to probabilistically model the relationships betweencontextual variables and sequences of actions. The PRO-CAB approach is based on three principled techniques:

• Representing behavior as sequential actions in a MarkovDecision Process (MDP) with parametric cost values

• Using Inverse Reinforcement Learning to recover costweights for the MDP that explain observed behavior

• Employing the principle of maximum entropy to find costweights that have the least commitment

The resulting probabilistic model of behavior is context-dependent, compact, and efficient to learn and reason about.In this section, we describe the comprising techniques andthe benefits each provides. We then explain how the result-ing model is employed to efficiently reason about context-dependent behavior. Our previous work [27] provides amore rigorous and general theoretical derivation and eval-uation.

Markov Decision Process Representation

Markov Decision Processes (MDPs) [21] provide a natu-ral framework for representing sequential decision making,such as route planning. The agent takes a sequence of ac-tions (a ∈ A), which transition the agent between states(s ∈ S) and incur an action-based cost1 (c(a) ∈ ℜ). A sim-ple deterministic MDP with 8 states and 20 actions is shownin Figure 1.

Figure 1. A simple Markov Decision Process with action costs

The agent is trying to minimize the sum of costs while reach-ing some destination. We call the sequence of actions apath, ζ. For MDPs with parametric costs, a set of features(fa ∈ ℜ

k) characterize each action, and the cost of the actionis a linear function of these features parameterized by a costweight vector (θ ∈ ℜk). Path features, fζ , are the sum of thefeatures of actions in the path:

∑a∈ζ fa. The path cost is

the sum of action costs (Figure 1), or, equivalently, the costweight applied to the path features.

cost(ζ|θ) =∑

a∈ζ

θ⊤fa = θ⊤fζ

The MDP formulation provides a natural framework for anumber of behavior modeling tasks relevant to ubiquitouscomputing. For example, in a shopping assistant applica-tion, shopping strategies within an MDP are generated thatoptimally offset the time cost of visiting more stores with thebenefits of purchasing needed items [4].

1The negation of costs, rewards, are more common in the MDPliterature, but less intuitive for our application.

Page 24: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

The advantage of the MDP approach is that the cost weightis a compact set of variables representing the reasons forpreferring different behavior, and if it is assumed that theagent acts sensibly with regard to incurred costs, the ap-proach gerneralizes to previously unseen situations.

Inverse Reinforcement Learning

Much research on MDPs focuses on efficiently finding theoptimal behavior for an agent given its cost weight [21]. Inour work, we focus on the inverse problem, that of find-ing an agent’s cost weight given demonstrated behavior (i.e.,traversed driving routes). Recent research in Inverse Rein-forcement Learning [18, 1] focuses on exactly this problem.Abbeel and Ng showed that if a model of behavior matches

feature counts with demonstrated feature counts, f (Equa-tion 1), then the model’s expected cost matches the agent’sincurred costs[1].

Path ζi

P (ζi)fζi= f (1)

This constraint has an intuitive appeal. If a driver uses136.3 miles of interstate and crosses 12 bridges in a month’sworth of trips, the model should also use 136.3 miles of in-terstate and 12 bridges in expectation for those same start-destination pairs, along with matching other features of sig-nificance. By incorporating many relevant features to match,the model will begin to behave similarly to the agent. How-ever, many distributions over paths can match feature counts,and some will be very different from observed behavior. Inour simple example, the model could produce plans thatavoid the interstate and bridges for all routes except one,which drives in circles on the interstate for 136 miles andcrosses 12 bridges.

Maximum Entropy Principle

If we only care about matching feature counts, and manydifferent distributions can accomplish that, it is not sensibleto choose a distribution that shows preference towards otherfactors that we do not care about. We employ the mathemat-ical formulation of this intuition, the principle of maximumentropy [9], to select a distribution over behaviors. The prob-ability distribution over paths that maximizes Shannon’s in-formation entropy while satisfying constraints (Equation 1)has the least commitment possible to other non-influentialfactors. For our problem, this means the model will show nomore preference for routes containing specific streets thanis required to match feature counts. In our previous exam-ple, avoiding all 136 miles of interstate and 12 bridge cross-ings except for one single trip is a very low entropy distribu-tion over behavior. The maximum entropy principle muchmore evenly assigns the probability of interstate driving andbridge crossing to many different trips from the set. The re-sulting distribution is shown in Equation 2.

P (ζ|θ) =e−cost(ζ|θ)

∑path ζ′ e−cost(ζ′|θ)

(2)

Low-cost paths are strongly preferred by the model, andpaths with equal cost have equal probability.

Compactness Advantage

Many different probabilistic models can represent the samedependencies between variables that pertain to behavior –context, actions, preference, and goal. The main advantageof the PROCAB distribution is that of compactness. In amore compact model, fewer parameters need to be learnedso a more accurate model can be obtained from smalleramounts of training data.

All Context

Action 1 Action 2 Action 3 Action N

Preference Goal

...

Figure 2. Directed graphical model of sequential context-sensitive,goal-oriented, preference-dependent actions

Directly modeling the probability of each action given allother relevant information is difficult when incorporatingcontext, because a non-myopic action choice depends notonly on the context directly associated with that action (e.g.,whether the next road is congested), but context associatedwith all future actions as well (e.g., whether the roads that aroad leads to are congested). This approach corresponds to adirected graphical model (Figure 2) [22, 14] where the prob-ability of each action is dependent on all contextual informa-tion, the intended destination, and personal preferences. Ifthere are a large number of possible destinations and a richset of contextual variables, these conditional action proba-bility distributions will require an inordinate amount of ob-served data to accurately model.

Context 1Features 1

Action 1

Preference

Context 2Features 2

Action 2

Context 3Features 3

Action 3

Context NFeatures N

Action N...

Goal

Figure 3. Undirected graphical model of sequential context-sensitive,goal-oriented, personalized actions

The PROCAB approach assumes drivers imperfectly mini-mize the cost of their route to some destination (Equation2). The cost of each road depends only on the contextual in-formation directly associated with that road. This model cor-responds to an undirected graphical model (Figure 3), wherecliques (e.g., “Action 1,” “Context 1,” and “Preference”) de-fine road costs, and the model forms a distribution over pathsbased on those costs. If a road becomes congested, its costwill increase and other alternative routes will become moreprobable in the model without requiring any other road coststo change. This independence is why the PROCAB model

Page 25: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

is more compact and can be more accurately learned usingmore reasonable amounts of data.

Probabilistic Inference

We would like to predict future behavior using our PRO-CAB model, but first we must train the model by finding theparameter values that best explain previously demonstratedbehavior. Both prediction and training require probabilis-tic inference within the model. We focus on the inference ofcomputing the expected number of times a certain action willbe taken given known origin, goal state, and cost weights2.A simple approach is to enumerate all paths and probabilis-tically count the number of paths and times in each path theparticular state is visited.

Algorithm 1 Expected Action Frequency Calculation

Inputs: cost weight θ, initial state so, and goal state sg

Output: expected action visitation frequencies Dai,j

Backward pass

1. Set Zsi= 1 for valid goal states, 0 otherwise

2. Recursively compute for T iterations

Zai,j= e−cost(ai,j |θ)Zs:ai,j

Zsi=

actions ai,j of si

Zai,j

Forward pass

3. Set Z ′si

= 1 for initial state, 0 otherwise

4. Recursively compute for T iterations

Z ′ai,j

= Z ′si

e−cost(ai,j |θ)

Z ′si

=∑

actions aj,i to si

Z ′aj,i

Summing frequencies

5. Dai,j=

Z ′si

e−cost(ai,j |θ)Zsj

Zsinitial

Algorithm 1 employs a more tractable approach by findingthe probabilistic weight of all paths from the origin (o) to a

specific action (a), Z ′a =

∑ζo→a

e−cost(ζ), all paths from the

action to the goal (g)3, Za =∑

ζa→ge−cost(ζ) and all paths

from the origin to the goal, Zo = Z ′g =

∑ζo→g

e−cost(ζ). Ex-

pected action frequencies are obtained by combining theseresults (Equation 3).

Da =ZaZ ′

ae−cost(a)

Zo

(3)

Using dynamic programming to compute the required Z val-ues is exponentially more efficient than path enumeration.

2Other inferences, like probabilities of sequence of actions or onespecific action are obtained using the same approach.3In practice, we use a finite T , which considers a set of paths oflimited length.

Cost Weight Learning

We train our model by finding the parameters that maxi-mize the [log] probability of (and best explain) previouslyobserved behavior.

θ∗ = argmaxθ

log∏

i

e−cost(ζi|θ)

∑path ζj

e−cost(ζj |θ)(4)

Algorithm 2 Learn Cost Weights from Data

Stochastic Exponentiated Gradient Ascent

Initialize random θ ∈ ℜk, γ > 0

For t = 1 to T :

For random example, compute Da for all actions a

(using Algorithm 1)

Compute gradient, ∇F from Da (Equation 5)

θ ← θeγt∇F

We employ a gradient-based method (Algorithm 2) for thisconvex optimization. The gradient is simply the differencebetween the demonstrated behavior feature counts and themodel’s expected feature counts, so at the optima these fea-ture counts match. We use the action frequency expecta-tions, Da, to more efficiently compute the gradient (Equa-tion 5).

f−∑

Path ζi

P (ζi)fζi= f−

a

Dafa (5)

The algorithm raises the cost of features that the model isover-predicting so that they will be avoided more and lowersthe cost of features that the model is under-predicting untilconvergence near the optima.

TAXI DRIVER ROUTE PREFERENCE DATA

Now that we have described our model for probabilisticallyreasoning from observed context-aware behavior, we willdescribe the data we collected to evaluate this model. Werecruited 25 Yellow Cab taxi drivers to collect data from.Their experience as taxi drivers in the city ranged from 1month to 40 years. The average and median were 12.9 yearsand 9 years respectively. All participants reported living inthe area for at least 15 years.

Collected Position Data

We collected location traces from our study participants overa 3 month period using global positioning system (GPS) de-vices that log locations over time. Each participant was pro-vided one of these devices4, which records a reading roughlyevery 6 to 10 seconds while in motion. The data collectionyielded a dataset of over 100,000 miles of travel collectedfrom over 3,000 hours of driving. It covers a large area sur-rounding our city (Figure 4). Note that no map is being over-laid in this figure. Repeated travel over the same roads leavesthe appearance of the road network itself.

4In a few cases where two participants who shared a taxi alsoshared a GPS logging device

Page 26: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Figure 4. The collected GPS datapoints

Road Network Representation

The deterministic action-state representation of the corre-sponding road network contains over 300,000 states (i.e.,road segments) and over 900,000 actions (i.e., available tran-sitions between road segments). There are characteristics de-scribing the speed categorization, functionality categoriza-tion, and lane categorization of roads. Additionally we canuse geographic information to obtain road lengths and turntypes at intersection (e.g., straight, right, left).

Fitting to the Road Network and Segmenting

To address noise in the GPS data, we fit it to the road networkusing a particle filter. A particle filter simulates a large num-ber of vehicles traversing over the road network, focusing itsattention on particles that best match the GPS readings. Amotion model is employed to simulate the movement of thevehicle and an observation model is employed to express therelationship between the true location of the vehicle and theGPS reading of the vehicle. We use a motion model based onthe empirical distribution of changes in speed and a Laplacedistribution for our observation model.

Once fitted to the road network, we segmented our GPStraces into distinct trips. Our segmentation is based on time-thresholds. Vehicles with a small velocity for a period oftime are considered to be at the end of one trip and the begin-ning of a new trip. We note that this problem is particularlydifficult for taxi driver data, because these drivers may oftenstop only long enough to let out a passenger and this can bedifficult to distinguish from stopping at a long stoplight. Wediscard trips that are too short, too noisy, and too cyclic.

MODELING ROUTE PREFERENCES

In this section, we describe the features that we employed tomodel the utilities of different road segments in our model,and machine learning techniques employed to learn goodcost weights for the model.

Feature Sets and Context-Awareness

As mentioned, we have characteristics of the road networkthat describe speed, functionality, lanes, and turns. Thesecombine to form path-level features that describe a pathas numbers of different turn type along the path and roadmileage at different speed categories, functionality cate-gories, and numbers of lanes. To form these path features,

we combine the comprising road segments’ features for eachof their characteristics weighted by the road segment lengthor intersection transition count.

Feature Value

Highway 3.3 milesMajor Streets 2.0 milesLocal Streets 0.3 milesAbove 55mph 4.0 miles

35-54mph 1.1 miles25-34 mph 0.5 miles

Below 24mph 0 miles3+ Lanes 0.5 miles2 Lanes 3.3 miles1 Lane 1.8 miles

Feature Value

Hard left turn 1Soft left turn 3

Soft right turn 5Hard right turn 0

No turn 25U-turn 0

Table 3. Example feature counts for a driver’s demonstrated route(s)

The PROCAB model finds the cost weight for different fea-tures so that the model’s feature counts will match (in expec-tation) those demonstrated by a driver (e.g., as shown in Ta-ble 3). when planning for the same starting point and desti-nation. Additionally, unobserved features may make certainroad segments more or less desirable. To model these un-observable features, we add unique features associated witheach road segment to our model. This allows the cost of eachroad segment to vary independently.

We conducted a survey of our taxi drivers to help identify themain contextual factors that impact their route choices. Theperceived influences (with average response) on a 5-pointLikert Scale ranging from “no influence” (1) to “strong influ-ence” (5) are: Accidents (4.50), Construction (4.42), Time ofDay (4.31), Sporting Events (4.27), and Weather (3.62). Wemodel some of these influences by adding real-time featuresto our road segments for Accident, Congestion, and Closures(Road and Lane) according to traffic data collected every 15minutes from Yahoo’s traffic service.

We incorporate sensitivity to time of day and day of week byadding duplicate features that are only active during certaintimes of day and days of week. We use morning rush hour,day time, evening rush hour, evening, and night as our timeof day categories, and weekday and weekend as day of weekcategories. Using these features, the model tries to not onlymatch the total number of e.g., interstate miles, but it alsotries to get the right amount of interstate miles under eachtime of day category. For example, if our taxi drivers try toavoid the interstate during rush hour, the PROCAB modelassigns a higher weight to the joint interstate and rush hourfeature. It is then less likely to prefer routes on the interstateduring rush hour given that higher weight.

Matching all possible contextual feature counts highly con-strains the model’s predictions. In fact, given enough con-textual features, the model may exactly predict the actualdemonstrated behavior and overfit to the training data. Weavoid this problem using regularization, a technique thatrelaxes the feature matching constraint by introducing apenalty term (−

∑i λiθ

2i ) to the optimization. This penalty

Page 27: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

prevents cost weights corresponding to highly specializedfeatures from becoming large enough to force the model toperfectly fit observed behavior.

Learned Cost Weights

We learn the cost weights that best explain a set of demon-strated routes using Algorithm 2. Using this approach, wecan obtain cost weights for each driver or a collective costweight for a group of drivers. In this paper, we group theroutes gathered from all 25 taxi drivers together and learn asingle cost weight using a training set of 80% of those routes.

High speed/Interstate Low speed/small road0

20

40

60

80

Co

st

(se

co

nd

s)

Road cost factors

Road type

Road speed

Figure 5. Speed categorization and road type cost factors normalizedto seconds assuming 65mph driving on fastest and largest roads

Figure 5 shows how road type and speed categorization in-fluence the road’s cost in our learned model.

NAVIGATION APPLICATIONS AND EVALUATION

We now focus on the applications that our route preferencemodel enables. We evaluate our model on a number of pre-diction tasks needed for those applications. We compare thePROCAB model’s performance with other state of the artmethods on these tasks using the remaining 20% of the taxiroute dataset to evaluate.

Turn Prediction

We first consider the problem of predicting the action at thenext intersection given a known final destination. This prob-lem is important for applications such as automatic turn sig-naling and fuel consumption optimization when the destina-tion is either specified by the driver or can be inferred fromhis or her normal driving habits. We compare PROCAB’sability to predict 55,725 decisions at intersections with mul-tiple options5 to approaches based on Markov Models.

We measure the accuracy of each model’s predictions (i.e.,percentage of predictions that are correct) and the averagelog likelihood, 1

#decisions

∑decision d log Pmodel(d) of the ac-

tual actions in the model. This latter metric evaluates themodel’s ability to probabilistically model the distribution ofdecisions. Values closer to 0 are closer to perfectly mod-eling the data. As a baseline, guessing uniformly at random(without U-turns) yields an accuracy of 46.4% and a log like-lihood of −0.781 for our dataset.5Previous Markov Model evaluations include “intersections” withonly one available decision, which comprise 95% [22] and 28%[11] of decisions depending on the road network representation.

Markov Models for Turn Prediction

We implemented two Markov Models for turn prediction. AMarkov Model predicts the next action (or road segment)given some observed conditioning variables. Its predictionsare obtained by considering a set of previously observed de-cisions at an intersection that match the conditioning vari-ables. The most common action from that set can be pre-dicted or the probability of each action can be set in pro-portion to its frequency in the set6. For example, Liao etal. [14] employ the destination and mode of transport formodeling multi-modal transportation routines. We evaluatethe Markov Models employed by Krumm [11], which con-ditions on the previous K traversed road segments, and Sim-mons et al. [22], which conditions on the route destination.

Non-reducing ReducingHistory Accu. Likel. Accu. Likel.

1 edge 85.8% −0.322 85.8% −0.3222 edge 85.6% −0.321 86.0% −0.3195 edge 84.8% −0.330 86.2% −0.32110 edge 83.4% −0.347 86.1% −0.32820 edge 81.5% −0.367 85.7% −0.337100 edge 79.3% −0.399 85.1% −0.356

Table 4. K-order Markov Model Performance

Evaluation results of a K-order Markov Model based on roadsegment histories are shown in Table 4. We consider twovariants of the model. For the Non-reducing variant, if thereare no previously observed decisions that match the K-sizedhistory, a random guess is made. In the Reducing variant,instead of guessing randomly when no matching historiesare found, the model tries to match a smaller history sizeuntil some matching examples are found. If no matchinghistories exist for K = 1 (i.e., no previous experience withthis decision), a random guess is then made.

In the non-reducing model we notice a performance degrada-tion as the history size, K, increases. The reduction strategyfor history matching helps to greatly diminish the degrada-tion of having a larger history, but still we still find a smallperformance degradation as the history size increases be-yond 5 edges.

Destinations Accuracy Likelihood

1x1 grid 85.8% −0.3222x2 grid 85.1% −0.3195x5 grid 84.1% −0.32710x10 grid 83.5% −0.32740x40 grid 78.6% −0.365100x100 grid 73.1% −0.4162000x2000 grid 59.3% −0.551

Table 5. Destination Markov Model Performance

We present the evaluation of the Markov Model conditionedon a known destination in Table 5. We approximate eachunique destination with a grid of destination cells, startingfrom a single cell (1x1) covering the entire map, all the way

6Some chance of selecting previously untaken actions is added bysmoothing the distribution.

Page 28: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

up to 4 million cells (2000x2000). As the grid dimensionsgrow to infinity, the treats each destination as unique. Wefind empirically that using finer grid cells actually degradesaccuracy, and knowing the destination provides no advan-tage over the history-based Markov Model for this task.

Both of these results show the inherent problem of data spar-sity for directed graphical models in these types of domains.With an infinite amount of previously observed data avail-able to construct a model, having more information (i.e., alarger history or a finer grid resolution) can never degradethe model’s performance. However, with finite amounts ofdata there will often be few or no examples with matchingconditional variables, providing a poor estimate of the trueconditional distribution, which leads to lower performancein applications of the model.

PROCAB Turn Prediction

The PROCAB model predicts turns by reasoning about pathsto the destination. Each path has some probability within themodel and many different paths share the same first actions.An action’s probability is obtained by summing up all pathprobabilities that start with that action. The PROCAB modelprovides a compact representation over destination, context,and action sequences, so the exact destination and rich con-textual information can be incorporated without leading todata sparsity degradations like the Markov Model.

Accuracy Likelihood

Random guess 46.4% −0.781Best History MM 86.2% −0.319Best Destination MM 85.8% −0.319PROCAB (no context) 91.0% −0.240PROCAB (context) 93.2% −0.201

Table 6. Baseline and PROCAB turn prediction performance

We summarize the best comparison models’ results for turnprediction and present the PROCAB turn prediction perfor-mance in Table 6. The PROCAB approach provides a largeimprovement over the other models in both prediction accu-racy and log likelihood. Additionally, incorporating time ofday, day of week, and traffic report contextual informationprovides improved performance. We include this additionalcontextual information in the remainder of our experiments.

Route Prediction

We now focus on route prediction, where the origin and des-tination of a route are known, but the route between the twois not and must be predicted. Two important applications ofthis problem are route recommendation, where a driver re-quests a desirable route connecting two specified points, andunanticipated hazard warning, where an application that canpredict the driver will encounter some hazard he is unawareof and warn him beforehand.

We evaluate the prediction quality based on the amount ofmatching distance between the predicted route and the ac-tual route, and the percentage of predictions that match,where we consider all routes that share 90% of distance as

Model Dist. Match 90% Match

Markov (1x1) 62.4% 30.1%Markov (3x3) 62.5% 30.1%Markov (5x5) 62.5% 29.9%Markov (10x10) 62.4% 29.6%Markov (30x30) 62.2% 29.4%Travel Time 72.5% 44.0%PROCAB 82.6% 61.0%

Table 7. Evaluation results for Markov Model with various grid sizes,time-based model, and PROCAB model

matches. This final measure ignores minor route differences,like those caused by noise in GPS data. We evaluate a previ-ously described Markov Model, a model based on estimatedtravel time, and our PROCAB model.

Markov Model Route Planning

We employ route planning using the previously describeddestination-conditioned Markov Model [22]. The model rec-ommends the most probable route satisfying origin and des-tination constraints. The results (Table 7, Markov) are fairlyuniform regardless of the number of grid cells employed,though there is a subtle degradation with more grid cells.

Travel Time-Based Planning

A number of approaches for vehicle route recommendationare based on estimating the travel time of each route andrecommending the fastest route. Commercial route recom-mendation systems, for example, try to provide the fastestroute. The Cartel Project [8] works to better estimate thetravel times of different road segments in near real-time us-ing fleets of instrumented vehicles. One approach to routeprediction is to assume the driver will also try to take thismost expedient route.

We use the distance and speed categorization of each roadsegment to estimate travel times, and then provide route pre-dictions using the fastest route between origin and destina-tion. The results (Table 7, Travel Time) show a large im-provement over the Markov Model. We believe this is dueto data sparsity in the Markov Model and the Travel timemodel’s ability to generalize to previously unseen situations(e.g., new origin-destination pairs).

PROCAB Route Prediction

Our view of route prediction and recommendation is funda-mentally different than those based solely on travel time es-timates. Earlier research [13] and our own study have shownthat there is a great deal of variability in route preferencebetween drivers. Rather than assume drivers are trying tooptimize one particular metric and more accurately estimatethat metric in the road network, we implicitly learn the met-ric that the driver is actually trying to optimize in practice.This allows other factors on route choice, such as fuel ef-ficiency, safety, reduced stress, and familiarity, to be mod-eled. While the model does not explicitly understand thatone route is more stressful than another, it does learn to imi-tate a driver’s avoidance of more stressful routes and featuresassociated with those routes. The PROCAB model provides

Page 29: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

Percentage of trip observedModel 10% 20% 40% 60% 80% 90%

MM 5x5 9.61 9.24 8.75 8.65 8.34 8.17MM 10x10 9.31 8.50 7.91 7.58 7.09 6.74MM 20x20 9.69 8.99 8.23 7.66 6.94 6.40MM 30x30 10.5 9.94 9.14 8.66 7.95 7.59Pre 40x40 MAP 7.98 7.74 6.10 5.42 4.59 4.24Pre 40x40 Mean 7.74 7.53 5.77 4.83 4.12 3.79Pre 80x80 MAP 11.02 7.26 5.17 4.86 4.21 3.88Pre 80x80 Mean 8.69 7.27 5.28 4.46 3.95 3.69PRO MAP 11.18 8.63 6.63 5.44 3.99 3.13PRO Mean 6.70 5.81 5.01 4.70 3.98 3.32

Table 8. Prediction error of Markov, Predestination, and PROCABmodels in kilometers

increased performance over both the Markov Model and themodel based on travel time estimates, as shown in Table 7.This is because it is able to better estimate the utilities of theroad segments than the feature-based travel time estimates.

Destination Prediction

Finally, we evaluate models for destination prediction. In sit-uations where a driver has not entered her destination into anin-car navigation system, accurately predicting her destina-tion would be useful for proactively providing an alternativeroute. Given the difficulty that users have in entering desti-nations [23], destination prediction can be particularly use-ful. It can also be used in conjunction with our model’s routeprediction and turn prediction to deal with settings where thedestination is unknown.

In this setting, a partial route of the driver is observed and thefinal destination of the driver is predicted. This application isespecially difficult given our set of drivers, who visit a muchwider range of destinations than typical drivers. We com-pare PROCAB’s ability to predict destination to two othermodels, in settings where the set of possible destinations isnot fully known beforehand.

We evaluate our models using 1881 withheld routes and al-low our model to observe various amounts of the route from10% to 90%. The model is provided no knowledge of howmuch of the trip has been completed. Each model providesa single point estimate for the location of the intended des-tination and we evaluate the distance between the true desti-nation and this predicted destination in kilometers.

Bayes’ Rule

Using Bayes’ Rule (Equation 6), route preference modelsthat predict route (B) given destination (A) can be employedalong with a prior on destinations, P (A), to obtain a proba-bility distribution over destinations, P (A|B)7.

P (A|B) =P (B|A)P (A)∑

A′ P (B|A′)P (A′)∝ P (B|A)P (A) (6)

7Since the denominator is constant with respect to A, the probabil-ity is often expressed as being proportionate to the numerator.

20 40 60 80

2

4

6

8

Percentage of Trip Completed

Pre

dic

tio

n E

rro

r (k

m)

Destination Prediction

Markov Model

Predestination

PROCAB

Figure 6. The best Markov Model, Predestination, and PROCAB pre-

diction errors

Markov Model Destination Prediction

We first evaluate a destination-conditioned Markov Modelfor predicting destination region in a grid. The model em-ploys Bayes’ Rule to obtain a distribution over cells for thedestination based on the observed partial route. The priordistribution over grid cells is obtained from the empiricaldistribution of destinations in the training dataset. The centerpoint of the most probable grid cell is used as a destinationlocation estimate.

Evaluation results for various grid cell sizes are shown inTable 8 (MM). We find that as more of the driver’s routeis observed, the destination is more accurately predicted. Aswith the other applications of this model, we note that havingthe most grid cells does not provide the best model due todata sparsity issues.

Predestination

The Predestination sytem [12] grids the world into a numberof cells and uses the observation that the partially traveledroute is usually an efficient path through the grid to the finaldestination. Using Bayes’ rule, destinations that are oppositeof the direction of travel have much lower probability thandestinations for which the partially traveled route is efficient.Our implemention employs a prior distribution over destina-tion grid cells conditioned on the starting location rather thanthe more detailed original Predestination prior. We considertwo variants of prediction with Predestination. One predictsthe center of the most probable cell (i.e., the Maximum aPosteriori or MAP estimate). The other, Mean, predicts theprobabilistic average over cell center beliefs. We find thatPredestination shows significant improvement (Table 8, Pre)over the Markov Model’s performance.

PROCAB Destination Prediction

In the PROCAB model, destination prediction is also an ap-plication of Bayes’ rule. Consider a partial path, ζA→B

from point A to point B. The destination probability is then:

P (dest|ζA→B , θ) ∝ P (ζA→B |dest, θ)P (dest)

∑ζB→dest

e−cost(ζ|θ)

∑ζA→dest

e−cost(ζ|θ)P (dest) (7)

We use the same prior that depends on the starting point (A)that we employed in our Predestination implementation. The

Page 30: Computational Rationalization: The Inverse Equilibrium Problem · Computational Rationalization: The Inverse Equilibrium Problem Kevin Waugh waugh@cs.cmu.edu Brian D. Ziebart bziebart@cs.cmu.edu

posterior probabilities are efficiently computed by taking thesums over paths from points A and B to each possible des-tination (Equation 7) using the forward pass of Algorithm1.

Table 8 (PRO) and Figure 6 show the accuracy of PROCABdestination prediction compared with other models. Usingaveraging is much more beneficial for the PROCAB model,likely because it is more sensitive to particular road seg-ments than models based on grid cells. We find that thePROCAB model performs comparably to the Predestinationmodel given a large percentage of observed trip, and betterthan the Predestination model when less of the trip is ob-served. PROCAB’s abilities to learn non-travel time desir-ability metrics, reason about road segments rather than gridcells, and incorporate additional contextual information mayeach contribute to this improvement.

CONCLUSIONS AND FUTURE WORK

We presented PROCAB, a novel approach for modelingobserved behavior that learns context-sensitive action util-ities rather than directly learning action sequences. We ap-plied this approach to vehicle route preference modeling andshowed significant performance improvements over exist-ing models for turn and route prediction and showed simi-lar performance to the state-of-the-art destination predictionmodel. Our model has a more compact representation thanthose which directly model action sequences. We attributeour model’s improvements to its compact representation, be-cause of an increased ability to generalize to new situations.

Though we focused on vehicle route preference modelingand smart navigation applications, PROCAB is a general ap-proach for modeling context-sensitive behavior that we be-lieve it is well-suited for a number of application domains.In addition to improving our route preference models by in-corporating additional contextual data, we plan to model el-der drivers and other interesting populations and we planto model behavior in other context-rich domains, includ-ing modeling the movements of people throughout dynamicwork environments and the location-dependent activities offamilies. We hope to demonstrate PROCAB as a domain-general framework for modeling context-dependent user be-havior.

ACKNOWLEDGMENTS

The authors thank Eric Oatneal and Jerry Campolongo ofYellow Cab Pittsburgh for their assistance, Ellie Lin Ratlifffor helping to conduct the study of driving habits, and JohnKrumm for his help in improving this paper and sharing GPShardware hacks. Finally, we thank the Quality of Life Tech-nology Center/National Science Foundation for supportingthis research under Grant No. EEEC-0540865.

REFERENCES1. P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse

reinforcement learning. In Proc. ICML, pages 1–8, 2004.

2. D. Ashbrook and T. Starner. Using gps to learn significant locationsand prediction movement. Personal and Ubiquitous Computing,7(5):275–286, 2003.

3. A. Bhattacharya and S. K. Das. Lezi-update: An information-theoreticframework for personal mobility tracking in pcs networks. Wireless

Networks, 8:121–135, 2002.

4. T. Bohnenberger, A. Jameson, A. Kruger, and A. Butz.Location-aware shopping assistance: Evaluation of adecision-theoretic approach. In Proc. Mobile HCI, pages 155–169,2002.

5. K. Chang, M. Y. Chen, and J. Canny. Tracking free-weight exercises.In Proc. Ubicomp, pages 19–37, 2007.

6. J. Fogarty, S. E. Hudson, C. G. Atkeson, D. Avrahami, J. Forlizzi,S. B. Kiesler, J. C. Lee, and J. Yang. Predicting human interruptibilitywith sensors. ACM Transactions on Computer-Human Interaction,12(1):119–146, 2005.

7. D. A. Hennesy and D. L. Wiesenthal. Traffic congestion, driver stress,and driver aggression. Aggressive Behavior, 25:409–423, 1999.

8. B. Hull, V. Bychkovsky, Y. Zhang, K. Chen, M. Goraczko, A. K. Miu,E. Shih, H. Balakrishnan, and S. Madden. CarTel: A DistributedMobile Sensor Computing System. In ACM SenSys, pages 125–138,2006.

9. E. T. Jaynes. Information theory and statistical mechanics. Physical

Review, 106:620–630, 1957.

10. L. Johannesson, M. Asbogard, and B. Egardt. Assessing the potentialof predictive control for hybrid vehicle powertrains using stochasticdynamic programming. IEEE Transactions on Intelligent

Transportation Systems, 8(1):71–83, March 2007.

11. J. Krumm. A markov model for driver route prediction. Society of

Automative Engineers (SAE) World Congress, 2007.

12. J. Krumm and E. Horvitz. Predestination: Inferring destinations frompartial trajectories. In Proc. Ubicomp, pages 243–260, 2006.

13. J. Letchner, J. Krumm, and E. Horvitz. Trip router with individualizedpreferences (trip): Incorporating personalization into route planning.In Proc. IAAI, pages 1795–1800, 2006.

14. L. Liao, D. J. Patterson, D. Fox, and H. Kautz. Learning and inferringtransportation routines. Artificial Intelligence, 171(5-6):311–331,2007.

15. R. M. Mayrhofer. An architecture for context prediction. PhD thesis,Johannes Kepler University of Linz, Austria, 2004.

16. M. C. Mozer. The neural network house: An environment that adaptsto its inhabitants. In AAAI Spring Symposium, pages 110–114, 1998.

17. B. Mutlu, A. Krause, J. Forlizzi, C. Guestrin, and J. Hodgins. Robust,low-cost, non-intrusive sensing and recognition of seated postures. InProc. UIST, pages 149–158, 2007.

18. A. Y. Ng and S. Russell. Algorithms for inverse reinforcementlearning. In Proc. ICML, pages 663–670, 2000.

19. K. Patel, M. Y. Chen, I. Smith, and J. A. Landay. Personalizing routes.In Proc. UIST, pages 187–190, 2006.

20. D. J. Patterson, L. Liao, D. Fox, and H. A. Kautz. Inferring high-levelbehavior from low-level sensors. In Proc. Ubicomp, pages 73–89,2003.

21. M. L. Puterman. Markov Decision Processes: Discrete Stochastic

Dynamic Programming. Wiley-Interscience, 1994.

22. R. Simmons, B. Browning, Y. Zhang, and V. Sadekar. Learning topredict driver route and destination intent. Proc. Intelligent

Transportation Systems Conference, pages 127–132, 2006.

23. A. Steinfeld, D. Manes, P. Green, and D. Hunter. Destination entryand retrieval with the ali-scout navigation system. The University of

Michigan Transportation Research Institute No. UMTRI-96-30, 1996.

24. E. M. Tapia, S. S. Intille, and K. Larson. Activity recognition in thehome using simple and ubiquitous sensors. In Proc. Pervasive, pages158–175, 2004.

25. K. Torkkola, K. Zhang, H. Li, H. Zhang, C. Schreiner, andM. Gardner. Traffic advisories based on route prediction. In Workshop

on Mobile Interaction with the Real World, 2007.

26. D. Wyatt, M. Philipose, and T. Choudhury. Unsupervised activityrecognition using automatically mined common sense. In Proc. AAAI,pages 21–27, 2005.

27. B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximumentropy inverse reinforcement learning. In Proc. AAAI, 2008.