efficient method for computing strategies for …tions to the multi-player pursuit evasion...
TRANSCRIPT
EFFICIENT METHOD FOR COMPUTING STRATEGIES FORSUCCESSIVE PURSUIT DIFFERENTIAL GAMES
A Thesis Presented
by
Reed Jensen
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirementsfor the degree of
Master of Science
in
Electrical Engineering
Northeastern UniversityBoston, Massachusetts
April 2014
© Copyright 2014 by Reed Jensen
All Rights Reserved
Efficient method for computing strategies for successive
pursuit differential games
Reed Jensen
April 2014
Abstract
In successive pursuit, a pursuer seeks to capture as many evaders as possible in successionin the shortest amount of time. At the same time, a coalition of evaders seeks to maximizecapture time (or prevent capture entirely) with or without the knowledge of the pursuer’scontrol law or preferred capture order. This study seeks to obtain a control strategy for boththe pursuer and the coalition of evaders that is robust to uncertainty and variation in thepursuer or evader coalition strategy and that can be computed in a reasonable amount oftime. A combination of techniques from differential game theory and discrete optimizationare employed to compute such a strategy. In particular, a sub-optimal numerical approachusing limited lookahead and a Monte Carlo tree search algorithm are used to obtain solutionsin the presence of a high-dimensional action space. Examples are presented for both simplepursuit dynamics and the dynamics of the so-called Homicidal Chauffeur game.
This work is sponsored by the Department of the Air Force under Air Force ContractFA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are thoseof the author and are not necessarily endorsed by the United States Government.
Acknowledgments
I want to thank Dr. Mykel Kochenderfer and Dr. Bahram Shafai for their counsel andsupport, and my wife and family for their love and dedication.
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Problem formulation 11
2.1 Two-player differential pursuit game . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Simple pursuit game formulation . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Homicidal Chauffeur game formulation . . . . . . . . . . . . . . . . . . . . . 15
2.4 Differential games with multiple players . . . . . . . . . . . . . . . . . . . . . 16
3 Optimal and approximate solutions 19
3.1 Solution approach for the two-player differential pursuit game . . . . . . . . 20
3.2 Two-player simple pursuit example . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Homicidal Chauffeur example . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Value function when Isaacs condition not satisfied . . . . . . . . . . . . . . . 38
3.5 Limited lookahead for multi-player games . . . . . . . . . . . . . . . . . . . . 39
3.6 Approximating cost-to-go for limited lookahead . . . . . . . . . . . . . . . . 41
3.7 Example solution for simple pursuit of several evaders . . . . . . . . . . . . . 44
4 Simulation approach 50
4.1 Numerical solutions to the successive pursuit game . . . . . . . . . . . . . . 50
4.2 Simulation using the limited lookahead method . . . . . . . . . . . . . . . . 52
4.3 Combinatorial optimization using tree search . . . . . . . . . . . . . . . . . . 54
4.4 Computational resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
i
5 Results and analysis 60
5.1 Limited lookahead performance with one pursuer and two evaders . . . . . . 61
5.2 Tree search performance with many evaders . . . . . . . . . . . . . . . . . . 65
5.3 Lookahead performance with many evaders . . . . . . . . . . . . . . . . . . . 67
5.4 Limited lookahead and the Homicidal Chauffeur game . . . . . . . . . . . . . 73
6 Conclusion and Future Work 77
Bibliography 80
ii
List of Figures
2.1 Reduced coordinates for the Homicidal Chauffeur game. The pursuer is lo-cated at the center with its heading aligned with the x2 axis. A turn by thepursuer causes the coordinate system to rotate about the point C. . . . . . . 16
3.1 The value map and singular surfaces of a Homicidal Chauffeur game withvp = 3, ve = 1, ω = 1/3 and ε = 1. Coordinates are centered on the pursuer,with the pursuer heading aligned with the vertical (x2) axis. The contoursrepresent capture times for various initial conditions, sampled at 0.5 timeunits and increasing outward from the useable part (UP) of the target set.All distances and times are normalized by the pursuer speed. . . . . . . . . . 31
3.2 Map of optimal capture times for successive pursuit of two evaders, normal-ized by the separation distance between the two evaders (reproduced fromBreakwell [1] with kind permission from Springer Science and Business Me-dia). Capture times are represented by solid contours, and sample optimaltrajectories of the pursuer relative to the two-evader system are represented bydashed lines. Note that initial conditions from regions 3 and 6 yield optimaltrajectories that contain curved motion in inertial space. . . . . . . . . . . . 46
4.1 A sample MCTS minimizing search tree for a four-evader successive pursuitgame. Each node represents a simulation run and each edge an evader in acapture sequence. The number on each node is the running expected capturetime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Sample engagement using limited lookahead as compared with the optimalresult (denoted by ∗ and dashed lines) in the linear motion regime. . . . . . 62
5.2 Sample engagement in Breakwell’s “curved motion” zone (capture sequencenot fixed) using limited lookahead. The final capture time is 12.5 sec shorterthan the fixed sequence capture time. . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Side-by-side comparison of full two-evader solution (adapted from Breakwell[1]) with the limited lookahead results for a variety of initial pursuer loca-tions. Solid contours represent capture times (normalized by the initial evaderseparation distance and pursuer speed), while dashed lines represent sampletrajectories relative to the two-evader system. The focal and dispersal linesappear along the bottom of the figure. . . . . . . . . . . . . . . . . . . . . . 65
iii
5.4 Average number of iterations for MCTS to achieve optimal and sub-optimal(within 1% error) results as compared to brute force (N ! iterations). Theerror bars represent one standard deviation. . . . . . . . . . . . . . . . . . . 67
5.5 Scenario with three evaders starting in the linear motion regime. The optimalsolution is represented by dashed lines. . . . . . . . . . . . . . . . . . . . . . 68
5.6 Three-evader scenario, with two starting in the curved motion regime. . . . . 69
5.7 Scenario with four evaders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.8 Limited lookahead results in inertial and pursuer-centric coordinates for atwo-player Homicidal Chauffeur game. In this scenario, the optimal play forthe evader is to follow the pursuer for a brief period until the pursuer can turnaround. In the right figure, the game trajectory in pursuer-centric coordinatesreveals several singular surfaces. The game trajectory moves along a universalline, departs from a dispersal line, moves around a barrier, and returns againto a universal line before reaching the target set. . . . . . . . . . . . . . . . 74
5.9 Limited lookahead results for a three-player Homicidal Chauffeur game. . . . 76
iv
Chapter 1
Introduction
Pursuit games provide a way of modeling conflict by representing competition as a pursuer
seeking to catch an evader and minimize some objective such as capture time, and an evader
seeking to maximize the same or to avoid capture entirely. The modeling of conflicts arises
in a large variety of domains including biology, economics, operations research, navigation
and collision avoidance, military applications, and control systems and engineering design.
Conflict models are often used to determine optimal participant strategies or controls that
maximizes a player’s benefits or minimizes worst-case cost.
In some models of conflict processes there can be many competing parties or players
seeking to optimize their own benefits. The analysis of optimal player decisions in dynamic
games involving multiple players can be difficult. To date, a general solution to multi-player
differential pursuit games – games with the state dynamics governed by differential equations
– is not yet available. Because of the number of various player pairings, multi-player games
may also suffer from the so-called “curse of dimensionality” where large state and action
spaces can make analytical and numerical solutions difficult. As a novel contribution to the
literature, it is the goal of this work to demonstrate at least an approximate solution to
1
zero-sum, multi-player differential games using a modern discrete optimization technique to
manage the high-dimensionality of the multi-player problem.
This work will focus on solutions to a successive pursuit differential game where pursuers
seek sequential capture of all the evaders, and each evader attempts to delay capture as long
as possible. The evaders work together as a coalition with perfect knowledge of all evader
states to maximize the game objective, while a pursuer or team of pursuers seeks to minimize
the same objective, which for the examples presented will be capture time. The goal is to
find player strategies that can be executed independent of the opposing players’ controls,
including pursuer capture order, that guarantee at least a minimum amount of performance.
A successive pursuit game with a single pursuer and multiple evaders can be seen as an
extension to the classical Traveling Salesman Problem (TSP) where a salesman seeks the
shortest path to visit every city once. The solution to TSP is known to be NP-complete,
though several efficient approximation schemes have been developed [2]. In the sense of
visiting a combination of target points, Belousov et al [3] have labeled the successive pursuit
problem the Dynamic Traveling Salesman Problem (DTSP), where the target “cities” (ner-
vous consumers?) now actively evade the pursuing salesman. The intent of this work is to
use a modern, efficient tree search method – Monte Carlo Tree Search – to demonstrate a
practical solution to the combinatorial DTSP in the presence of evader dynamics governed
by differential equations.
The following section discusses the differential pursuit game and defines the concepts of
a game solution and optimal strategies. The subsequent section includes a brief background
and summary of the latest research in successive pursuit games and the history and applica-
tion of Monte Carlo Tree Search. The final section will then introduce the remainder of the
paper.
2
1.1 Background
Conflict processes or games are called dynamic games if the order of the decisions made
by the different parties are important [4]. Dynamic games where the benefit to one player
exactly matches the detriment to the other are called zero-sum, and many conflict processes
have this property. Players that are at odds seek an optimal strategy that yields the largest
benefit to their party.
In conflict processes there are many ways to define optimality. For zero-sum games, one
way to define optimality is by determining the Nash equilibrium of the game. Under this
condition, no unilateral decision by one player or coalition of players can reduce the benefit
of the other player or coalition. A player strategy that guarantees a Nash equilibrium
is called a guaranteeing strategy and will be considered the definition of optimality for the
subsequent sections. Guaranteeing strategies ensure at least a minimum benefit to the player
that executes it regardless of the moves or decisions by the other players. For two-player,
zero-sum games this minimum benefit is called the game value.
Guaranteeing strategies can be useful in applications like robust control where the con-
troller seeks to maintain a minimum level of control performance in the presence of worst-case
noise or other uncertainties. In this case the roles of pursuer and evader, minimizer and max-
imizer, can be reversed depending on the application. Because guaranteeing strategies do
not necessarily require the knowledge of the other players’ controls, they can also be useful
in conflicts where information about the adversarial processes is limited. Of course, as more
knowledge about the opponent becomes available, it may be possible to form other optimal
strategies that yield a larger payoff.
Differential games are a type of dynamic game where the game state is described by a set
of differential equations and were introduced by Isaacs in the 1950s [5]. Some advantages of
using the differential game formulation are that it may provide a continuous game solution
3
in time and/or space, define entire sets of game trajectories that meet a specified condition
such as capture or escape, or reveal singularities in the game that may have profound effects
on the game outcome and optimal player controls. Differential games are defined by the
state differential equations, the game state space, admissible player information and control
sets, player preferences and objectives, and the target or termination sets for each player.
Solving a differential game often involves determining the game value and associated
player controls that solve a set of partial differential equations (PDEs) called the Hamilton-
Jacobi-Bellman-Isaacs (HJI) equations. Candidate trajectories from the HJI solution are
then checked to ensure that they terminate on the target set, fill the entire game space, and
meet boundary conditions at the boundaries of the game space and other singular surfaces
that may appear. Verification of candidate solutions and the discovery and characterization
of singular surfaces contribute to the difficulty of solving differential games analytically and
numerically.
The challenge of solving sets of partial differential equations with possible discontinuous
solutions has been partially addressed by the identification of viscosity solutions as weak
solutions to the HJI equations and the development of several numerical solution approaches
(see [6] for an overview). However, numerical solutions to PDEs can be time consuming,
which for practical applications has motivated the development of methods for approximating
multi-player differential game solutions. In his doctoral dissertation, Li [7] presents a method
for approximating solutions to zero-sum, multi-player differential pursuit games using a
limited lookahead technique akin to the limited lookahead of optimal control. Li proves that
after a finite number of iterations, the limited lookahead technique can approach the optimal
game value. It is his technique that this work will adopt to achieve efficient differential game
solutions.
The ultimate contribution of this work is to combine the limited lookahead approach of
Li with an efficient, modern tree search method – Monte Carlo Tree Search – to solve the
4
differential successive pursuit game in the presence of many evaders in a practical amount of
time. This then provides an automated way to derive robust control strategies for competitive
processes.
1.2 History
The study of differential games was introduced by Isaacs [5] in the 1950’s when he devised
several games relevant to military conflicts, including the “Homicidal Chauffeur” game, and
formulated their solutions. In his work he combined concepts from classical game theory
and control theory to derive optimal control strategies for several dynamical systems that
can be represented by differential equations. To do so, he used the dynamic programming
principle in conjunction with a set of partial differential equations that now include his
name – the Hamilton-Jacobi-Bellman-Isaacs (HJI) equations. Additionally he discovered
many singular phenomena that arise within differential games that have a profound impact
on game outcomes.
His work on differential games and singular surfaces were later continued by J. V. Break-
well, P. Bernhard, A. Merz, and J. Lewin (for just a few examples, see [1, 8, 9]). In the 1980’s,
work by Crandall and Lions [10] and independently by Subbotin and Krassovski [11, 12] lead
to the notion of viscosity solutions, which are weak solutions to the HJI PDEs. These con-
cepts have been developed for several variations of the HJI equations and allow for both
smooth and non-smooth solutions. This has enabled many modern numerical approaches to
solving differential games such as level set methods (see [13, 14, 15, 6, 16, 17]).
Recently there has been much interest in the study of multi-player differential pursuit
games with a variety of dynamics and objectives. Zemskov et al [18] consider a single
pursuer and two evaders with the “game of two cars” dynamics. Shevchenko [19] considers a
similar problem with two terminal manifolds but in the context of search and identification.
5
Bhattacharya [20] addresses non-singular solutions to a spatial jamming problem as a zero-
sum multi-player differential game. Fuchs et al [21, 22] examine cooperation among multiple
evaders through a modified cost function to encourage pursuer retreat, assuming open-loop
pursuer intent. Yeung and Petrosyan [23] consider cooperative stochastic differential games
that include non-zero-sum solutions. While this list is far from exhaustive, it does suggest
that a solution method for multi-player games that addresses varying dynamics, singular
solutions, closed-loop decision feedback, and stochastic behaviors would be of interest.
Differential games involving successive pursuit of multiple evaders were initially studied
by Breakwell et al [1], who also identify some singular surfaces within the game (see Section
3.7). They were also studied by Petrosjan [24] who proved that an infinite set of Nash
equilibria exist in non-zero-sum, many evader games. He also demonstrated that allowing
the pursuer to change its preferred capture order over time can improve its performance
[25]. In this vein, Shevchenko has considered open-loop, alternative capture sequences and
multiple terminal manifolds for successive capture [26, 27, 28]. Determining a closed-loop
optimal strategy for choosing between terminal sets in general multi-player pursuit games is
still an open problem.
For simple successive pursuit with a known capture order, Chikrii et al [29] derive the
optimal control for the pursuer and evaders and show that straight-line motion is optimal
for each party. Belousov et al [3] demonstrate a numerically efficient method to obtain the
Chikrii solution for a known capture order, claiming efficient computation for scenarios with
11 or 12 evaders. Berdyshev finds solutions for a pursuer with nonlinear motion constraints
[30, 31], also under the fixed capture sequence assumption. It should be noted that, because
these solutions require a fixed capture order, they omit some of the interesting curved motion
solutions and singular surfaces as described by Breakwell and Petrosjan that influence the
optimal capture time.
6
Liu et al [32] solve for the evader optimal open-loop control for the many-evaders suc-
cessive pursuit problem that does not assume a pursuer capture sequence. They compare
results with the solution from Belousov and the optimal HJI solution for the two-evader case,
showing also the linear and curved motion regions found by Breakwell et al. To improve
their solution for time-varying capture sequences, they also implement an iterative open-loop
approach. Computation times within one second are achieved for scenarios of five evaders
or less. The work does not, however, determine the optimal actions of the pursuer.
No general solution to multi-player differential games has been derived to date. Some of
the challenges to solving these games are the difficulty in defining appropriate terminal con-
ditions, solving complex partial differential equations, addressing capturability, and coping
with high-dimensional game and action spaces. Stipanovic et al [33] recently have looked
at Lyapunov methods as alternatives to solving PDEs and have also looked at capturability
[34]. Shevchenko [28] has examined the selection of alternative terminal manifolds for mul-
tiple evader problems. Studies of differential pursuit games with asymmetric information
[18, 35] and non-zero-sum objectives [25] are also of recent interest.
To avoid the analytical and computational difficulty of this problem, several approxima-
tions to the multi-player pursuit evasion differential game have recently been considered.
Jang et al [36] use direct differentiation of the game value function to solve a set of ordi-
nary differential equations rather than the HJI PDEs and obtain a non-cooperative set of
pursuer strategies. Bolonkin et al [37] uses a geometric method to approximate the single-
pursuer, multiple-evader problem quickly. Ge et al [38], Wang et al [39], and Wei et al
[40] independently use a hierarchical approach to solve the multi-player game, dividing it
into a collection of solvable subgames to reduce communication overhead and achieve real-
time performance in some instances. While these methods consider efficient solutions to the
multi-player problem, they do not necessarily claim optimality.
7
In his doctoral thesis, Li ([7], see also [41, 42, 43]) introduces a framework for approxi-
mating the solution to differential games with multiple pursuers and evaders. He extends the
concept of limited lookahead and rollout policies from optimal control [44] to multi-player,
zero-sum differential games and demonstrates the finite convergence of subsequent iterations
of the limited lookahead approach to the optimal solution. Furthermore, he shows that a
hierarchical approach similar to the studies above yields a valid estimate of the cost-to-go
for the limited lookahead method for certain successive pursuit games. Thus, Li’s approach
potentially realizes some of the computational efficiency benefits of the previous studies while
also achieving near-optimal results. It is this approach – limited lookahead with hierarchical
decomposition of the cost-to-go – that will be examined in this work.
In each of the works on successive pursuit by Li, Liu, and Belousov, among others, the
difficulty of the combinatorial nature of the problem is mentioned. This study seeks to extend
the results of these works by approximating the optimal controls of both players under a
variable capture sequence, as in Li, while also solving the combinatorial problem efficiently.
This will be accomplished by combining Li’s limited lookahead approach with the Monte
Carlo Tree Search method, a tool commonly used in discrete combinatorial games with high
branching factors.
A substantial review of Monte Carlo Tree Search (MCTS) and its variants can be found in
Browne et al [45]. MCTS has typically been used in the domain of two-player, discrete games
such as Go, where the method selects the best action sequences of each player, represented
by the branches of the tree, using random sampling, a tree search policy, and rollout-based
simulation. MCTS has also been used successfully in single-player games, decision theory
applications such as Markov decision processes, and optimization problems including the
traveling salesman problem and other NP-complete problems (see [46, 47, 48, 49]). Rimmel
et al [50] and Perez et al [47] have used MCTS with some success to solve TSP with time
windows and with dynamic constraints on the salesman motion.
8
Because MCTS is an anytime algorithm that returns a useful result even when termi-
nating prematurely, Perez et al [48] have used MCTS in conjunction with rolling horizon
evolutionary algorithms to find TSP solutions that also navigate obstacles in real time. The
anytime nature of the MCTS approach, its ability to quickly find valuable branches in com-
binatorial trees, and the compatibility of its rollout-based simulation approach with limited
lookahead suggest that MCTS is a prime candidate for addressing the Dynamic Traveling
Salesman problem covered in this work.
1.3 Outline
Before testing the ability of limited lookahead with MCTS to solve successive pursuit games,
it is necessary to introduce the theory and examples that will be used. The next chapter
presents the formulation of a differential game for two and several players and introduces
two differential game examples – simple pursuit and the Homicidal Chauffeur game.
The subsequent chapter formulates both optimal and sub-optimal solutions to these and
similar games that will be used in later analysis. The analytical solutions to the two-player
simple pursuit and Homicidal Chauffeur game are first reviewed, followed by a review of the
development of the limited lookahead method by Li for multi-player games. An overview
of existing solutions for the successive pursuit of many evaders and the proposed solution
approach for the above examples finish the chapter. Chapter 4 provides the assumptions
and implementation details for simulating limited lookahead and Monte Carlo Tree Search,
including the chosen numerical optimization and software packages.
Finally, Chapter 5 shows the results of the proposed technique for the successive pursuit
scenario. First the results are compared with known solutions for two-evader successive
pursuit with and without a fixed capture sequence. The performance of MCTS in selecting
the optimal capture sequence is examined in the next section. Results for limited lookahead
9
with MCTS for the many-evader scenario are then presented. The results conclude by testing
the technique in both the two-player and multi-player Homicidal Chauffeur game. Chapter
6 offers concluding remarks and suggests future work.
10
Chapter 2
Problem formulation
This paper considers zero-sum differential games with the competing players or processes
modeled as pursuers (minimizers) and evaders (maximizers). This chapter begins with the
formulation of a general zero-sum, two-player differential pursuit game. It then follows with
examples of the simple pursuit game and the Homicidal Chauffeur game, illustrating how
the dimension of different dynamical models can be reduced to a minimum set to ease game
analysis. Finally, the game formulation is extended to multiple players, which is the form
that will be used throughout the rest of the paper. The subsequent chapter will then address
the construction of game solutions.
2.1 Two-player differential pursuit game
A zero-sum pursuit-evasion (PE) differential game between a single pursuer and single evader
can be formulated as follows. Let the combined state variable of the pursuer and evader be
represented by x ∈ Rn, where the dimensionality n depends on the specific dynamics of the
game. The set of all possible states in the game is called the game set, denoted by S, and
11
can be a subset of the n-dimensional Euclidean state space. The dynamics of the game are
represented by f : Rn × Rnp × Rne → Rn,
x(t) = f(x(t), a(t), b(t)), x(0) = x0, a ∈ A, b ∈ B (2.1)
where a ∈ Rnp and b ∈ Rne are the control vectors of the pursuer and evader, respectively,
and A and B are the admissible control sets of the game. For this paper it is assumed that
f is single-valued and convex in a and b, the range of f is bounded and Lipschitz continuous
and the control sets A and B are convex. It should be noted that, while (2.1) does not
show explicitly its dependence on time t, time-dependent formulations can be considered by
including t in the state vector x. For this reason, explicit notation for time t will be generally
suppressed for the subsequent development when the state vector x is present.
A differential game terminates when the state vector reaches the target set Λ, defined as
a closed subset of the boundaries ∂S of the game set S. For this paper it will be assumed
that Λ is piecewise smooth and that termination occurs when the state velocity vector f
penetrates the target set, or
f(x, a, b) · n(x) < 0
where n(x) is a unit vector normal to the boundary ∂S. Additionally, the boundary of the
target set Λ will be denoted by a continuous and continuously differentiable scalar function
`(x) = 0.
The objective of the game for the pursuer (evader) is to minimize (maximize) a cost
function of the form
J(x, a, b) =
∫ T
0
G(x(t), a(t), b(t))dt+Q(x(T )) (2.2)
12
where G : Rn × Rnp × Rne → R is the running cost and Q : Rn → R is the terminal cost. It
is assumed that G has the same properties as f and that Q and its derivatives have at most
a finite number of jump discontinuities. Games with only a running cost term are called in
the literature games of degree, while games with only a terminal cost are called games of
kind. For the pure pursuit game, G = 1 and Q = 0 and the game cost is the capture time:
J(x, a, b) =
∫ T
0
dt (2.3)
where T is the capture time,
T = inf{t ∈ R+ : x(t) ∈ Λ}. (2.4)
It should be noted that, for games where the evader is guaranteed to escape (i.e., the target
set is never reached), T can be infinite.
In a two-player zero-sum game, the preference of each player is in pure conflict with
the other. Each player makes a control decision based on the game information, which for
this paper will be the true, current state vector and its histories x(τ), 0 ≤ τ ≤ t unless
otherwise stated. A policy for a pursuer (evader) that assigns a control vector a (b) from
the admissible control set A (B) to a state x(t) is called the pursuer’s (evader’s) closed-loop
strategy and will be denoted by α(x(t)) = a(t) (β(x(t)) = b(t)). The set of all admissible
strategies for the pursuer and evader will be denoted byA and B, respectively. As will be seen
in the Homicidal Chauffeur game, there may be instances where optimal play requires the
knowledge of the opponent’s control, i.e., the strategy is not admissible. In those cases, the
player’s deterministic strategy will be replaced with a mixed strategy (randomized control
selection) so as not to violate game information constraints.
The sections that follow define two example games, simple pursuit and the Homicidal
13
Chauffeur, that will be used in subsequent sections to demonstrate the sub-optimal ap-
proaches of the paper.
2.2 Simple pursuit game formulation
Two-player simple pursuit consists of a pursuer and evader that travel at maximum speeds
vp = 1 and v2 = ν, 0 < ν < 1, respectively, and can turn in any direction instantaneously.
The dynamics of the game are
xp = (vp sinφ, vp cosφ)T = (sinφ, cosφ)T , −π < φ ≤ π (2.5)
xe = (ve sinψ, ve cosψ)T = (ν sinψ, ν cosψ)T , −π < ψ ≤ π (2.6)
where xp,e represents the two-dimensional position of the pursuer and evader, respectively,
and φ and ψ are the respective controls. To simplify the analysis of the game, one can reduce
the dimensionality of the problem dynamics by defining a new state x = xe − xp relative to
the pursuer location. The dynamics then become
x = f(x, φ, ψ) = (ν sinψ − sinφ, ν cosψ − cosφ)T . (2.7)
The game set is S = {x : x ∈ R2} and the target set is a small circle around the pursuer
with radius ε, Λ = {x : x ∈ R2, x21 + x2
2 ≤ ε2}. The boundary of the target set can be
characterized by the scalar function `(x) = x21 +x2
2− ε2 = 0. The admissible controls at time
t are A = {φ : −π < φ ≤ π} and B = {ψ : −π < ψ ≤ π}. The pursuer (evader) seeks to
minimize (maximize) the capture time according to the cost function J in (2.3).
14
2.3 Homicidal Chauffeur game formulation
In the Homicidal Chauffeur game, the evader dynamics are equivalent to simple pursuit,
while the pursuer has an additional turn rate limit ω:
xp1 = sinxp3 xe1 = ν sinψ (2.8)
xp2 = cosxp3 xe2 = ν cosψ (2.9)
xp3 = ωφ (2.10)
The pursuer’s control is drawn from A = {φ : φ ∈ R, |φ| ≤ 1}, while the evader’s is taken
from the set B = {ψ : ψ ∈ R,−π < ψ ≤ π}. The game set, target set, information
constraints, and objective function are as in the simple pursuit game above.
The dimensionality of the problem can be reduced from n = 5 to n = 2 by transforming
the coordinates relative to the pursuer and folding in the turn rate,
x1 = −ωx2φ+ ν sinψ (2.11)
x2 = ωx1φ− 1 + ν cosψ. (2.12)
In this reduced-space formulation, the pursuer heading is fixed along the x2-axis such that
a turn causes the coordinate system to rotate. Consequently the evader control ψ adopts a
different meaning from the inertial coordinates in (2.8) (see Figure 2.1). The dynamics in
the reduced game space may be less intuitive but will make the mathematical analysis more
tractable.
15
Figure 2.1: Reduced coordinates for the Homicidal Chauffeur game. The pursuer is located atthe center with its heading aligned with the x2 axis. A turn by the pursuer causes the coordinatesystem to rotate about the point C.
2.4 Differential games with multiple players
To formulate a multi-player differential game, a few modifications need to be made to the
definitions earlier in the section. Here the formulation proceeds as in Li [7]. Assuming M
pursuers and N evaders, where each pursuer and evader is denoted by the index i and j
respectively, the dynamics of each pursuer and evader are
xip = f ip(xip(t), ai(t)), xip(0) = xip0 , i = 1, . . . ,M
xje = f je (xje(t), bj(t)), xje(0) = xje0 , j = 1, . . . , N
with respective controls ai ∈ Ai and bj ∈ Bj. The total state vector is then x , (xTp , xTe )T ,
xp , (xp1 , . . . , xpM )T , xe , (xe1 , . . . , xeN )T with the combined dynamics fp , (f 1p , . . . , f
Mp )T
and fe , (f 1e , . . . , f
Ne )T .
To define the termination of the multi-player pursuit game, additional definitions of a
terminal state are required. Let Pp,e(xp,e) : Rnp,ne → Rn be a projection operator that returns
16
the positional elements of dimension n from the respective state vector. Capture between
pursuer i and evader j occurs when ||Pp(xip) − Pe(xje)|| ≤ ε for t ≥ 0. The capture time of
the j-th evader is then signified by
Tj = {t ≥ 0 | ∃ i such that ||Pp(xip)− Pe(xje)|| ≤ ε} (2.13)
and the game is terminated at the final capture time,
T = maxj=1..N
Tj (2.14)
In the successive pursuit games in this paper, the game terminates only after the final
evader is captured. Following Li, define a discrete variable zj ∈ {0, 1} that assigns a value
of 0 to evader j when it is captured and 1 otherwise. The dynamics of each zj is governed
by the algebraic equations
gj(0, x) = 0
gj(1, x) =
0, if ||Pp(xip)− Pe(xje)|| ≤ ε for some i
1, otherwise
z(t) = z(t+) = g(z(t−), x(t)) (2.15)
where g , (g1, . . . , gN)T , z , (z1, . . . , zN)T , z ∈ Z = ΠNj=1Zj, Zj = {0, 1}, and z(t+), z(t−)
denote the left and right limits at time t, respectively. Assuming the evader stops after
capture, the dynamics can be revised to their final form as
x = f(x(t), z(t), a(t), b(t)), x(0) = x0 (2.16)
17
where f , (f ip, · · · , fMp , zje · f je , · · · , zNe · fNe )T and a ∈ Aa = ΠMi=1Ai, b ∈ Ba = ΠN
j=1Bj.
The pursuer (evader) seeks to minimize (maximize) the objective
J(x, z, a, b) =
∫ T
t0
G(x(t), z(t), a(t), b(t))dt+Q(x(T )) (2.17)
subject to (2.15) and (2.16), with the same restrictions on G(·) and Q(·) as in the two-player
formulation above. For the pure pursuit of multiple evaders the objective becomes
J(x, z, a, b) =
∫ T
t
[ N∑j=1
zj(t)
]dt (2.18)
which is the sum of the capture times for each evader.
18
Chapter 3
Optimal and approximate solutions
The solution to a differential game produces the optimal player controls and game outcome
given these controls. One advantage of the differential game formulation, though, is that
one can also obtain information about entire sets of game trajectories. For example, one
can determine the set of all initial conditions where a certain threshold objective, such as
capture, is met. Additionally, the differential solution with its continuous formulation can
reveal the “topography” of the game – conditions where certain decisions yield larger or
smaller payoffs, where or when critical control decision points occur, or when to use mixed
or behavioral strategies, for example.
It will be seen, however, that some solutions to differential games can be extremely com-
plex, even when the player dynamics are simple. Furthermore, such solutions may require
heavy numerical computation. For these reasons it is desirable to have a reliable approxi-
mation to the game solution, especially if the approximation can be realized in real-time.
The sections in this chapter outline the solution to the zero-sum formulation of a dif-
ferential game by first demonstrating the approach for solving a two-player zero-sum game.
The analytical solution to the simple pursuit and Homicidal Chauffeur games will be cov-
ered briefly, including some comments on the game topography. Approximation of the game
19
outcome will then be addressed, including an adaptation of the limited lookahead method
from optimal control to multi-player games, work formulated previously by Li [7]. Additional
techniques for approximating the cost-to-go estimate of the limited lookahead method, also
derived by Li, will be outlined and adapted to the present problem. Finally, the chapter
gives an example solution for successive capture of several evaders for the case where the
pursuer capture sequence is either known or unknown to the evaders.
3.1 Solution approach for the two-player differential pur-suit game
The solution to a two-player, zero-sum pursuit-evasion differential game consists of the fol-
lowing elements [51]:
• The capture set Sc ⊂ S where the capture of the evader is guaranteed, ∀x ∈ Sc
• The escape set Se ⊂ S (Se ∩ Sc = ∅) where capture is prevented indefinitely, ∀x ∈ Se
• The optimal pursuer strategy α∗ which guarantees game termination, ∀x ∈ Sc
• The optimal evader strategy β∗ which guarantees that a game does not terminate,
∀x ∈ Se
• The game value function V (x) = J(x, α∗, β∗), if it exists, representing the game out-
come
Optimal play or the optimal trajectory for a PE game is defined as the triplet (x, α∗, β∗)
for games where x ∈ Sc, and the value function V (x) = J(x, α∗, β∗) is the optimal outcome
at x.
20
A value function is said to exist if the following is satisfied [4]. First, define the upper
value function
V (x) = minα∈A
maxβ∈B
{∫ T
t
G(x(τ), α(x(τ)), β(x(τ)))dτ +Q(x(T ))}
(3.1)
and the lower value function as
V (x) = maxβ∈B
minα∈A
{∫ T
t
G(x(τ), α(x(τ)), β(x(τ)))dτ +Q(x(T ))}
(3.2)
Assuming V (x) is differentiable in t and x, it satisfies the partial differential equation
− ∂V
∂t= min
a∈Amaxb∈B
[∂V
∂xf(x, a, b) +G(x, a, b)
](3.3)
and analogously for V (x),
− ∂V
∂t= max
b∈Bmina∈A
[∂V
∂xf(x, a, b) +G(x, a, b)
]. (3.4)
If the upper and lower values are equal
V (x) = V (x) = V (x) (3.5)
then the so-called Isaacs condition is satisfied and one obtains a single game value V (x)
satisfying the Hamilton-Jacobi-Isaacs equation:
−∂V∂t
= mina∈A
maxb∈B
[∂V
∂xf(x, a, b) +G(x, a, b)
](3.6)
= maxb∈B
mina∈A
[∂V
∂xf(x, a, b) +G(x, a, b)
]. (3.7)
21
From Basar and Olsder [4], the following theorem establishes the existence of the game
value function.
Theorem 1. If a continuously differentiable function V (x) exists that (i) satisfies the HJI
equation (3.6), (ii) V (x(T )) = q(x(T )) on the boundary of the target set Λ, and (iii) either
α∗ or β∗ generates trajectories that terminate in finite time, then V (x) is the value function
and the pair (α∗, β∗) satisfy the saddle condition
J(x, α∗, β) ≤ J(x, α∗, β∗) ≤ J(x, α, β∗). (3.8)
The saddle-point condition (3.8) of the above zero-sum game constitutes the Nash equi-
librium of the differential game. Under this condition, neither player can improve their
guaranteed result, V (x), by a unilateral deviation from their optimal strategy [51]. For the
remainder of the paper, an optimal strategy is one in the sense of (3.8) and will be called
a guaranteeing strategy, since each party in the differential game can guarantee at least the
game value V (x).
It should be noted that the Isaacs condition will hold for f and G that are separable, i.e.,
f(x, a, b) = f1(x, a) + f2(x, b),
G(x, a, b) = G1(x, a) +G2(x, b).
For cases where the Isaacs condition does not hold, such as when the value function or its
derivative is discontinuous, one may solve for the upper value V (x). The formulation for
the sub-optimal solution of a pursuit-evasion game in Section 3.4 addresses this condition in
more detail.
It remains to show how to solve for the saddle-point equilibrium V (x) for the differential
game formulation above. The following theorem is also from Basar and Olsder [4]:
22
Theorem 2. Suppose the pair of feedback strategies (α∗, β∗) provides a saddle-point solution
to the differential game (2.1) - (2.4), with x∗(t) denoting the corresponding state trajectory.
Furthermore, let its open-loop representation {a(t) = α(t, x∗(t)), b(t) = β(t, x∗(t))} also
provide a saddle-point solution. Then there exists a costate function p(·) : [0, T ]→ Rn such
that the following relations are satisfied:
x∗(t) = f(x∗(t), a∗(t), b∗(t)), x∗(0) = x0 (3.9a)
H(x∗, p, a∗, b) ≤ H(x∗, p, a∗, b∗) ≤ H(x∗, p, a, b∗), ∀a ∈ A, ∀b ∈ B, (3.9b)
pT (t) = − ∂
∂xH(x∗(t), p(t), a∗(t), b∗(t)), (3.9c)
pT (T ) =∂
∂xQ(x∗(T )) along `(x(T )) = 0, (3.9d)
where
H(x, p, a, b) , G(x, a, b) + pTf(x, a, b) (3.10)
is the Hamiltonian and
H(x, p, a∗, b∗) = mina∈A
maxb∈B
H(x, p, a, b) (3.11)
is known as the first main equation of Isaacs [5]. In pursuit-evasion games, the costate
equation is the gradient of the value function,
pT (t) =∂
∂xV (x(t))
Note that in this case, the gradient of V (x(t)) is a function of time only.1
The equations in (3.9) can be used to solve for the (regular) optimal trajectories, control
strategies, and value function where it is continuous and differentiable. In the section that1Also recall that the state vector x(t) may contain the variable t.
23
follows, the solution methodology will be demonstrated for simple pursuit. Later a partial
solution for the Homicidal Chauffeur game will be shown.
There are many cases in differential games where V (x) is discontinuous in the derivative
or in the function itself, or when the optimal strategies α∗ or β∗ are not unique. These
situations give rise to singular surfaces which divide the game set into mutually disjoint
regions where V (x) is continuous. Within the continuous regions – the regular part of the
game space – the costate equations above can be solved to obtain regular trajectories. At the
discontinuous boundaries, however, additional techniques must be used to find the singular
surfaces. The Homicidal Chauffeur example in Section 3.3 provides a brief example of a
singular solution and identifies a few additional singular surfaces present in the game. For a
more detailed introduction to singular surfaces, see Lewin [51].
The determination of the capture and escape sets is another important element of the
differential PE game solution. Such sets can be constructed from the game set boundaries
or from singular surfaces. Since most of the example solutions in this paper are confined
to the capture set, a detailed discussion of determining capture and escape sets will not be
covered here. The example solutions in this chapter will address capturability briefly.
3.2 Two-player simple pursuit example
As an example of how to solve a differential game using the equations in (3.9), this section
examines the two-player, zero-sum simple pursuit game defined in Section 2.2. The goal is to
find a set of candidate optimal trajectories that begin in the capture region Sc and terminate
on the target set Λ. The simple pursuit example below will demonstrate the procedure for
finding solutions in the regular part of the game space. The procedure follows the approach
from Lewin [51].
24
To identify a candidate regular trajectory, one must begin by partitioning the target
set into a usable part where such trajectories may terminate, and a non-usable part where
optimal trajectories cannot terminate. The usable part of the target set ΛUP are the points
along the boundary ∂Λ that satisfy
ΛUP , {x ∈ ∂Λ | mina∈A
maxb∈B
[f(x, a, b) · n(x)] < 0} (3.12)
where n(x) ∈ Rn is a unit vector normal to the target set pointing into the game set. The
non-usable part ΛNUP can be defined analogously with the inequality reversed, and the
boundary ΛBUP is (3.12) where the condition is an equality.
For the simple pursuit game of Section 2.2, one can evaluate the condition in (3.12) by
first determining the controls a = φ ∈ {−π < φ ≤ π}, b = ψ ∈ {−π < ψ ≤ π}:
φ = arg minφ
[f(x, a, b) · n(x)
]= arg min
φ
[(ν sinψ − sinφ) x1/ε+ (ν cosψ − cosφ) x2/ε
]ψ = arg max
ψ
[f(x, a, b) · n(x)
]= arg max
ψ
[(ν sinψ − sinφ) x1/ε+ (ν cosψ − cosφ) x2/ε
]
where n(x) = (x1/ε, x2/ε)T and ε = x2
1 + x22 on the boundary of the target set. Evaluating
these optimization conditions yields
tan φ = x1/x2 = tan ψ,
signifying that the controls at the boundary point away from the origin. Substituting sin φ =
x1/ε, cos φ = x2/ε into the condition in (3.12), one obtains
f(x, a, b) · n(x) = (ν − 1)(x21 + x2
2) < 0, ∀x ∈ ∂Λ
since ν < 1 by definition. Since this is satisfied for all x, the entire target set boundary is
25
the usable part, ΛUP = ∂Λ. This means that the regular optimal trajectories can terminate
anywhere on the circle `(x) = x21 + x2
2 − ε2.
One now seeks the candidate optimal control laws for each player. This is obtained using
the first main equation of Isaacs (3.11):
φ∗ = arg minφH(x, p, φ, ψ∗)
= arg minφ
[p1(ν sinψ∗ − sinφ) + p2(ν cosψ∗ − cosφ) + 1
]ψ∗ = arg max
ψH(x, p, φ∗, ψ)
= arg maxψ
[p1(ν sinψ − sinφ∗) + p2(ν cosψ − cosφ∗) + 1
]
where H(x, p, φ, ψ) is the Hamiltonian from (3.10) and G(x, a, b) = 1.
Proceeding in the same manner as with boundary condition above, one obtains
tanφ∗ = p1/p2 = tanψ∗ (3.13)
suggesting that the player controls are parallel.2 To fully determine the controls, the costate
variables pT = (p1, p2) = ∂V∂x
need to be determined. Referring to (3.9c) and noting that H
is independent of x, one can deduce that p is constant and therefore the player controls yield
constant, straight-line motion.
The equation in (3.9c) is known as the adjoint equation or retro-path equation (RPE), as
it signifies an integration along a path from the terminal set in reverse time. Since p = 0, it
is evident that an additional condition is needed to solve for p. One condition can be found
from the terminal condition in (3.9d). Before proceeding, however, it is useful to state three
lemmas from Lewin [51]:2Anti-parallel would imply the optimal evader control is to always approach the pursuer!
26
Lemma 1. At points x in the useable part where optimal trajectories terminate:
V (x) = Q(x) (3.14)
Lemma 2. If the subset of the usable part where optimal trajectories terminate is of dimen-
sion m and if κ are m vectors that span the tangent to that subset at x, then the following
m relations between the directional derivatives of V (x) and Q(x) hold:
∇V (x) · κ = ∇Q(x) · κ (3.15)
Lemma 3. For points in the capture set Sc that belong to regular parts of optimal trajectories,
Equation (3.9b) and the following relation must hold:
H(x, p, a∗, b∗) = 0 (3.16)
The first lemma simply states that the value at the usable part of the terminal set is the
cost function itself. The second lemma gives a tangent boundary condition for the gradient
of the value function at the boundary (equivalent to the costate pT ) that can be used to solve
the RPE. The equations (3.9b) from the second lemma and (3.11) constitute the Isaacs first
main equation, and (3.16) is the second main equation. The second main equation suggests
that, for terminal cost games, the Hamiltonian for the regular part of the solution space can
be interpreted as a measure (in the informal sense) of how much the game trajectory points
perpendicular to the gradient of the game value.
To finish the simple pursuit solution, the tangent relation (3.15) suggests that, since
Q(x) = 0,
∇V (x) · κ = pT · κ = 0,
27
or that p is perpendicular to the terminal surface. Since φ∗ and ψ∗ are parallel to p, then
they must also point normal to surface. With κ = (−x2/ε, x1/ε)T at the terminal boundary,
one obtainsp1
p2
=x1
x2
=x1(0)
x2(0).
Since the player controls point in the direction of p1/p2, the optimal pursuer and evader
controls consist of straight-line motion away from the pursuer along the initial line connecting
the two players. Since the optimal trajectories cover the entire game space and can terminate
anywhere on the target set, it can be concluded that the locally-derived optimal controls are
globally optimal. Furthermore, since ν < 1 the pursuer will always overtake the evader for
any initial condition, so the capture set Sc is the entire game set S.
To determine the game value V (x), more conditions on p are needed. Using the second
main equation (3.16), one obtains
H(x, p, φ, ψ) = p1(ν sin ψ − sin φ) + p2(ν cos ψ − cos φ) + 1 = 0.
Substituting p1 = p2 tan φ = p2 tan ψ and noting that cos φ = cos ψ one obtains after some
algebra
cos ψ = cos φ = p2(1− ν)
sin ψ = sin φ = p1(1− ν)
which, when substituted back into (3.16) yields
p21 + p2
2 =
(1
1− ν
)2
28
Finally, substituting p1 = x1x2p2 into the above gives
p1 = ± x1√x2
1 + x22
(1
1− ν
)p2 = ± x2√
x21 + x2
2
(1
1− ν
)
which, when integrated with respect to x returns
V (x) =
√x2
1 + x22
1− ν+ C.
Using (3.14) from the first lemma resolves the constant of integration at the terminal set
to finally obtain
V (x) =
√x2
1 + x22 − ε
1− ν
which is the geometrically intuitive result. For example, for an evader speed of ν = 1/2, the
pursuer captures at a location twice the initial relative distance (minus the target radius)
along the initial bearing to the evader.
Such a result can be determined by simpler means using geometrical arguments, as Isaacs
did [5], but the result here does illustrate the basic solution procedure. The example in the
following section is more challenging and requires many of the tools presented here.
3.3 Homicidal Chauffeur example
The Homicidal Chauffeur (HC) game consists of a pursuer (a car) with a turn rate limit who
chases a pedestrian who is slower but can turn instantaneously. The game was introduced
by Isaacs, who showed that the solution exhibits a variety of singular surfaces. Merz [8] in
his dissertation discovered twelve different singular phenomena in twenty different regions
29
of the game’s parameter space. Singular surfaces have a profound effect on the formation
optimal player strategies, often requiring player controls to consist of several different stages
with a variety of control laws. The Homicidal Chauffeur game’s nonlinear dynamics and rich
set of singular surfaces make it a good candidate for testing the viability of the lookahead
method and its utility in computing complex strategies in an automated way.
This section demonstrates a few of the singular phenomena present the Homicidal Chauf-
feur game and addresses briefly the solution for a single set of parameters in a limited region
of the game space. The exposition below follows the works of Isaacs [5] and Merz [8].
The solution to HC begins as with the previous example by finding the usable and non-
usable parts of the circular target set. Let x = (ε sin θ, ε cos θ)T be a point on the boundary
of the target set `(x) = 0, with θ defined clockwise from the x2-axis. The vector normal to
the circle is n(x) = (n1, n2)T = (sin θ, cos θ)T . Using condition (3.12) one obtains
minφ
maxψ
[n1 ˙x1 + n2 ˙x2] = minφ
maxψ
[sin θ
(−ω(ε cos θ)φ+ ν sinψ
)+
cos θ(ω(ε sin θ)φ− 1 + ν cosψ
)]= max
ψ[− cos θ + ν cos(ψ − θ)]
= ν − cos θ < 0
which yields the condition for the angle at the boundary of the usable part θB
cos θB = ν, 0 ≤ θB ≤π
2(3.17)
where θB is confined to the first quadrant. The useable part is then
|θ| < θB (3.18)
30
with the boundary occurring at θB. The useable part (UP) is identified in the diagram in
Figure 3.1.
−4 −2 0 2 4
−2
−1
0
1
2
3
4
Universal Line
Barrier
Λ
UP
Regular Trajectories
Equivocal Line
Dispersal Line
Figure 3.1: The value map and singular surfaces of a Homicidal Chauffeur game with vp = 3, ve =1, ω = 1/3 and ε = 1. Coordinates are centered on the pursuer, with the pursuer heading alignedwith the vertical (x2) axis. The contours represent capture times for various initial conditions,sampled at 0.5 time units and increasing outward from the useable part (UP) of the target set. Alldistances and times are normalized by the pursuer speed.
For the range of parameters used in this game, the boundary of the usable part has an
interesting property – it is the origin of a singular surface called a barrier. A barrier is a line
or surface in the game set where the game value is discontinuous and is so called because
neither player can penetrate the surface if the other plays optimally. The condition for a
barrier is similar in form to Isaac’s main equations and the equation for the usable part:
mina∈A
maxb∈B
[f(x, a, b) · n(x)] = 0 (3.19)
31
where n(x) ∈ Rn in this case is a vector normal to the barrier surface. The derivation of the
barrier in this example is illustrative and will be given briefly; for full details, see [5].
Substituting the dynamics (2.11) into the main equation (3.16)
minφ
maxψ
(−ω (x2n1 − x1n2)φ− n2 + ν (n1 sinψ + n2 cosψ)
)= 0
and solving, one obtains
φ = sgn S = σ, σ ∈ {−1, 1}
where S = x2n1 − x1n2 is the switch function that determines the direction of the pursuer
control. For the evader,
cos ψ =n2
ρ, sin ψ =
n1
ρ, ρ =
√n2
1 + n22
and the main equation with φ, ψ becomes
−σωS − n2 + νρ = 0.
The RPE equation (3.9c) and trajectory equations (2.11) then become
x1 = −ωσx2 + νn1
ρ, x2 = ωσx1 − 1 + ν
n2
ρ
n1 = −ωσn2, n2 = ωσn1
32
With the additional condition S = −n1 and, on ∂Λ, σ = sgnn1 = sgn θB the above
equations can be solved to obtain the right barrier (σ = 1):
x1 = (ε− ντ) sin(θ + σωτ) +1− cosσωτ
ω(3.20)
x2 = (ε− ντ) cos(θ + σωτ) +sinσωτ
ω(3.21)
The left barrier (σ = −1) is symmetric with the right. Figure 3.1 illustrates the barrier
paths.
Note that the barrier paths terminate before reaching the x2 axis. The switch function
determines this termination point. Using the relations above, the switch function can be
found to be
S = [cos θ − cos(θ + σωτ)] .
The barrier continues until the switch function is no longer positive, or
θ + σωτ = 2π − θ.
At this point the barrier terminates and optimal trajectories can be routed around it.
An evader starting behind the barrier requires, then, that the pursuer travel away from
the evader for some period before turning full circle and finally pursuing along a straight
line in the same manner as the simple pursuit game. The evader, on the other hand, chases
the pursuer directly along their connecting line until their trajectory passes the end of the
barrier. For the rest of the pursuit, the evader flees along a straight line tangential to the
pursuer’s turning circle until capture occurs.
To address the game solution, one is interested in finding the optimal player controls and
trajectories. In this problem, the regular trajectories that emanate from the terminal set are
less significant than that of simple pursuit, as they originate fairly close to target set and
33
fill very little of the game set. The derivation of these trajectories is similar to the previous
examples and will not be reproduced here; more details can be found in [8] and [5]. The
regular trajectories can be found to be
x∗1 = (ε− ντ) sin (θB + ωτ) +1− cosωτ
ω.
x∗2 = (ε− ντ) cos (θB + ωτ) +sinωτ
ω
An example trajectory emanating from the target set can be seen in Figure 3.1. Note that
this trajectory begins on the barrier. For this case, all trajectories terminating on the usable
part except at x1 = 0 begin on the barrier at a point called the dispersal point and do not
fill the entire game set. This behavior leaves a void above the target set, and other methods
must be used to obtain candidate trajectories.
Much of the game set for this example is filled with optimal trajectories that are tribu-
taries to a singular line called a universal line by Isaacs. Optimal trajectories join this line
transversely from both sides and then travel along it. In this game the x2 axis constitutes the
universal line for x2 > ε as well as a portion below the target set (see Figure 3.1). Universal
lines act as tributaries for optimal trajectories such that, should a player act sub-optimally,
the optimal next move is to return to the line. Note that in some instances a non-admissible
strategy – one where one player must know the control of the other to act optimally – is
required to remain on the line. This can result in a chatter condition where the player con-
stantly oscillates to and from the surface. An example of this will be seen in the results of
Section 5.4.
Universal surfaces are often good candidates for finding optimal trajectories within voids.
On the universal line the switch function, its retrograde derivative, and the Hamiltonian are
all zero (see Lewin [51, p. 187]). Using these relations, one can derive the following optimal
34
trajectories that fill much of the void of the present game [5]:
x1 = (h− ντ) sinσωτ +1− cosσωτ
ω
x2 = (h− ντ) cosσωτ +sinσωτ
ω
and the value function
V =h− ε1− ν
+ τ
where h is the distance along x2 from ε where the optimal trajectory contacts the universal
line. To compute V one can rearrange the trajectory equations and solve for τ and h assuming
the condition x1 = 0 when τ = 0:
cosσωτ =−R(x1 −R) + x2
√x2
1 + x22 − 2x1R
(x1 −R)2 + x22
sinσωτ =x2R + (x1 −R)
√x2
1 + x22 − 2x1R
(x1 −R)2 + x22
where R = 1/ω is the turn radius. It is these trajectories and the contours of this expression
for V that fill much of the region shown in Figure 3.1.
The optimal player controls for this region, aside from the area behind the barrier, are
as described previously, where the pursuer executes a hard turn in one direction and follows
with straight simple pursuit, while the evader flees along a straight line tangentially from
the pursuer turning circle. A detailed derivation of these control strategies can be found in
Isaacs [5] or Merz [8].
It is evident from the phenomena of the barrier and universal line in this example that
singular surfaces can have a significant effect on player strategies. The presence and type
of singular surfaces in a differential game can vary according to the game parameters. The
35
parameters for the present section follow the example from Patsko et al [14] with an evader-
to-pursuer speed ratio ν = 1/3, a capture radius ε = 1 and turn rate ω = 1/3. These
parameters correspond to Region IIc of Merz, wherein a barrier, a universal line, a pursuer
dispersal line, an equivocal line, and safe contact may be encountered in the course of play
(see [8], also [52]). The barrier and universal line concepts have been discussed previously.
A dispersal line for a player indicates a set of points where the player, upon reaching
that point, must decide between two equally valid optimal strategies and make an immediate
change of course using the selected control. In this Homicidal Chauffeur example, a pursuer
dispersal line occurs along the negative x2 axis (see Figure 3.1) where the pursuer faces
directly away from the evader and must choose to turn sharply either right or left. Both the
gradient of V and the switch function are discontinuous across this line.
An equivocal line for a player indicates a set of points where the optimal strategy for
the player may be either to choose to remain on the equivocal line or to deviate. In this
game, an equivocal line for the evader extends from the end of the barrier to the negative
x2 axis, joining at the junction of the universal line and the dispersal line (again refer to
Figure 3.1). If the evader chooses to remain on the line it can travel to the end of the barrier,
along which it can travel to terminate the game tangentially along the terminal set. The
behavior of traveling alongside a boundary, barrier, or terminal set is called safe contact. If
for a particular game a mere grazing of the terminal set is a result preferred by the evader,
it may elect to follow this strategy. Otherwise the evader deviates from the equivocal line
and follows one of the optimal trajectories emanating therefrom.
Travel along the equivocal line requires that the evader follow a path in pure pursuit of
the pursuer. For the pursuer, travel along an equivocal line requires a mixed strategy, where
its optimal control must be selected from two control options according to some probability.
This control, unlike the hard-turn controls described previously, have time-varying curvature
where the direction of the curvature is selected randomly at each instant. This behavior can
36
result in a chattering phenomenon similar to that of the universal line mentioned previously.
However, such a condition can be avoided, as noted by Isaacs, if the pursuer plays sub-
optimally for a brief period in order to draw the evader beyond the equivocal line and onto
a regular trajectory that requires only a simple sharp turn.
It should be noted that where the equivocal line, universal line, and dispersal line meet
there is a condition where both the pursuer and evader have different control options. In this
case both parties may have to execute mixed strategies. The presence of mixed strategies
suggests that an automated numerical solution to the Homicidal Chauffeur must address the
randomized selection of player controls (see Section 4.2).
While not in this particular instantiation of the game, other singular surfaces in Homicidal
Chauffeur and other differential games can occur, for example, a switch envelope or a focal
line. For a good review of the topography of singular surfaces, see Lewin [51, ch. 8]. It
should be noted that a focal line – similar to a universal line, but where trajectories contact
tangentially – will be seen in the two-evader simple pursuit game in Section 3.7.
Because of the complexity of the singular surfaces within the game, Homicidal Chauffeur
has been used as a test case for several numerical solution schemes such as level set methods
[52]. Such schemes are able to generate value maps, such as the contours in Figure 3.1, as
lookup tables which can be used by online approximation schemes such as limited lookahead.
This effectively enables fast and automatic generation of optimal controls without relying
on the detailed analysis of this section. The goal of the remainder of this chapter and the
simulation results of Section 5.4 is to introduce limited lookahead and examine its viability
for approximating optimal controls for games with singular value functions.
37
3.4 Value function when Isaacs condition not satisfied
Before an approximate solution to a two-player differential game can be obtained, it is
first necessary to examine the case when the Isaacs condition (3.5) is not known to hold.
This can be necessary, say, when one has only an approximate upper (lower) bound of the
upper (lower) value of the game, as will be the case for the sub-optimal approaches of the
next section. In this situation it is often the case [4] that one of the players assumes an
instantaneous informational advantage over the other. Formally, the team of pursuers can
assume a strategy α : B(t) → A(t), α ∈ Γ(t) based on a strategy from the evader set B(t).
Analogously, the evaders assume a strategy β : A(t) → B(t), β ∈ ∆(t). Sets Γ and ∆
contain all possible nonanticipative strategies – strategies based only on opponent’s current
or previous states and controls – for the pursuers and evaders, respectively.
Given these informational constraints, the following value functions can be defined
V +(x(t), z(t)) = infa∈A(t)
supβ∈∆(t)
(J(x(t), z(t), a(t), β[a](t))
)= inf
a∈A(t)supβ∈∆(t)
(∫ T
t
G(x(t), z(t), a(t), β[a](t))dt+Q(x(T ))
)(3.22)
where the evader has the informational advantage, and
V −(x(t), z(t)) = supb∈B(t)
infα∈Γ(t)
(J(x(t), z(t), α[b](t), b(t))
)= sup
b∈B(t)
infα∈Γ(t)
(∫ T
t
G(x(t), z(t), α[b](t), b(t))dt+Q(x(T ))
)(3.23)
where the pursuer has the advantage. Note that V +(x(t), z(t)) ≥ V −(x(t), z(t)).
Given the regularity conditions on f , G, and Q from earlier sections, V + and V − are
solutions to (3.3) and (3.4) and are thus equal to V and V , respectively (see [4]). This for-
mulation is particularly useful when solving (3.3) and (3.4) using the viscosity formulation
38
initially derived by Crandall and Lions [10]. The viscosity framework allows for numerical
solutions to the Hamilton-Jacobi-Isaacs equations, including value functions with disconti-
nuities and discontinuous derivatives that commonly occur in differential games. The level
set method (see Section 4.1) uses the viscosity formulation to solve for the value map for a
variety of differential games.
The sub-optimal solutions of the following section, in addition the viscosity solutions to
the HJI equation, will assume the informational advantages and value functions presented
here.
3.5 Limited lookahead for multi-player games
To date, a general solution to multi-player differential games has not been found. One
difficulty lies in defining terminal sets and specifying how the dynamics and objective should
change as different players reach the terminal sets at different times. In the formulation of
multi-player games in Section 2.4, a discrete variable representing the capture of each evader
was introduced to account for asynchronous capture. However, a solution to the Hamilton-
Jacobi-Isaacs equation with mixed continuous and discrete variables is not yet available. In
place of a general optimal solution, Li in his dissertation [7] developed a general methodology
for multi-player differential games to approximate the upper or lower value of the game using
the limited lookahead method. His work is summarized in this section and will be used
elsewhere in the paper to approximate the solution to the successive capture problem.
In the limited lookahead scheme, the current game value and optimal trajectories for all
of the players are computed over a small time interval [t, t + ∆t] using an estimate of the
game value from t+ ∆t to the capture time T . This game value estimate is analogous to the
cost-to-go of limited lookahead in optimal control [44] and analogous to a rollout policy. If
the cost-to-go has the improving property, then successive iterations of the lookahead scheme
39
will result in an approximate game value that approaches the true (upper or lower) value of
the game as the number of iterations approach infinity. Correspondingly, the minimax (or
maximin) strategies of the players will also approach the optimal (guaranteeing) strategies.
In his dissertation, Li proves that the cost-to-go of the game formulation above has the
improving property and finite convergence to the game value under certain conditions to be
described subsequently.
To facilitate the estimation of the cost-to-go, it is beneficial to define a structured, or
restricted, control set with time-consistent elements. Imposing a structure on the control set
can, for example, simplify the number of types of controls that must be examined to find the
optimal strategy and eases the evaluation of different strategy combinations. In this paper,
a control set under structure S is denoted by AS(t, x, z), BS(t, x, z), respectively for pursuers
and evaders.
In order to establish the improving property of the approximate cost-to-go, it is necessary
to require that the control sets are set-time-consistent. If a control is selected from a set-
time-consistent set at some time t, the control is guaranteed to be available at a later time
τ : t ≤ τ ≤ T . Also, if a control structure S is independent of state x, z, and time t, i.e.,
AS(t, x, z)→ AS for all x, z, and t, then it is set-time-consistent.
It is assumed that the differential game takes place in the capture region, x ∈ Sc, that is,
for any x ∈ Sc, z ∈ Z for any time t ≥ 0 there exists a ∈ AS(t, x, z) such that T <∞ for all
b ∈ B(t). This assumption reduces the problem to finding optimal strategies for the players
without having to address capturability.
Under these assumptions, an approximate upper value can be defined as
V (x, z) = infa∈AS(t,x,z)
supβ∈∆(t)
(∫ T
t
G(x(t), z(t), a(t), β[a](t))dt+Q(x(T ))
). (3.24)
40
The definition of the lower value is analogous. It should be noted that V (x, z) ≥ V (x, z),
i.e., the approximate upper value is an upper bound for the actual upper value.
It is now possible to state the theorem for limited lookahead in multi-player differential
games as proven by Li:
Theorem 3. Under a set-time-consistent control structure S for any x ∈ Sc, z ∈ Z and
∆t : 0 ≤ ∆t ≤ T − t, the function V (x, z) in (3.24) satisfies
V (x, z) = infa∈A(t)
supβ∈∆(t)
(∫ t+∆t
t
G(x(τ), z(τ), a(τ), β[a](τ))dτ + V (xt,a,β[a](t+ ∆t), zt(t+ ∆t))
)(3.25)
where xt,a,b(τ) and zt(τ) are the continuous and discrete state vectors at time τ ≥ t as
generated from the initial states x(t) and z(t), respectively, under the controls a and b.
With the formulation in (3.25), one can use an estimate of the cost-to-go at a short time
interval ∆t later – the V term on the right hand side of the equation – to estimate the game
value and then obtain the corresponding approximately optimal strategies a∗ and β∗ for the
game using (3.25). If V has the improving property, then subsequent evaluations of (3.25)
will yield approximate game values that are closer and closer to the true upper value V . Li
has proven that the limited lookahead value V (x, z) has the improving property, that is, it
approaches the true upper value as the number of iterations approaches infinity, and it does
so under finitely many iterations. For details, see [7]. This establishes the ability of the
limited lookahead method to approximate the upper value of a differential game and hence
to determine the approximate optimal controls.
3.6 Approximating cost-to-go for limited lookahead
As detailed in the previous section, the limited lookahead method requires an estimate of
the cost-to-go – the estimated game value at a future state – to refine the estimate of the
41
game value at the current state and determine the appropriate controls for the current time
interval. In formulating limited lookahead method for multi-player games, Li introduced the
concept of a structured or restricted control set to facilitate estimation of cost-to-go. This
section provides an example from Li [7] of a control structure that can be used in running
cost games like simple pursuit to obtain a valid estimate of the cost-to-go.
Because it can be difficult to define terminal states in a general multi-player differential
game, Li proposes a hierarchical solution to the game where the game is divided into two
“levels” of optimization – an upper level where the assignment of pursuers and evaders is
optimized, and a lower level where the game value for a particular assignment is solved. For
assignments where pursuers engage more than one evader, the games are solved sequentially
with the assumption that the evaders know the strategy of the pursuer for all of the previous
engagements. In this sense, when approximating the upper value using the hierarchical
method, the evaders are given an informational advantage at the lower level in a manner
similar to that of (3.24) to derive a “local optimization” against the pursuer. The pursuer
then finds the assignment at the upper level that yields the smallest game value.
Let si be the assigned capture sequence for pursuer i, represented by an ordered set of
evader indices, si = {si1, · · · , siNi}, sik ∈ {1, · · · , N} and Ni is the number of evaders assigned
to pursuer i. Let Si be the set of all possible capture sequences for pursuer i and S = ΠMi=1Si
be the set of all possible pursuer team assignments. The (upper) game value estimate for an
engagement assignment s = {s1, · · · , sM} ∈ S at the upper level of the hierarchical method
is then given by
Vh
(x, z) = mins∈S
Vs
(x, z) (3.26)
where Vs
(x, z) is the game value assuming the pursuers follow an assigned capture sequence
s and represents the lower level optimization of the hierarchical approach.
42
Assigning the team of pursuers a capture sequence s effectively imposes a control structure
on the pursuers in the sense of the structure S in Section 3.5, and hence the value Vs
(x, z)
can be obtained using the optimization from (3.24). It should be noted that, because the
number of evaders remains constant throughout the entire engagement, the set of possible
engagements S is independent of both time and state and is thus set-time-consistent, imply-
ing that Vh
(x, z) has the improving property and is a valid starting point for an iterative
approach like limited lookahead [7].
If one assumes a game with rolling cost, such as a pure pursuit game with the objective
J(x, z, a, b) =
∫ T
t
[ N∑j=1
zj(t)
]dt
representing the sum of the capture time of each evader, one can approximate Vh
further.
First, it is assumed that each evader can be captured by at least one pursuer, and that an
evader is captured by no more than one pursuer. Then, assuming each pairing of pursuer
i and evader j can be solved as a two-player game, an upper value for the pairing can be
obtained individually as
V ij = infai∈Ai(t)
supβj∈∆j
i (t)
(J(xij, ai, βj[ai])
)(3.27)
= infai∈Ai(t)
supβj∈∆j
i (t)
(∫ T
t
dt
)(3.28)
where xij is the combined state of pursuer i and evader j and Ai(t),∆ji (t) are defined as in
Section 3.4. The game value for the lower level optimization within the hierarchical method
can then formed as a sum of the two-player game capture times,
Vs
(x, z) =M∑i=1
∑j∈si
V ij (3.29)
43
and used in the upper-level equation (3.26) to form the hierarchical estimate of the game
value. It should be noted that, since each evader is assumed to be capturable, V ij < ∞
and, as proven by Li, Vh
is uniformly continuous and is therefore finitely convergent under
iteration such that the limited lookahead method is valid for the hierarchical approach [7].
The simulation results from Chapter 5 will demonstrate the validity of this hierarchical
approach for single-pursuer, multiple-evader scenarios.
3.7 Example solution for simple pursuit of several evaders
This section illustrates some of the elements of the solution to the single-pursuer, multiple-
evader scenario with successive capture – the Dynamic Traveling Salesman problem. The
dynamics are assumed to be those of simple pursuit (2.7) with the objective of pure pursuit
given by (2.18). It will be assumed that the pursuer speed is greater than any single evader
speed (νj < 1 for j = 1, · · · , N) so that capture of each evader is guaranteed (x ∈ Sc). Also,
for the examples in this section, capture refers to point capture (ε→ 0).
One of the first solutions to a simple pursuit game with successive capture of two evaders
was given by Breakwell et al [1] using geometrical arguments and numerical integration.
Breakwell demonstrates that for many initial conditions the optimal strategy of both the
pursuer and evaders is straight-line motion. The direction of each evader path is determined
numerically, with the second evader heading directly away from the first evader’s capture
point and the pursuer heading linearly to each capture point in succession. Breakwell also
shows that for a set of initial conditions where the evaders become equidistant from the
pursuer, curved motion by all players is optimal. Depending on the time when this occurs,
the pursuer maintains equal distance to the evaders until a critical time when the pursuer
must choose one or the other.
44
Figure 3.2 shows Breakwell’s numerical solution to the simple pursuit game with two
evaders (reproduced from the original paper [1]) for a variety of initial conditions. The
capture time of the second evader, normalized by the initial distance between the two evaders
and the pursuer speed, is indicated by the contours in the figure. The axes indicate the
pursuer position relative to the center of the evader pair, and the y-axis is fixed along a
line between the two evaders. The initial pursuer locations where curved motion occur in
inertial space are indicated on the figure as regions 3 and 6; the remaining regions require
straight-line motion only. Some sample trajectories are also shown, overlaid as dashed curves.
The results from Figure 3.2 also reveal two singular surfaces – a focal line drawn from P ∗
to PC and a dispersal line from PC to the origin. All trajectories beginning in regions 3 and
6 are drawn to the focal surface. If the pursuer reaches the focal line before the point PC ,
then the optimal control for the pursuer is to remain on the line until PC , after which the
pursuer reaches the dispersal line. Trajectories that arrive at or begin on the dispersal line
demand an immediate choice by the pursuer to commit to either evader in order to obtain an
optimal capture time. From this it is evident that in the curved motion region, the optimal
capture sequence is not necessarily fixed for all time throughout in the engagement.
For scenarios with N > 2 evaders and time-varying capture sequences it can be surmised
that the singular surfaces in the game value map become increasingly complicated. However,
if one fixes the capture sequence, it has been shown by Chikrii [29] that linear motion
(“parallel pursuit”) is optimal for all parties. Chikrii and Belousov et al [3] have derived
algorithms for the linear motion regime (see Section 3.7) that provide solutions that are
equivalent to V s of the previous section. Thus, one could use these algorithms to compute
the lower-level optimization for the hierarchical method. Combined with a combinatorial
45
Figure 3.2: Map of optimal capture times for successive pursuit of two evaders, normalized by theseparation distance between the two evaders (reproduced from Breakwell [1] with kind permissionfrom Springer Science and Business Media). Capture times are represented by solid contours, andsample optimal trajectories of the pursuer relative to the two-evader system are represented bydashed lines. Note that initial conditions from regions 3 and 6 yield optimal trajectories thatcontain curved motion in inertial space.
optimization scheme to solve the upper-level equation (3.26), one could obtain a cost-to-go
suitable for the limited lookahead scheme.
Recently Liu et al [32] have examined the successive capture problem from the evader
perspective, creating open-loop controls for the N evaders and iterating these controls over
time to approximate an optimal evader response independent of capture sequence. They solve
the two-evader problem numerically using the HJI formulation and demonstrate through
numerical simulation the existence of the curved motion region described by Breakwell.
While they do not address the optimal pursuer response, they do create a heuristic control
46
for the pursuer to approximate the full HJI solution.
It is noted in several references [15, 7, 32] that solving the HJI equations numerically
for more than two evaders can become become computationally prohibitive and is likely
unsuitable for real-time implementation. It is surmised that the limited lookahead solution
method can be realized in a near-real-time fashion for N ≥ 2, particularly when an efficient
tree search method is used to solve the upper-level combinatorial optimization. This will be
tested in Chapter 5. As a pre-requisite, it is necessary to first define how the approximate
solution is obtained for the single-pursuer, many-evader simple pursuit game.
The formulation of the limited lookahead method for simple pursuit is as follows. Assume
one seeks to approximate the upper value of the game. To estimate the game value V for
the current time t using (3.25), a short interval ∆t is selected and the optimal control for
the interval is estimated in the following manner.
For the short time interval it is assumed that the pursuer and evaders travel in a straight
line with the understanding that in the limit that ∆t → 0, curved motion can be approxi-
mated. A pursuer control a is selected from A = {φ : −π < φ ≤ π} and a respective evader
control β[a] is selected from ∆(t) = {ΠNj=1Bj | a}, Bj = {ψj : −π < ψj ≤ π}. Under these
controls the pursuer and evaders are propagated to x(t+ ∆t)|a,β[a] = x, z(t+ ∆t) = z. The
cost-to-go in equation (3.25) is estimated from the state x, z using the hierarchical approach.
Since the objective is the sum of evader capture times and the pursuer speed is greater than
any evader speed, the necessary conditions for the hierarchical approach are satisfied.
For the hierarchical approach, the lower-level, two-player game values V ij are given by a
sequence of two-player games as formulated in Section 3.2, where straight-line motion directly
away from the pursuer location is the optimal control for the first engaged evader. The next
evader in the sequence moves directly away from the capture location of the previous, etc.,
and the pursuer follows linearly until all evaders are captured. The values for each subgame
47
are computed for the fixed capture sequence and summed according to (3.29). Next, the
combinatorial minimization over all possible capture sequences is computed as in (3.26) to
obtain the hierarchical estimate of the cost-to-go, Vh
(x, z).
Finally, the optimization over the pursuer controls A(t) and evader controls ∆(t) is con-
ducted using Vh
(x, z) and G =∑
j zj(τ) in (3.25) to find the approximate optimal controls
for the interval, at,t+∆t = φt,t+∆t and βt,t+∆t = ψt,t+∆t and the approximate upper value
V (x, z). As an approximation, the pursuer and evader strategies for the entire engagement
are formed by adjoining the strategies for each interval,
a∗ = [at,t+∆t, · · · , at+∆t,T ]
β∗ = [βt,t+∆t, · · · , βt+∆t,T ]
as in Li et al [43]. As time progresses and Equation (3.25) is iterated, the estimate of the
game value and hence the approximate optimal controls approach their true values.
It should be noted that a similar process can be followed for the dynamics of the Homicidal
Chauffeur game using the same objective, assuming capturability of all evaders is ensured.
A condition for capturability within a two-player subgame is given by Isaacs [5, p. 237] as
ωε >√
1− ν2 + sin−1 ν − 1, ν < 1 (3.30)
for a circular target region about the pursuer with radius ε, the evader-speed to pursuer-
speed ratio of ν and a pursuer turn rate limit ω. With these conditions met, the hierarchical
approach can be used to approximate the optimal solution assuming the solutions to each
subgame V ij can be computed. The next chapter details the simulation approaches and im-
plementation of the approximate solution above, including an efficient method for computing
48
the subgames V ij for games with more complex dynamics and singular surfaces such as the
Homicidal Chauffeur.
49
Chapter 4
Simulation approach
In the previous section it was shown that a solution to the multi-player differential game
can be approximated using the limited lookahead method. The motivation for this work is
to demonstrate that such a solution can be simulated efficiently for a single-pursuer, multi-
evader pursuit game with N > 2 evaders. The simulation efforts in this work serve both
to efficiently compute lookahead results and to validate the results against known solutions
which, due to the nature of the pursuit games explored in this work, must also be computed
numerically for comparison. The following sections describe the simulation methods used to
compute both the known solutions and the lookahead solutions. As the novel contribution of
this paper, the application of both Monte Carlo Tree Search and table lookups to compute
the cost-to-go of the lookahead method are detailed below.
4.1 Numerical solutions to the successive pursuit game
In order to demonstrate the utility of the limited lookahead simulation technique, one needs
a reference solution for comparison. For two-player games, the viscosity method has been
shown to provide adequate numerical solutions for many game variations [13, 14, 15], even
50
when the value function is discontinuous. It was proven by Barron et al [53] and Evans and
Souganidis [54] that the values V + and V − from (3.22) and (3.23) are the viscosity solutions
to the HJI equations (3.3) and (3.4). Level set methods for partial differential equations have
been used to solve these HJI equations for games such as the Homicidal Chauffeur [52]. For
the sake of generating reference results to compare with the approximate lookahead values,
this paper leverages qualitative results from Breakwell for the two-evader successive pursuit
game and the value map generated by level set methods by Patsko et al [14] for the Homicidal
Chauffeur game in place of solving the HJI equations explicitly.
For successive pursuit of many evaders – the Dynamic Traveling Salesman problem –
Belousov et al developed an efficient algorithm to solve for the optimal evader directions when
the player trajectories start in the linear motion region and follow a fixed capture sequence.
Belousov et al were able to transform the problem from an N -dimensional optimization
problem to a nonlinear, root-finding problem for a single variable. The algorithm requires
an initial guess for the first evader’s heading and solves for the roots of a nonlinear, iterative
function to obtain the complete vector of evader heading solutions.
It should be noted that the finding an appropriate initial condition for the nonlinear
function is not completely straightforward. The root-finding solution can be sensitive to
the initial guess, and not all initial conditions yield a valid result. In implementing the
algorithm, it was discovered that different initial conditions can yield a variety of maxima,
with the number of peaks on the order of the number of evaders. In order to obtain the
global maximum for capture time, a variety of initial conditions must be supplied to the
root-finding routine until the best solution is found.
For this paper, the global maximum over the set of initial first-evader headings was found
using a basin-hopping algorithm (see Section 4.4). The basin-hopping algorithm randomly
selects an initial evader heading and executes a local numerical minimization algorithm that
uses the (negated) capture time returned by the Belousov equations. It then accepts or
51
rejects the capture time using the Metropolis criterion and repeats the process for another
initial heading until the maximum number of iterations is reached. Since the initial heading
selection is stochastic, there is a small probability of not finding the (negated) global maxi-
mum. To improve the chances of finding the global value, it was determined empirically that
floor(c log(N)) iterations would be sufficient, where c = 5 yielded an error rate less than
1% for N > 2 evaders.
In Chapter 5 the limited lookahead results in linear scenarios will be validated using the
method of Belousov et al. Additionally, two-player games in the curved motion region will
be compared with the Breakwell value map from Figure 3.2 in Section 3.7.
4.2 Simulation using the limited lookahead method
In order to simulate optimal controls using limited lookahead it is necessary to estimate the
expected cost-to-go. It was shown in Section 3.6 that the hierarchical approach provides a
simple means for computing an estimate of the cost-to-go by dividing the estimate into two
optimizations – an “upper-level” combinatorial optimization and a “lower-level” combination
of two-player subgames. For this work, two simulation techniques were used to solve each
level of the hierarchical approach. At the upper level, the Monte Carlo tree search method
was used to quickly find the (approximately) optimal capture sequence and will be described
in the next section.
For simulation at the lower level, it is desirable to efficiently simulate the two-player
subgames. In a differential game with additive cost such as pure pursuit, it is possible
to chain a series of subgame values together to construct an overall game value, which is
capture time in the case of pure pursuit. Because evaders are assumed to have knowledge of
the pursuer strategy for previous subgames, the subgames are not independent and must be
solved sequentially. However, since only the value is required to estimate the cost-to-go in
52
(3.25) one can reduce computation by using a pre-computed table of two-player game values
to solve the lower-level subgames. In the Homicidal Chauffeur game, for example, a value
map table indexed by the relative pursuer-evader position xij, the pursuer-evader speed ratio
ν, and the pursuer turn rate ω is sufficient to generate V ij for each pursuer-evader pair (see
the contours in Figure 3.1). Using interpolated table lookups to solve complex games has the
potential to reduce on-line game computation significantly, deferring the solution of partial
differential equations to an offline simulation. Section 5.4 will use this method to solve the
multi-player Homicidal Chauffeur game and evaluate the potential of solving the game in
real-time.
Once the hierarchical cost-to-go Vh
has been found using (3.26) and (3.29), the simulation
must implement the minimax optimization of (3.25) to form the value estimate V . As
discussed in Section 3.7, in order to solve the optimization, each player is assumed to undergo
linear motion over the short interval ∆t, and the optimal strategies that are solved locally
for each interval are combined to form the overall player strategy.
To solve the minimax optimization in (3.25) for each player control it is again neces-
sary to use a global method, as multiple maxima and minima can form within each team’s
optimization space. Additionally, the method must support optimization in multiple dimen-
sions in order to support the maximization for the many evaders. Two global optimization
routines were considered initially: brute-force global optimization and basin hopping. Brute-
force optimization consists of forming a grid of points in the optimization space, evaluating
the objective function at each point, and executing a local optimization at the extreme
point. Brute-force search has the advantage that it is simple to implement, uses existing
minimization routines, and does not require random sampling. Brute-force search does,
however, require sampling a grid of points which for high-dimensional problems may use a
large amount of memory. For the current game, brute-force search will be considered for the
pursuer’s one-dimensional minimization step.
53
For the evaders’ multi-dimensional maximization step, a similar basin hopping algorithm
to the previous section was chosen. Basin hopping has a much lower memory footprint than
brute-force search since it does not have to form a full grid and is therefore suitable for the
many-evader problem. However, basin-hopping does have a random sampling component
which requires that enough iterations are run in order to guarantee that the pursuer’s min-
imizing objective function is as smooth as possible. In this study, the number of iterations
were chosen in the same manner as described for the root-finding algorithm of Section 4.1.
For the local optimization, the BFGS algorithm [55] was selected experimentally for the
evader maximization step due to its speed. For the pursuer’s minimization step, a modified
Powell’s method [56] was used for the local minimization for its performance in the presence of
noisy objective functions. Solver stability within the minimax optimization can be especially
important, as large errors in the inner optimization loop (maximization for minimax) can
yield unstable results in the outer loop (minimization) that may cause trajectories to diverge.
It should be noted that the random point selection of the basin hopping method can
be advantageous when dealing with certain singular game surfaces. For example, in the
case of the Homicidal Chauffeur game when a pursuer encounters the dispersal surface at
the bottom of Figure 3.1, the optimal play for the pursuer is to employ a mixed strategy
when choosing which direction to turn. Similar surfaces can also arise for the evader. The
stochastic nature of basin hopping and, as will be seen in the next section, Monte Carlo Tree
Search, provide a natural solution to decision surfaces that require mixed strategies.
4.3 Combinatorial optimization using tree search
Many games and optimization problems suffer from the so-called “curse of dimensionality”
where computing an optimal solution is either difficult or even impossible due to a high-
dimensional state or action space. For the pursuit-evasion problem, each of the works by Li,
54
Belousov et al, and Liu et al [7, 3, 32] note the combinatorial challenge of selecting an optimal
capture sequence – the upper level of the hierarchical method from Section 3.6. Indeed, for
the “static evader” case – a Traveling Salesman problem – the combinatorial problem has
been shown to be NP-complete. For a computational treatment of TSP, including both exact
and approximate methods, see [2]. While many of these methods have good performance
and may even be suitable for real-time applications, it was desired to find an efficient search
algorithm that would be simple to implement, flexible enough to accommodate a variety of
objectives and player dynamics, and also suitable for stochastic problems. The Monte Carlo
Tree Search method was selected as a candidate for solving the combinatorial step of the
approximate differential game solution because it meets these criteria.
Monte-Carlo Tree Search has received much attention in recent years due to its success in
discrete combinatorial games1 such as Go that have high branching factors. A good review of
MCTS and its variants can be found in [45]. MCTS works in the following manner. A search
tree is constructed by selecting nodes asymmetrically according to a tree policy that balances
exploration of new nodes with the exploitation of more promising nodes. A simulation is
run from the selected node using a default policy that reports a terminal expected value
or outcome to be used by the tree policy. Within the simulation, the default policy assigns
sequential actions randomly or, if domain knowledge is available, according to some heuristic.
The selected node and its ancestors are then updated with the results of the simulation and
the tree search is resumed using the updated node values until a maximum number of
iterations are reached.
MCTS has several salient features. The exploration and exploitation capability of the tree
policy expands promising nodes while still allowing for the discovery of even better branch
paths. MCTS is an “anytime algorithm,” meaning that the process can be terminated at1Combinatorial games have two players and are zero-sum, perfect-information, deterministic, discrete,
and sequential [45]
55
any time and still return a promising path in the tree. This is especially useful when MCTS
is used for real-time applications. MCTS does not require the storage or manipulation of
intermediate states, meaning that the algorithm does not require domain knowledge and can
thus be applied to a variety of domains. The algorithm also allows for the simulation of
stochastic applications, as the tree and default policies are stochastic in nature. MCTS is
also simple to implement.
MCTS is a natural fit for computing the optimal capture sequence within the hierarchical
framework. For the single-pursuer, multiple evader pursuit game, the action states of MCTS
are modeled as the components s1, s2, . . . of the capture sequence s. A tree node represents
a partial capture sequence s = {s1, s2, . . . , sna} where na is the current number of evaders
assigned, or the current tree depth. Each node is updated with the latest expected game
value for that sequence as determined by simulation using the default policy. The tree policy
then balances exploitation of the best partial capture sequences found so far with unexplored
sequences to select the next evader sna+1 ∈ {1, · · · , N} \ s in the sequence.
A sample tree search result for a four-evader successive pursuit game is shown in Figure
4.1. Each node (aside from the root node) represents at least one simulation run using the
default policy, and the number on each node is the running expected reward. Each edge
represents an evader in the capture sequence. For the example shown, the solver sought to
minimize the capture time and found the minimum time result with a capture time of 555.3.
The minimum-time capture sequence in the example is {0, 1, 3, 2} (the node for Evader 2 is
not shown).
The tree policy chosen for the current implementation is the “plain” Upper Confidence
Bound for Trees (UCT) algorithm by Kocsis and Szepesvari [57], where each node in the tree
is treated as a multi-armed bandit problem. A child node sk ∈ {1, · · · , N} \ s is selected to
56
3
3
2 0
0
3
1 2
1
2
3
3 3
982.4
1050.9 755.8 1644.6 1146.3
660.5 739.0764.0 969.9 876.41291.3
783.8
1388.1
555.3
Figure 4.1: A sample MCTS minimizing search tree for a four-evader successive pursuit game.Each node represents a simulation run and each edge an evader in a capture sequence. The numberon each node is the running expected capture time.
maximize the quantity
Rk + 2Cp
√2 lnnInIk
(4.1)
where Rk is the current expected reward of the child node, nI is the number of times the
parent node has been visited, nIk is the number of times the child node has been visited,
and Cp > 0 is a constant. The expected reward term R is computed as the average value
of the simulation results from previous visits and is continually updated with values from
future simulations as child nodes are selected and new simulations are run. In plain UCT,
a larger expected reward for a given node encourages exploitation of the associated branch.
The second term in (4.1), however, encourages exploration and guarantees that all nodes will
be visited as nI approaches infinity. A value of Cp = 1√2as used by Kocsis and Szepesvari is
assumed for this study.
57
The upper confidence bound in plain UCT is guaranteed to be within a constant factor of
the best possible bound on the growth of the regret – the difference between the true value
and estimated value after nI iterations – which goes as O(log(nI)). To meet this condition,
however, R must be within the support of [0, 1], which is not generally the case for game
values that represent, say, the capture time of an evader. To workaround this issue, the
current implementation normalizes the expected rewards for all nodes by the largest reward
encountered in the tree search. Experiments conducted with this workaround show that at
least the qualitative balance between exploration and exploitation is preserved.
The use of a default policy in MCTS is compatible with the rollout policy required
by limited lookahead. In the current implementation, the default policy for MCTS is the
computation of the expected game value Vs
, where the remaining components of the partial
sequence s are selected randomly from a uniform distribution to form the full sequence s.
While uniform sampling of remaining evaders is simple, it can be inefficient for deterministic
problems – a sampled capture sequence s may be repeated unnecessarily. To prevent this
from occurring, an option was added to the current implementation to remember previously
visited nodes. Adding memory to MCTS has been done previously with Nested Monte Carlo
Tree Search [50], where a tree search is conducted at each level and the best branch at each
level is remembered. For most of the tests in this work, however, remembering already-
visited nodes did very little to improve search performance, as checking previously visited
nodes adds overhead. The ability to memorize visits to branches was kept as an option, but
for most of the results it is not used. Examining the benefits and drawbacks of memory in
MCTS for this problem is an area for future investigation.
58
4.4 Computational resources
For the algorithms and results in this paper, the following resources were used. All simula-
tions were run on an Intel Core i7-3720QM 2.6 GHz CPU with 8GB RAM. The simulation
routines have been written in C Python [58] and leverage the optimizations provided by the
Anaconda distribution [59], including Numba [60], a package that uses LLVM [61] to compile
Python code into C on-the-fly for subsequent evaluations. Mathematical computations and
plots use the Numpy, Scipy, and Matplotlib packages [62, 63, 64]. The local optimization
and basin hopping methods [65] were provided by the Scipy optimize package. For local
minimization within basin hopping, the BFGS algorithm [55] was used for the maximization
step and Powell’s method [56] for the minimization step, as mentioned previously.
Some of the value map plots leverage the Python multiprocessing package for parallel
processing. In general, MCTS is suited for parallelization, as each default policy simulation is
evaluated independently. However, care must be taken when combining results from similar
branches; see [45] for a summary of MCTS parallelization methods. In this work, none of
the MCTS tree searches are parallelized.
59
Chapter 5
Results and analysis
The results in this section demonstrate both the utility of the limited lookahead method
when applied to differential games with singular surfaces and the efficiency of the Monte
Carlo Tree Search method. The section begins with limited lookahead results for the simple
pursuit of two evaders by a single pursuer, comparing simulation results with those of Break-
well [1] and the algorithm from Belousov et al [3]. Next, the MCTS method is benchmarked
against brute-force search to examine its potential in computing optimal engagements with
many evaders. Lookahead engagement results using MCTS for scenarios with more than
two evaders follow thereafter, along with a summary of typical computation times for com-
ponents of the lookahead method. Finally, the extension of the limited lookahead method
with multiple evaders to the more complex dynamics of the Homicidal Chauffeur game is
demonstrated.
60
5.1 Limited lookahead performance with one pursuer andtwo evaders
The work on approximate multi-player game solutions by Li from Chapter 4 suggests that ap-
proximations to both the upper and lower values of differential pursuit games can be obtained
iteratively using the limited lookahead method. This section demonstrates the closeness of
the approximation for the single-pursuer, two-evader simple pursuit game. Comparisons of
lookahead results with the geometrical solutions by Chikrii and Belousov (see Section 3.7)
for initial conditions in the linear motion region are given, along with examples of scenarios
that begin in Breakwell’s “curved motion” zones. Finally, the approximate upper values of
the game are compared qualitatively with the results from Breakwell (equivalent to the full
HJI solution) for a variety of initial conditions, revealing the closeness of the approximation
and highlighting several features inherent in the two-evader game.
Figure 5.1 shows the results of an engagement between a single pursuer and two evaders
(ν = 1/2) with an initial condition inside the linear solution set, which is the set of points
where linear motion by all parties is optimal and the optimal capture sequence is fixed for
all time. Empty circles represent the initial positions of the pursuer and evaders, while dots
represent positions at each time step (∆t = 0.1). Dashed lines represent the optimal linear
motion solution that one obtains using the method of Belousov et al, and filled circles rep-
resent the capture points, annotated with respective capture times. Capture times denoted
with an asterisk (∗) are the optimal linear capture times.
The cost-to-go for the limited lookahead algorithm was computed by solving the two-
player subgames sequentially. For the engagement in Figure 5.1, the second evader assumes
that the first evader seeks to maximize its own objective function independently, regardless
of the second evader’s decisions. Thus when computing the first subgame V 1, Evader 1 is
estimated to flee directly away from the pursuer. The path for the second subgame V 2 is
61
−20 −10 0 10 20 30
0
10
20
30
40
P E0
T ∗ = 19.03
E1
T ∗ = 70.68
E0
T = 19.40
E1
T = 71.30
P
Figure 5.1: Sample engagement using limited lookahead as compared with the optimal result(denoted by ∗ and dashed lines) in the linear motion regime.
formed as Evader 2 flees directly from the estimated capture point of Evader 1. It is clear
from Figure 5.1 that this estimate is sub-optimal, as Evader 2’s initial heading does not
align with the optimal solution from Belousov et al. However, after roughly 25 iterations
(2.5 sec simulation time) the sub-optimal solution approaches the optimal one. The final
capture time is within a few time steps (1%) of the optimal result. The path in Figure 5.1
demonstrates qualitatively the ability of the limited lookahead method to approximate the
optimal solution, even when the subgames of the estimated time-to-go are sub-optimal.
The simulation of the two-player subgames above can be done very quickly, as the motion
is linear and the subgames are connected only by their initial conditions. This cost-to-go
estimate requires, however, that each evader assume a general strategy for the other evaders
in the coalition. Another way to estimate the cost-to-go is to assume that each evader
plays a strategy that is completely independent of the other evaders, or in other words, each
evader plays the game as if it is the only target, oblivious to the existence of other evaders.
62
This approach requires the simulation of an entire game for each evader, since each evader
executes the simple-pursuit control law throughout the entire engagement, fleeing along a
line directly away from the pursuer. As the pursuer cannot follow a straight line against
every evader at once, this results in nonlinear evader motion that in general must be solved
numerically.
Estimating the cost-to-go in this fashion has the advantage of requiring the fewest as-
sumptions among the evaders in the coalition, but it has an added computational cost and
can result in cost-to-go estimates that are even lower than those of the linear subgames
above. However, experiments using this estimation approach yield nearly identical results
to that of Figure 5.1, suggesting that this cost-to-go estimation method is viable for lim-
ited lookahead. This fact will be exploited in the table lookup approach for the Homicidal
Chauffeur game in Section 5.4.
Figure 5.2 shows the engagement results for an example where evaders are in the “curved
motion” zone as described by Breakwell (see Section 3.7). In this region, the pursuer benefits
from delaying pursuit while it is equidistant from the evaders until a certain critical point,
after which the pursuer must pick one evader or the other. The motion of all players during
this delayed decision period is circular, as was described by Breakwell. Note that curved
motion does not occur in the optimal solution from Belousov et al when the capture sequence
is fixed for all time. The advantage given to the pursuer for delaying a capture decision sug-
gests the importance of considering time-varying capture sequences when computing optimal
strategies for multi-player games.
As stated in Section 3.7, the two-player simple pursuit game has at least two singular
surfaces, a focal line and a dispersal line. In this example, the pursuer entered the focal line
and traveled along that path for roughly one second until reaching the dispersal line, after
which it was optimal to commit to either evader. This behavior is consistent with Figure 3.2,
63
−6 −4 −2 0 2 4−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
E0
T = 4.84E1
T = 13.25
P
Figure 5.2: Sample engagement in Breakwell’s “curved motion” zone (capture sequence not fixed)using limited lookahead. The final capture time is 12.5 sec shorter than the fixed sequence capturetime.
at least to the accuracy of the figure, and demonstrates that the limited lookahead method
can yield results even in the presence of singularities in the game value function.
As mentioned in Chapter 3.7, Breakwell et al [1] derived the optimal result for the two-
evader scenario, including the regions with curved motion. Figure 5.3 shows a side-by-side
comparison of the Breakwell solution with the results from limited lookahead simulations for
the same initial conditions. The axes in the figure represent the pursuer location relative to
the center of a two-evader system, normalized by the evader separation distance and pursuer
speed (refer to Figure 3.2 above). A few trajectories from the lookahead simulations are
plotted as dashed lines over the figure to reveal the focal and dispersal lines inherent in the
game. Though not exact, the value contours, trajectory paths, focal line, and dispersal line
are at least qualitatively similar and suggest that limited lookahead is able to approximate
the optimal solution in both the curved and linear motion regions.
64
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.00.0
0.5
1.0
1.5
2.0
2.5
3.0
E1
E2
E1
E2
E1
E2
E1
E2
E1
E2
E1
E2
E1
E2
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
Figure 5.3: Side-by-side comparison of full two-evader solution (adapted from Breakwell [1])with the limited lookahead results for a variety of initial pursuer locations. Solid contours representcapture times (normalized by the initial evader separation distance and pursuer speed), while dashedlines represent sample trajectories relative to the two-evader system. The focal and dispersal linesappear along the bottom of the figure.
5.2 Tree search performance with many evaders
Now that the efficacy of the lookahead method has been demonstrated for two evaders, it
remains to be shown whether the game with two or more evaders can be solved efficiently.
As noted by previous authors [7, 3, 32], as the number of evaders increases the combinatorial
optimization of the capture sequence becomes the primary factor in limiting computational
efficiency. It is necessary, then, to examine the performance of the proposed solution for
searching the optimal capture sequence – Monte Carlo Tree Search.
To summarize the assumptions of Section 4.3, MCTS will be used to select the optimal
capture sequence – represented by a branch in a search tree – by simulating a complete
game from each node in the branch using a rollout policy – a default choice of the remaining
capture sequence. For this implementation, the default policy will be to select remaining
evaders to capture from a uniform distribution. The rollout policy then gives an estimate of
65
the total capture time – the game reward or value. This estimate is used in limited lookahead
as the cost-to-go estimate for a given choice for the player controls (see Equation (3.25)).
Nodes are updated with the reward estimate to inform the tree policy for the next branch
selection. In this implementation, the tree policy selects capture sequences using the plain
UCT algorithm.
To illustrate the benefit of MCTS and plain UCT as compared to a brute-force search for
several evaders, the conditions of the two-evader game in the previous section are extended to
N evaders. Figure 5.4 shows the average number of iterations of the rollout policy needed to
achieve optimal and near-optimal results for N = 1, . . . , 10. The average is needed because
MCTS uses randomized branch selection. To remove dependence on initial conditions, the
starting locations of the evaders are also randomized within a 20 x 20 grid surrounding the
pursuer. The number of iterations until the optimal solution is reached is recorded for each
trial; for the curves in Figure 5.4, 100 trials per scenario are used. The MCTS results are
compared with a brute force search, which requires N ! iterations.
For each trial, the number of iterations required to achieve a capture time less than one
percent from the optimal is also recorded; the statistics from these results are also shown in
Figure 5.4. For the simulations tested, the mean number of iterations tends to follow the
trend N log(N). Figure 5.4 also shows a single standard deviation around the mean for both
the optimal and approximate results, indicating that most of the approximate results are
within a constant factor of N log(N), which is a significant improvement over brute-force
search.
As noted in the Section 4.3, the tree search logic used here incorporates no memory of
previously run simulations. This allows for the possibility of simulating a single pursuit order
multiple times. While not demonstrated here, preliminary experiments show that adding the
66
2 3 4 5 6 7 8 9 10
Number of evaders (N)
10−1
100
101
102
103
104
105
106
107
Num
bero
fite
ratio
nsOptimal< 1% ErrorN !
N logN
Figure 5.4: Average number of iterations for MCTS to achieve optimal and sub-optimal (within1% error) results as compared to brute force (N ! iterations). The error bars represent one standarddeviation.
ability to remember previously visited nodes reduces the standard deviation of the number
of iterations needed.
It is apparent from Figure 5.4 that using MCTS with plain UCT can reduce the number
of rollout evaluations by an order of magnitude or more for simulations with more than
four evaders. Later in the section it will be shown how this improvement facilitates the fast
execution of the lookahead technique.
5.3 Lookahead performance with many evaders
To fully exercise the limited lookahead and MCTS methods, this section presents results for
scenarios with three or more evaders. As noted earlier in Section 3.7, an HJI solution for
more than two evaders is not available for comparison, but for regions with linear motion
67
(i.e., where a static capture sequence is optimal) the lookahead method can be compared
to the optimal results from Chikrii and Belousov. Figure 5.5 shows a scenario with three
evaders starting in the linear region, again showing lookahead results as dotted paths and
the optimal linear motion solution as solid lines. The time step used for this simulation was
∆t = 0.05 and the discrepancy from the optimal capture time is about ten time steps (about
1% error).
−5 0 5 10 15 20 25 30 35
−5
0
5
10
T ∗ = 1.69
T ∗ = 9.38
T ∗ = 45.34
E0
T = 45.40
E1
T = 1.61
E2
T = 9.33
P
Figure 5.5: Scenario with three evaders starting in the linear motion regime. The optimal solutionis represented by dashed lines.
Figure 5.6 shows a three-evader example with the first two evaders starting in the curved-
motion region for a two-evader game. The third evader is placed at a location well within the
linear regime for either of the other evaders. Again the lookahead results capture qualitatively
68
the curved motion of the first two evaders, followed by linear pursuit of the third. This
resulted in a shorter capture time than if the pursuer had fixed the capture sequence from
the start of the engagement.
−5 0 5 10 15 20
−10
−5
0
5
E0
E1
T = 7.93
E2
T = 26.47
P
Figure 5.6: Three-evader scenario, with two starting in the curved motion regime.
Finally, Figure 5.7 shows a four-evader scenario with the evaders placed arbitrarily around
the pursuer. Evaders 1 and 3 begin equidistant from the pursuer, possibly on a singular
surface. The equidistant condition occurs again after the capture of Evader 0, during the
pursuit of Evaders 1 and 2. After Evader 2 is captured, the players follow a linear strategy
for the remainder of the engagement. Similar results were found for five- and six-evader
engagements.
To realize the utility of the limited lookahead approximation one must also examine the
computation time of the algorithm. Table 5.1 shows the minimum1 run time (in ms) for1Only the minimum time is reported to avoid the inconsistencies of computer clock timing.
69
−60 −40 −20 0 20 40 60−40
−20
0
20
40
60
80
E0
E1E2
E3
P
Figure 5.7: Scenario with four evaders.
several components of the lookahead algorithm, averaged over many randomly generated
scenarios. In the “lower level” of the hierarchical limited lookahead algorithm, an estimate of
the capture time (the single-sequence cost-to-go, or Vs
) is needed for each possible capture
sequence s so that the minimum can be found using discrete optimization (Equation (3.26)).
The first row in Table 5.1 represents the single-sequence cost-to-go estimate using the linear
evader subgame strategy from Section 5.1, where each evader flees linearly from the estimated
capture location of the previous evader. Because the evader motion is assumed to be linear
the computation can be done quickly.
The next section in Table 5.1 represents the combined time to compute the evader max-
imization strategy in (3.25) and to search for the minimum-time capture sequence given the
evader strategy for a single time step. The evader maximization is computed first for a
single sequence as stated in 4.2, using the BFGS algorithm combined with the gradient of
the value function. Adding a gradient computation increased the single-sequence run time
70
Table 5.1: Average minimum run time vs evader number for different lookahead algorithm com-ponents
Algorithm component N=2 N=3 N=4 N=5 N=6Single-sequence cost-to-go estimate 56 85 110 130 160 µs. . . using table lookup (HC) 55 81 110 140 170Minimum-sequence max evader strategy (Brute) 6.6 39 270 2200 1.9e4 msMinimum-sequence max evader strategy (MCTS) 10 53 150 360 690Limited lookahead for single time step (Brute) 0.23 1.4 9.8 91 1800 sLimited lookahead for single time step (MCTS) 0.38 1.9 4.7 14 44
slightly but also reduced the number of maximizer function evaluations, resulting in a net
speed improvement.
Using the maximization result for each sequence, the tree search algorithm examines
different pursuit sequences and reports the minimum-time sequence for a given initial pursuer
heading, which the pursuer will use in its own minimization step. The second row of the
table above compares the search performance (plus evader maximization) for both brute-
force search and plain UCT Monte Carlo Tree Search. Though initially MCTS has some
overhead due to the extra search logic, the benefits of MCTS are apparent for N > 3.
For the simulation runs above, the number of iterations used in the MCTS algorithm was
cN log(N), where c is a constant of approximately 3, chosen empirically to balance execution
speed with accuracy. If c is set too low, the MCTS algorithm finds the optimal time less often
and the capture time profile presented to the pursuer for minimization becomes too noisy.
Note that the chosen iteration number is consistent with the results of Figure 5.4, where the
approximate optimal value is reached on average within a constant factor of N log(N).
As was mentioned in 4.2, the cost-to-go function evaluated by the evader can have multiple
extrema, particularly when singular surfaces are encountered. To help the maximizer find
the global maximum, the basin-hopping technique of 4.2 was tried. Unfortunately due to a
memory error associated with the basin-hopping routine, not all scenarios in Table 5.1 could
71
be computed. Furthermore, brute-force global optimization by sampling an entire grid of
points also became computationally prohibitive. Thus, a new approach was needed.
Instead of brute-force sampling an entire grid of points in the optimization space, MCTS
is used to sparsely sample the grid. To capture a grid as a tree, the space is first divided
evenly across each dimension, and each grid unit is represented by a node in a tree. Next, each
grid unit is itself sampled in an identical manner, and a set of nodes representing the newly
divided grid squares is assigned as children to the parent grid node. Division is continued
until the desired number of samples for each dimension is reached. MCTS then samples the
grid squares as tree nodes in the manner of Section 4.3, preferring grids with higher (lower)
rewards while also maintaining exploration of new maxima (minima) in other grids. This
MCTS grid-sampling method uses Latin Hypercube Sampling during the simulation step to
explore newly selected grid squares, ensuring a more uniform sampling of the space. Using
this sampling method, each global maximization step could be executed to completion.
The pursuer minimization routine is the final step of the limited lookahead algorithm.
The time to compute a one-second limited lookahead step can be found in the final rows of
Table 5.1, where the advantages of reducing the search space using sampling are apparent.
Iterations for the MCTS optimization step were roughly 3N log(N), which resulted in good
accuracy for the linear motion scenarios and fair accuracy for the curved motion scenarios. It
is apparent from the table results that for near-real-time applications, say, within a second,
the limited lookahead method as currently implemented would only be suitable for the two-
evader scenario. However, additional modification options exist that could bring run times
to within a second (see Chapter 6).
While more needs to be done to define accuracy and stability measures and to fine-
tune the algorithms for speed, the results of this section demonstrate, at least as a proof
of concept, the viability of limited lookahead with Monte Carlo Tree Search to efficiently
compute automated player controls for successive pursuit games.
72
5.4 Limited lookahead and the Homicidal Chauffeurgame
The Homicidal Chauffeur game provides an interesting and valuable test case for the limited
lookahead method with multiple evaders. First, the game provides a rich set of singular
surfaces and simple nonlinear dynamics that can exercise Li’s theory in the presence of
a discontinuous game value or value gradient. This is especially important in multi-player
games where, as shown in Section 3.7 and the results of this chapter, singular surfaces readily
appear. Second, if successful, the limited lookahead method would provide an automatic and
efficient way to derive the optimal control strategies of each player in a complex game. As was
shown by Merz [8], the Homicidal Chauffeur game can require one to fifteen stages within a
player strategy depending on the initial conditions. The test scenarios in this section exhibit
several of the singular surfaces described in Section 3.3 in order to test the viability of limited
lookahead and MCTS.
In the two-player scenario in Figure 5.8, the evader is positioned to the left of the pursuer,
inside the pursuer’s turning circle but just outside the capture region. This corresponds to a
location in pursuer-centric coordinates just behind the barrier and to the left of the capture
region (see Section 3.3 and Figure 3.1 for a description of the relevant singular surfaces). The
player controls in this scenario consist of of four stages. First, the evader heads tangentially
toward the pursuer’s turning circle whilst the pursuer turns at maximum rate away from
the evader. Then, once the game trajectory reaches the (bottom) universal line, the evader
follows the pursuer directly while the pursuer “evades” along the same line. Once the game
trajectory reaches the dispersal point, the pursuer chooses a hard turn to the right and the
evader flees along a tangent to the pursuer’s turning circle. These motions send the game
trajectory around the barrier and finally to the (top) universal line, where the game ends in
simple pursuit. The motion along and around singular surfaces can be seen in Figure 5.8b.
73
−2 −1 0 1 2 3 4−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
E0
T = 10.88
P
(a) Inertial coordinates
−4 −2 0 2 4
−2
0
2
4
6
E0
2.0
4.0
6.0
8.0
8.0
10.0
(b) Pursuer-centric coordinates and value map
Figure 5.8: Limited lookahead results in inertial and pursuer-centric coordinates for a two-playerHomicidal Chauffeur game. In this scenario, the optimal play for the evader is to follow the pursuerfor a brief period until the pursuer can turn around. In the right figure, the game trajectory inpursuer-centric coordinates reveals several singular surfaces. The game trajectory moves along auniversal line, departs from a dispersal line, moves around a barrier, and returns again to a universalline before reaching the target set.
Figure 5.8b also shows the game value (capture time) for the first scenario overlaid with
the game trajectory in pursuer-centric coordinates. For this example, the capture time of
10.88 obtained in the simulation matches closely with the capture time contour at the initial
relative location of the evader. These coordinates also reveal the singular game surfaces
encountered in this engagement. Directly behind the pursuer on the negative x2 axis is a
universal line to which the optimal trajectory is initially drawn. Note that the trajectory
reveals the chattering behavior that is characteristic of this surface. Likewise, the universal
line along the positive x2 axis also exhibits the chattering effect.
The pursuer dispersal line begins where the trajectory leaves the universal line and is
the point where the pursuer must commit a sharp turn to one direction or the other. In
this case, the pursuer “choice” is made according to a mixed strategy which arises naturally
from the stochastic solver. The trajectory then proceeds around the barrier as the pursuer
turns and the evader flees tangent to the turning circle, drawing the trajectory around the
74
barrier and onto the universal line until capture. Using only a table of capture time values,
the limited lookahead method was able to generate a game trajectory with all of the major
game features for this scenario.
To illustrate performance for a multiple-evader Homicidal Chauffeur game, Figure 5.9
shows an example with three evaders, where two evaders begin in front of the barrier. Here
the optimal play of the first is to flee directly away while the second, anticipating the first
capture, eventually flees tangent to a turning circle approximately from that point. The
third evader follows the pursuer in simple pursuit, trying to remain behind the barrier as
long as possible before fleeing tangent to the pursuer turning circle.
In the current implementation of the cost-to-go estimate using independent table lookup,
the evaders assume no information is exchanged between them and thus act independently,
which is sub-optimal. However, even with a sub-optimal estimate the solution converges to
behavior that considers the motion and capture locations of the other evaders. So, while no
solution to the full HJI problem currently exists for this problem as a reference, the numer-
ical results here suggest that play using sub-optimal, independent subgames can approach
optimal play behavior that considers the moves of other evaders.
Regarding computation performance, the solutions to the Homicidal chauffeur game sce-
narios could be computed very quickly using the table-lookup method. The second row of
Table 5.1 shows that the run time for a single-sequence cost-to-go estimate for N evaders
is comparable to the estimate for the simple pursuit. Limited lookahead run times are thus
comparable to those in the last section of the table. These results suggest that fast solutions
to complex pursuit games like the Homicidal Chauffeur are possible using the techniques of
this paper.
75
−2 −1 0 1 2 3 4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
E0
T = 1.01
E1
T = 3.43
E2
T = 12.66
P
(a) Inertial coordinates
−4 −2 0 2 4
−2
0
2
4
6
E0
E1
E2
2.0
4.0
6.0
8.0
8.0
10.0
(b) Pursuer-centric coordinates and value map
Figure 5.9: Limited lookahead results for a three-player Homicidal Chauffeur game.
76
Chapter 6
Conclusion and Future Work
The primary goal of this work is to establish whether the limited lookahead method combined
with Monte Carlo Tree Search is indeed a viable and efficient way to solve multi-player games,
including the simple successive pursuit game with several evaders – the Dynamic Traveling
Salesman problem – and a many-evader Homicidal Chauffeur game. The results of Chapter
5 demonstrate that one can obtain approximate game trajectories and capture times for the
two-evader simple pursuit game that agree well with the optimal solutions in both the linear
(fixed capture sequence) and curved motion regions of the game space. Furthermore, the
two-evader game can be solved within a second of real time.
For the many-evader simple pursuit game, the limited lookahead results are able approxi-
mate the optimal player paths within the fixed capture sequence region. Though no reference
solution is available for the multi-player game in the curved motion region for comparison,
the trajectories for the constrained scenarios in Section 5.3 at least appear reasonable.
Automated solutions to the multi-player differential pursuit game, like many other multi-
agent problems, suffer from the “curse of dimensionality.” For the single-pursuer, N -evader
pursuit game the number of possible capture combinations grows as N !. This study pro-
poses Monte Carlo Tree Search as a means to reduce the number of iterations needed to
77
achieve an optimal or near-optimal solution. To achieve capture times within one percent
of optimal, MCTS required O(N log(N)) iterations on average to converge. Optimal results
were also attained much more quickly than brute-force, though not as quickly as N log(N).
Furthermore, the limited lookahead results in Chapter 5 demonstrate that it is possible to
use an approximate MCTS result to generate approximate game trajectories.
Though MCTS significantly reduces the number of iterations needed to find an optimal
capture sequence, the execution time of the current limited lookahead implementation for
the many-evader scenarios still does not meet real-time requirements. This is due primarily
to the global minimization routines required for the minimax portion of limited lookahead.
Attempts were made to reduce the compute time, such as supplying the gradient of the cost
function and using Monte Carlo Tree Search as a grid-sampling mechanism, but these efforts
were not enough to bring run times within the one-second goal. For bringing multi-player
game solutions into the domain of real-time execution, it will be necessary to find more
efficient numerical optimization schemes. Nevertheless, should a more efficient technique be
found, MCTS still can provide a significant improvement for the combinatorial step.
Finally, it was shown in Section 5.4 that limited lookahead can produce an automated
solution to the Homicidal Chauffeur game. By computing a two-player subgame offline and
storing the game values as a lookup table, the limited lookahead method is able to produce
trajectories for a many-evader engagement, even in the presence of singular game surfaces
such as barriers, universal lines, and dispersal lines.
The results of this study suggest a number of avenues for further work. Finding a more
efficient continuous optimization scheme for the minimax operation has already been men-
tioned. The numerical stability of the minimax optimization in conjunction with the stochas-
tic sampling of MCTS should be studied in more detail. Of particular interest is the stability
78
near singular surfaces and in the presence of multiple minima as the number of players be-
comes large. The stability and convergence of the MCTS grid sampling method introduced
here are also of interest.
More could be done to fine-tune the MCTS implementation, including using better mem-
ory management, a compiled language, and parallelization techniques. Some performance
gains could also be had in several auxiliary routines, as not all were implemented using the
optimization of the Numba / LLVM framework. As mentioned by previous authors [7], [31],
it is also worth exploring the many combinatorial techniques that are used in the Traveling
Salesman problem.
It would be valuable to extend the work of Li to games with non-zero-sum objectives,
asymmetric player information, or stochastic processes. Li initially addresses stochastic
conditions for the lookahead method in [7]. Since MCTS is suitable for stochastic simulations,
the work here could likely be adapted.
To date there is no general solution to the multi-player differential pursuit game. Some
of the challenges include defining game termination, solving complex PDEs to obtain the
value function, addressing capturability, and, as shown here, coping with high dimensionality.
Each of these challenge areas have open questions that warrant future study.
79
Bibliography
[1] J. Breakwell and P. Hagedorn, “Point capture of two evaders in succession,” Journal ofOptimization Theory and Applications 27(1) (1979).
[2] D. Applegate, R.E. Bixby, V. Chvatal, andW.J. Cook, The Traveling Salesman Problem:A Computational Study, 2006 Princeton University Press (2006).
[3] A. Belousov, Y.I. Berdyshev, A. Chentsov, and A. Chikrii, “Solving the dynamic trav-eling salesman problem,” Cybernetics and Systems Analysis 46(5) (2010).
[4] T. Basar and G.J. Olsder, Dynamic Noncooperative Game Theory, 2nd ed., SIAM(1999).
[5] R. Isaacs, Differential Games: A Mathematical Theory with Applications to Warfareand Pursuit, Control and Optimization, Dover (1965).
[6] M. Falcone, “Numerical methods for differential games based on partial differential equa-tions,” International Game Theory Review 8(2), 231–272 (2006).
[7] D. Li, Multi-player Pursuit-Evasion Differential Games, Dissertation, The Ohio StateUniversity (2006).
[8] A.W. Merz, The Homicidal Chauffeur–A Differential Game, Ph.D. thesis, StanfordUniversity (1971).
[9] J. Lewin and G. Olsder, “Conic surveillance evasion,” Journal of Optimization Theoryand Applications (1979).
[10] M.G. Crandall and P.L. Lions, “Viscosity solutions of Hamilton-Jacobi equations,”Transactions of the American Mathematical Society 277(1), 1–42 (1983).
[11] A. Subbotin, “Generalization of the main equation of differential game theory,” J. Optim.Th. Appl. 43, 103–133 (1984).
[12] N. Krasovskii and A. Subbotin, Game Theoretical Control Problems, Springer (1984).
80
[13] M. Bardi, M. Falcone, and P. Soravia, “Numerical methods for pursuit-evasion gamesvia viscosity solutions,” in T.R. M. Bardi, T. Parthasarathy (ed.), Stochastic and Dif-ferential Games: Theory and Numerical Methods, Annals of the International Societyof Differential Games, vol. 4, pp. 289–303 (2000).
[14] V. Patsko, “Level sets of the value function in differential games with the homicidalchauffeur dynamics,” International Game Theory Review 3(1), 67–112 (2001).
[15] I.M. Mitchell, A.M. Bayen, and C.J. Tomlin, “A time-dependent Hamilton-Jacobi for-mulation of reachable sets for continuous dynamic games,” IEEE Transactions on Au-tomatic Control (2005).
[16] S. Shankaran, D.M. Stipanovic, and C.J. Tomlin, “Collision avoidance strategies for athree-player game,” in Advances in Dynamic Games, Annals of the International Societyof Dynamic Games 11, Springer Science+Business Media (2011).
[17] N.D. Botkin, K.H. Hoffmann, and V.L. Turova, “Stable numerical schemes for solv-ing Hamilton-Jacobi-Bellman-Isaacs equations,” SIAM J. Sci. Comput. 33(2), 992–1007(2011).
[18] K. Zemskov and A. Pashkov, “Construction of optimal position strategies in a differentialpursuit-evasion game with one pursuer and two evaders,” J. Appl. Maths Mechs 61(3),391–399 (1997).
[19] I. Shevchenko, “Successive pursuit with a bounded detection domain,” Journal of Opti-mization Theory and Applications 95(1), 25–48 (1997).
[20] S. Bhattacharya and T. Basar, “Differential game-theoretic approach to a spatial jam-ming problem,” in R.C. P. Cardaliaguet (ed.), Advances in Dynamic Games, SpringerScience+Business Media, Annals of the International Society of Dynamic Games 12(2012).
[21] Z.E. Fuchs, P.P. Khargonekar, and J. Evers, “Cooperative defense within a single-pursuer, two-evader pursuit evasion differential game,” in 49th IEEE Conference onDecision and Control (2010).
[22] Z.E. Fuchs and P.P. Khargonekar, “Encouraging attacker retreat through defender coop-eration,” in 2011 50th IEEE Conference on Decision and Control and European ControlConference (CDC-ECC) (2011).
[23] D.W. Yeung and L.A. Petrosyan, Cooperative Stochastic Differential Games, SpringerScience+Business Media (2006).
[24] L. Petrosjan and V. Shirjaev, Hierarchical Games, Saransk (1986).
81
[25] S.I. Tarashnina, “Nash equilibria in differential pursuit game with one pursuer and mevaders,” in V.M. L.A. Petrosjan (ed.), Game Theory and Applications III, Nova SciencePublishers, Inc. (1997).
[26] I. Shevchenko, “Minimizing the distance to one evader while chasing another,” Comput-ers and Mathematics with Applications 47 (2004).
[27] I. Shevchenko, “Approaching coalitions of evaders on the average,” in Advances in Dy-namic Game Theory, Birkhauser Boston (2007).
[28] I. Shevchenko, “Strategies for alternative pursuit games,” in P.B. et al. (ed.), Advances inDynamic Games and Their Applications, Birkhauser Boston, Annals of the InternationalSociety of Dynamic Games 10 (2009).
[29] A. Chikrii and S. Kalashnikova, “Pursuit of a group of evaders by a single controlledobject,” Kibernetika 4, 1–8 (1987).
[30] Y.I. Berdyshev, “On a nonlinear problem of a sequential control with a parameter,”Journal of Computer and Systems Sciences International 47(3), 380–385 (2008).
[31] Y.I. Berdyshev, “Choosing the sequence of approach of a nonlinear object to a group ofmoving points,” Journal of Computer and Systems Sciences International 50(1), 30–37(2011).
[32] S.Y. Liu, Z. Zhou, C. Tomlin, and K. Hedrick, “Evasion as a team against a fasterpursuer,” in 2013 American Control Conference (ACC) (2013).
[33] D.M. Stipanovic, A. Melikyan, and N. Hovakimyan, “Guaranteed strategies for nonlinearmulti-player pursuit-evasion games,” International Game Theory Review 12(1) (2010).
[34] D.M. Stipanovic, A. Melikyan, and N. Hovakimyan, “Some sufficient conditions formulti-player pursuit-evasion games with continuous and discrete observations,” in Ad-vances in Dynamic Games and Their Applications, Annals of the International Societyof Dynamic Games 10, Birkhauser Boston (2009).
[35] T. Abramyants, M. Ivanov, E. Maslov, and V. Yakhno, “A detection evasion problem,”Automation and Remote Control 65(10), 1523–1530 (2004).
[36] J.S. Jang and C.J. Tomlin, “Control strategies in multi-player pursuit and evasion game,”in AIAA Guidance, Navigation, and Control Conference and Exhibit (2005).
[37] A. Bolonkin and R. Murphey, “Geometry-based parametric modeling for single-pursuer/multiple-evader problems,” Journal of Guidance, Control, and Dynamics 28(1) (2005).
[38] J. Ge, L. Tang, J. Reimann, and G. Vachtsevanos, “Hierarchical decomposition approachfor pursuit-evasion differential game with multiple players,” in Aerospace Conference,2006 IEEE (2006).
82
[39] X. Wang, J.B. Cruz, Jr., G. Chen, K. Pham, and E. Blasch, “Formation control in multi-player pursuit evasion game with superior evaders,” in Defense and Security Symposium,International Society for Optics and Photonics (2007).
[40] M. Wei, G. Chen, J.B. Cruz, Jr., L.S. Haynes, K. Pham, and E. Blasch, “Multi-pursuermulti-evader pursuit-evasion games with jamming confrontation,” Journal of AerospaceComputing, Information, and Communication 4 (2007).
[41] D. Li and J.B. Cruz, Jr., “A hierarchical approach to multi-player pursuit-evasion dif-ferential games,” in Proceedings of the 44th IEEE Conference on Decision and Control(2005).
[42] D. Li and J. Cruz, “Better cooperative control with limited look-ahead,” in AmericanControl Conference, IEEE (2006).
[43] D. Li and J.B. Cruz, Jr., “Improvement with look-ahead on cooperative pursuit games,”in Proceedings of the 45th IEEE Conference on Decision & Control (2006).
[44] D.P. Bertsekas, Dynamic Programming and Optimal Control, Athena Scientific (2000).
[45] C.B. Browne, E. Powley, D. Whitehouse, S.M. Lucas, P.I. Cowling, P. Rohlfshagen,S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of Monte Carlo treesearch methods,” IEEE Transactions on Computational Intelligence and AI in Games4(1) (2012).
[46] M.P. Schadd, M.H. Winands, H.J. van den Herik, and H. Aldewereld, “AddressingNP-complete puzzles with Monte-Carlo methods.” in Proceedings of the AISB 2008Symposium on Logic and the Simulation of Interaction and Reasoning. (2008).
[47] D. Perez, P. Rohlfshagen, and S.M. Lucas, “Monte Carlo tree search for the physicaltravelling salesman problem,” Applications of Evolutionary Computation (2012).
[48] D. Perez, S. Samothrakis, P. Rohlfshagen, and S.M. Lucas, “Rolling horizon evolutionversus tree search for navigation in single-player real-time games,” in Proceeding of thefifteenth annual conference on Genetic and evolutionary computation conference (2013).
[49] E.J. Powley, D. Whitehouse, and P.I. Cowling, “Monte Carlo tree search with macro-actions and heuristic route planning for the physical travelling salesman problem,” inComputational Intelligence and Games (CIG), 2012 IEEE Conference on (2012).
[50] A. Rimmel, F. Teytaud, and T. Cazenave, “Optimization of the nested Monte-Carloalgorithm on the traveling salesman problem with time windows,” Applications of Evo-lutionary Computation (2011).
[51] J. Lewin, Differential Games: Theory and Methods for Solving Game Problems withSingular Surfaces, Springer-Verlag (1994).
83
[52] V.S. Patsko and V.L. Turova, “Homicidal chauffeur game: History and modern studies,”in Advances in Dynamic Games, Annals of the International Society of Dynamic Games11, Springer Science+Business Media (2011).
[53] E. Barron, L. Evans, and R. Jensen, “Viscosity solutions of Isaac’s equations and dif-ferential games with Lipschitz controls,” Journal of Differential Equations 53, 213–233(1984).
[54] L. Evans and P. Souganidis, “Differential games and representation formulas for solutionsof Hamilton-Jacobi-Isaacs equations,” Indiana Univ. Math. J. 33, 773–797 (1984).
[55] J. Nocedal and S. Wright, Numerical Optimization, Springer New York (2006).
[56] M. Powell, “An efficient method for finding the minimum of a function of several variableswithout calculating derivatives,” Computer Journal 7(2), 155–162 (1964).
[57] L. Kocsis and C. Szepesvari, “Bandit based Monte-Carlo planning,” in Proc. Eur. Conf.Mach. Learn. (2006).
[58] “C Python, version 2.7.5,” http://www.python.org, [Online; accessed 12-March-2014].
[59] “Anaconda Python - Continuum Analytics,” http://www.continuum.io, [Online; ac-cessed 12-March-2014].
[60] “Numba - Continuum Analytics,” http://numba.pydata.org, [Online; accessed 12-March-2014].
[61] “LLVM,” http://www.llvm.org, [Online; accessed 12-March-2014].
[62] “Numpy,” http://www.numpy.org, [Online; accessed 12-March-2014].
[63] “Scipy,” http://www.scipy.org, [Online; accessed 12-March-2014].
[64] “Matplotlib,” http://matplotlib.org, [Online; accessed 12-March-2014].
[65] D. Wales and J. Doye, “Global optimization by basin-hopping and the lowest energystructures of Lennard-Jones clusters containing up to 110 atoms,” Journal of PhysicalChemistry A 101, 5111 (1997).
84