Uncertain Multiagent Systems: Games and Learning
H. Jin Kim, Songhwai Oh and Shankar Sastry
University of California, Berkeley
July 17, 2002
Decision-Making under Uncertainty ONR MURI
Outline
• Hierarchical architecture for multiagent operations
• Partial observation Markov games (POMGame)
• Berkeley pursuit-evasion game (PEG) setup
• From PEG to unmanned dynamic battlefield– Model predictive techniques for dynamic replanning
– Multi-target tracking (detect ID track)
– Dynamic model selection for estimating adversarial intent
Partial-observation Probabilistic Pursuit-Evasion Game (PEG) with 4 UGVs & 1 UAV
A prototype systemof fully autonomousmobile teams ofintelligent andnetworked sensingagents deployedto discover and trackmobile targets in unmappedenvironments
Uncertainty pervades every layer!
Hierarchy in Berkeley Platformactuator
positions
inertialpositions
height over
terrain
• obstacles detected• targets detectedcontrol
signals
INS GPSultrasonic altimeter
vision
state of agents
obstacles detected
targetsdetected
obstaclesdetected
agentspositions
desiredagentsactions
Tactical Planner& Regulation
Vehicle-level sensor fusion
Strategy Planner Map Builder
• position of targets • position of obstacles • positions of agents
Communications Network
tacticalplanner
trajectoryplanner
regulation
•lin. accel.•ang. vel.
Targets
Exogenousdisturbance
UAV
dynamics
Terrain
actuatorencoders
UGV dynamics
Pursuit-Evasion Game Experiment Setup
Ground Command Post
Waypoint Commands
Position & vehicle status
Position & vehicle status
Pursuer UAV
Evader UGV
Evader location detected by vision system
Pursuer UGVs
Information Flow in UC Berkeley PEG Platform
Wireless Network
Pursuer UAV
Ground-basedStrategyPlanner
Current Coordination
of AgentProcessed
Vision Input
Map Builder
Policy Calculator
Probability Map
Pursuer UGV
Evader UGV
Flight Computer
Vision Computer
Vision Computer
Motion Controller
Motion Controller
Vision ComputerMap Builder
Display Info
Current Positionfor Ground
Station Display
Waypoint Requests
Vision Data
Current Position
Waypoint Requests
Vision Data
Current Position
Agent PositionRequests
Lessons Learned and UAV/UGV
Objective • Scalable/replicable system that deliver mission reliably under uncertainty
and evaluate their performance
• Hierarchical architecture design and analysis– High-level decision making in a discrete space
– Physical-layer control in a continuous space
• Hierarchical decomposition requires tight interaction between layers to achieve cooperative behavior, to deconflict and to support constraints.
• Confronting uncertainty arising from partially observable, dynamically changing environments and intelligent adversaries
Representing and Managing Uncertainty
• Uncertainty is introduced in various channels– Sensing unable to determine the current state of world
– Prediction unable to infer the future state of world
– Actuation unable to make the desired action to properly affect the state of world
• Different types of uncertainty can be addressed by different approaches – Nondeterministic uncertainty : Robust Control
– Probabilistic uncertainty :
(Partially Observable) Markov Decision Processes
– Adversarial uncertainty : Game TheoryPOMGame
Partial Observation Markov Games (POMGame)
Policy for POMGames
• Optimal value function of a state – the expected sum of a reward that agent will gain by executing the
optimal policy starting from that state:
• Poorly understood: analysis exists only for very specially structured games such as a game with a complete information on one side
• Special case : partially observable Markov decision processes (POMDP)
Berkeley Pursuit-Evasion Game (PEG) Setup
Abstraction of Pursuit-Evasion Game
• A partial-observation stochastic pursuit-evasion game in a 2-D grid world, between (heterogeneous) teams of ne evaders and np pursuers .
• At each time t, – Each evader and pursuer, located at and
respectively, – takes the observation over its visibility region– updates the belief state– chooses action from
• Goal: capture of the evader, or survival
• Performance measure : capture time
• Optimal policy minimizes the cost
*
: min 1:
where is the set of all {y(1)...y( )},
associated with an evader not being found up to
fnd
fnd
T Y Y
Y
Optimal Pursuit Policy
Optimal Pursuit Policy –Dynamic Programming Formulation
Persistent Pursuit Policies
• Solving for the optimal policy of the partial observation Markov games of non-trivial size using dynamic programming is computationally intractable.
• If the pursuit policy is persistent with a period T, then the expected capture time is bounded.
Example of Persistent Pursuit Policies
Greedy Policy
– Pursuer moves to the neighboring cell with the highest probability of having an evader at the next instant
– Strategic planner assigns more importance to local or immediate considerations
Global Maximum Policy
– Pursuer moves toward the global location with the highest probability, weighted by some distance metric, of having an evader at the next instant
Experimental Results: Pursuit Evasion Games with Four UGVs and a UAV
Game-theoretic Policy Search Paradigm
• Large number of variables affect the solution
• Many interesting games including pursuit-evasion are a large game with partial information, and finding optimal solutions is well outside the capability of current algorithms
• Approximate solution is not necessarily bad. There might be simple policies with satisfactory performances
Choose a good policy from a restricted class of policies !
• We can find approximately optimal solutions from restricted classes, using a sparse sampling and a provably convergent policy search algorithm
Constructing a Policy Class
• Given a mission with specific goals, we – decompose the problem in terms of the functions that need to be
achieved for success and the means that are available– analyze how a human team would solve the problem– determine a list of important factors that complicate task performance
such as safety or physical constraints • Maximize aerial coverage,
• Stay within a communications range,
• Penalize actions that lead an agent to a danger zone,
• Maximize the explored region,
• Minimize fuel usage, …
Policy Representation
• Quantize the above features and define a feature vector that consists of the estimate of above quantities for each action given agents’ history
• Estimate the ‘goodness’ of each action by a function
where is the weighting vector to be learned .• Choose an action that maximizes .• Or choose a randomized action according to the distribution
Example: Policy Feature
• Maximize collective aerial coverage -> maximize the distance between agents
where is the location of pursuer that will be landed by taking action from
• Try to visit an unexplored region with high possibility of detecting an evader
where is a position arrived by the action that maximizes the evader map value along the frontier
• Prioritize actions that are more compatible with the dynamics of agents
• Policy representation
Example: Policy Feature (Continued)
Benchmarking Experiments
• Performance of two pursuit policies compared in terms of capture time• Experiment 1 : two pursuers against the evader who moves greedily with
respect to the pursuers’ location
• Experiment 2 : When the position of evader at each step is detected by the sensor network with only 10% accuracy, two optimized pursuers took 24.1 steps, while the one-step greedy pursuers took over 146 steps in average to capture the evader in 30 by 30 grid.
Grid size 1-Greedy pursuers Optimized pursuers
10 by 10 (7.3, 4.8)* (5.1, 2.7)
20 by 20 (42.3, 19.2) (12.3, 4.3)
* (mean, standard deviation)
Why General-sum Games?
"All too often in OR dealing with military problems, war is viewed as a zero-sum two-person game with perfect information. Here I must state as forcibly as I know that war is not a zero-sum two-person game with perfect information. Anybody who sincerely believes it is a fool. Anybody who reaches conclusions based on such an assumption and then tries to peddle these conclusions without revealing the quicksand they are constructed on is a charlatan....There is, in short, an urgent need to develop positive-sum game theory and to urge the acceptance of its precepts upon our leaders throughout the world."
Joseph H. Engel, Retiring Presidential Address to the Operations Research Society of America, October 1969
General-sum Games
• Depending on the cooperation between the players,– Noncooperative– Cooperative
• Depending on the least expected payoff that a player is willing to accept- Nash’s special/general bargaining solution
• By restricting the blue and red policy class to be the finite size, we reduce the POMGame into the bimatrix game.
From PEG to Combat Scenarios
• Adversarial attack – Reds just do not evade, but also attack -> Blues cannot blindly
pursue reds.
• Unknown number/capability of adversary
-> Dynamic selection of the relevant red model from unstructured observation
• Deconfliction between layers and teams• Increase number of feature
-> Diversify possible solutions when the uncertainty is high
From POMGame To Bimatrix Game
Dynamic Bayesian Model Selection
• Dynamic Bayesian model selection (DBMS) is a generalized model selection approach to time series data of which the number of components can vary with time
• If K is the number of the components at any instance and T is the length of the time series, then there are O(2KT) possible models which demands an efficient algorithm
• The problem is formulated using Bayesian hierarchical modeling and solved using reversible jump MCMC methods suitably adapted.
DBMS
DBMS: Graphical Representation
– Dirichlet prior
A – Transition matrix for mt
t – Dirichlet prior
wt – component weights
zt – allocation variable
F – transition dynamics
DBMS
DBMS: Multi-target Tracking Example
Estimated target position+ True target trajectory Observation
Estimated target position+ True target trajectory Observation
Summary
• Decomposition of complex multiagent operation problems requires tighter interaction between subsystems and human intervention
• Partial observation Markov games provides a mathematical representation of a hierarchical multiagent system operating under adversarial and environmental uncertainty
• Policy class framework provides a setup for including human experience
• Policy search methods and sparse sampling produce computationally tractable algorithms to generate approximate solutions to partially observable Markov games.
• Model predictive (receding horizon) techniques can be used for dynamic replanning to deconflict/coordinate between vehicles, layers or subtasks
• THE END
Acting under Partial Observations
• We need to use memory of previous actions and observations to disambiguate the current state.
• The state estimate, or belief state– Posterior probability distribution over states – The likelihood the world is actually in the state x, at time t, given the
agent’s past experience (I.e. actions and observation histories).
Updating Belief State
– Can be updated recursively using the estimated world model and Bayes’ rule.
New info on the state of
world
New info on prediction
Pursuit-Evasion Game Experiment
PEG with four UGVs• Global-Max pursuit policy• Simulated camera view
(radius 7.5m with 50degree conic view)• Pursuer=0.3m/s Evader=0.5m/s MAX
Pursuit-Evasion Game Experiment
PEG with four UGVs• Global-Max pursuit policy• Simulated camera view
(radius 7.5m with 50degree conic view)• Pursuer=0.3m/s Evader=0.5m/s MAX
Experimental Results: Evaluation of Policies for Different Visibility
• Global max policy performs better than greedy, since the greedy policy selects movements based only on local considerations.
• Both policies perform better with the trapezoidal view, since the camera rotates fast enough to compensate the narrow field of view.
Capture time of greedy and global-max for the different region of visibility
of pursuers
Three pursuers with trapezoidal or omni-directional view
Randomly moving evader
Experimental Results: Evader’s Speed vs. Intelligence
• Having a more intelligent evader increases the capture time
• Harder to capture an intelligent evader at a higher speed
• The capture time of a fast random evader is shorter than that of a slower random evader, when the speed of evader is only slightly higher than that of pursuers.
Capture time for different speeds and levels of intelligence of the evader
Three pPursuers with a trapezoidal view & global maximum policy
Max speed of pursuers: 0.3 m/s
Coordination under Multiple Sources of Commands
• When different agents or layers specify multiple, possibly conflicting goals or actions, how the system can prioritize or resolve them ?– a priori assignment of the degrees of authority – Surge in coordination demand when the situation deviates from textbook
cases: can the overall system adapt real-time?
• Intermediate, cooperative modes of interaction between layers, agents and human operator based on anticipatory reasoning is desirable