Opponent Modelling by Sequence Predictionand Lookahead in Two-Player Games
Richard Mealing and Jonathan L. Shapiromealingr,[email protected]
Machine Learning and Optimisation GroupSchool of Computer Science
University of Manchester, UK
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 1 / 22
The Problem
You play against an opponent
The opponent’s actions are based on previous actions
How can you maximise your reward?
Applications
Heads-up pokerAuctionsP2P networkingPath findingetc
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 2 / 22
Possible Approaches
You could use reinforcement learning to learn to take actions withhigh expected discounted rewards
However we propose to:
Model the opponent using sequence prediction methodsLookahead and take actions which probabilistically, according to theopponent model, lead to the highest reward
Which approach give us the highest rewards?
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 3 / 22
Opponent Modelling using Sequence Prediction
Observe the opponent’s action and the player’s action (aopp, a)
Form a sequence over time t (memory size n)
(atopp, a
t), (at−1opp , a
t−1), ..., (at−n+1opp , at−n+1)
Predict the opponent’s next action based on this sequence
Pr(at+1
opp |(atopp, a
t), (at−1opp , a
t−1), ..., (at−n+1opp , at−n+1)
)
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 4 / 22
Sequence Prediction Methods
We tested a variety of sequence prediction methods...
Lempel-Ziv-1978 (LZ78) [1]
Knuth-Morris-Pratt (KMP) [2]Unbounded contexts
Prediction by Partial Matching C (PPMC) [3]
ActiveLeZi [4]Context blending
Transition Directed Acyclic Graph (TDAG) [5]
Entropy Learned Pruned Hypothesis Space (ELPH) [6]
Contextpruning
N-Gram [7]
Hierarchical N-Gram (H. N-Gram) [7] Collection of 1 to N-Grams
Long Short Term Memory (LSTM) [8] Implicit blending & pruning
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 5 / 22
Sequence Prediction Method Lookahead
Predict with k lookahead given a hypothesised context i.e.
Pr(at+k
opp |(at+k−1opp , at+k−1), (at+k−2
opp , at+k−2), ..., (at+k−nopp , at+k−n)
)
A hypothesised context may contain unobserved (predicted) symbols
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 6 / 22
Reinforcement Learning: Q-Learning
Learns an action-value function that when input a state-action pair(s, a) outputs the expected value of taking that action in that stateand following a fixed strategy thereafter [9]
Q(
State︷︸︸︷st ,
Action︷︸︸︷at )← (1−
Learning rate︷︸︸︷α )Q(st , at)︸ ︷︷ ︸
fraction of old value
+α[
Reward︷︸︸︷r t +
Discount︷︸︸︷γ max
at+1Q(st+1, at+1)]︸ ︷︷ ︸
fraction of reward & next max valued action
Select actions with high q-values with some exploration
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 7 / 22
Need for Lookahead (Prisoner’s Dilemma Example)
D C
D 1,1 4,0
C 0,4 3,3
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 8 / 22
Need for Lookahead (Prisoner’s Dilemma Example)
Defect is the dominant action
Cooperate-Cooperate is socially optimal (highest sum of rewards)
Tit-for-tat (copy opponent’s last move) is good for iterated play
Can we learn tit-for-tat?
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 9 / 22
Need for Lookahead (Prisoner’s Dilemma Example)
4
D
3
C
Pred. C
D C
D 1,1 4,0
C 0,4 3,3
1
D
0
C
Pred. D
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 10 / 22
Need for Lookahead (Prisoner’s Dilemma Example)
4
D
3
C
Pred. C
D C
D 1,1 4,0
C 0,4 3,3
1
D
0
C
Pred. D
Lookahead of 1 shows D has highest reward
With lookahead of 2 (D,C,D,C) has highest total reward (unlikely)
Assume the opponent copies the player’s last move (i.e. tit-for-tat)
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 11 / 22
Need for Lookahead (Prisoner’s Dilemma Example)
4
5
D
4
C
Pred. D
D
3
7
D
6
C
Pred. C
C
Pred. C
D C
D 1,1 4,0
C 0,4 3,3
1
2
D
1
C
Pred. D
D
0
4
D
3
C
Pred. C
C
Pred. D
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 12 / 22
Need for Lookahead (Prisoner’s Dilemma Example)
4
5
D
4
C
Pred. D
D
3
7
D
6
C
Pred. C
C
Pred. C
D C
D 1,1 4,0
C 0,4 3,3
1
2
D
1
C
Pred. D
D
0
4
D
3
C
Pred. C
C
Pred. D
Lookahead of 2 against tit-for-tat shows C has highest reward
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 13 / 22
Q-Learning’s Implicit Lookahead
Q(
State︷︸︸︷st ,
Action︷︸︸︷at )← (1−
Learning rate︷︸︸︷α )Q(st , at)︸ ︷︷ ︸
fraction of old value
+α[
Reward︷︸︸︷r t +
Discount︷︸︸︷γ max
at+1Q(st+1, at+1)]︸ ︷︷ ︸
fraction of reward & next max valued action
Assume each state is an opponent action i.e. s = aoppLearns (player action, opponent action) values as:
γ = 0 - payoff matrix (arg maxa Q(at+1opp, a
)same as max lookahead 1)
0 <γ <1 - payoff matrix + future rewards with exponential decayγ = 1 - payoff matrix + future rewards
Increasing γ increases lookahead
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 14 / 22
Exhaustive Explicit Lookahead
We use exhaustive explicit lookahead with the opponent model and actionvalues to greedily select actions (to limited depth) maximising total reward
2
D
5
C
D
1
D
4
C
C
D
5
D
8
C
D
4
D
7
C
C
C
D
1
D
4
C
D
0
D
3
C
C
D
4
D
7
C
D
3
D
6
C
C
C
C
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 15 / 22
Experiments
Iterated Rock-Paper-Scissors
Opponent’s actions depend onits previous actions
Iterated Prisoner’s Dilemma
Opponent’s actions depend onboth players’ previous actions
Littman’s Soccer [10]
Direct competition
Which approach has betterperformance?
R P S
R 0,0 -1,1 1,-1
P 1,-1 0,0 -1,1
S -1,1 1,-1 0,0
D C
D 1,1 4,0
C 0,4 3,3
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 16 / 22
Iterated Rock Paper Scissors
Name Avg Payoff Avg Time Name Avg Payoff Avg Time Name Avg Payoff Avg Time
ELPH 1 ± 0 14.7 ± 0.6 WoLF-PHC 0.645 ± 0.006 89 ± 5 ELPH 0.666 ± 0.0003 56 ± 4
WoLF-PHC 1 ± 0 27 ± 2 PGA-APP 0.644 ± 0.008 59 ± 5 PGA-APP 0.652 ± 0.005 62 ± 4
PGA-APP 0.973 ± 0.009 24 ± 2 ɛ Q-Learner 0.635 ± 0.008 22 ± 3 WoLF-PHC 0.646 ± 0.004 71 ± 4
ɛ Q-Learner 0.97 ± 0.01 29 ± 2 ELPH 0.617 ± 0.002 210 ± 0 ɛ Q-Learner 0.582 ± 0.008 48 ± 6
WPL 0.87 ± 0.01 74 ± 6 WPL 0.374 ± 0.007 143 ± 7 WPL 0.393 ± 0.008 139 ± 7
ELPH 1 ± 0 10 ± 0 ELPH 1 ± 0 10 ± 0 WoLF-PHC 0.68 ± 0.01 173 ± 6
WoLF-PHC 0.98 ± 0.008 91 ± 3 ɛ Q-Learner 0.92 ± 0.01 45 ± 4 ɛ Q-Learner 0.64 ± 0.01 56 ± 5
ɛ Q-Learner 0.97 ± 0.01 28 ± 2 WoLF-PHC 0.91 ± 0.01 147 ± 8 PGA-APP 0.61 ± 0.01 120 ± 7
PGA-APP 0.92 ± 0.01 52 ± 3 PGA-APP 0.86 ± 0.01 109 ± 6 ELPH 0.6 ± 0.002 58 ± 4
WPL 0.65 ± 0.02 105 ± 7 WPL 0.54 ± 0.01 71 ± 6 WPL 0.375 ± 0.009 139 ± 7
ELPH 1 ± 0 10 ± 0 ELPH 1 ± 0 17.2 ± 0.7 ELPH 1 ± 0 16.3 ± 0.7
WoLF-PHC 0.95 ± 0.01 181 ± 6 WoLF-PHC 0.89 ± 0.01 205 ± 3 WoLF-PHC 0.85 ± 0.01 210 ± 0
ɛ Q-Learner 0.94 ± 0.01 37 ± 4 ɛ Q-Learner 0.87 ± 0.01 71 ± 5 ɛ Q-Learner 0.84 ± 0.01 84 ± 6
PGA-APP 0.9 ± 0.02 144 ± 6 PGA-APP 0.87 ± 0.01 179 ± 6 PGA-APP 0.77 ± 0.01 198 ± 3
WPL 0.63 ± 0.01 98 ± 6 WPL 0.69 ± 0.01 208 ± 2 WPL 0.76 ± 0.01 210 ± 0
Mem
ory
Siz
e 1
Mem
ory
Siz
e 2
Mem
ory
Siz
e 3
R,P,S Order 1 R,R,P,P,S,S Order 2 R,R,R,P,P,P,S,S,S Order 3
Good ← ― ― ― ― ― ― → Bad
Agents cannot learn best response with memory size < model order
Our approach gains the highest payoffs at generally the fastest rates
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 17 / 22
Iterated Prisoner’s Dilemma
Name Avg Payoff Avg Time Position Name Avg Payoff Avg Time Position
PGA-APP 2.03 ± 0.01 30 ± 3 13 ɛ Q-Learner 2.68 ± 0.01 180 ± 5 1
ɛ Q-Learner 1.94 ± 0.01 30 ± 4 16 TDAG + Q-Learner 2.63 ± 0.01 60 ± 4 1
WPL 1.932 ± 0.007 20 ± 1 17 TDAG 2.607 ± 0.008 20 ± 1 1
TDAG 1.93 ± 0.01 30 ± 2 16 WPL 2.31 ± 0.01 30 ± 4 12
WoLF-PHC 1.89 ± 0.01 20 ± 2 18 PGA-APP 2.17 ± 0.02 30 ± 3 13
WoLF-PHC 2.1 ± 0.02 40 ± 5 13
PGA-APP 2.01 ± 0.01 30 ± 4 14 TDAG + Q-Learner 2.828 ± 0.009 120 ± 6 1
WPL 1.949 ± 0.008 20 ± 1 17 ɛ Q-Learner 2.74 ± 0.01 180 ± 5 1
WoLF-PHC 1.92 ± 0.01 30 ± 4 17 TDAG 2.72 ± 0.01 20 ± 1 1
TDAG 1.902 ± 0.008 20 ± 2 16 WPL 2.34 ± 0.01 40 ± 4 12
ɛ Q-Learner 1.822 ± 0.007 20 ± 2 18 PGA-APP 2.18 ± 0.02 40 ± 5 13
WoLF-PHC 2.14 ± 0.01 30 ± 3 13
ɛ Q-Learner 2.02 ± 0.01 30 ± 3 14 TDAG + Q-Learner 2.847 ± 0.009 130 ± 5 1
TDAG 1.958 ± 0.008 20 ± 3 17 TDAG 2.74 ± 0.01 30 ± 3 1
WPL 1.945 ± 0.009 20 ± 3 17 ɛ Q-Learner 2.65 ± 0.01 170 ± 5 1
PGA-APP 1.92 ± 0.009 20 ± 2 16 WPL 2.32 ± 0.01 30 ± 4 12
WoLF-PHC 1.773 ± 0.007 20 ± 1 18 PGA-APP 2.18 ± 0.02 40 ± 4 12
WoLF-PHC 2.14 ± 0.02 40 ± 4 13
Mem
ory
Siz
e 2
Mem
ory
Siz
e 3
Discount = 0 and Depth = 1 Discount = 0.99 and Depth = 2
Mem
ory
Siz
e 1
Good ← ― ― ― ― ― ― → Bad
Increasing lookahead (discounting, search depth) increases rewards
Our approach + Q-Learning increases rewards but also increases time
Our approach gains the highest payoffs at generally the fastest rates
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 18 / 22
Soccer
Name Avg Payoff Name Avg Payoff Name Avg Payoff Name Avg Payoff
PPMC 0.687 ± 0.006 PPMC 0.701 ± 0.006 PPMC 0.717 ± 0.004 PPMC 0.648 ± 0.006
LSTM 0.635 ± 0.004 LSTM 0.638 ± 0.005 H. N-Gram 0.674 ± 0.002 H. N-Gram 0.608 ± 0.003
TDAG 0.63 ± 0.004 FP 0.637 ± 0.004 N-Gram 0.665 ± 0.001 ActiveLeZi 0.599 ± 0.004
H. N-Gram 0.628 ± 0.003 N-Gram 0.614 ± 0.003 LSTM 0.659 ± 0.003 FP 0.593 ± 0.003
LZ78 0.621 ± 0.004 H. N-Gram 0.612 ± 0.003 TDAG 0.659 ± 0.002 TDAG 0.589 ± 0.004
N-Gram 0.62 ± 0.003 ActiveLeZi 0.606 ± 0.004 FP 0.655 ± 0.003 LSTM 0.585 ± 0.004
ActiveLeZi 0.618 ± 0.003 TDAG 0.606 ± 0.004 LZ78 0.653 ± 0.002 N-Gram 0.582 ± 0.003
ELPH 0.601 ± 0.004 LZ78 0.602 ± 0.004 ActiveLeZi 0.651 ± 0.002 LZ78 0.574 ± 0.003
FP 0.536 ± 0.003 ELPH 0.576 ± 0.003 ELPH 0.637 ± 0.002 ELPH 0.565 ± 0.003
KMP 0.524 ± 0.002 KMP 0.564 ± 0.003 KMP 0.62 ± 0.002 KMP 0.553 ± 0.003
ɛ Q-Learner WoLF-PHC WPL PGA-APP
Good ← ― ― ― ― ― ― → Bad
Our approach wins above 50% of the games using any predictor
PPMC has the highest performances
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 19 / 22
Conclusions
We proposed sequence prediction and lookahead to accurately modeland effectively respond to opponents with memory
Empirical results show given sufficient memory and lookahead ourapproach outperforms reinforcement learning algorithms
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 20 / 22
Future Work
Will apply our approach to domains with:
Larger state spacesHidden information
Where the challenges are:
Deeper lookahead (e.g. sampling techniques)Sequence predictor configuration (e.g. 1 predictor per state)
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 21 / 22
References
[1] Lempel and Ziv. Compression of Individual Sequences via Variable-Rate Coding. 1978.
[2] Byron Knoll. Text Prediction and Classification Using String Matching. 2009.
[3] Alistair Moffat. “Implementing the PPM Data Compression Scheme”. In: IEEETransactions on Communications 38 (1990), pp. 1917–1921.
[4] Karthik Gopalratnam and Diane J. Cook. “ActiveLezi: An incremental parsing algorithmfor sequential prediction”. In: 16th Int. FLAIRS Conf. 2003, pp. 38–42.
[5] Philip Laird and Ronald Saul. “Discrete Sequence Prediction and Its Applications”. In:Machine Learning 15 (1994), pp. 43–68.
[6] Jensen et al. “Non-stationary policy learning in 2-player zero sum games”. In: Proc. of20th Int. Conf. on AI. 2005, pp. 789–794.
[7] Ian Millington. “Artificial Intelligence for Games”. In: ed. by David H. Eberly. MorganKaufmann, 2006. Chap. Learning, pp. 583–590.
[8] Felix A. Gers, Nicol N. Schraudolph, and Jurgen Schmidhuber. “Learning Precise Timingwith LSTM Recurrent Networks”. In: JMLR 3 (2002), pp. 115–143.
[9] C. J. C. H. Watkins. “Learning from delayed rewards”. PhD thesis. Cambridge, 1989.
[10] Michael L. Littman. “Markov games as a framework for multi-agent reinforcementlearning”. In: 11th Proc. of ICML. Morgan Kaufmann, 1994, pp. 157–163.
Richard Mealing and Jonathan L. Shapiro (Machine Learning and Optimisation GroupSchool of Computer ScienceUniversity of Manchester, UK)Sequence Prediction Opponent Modelling 22 / 22