data biased robust counter strategies€¦ · motivation: playing strong poker background: games as...
TRANSCRIPT
Data Biased Robust Counter Strategies
Michael Johanson, Michael Bowling
April 14, 2009
U
VA♠
A♠
CK♥
K♥
PQ♣
Q♣
RJ♦
J♦
G10♠
10 ♠
University of AlbertaComputer Poker Research Group
Introduction
Data Biased Robust Counter Strategies
In a competitive two player game, we have a tradeoff:
Playing to win: taking advantage of our opponent’s mistakesPlaying well: being robust against any opponent
Our technique:
Partially satisfies both goalsRequires only observations of the opponent’s actions
Demonstrated in the challenging game of 2-player Limit TexasHold’em Poker
Outline
Motivation: Playing strong Poker
Background: Games as POMDP’s
Strategies for playing games
Existing techniques and their problems
Building Counter-Strategies from Data
Results against human opponents
Conclusion
Motivation
University of Alberta Computer Poker Research Group
Polaris - the world’s strongest program for playing Two-Player LimitTexas Hold’em PokerIn 2008, won competitions against top human and computer players
Our approach revolves around being able to find Nash equilibriumstrategies for large abstractions of Poker
Nash equilibrium strategies are guaranteed to not lose, but don’texploit opponent errors
Games as POMDP’s
Games like Poker can be thought of as repeated POMDPs
When cards are dealt, state transitions are drawn from a knowndistributionWhen the opponent acts, state transitions depend on:
Opponent’s strategyOpponent’s cards (that we can’t see)
Nash Equilibrium: A strategy that does well, no matter what statetransitions at opponent node are set to
But what if we have observations of opponent state transitions? Wecan do better.
Types of Strategies
Performance against a static Poker opponent, inmillibets per game
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000
Exp
loita
tion
of O
ppon
ent (
mb/
g)
Worst Case Performance (mb/g)
Nash Equilibrium
Best ResponseGame Theory:Nash equilibrium.Lowexploitiveness,low exploitability
Decision Theory:Best response.Highexploitiveness,high exploitability
Types of Strategies
Performance against a static Poker opponent, inmillibets per game
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000
Exp
loita
tion
of O
ppon
ent (
mb/
g)
Worst Case Performance (mb/g)
Nash Equilibrium
Best Response
Other possiblestrategies
Types of Strategies
Performance against a static Poker opponent, inmillibets per game
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000
Exp
loita
tion
of O
ppon
ent (
mb/
g)
Worst Case Performance (mb/g)
Nash Equilibrium
Best Response
Restricted NashResponse: Set ofbest possibletradeoffs
Restricted Nash Response (Johanson, Zinkevich, Bowling – NIPS 2007)
Choose a value p and solve an unusual game
Opponent is forced to follow a model with probability p
GameTree
GameTree
Model
p1-p
Our Strategy Opponent's Strategy
Restricted Nash Response
Performance against a static Poker opponent, inmillibets per game
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000
Exp
loita
tion
of O
ppon
ent (
mb/
g)
Worst Case Performance (mb/g)
(0)
(0.5)(0.7)(0.8)
(0.9)(0.93)
(0.97)(0.99)
(1)
Restricted NashResponse: Eachdatapoint showsa different valuep
Restricted Nash Response with Models
2♦2♥ K♦K♥
4/10 6/10 1/4 3/4
0/00/0
Restricted Nash requires theopponent’s strategy
Can we use a model instead?
Observe 1 million gamesplayed by the opponent
Do frequency counts onopponent actions taken atinformation sets
Model assumes opponenttakes actions with observedfrequencies
Need a default policy whenthere are no observations
Problems with Models
Problem 1: Overfitting to the model
0
5
10
15
20
25
30
35
40
45
50
0 200 400 600 800 1000
Per
form
ance
aga
inst
opp
onen
t (m
b/g)
Worst case performance (mb/g)
Opponent Model
Actual Opponent
Problems with Models
Problem 2: Requires a lot of training data
-120
-100
-80
-60
-40
-20
0
20
0 200 400 600 800 1000
Exp
loita
tion
(mb/
g)
Exploitability (mb/g)
1m
100k
10k
1k
100
Problems with Models
Problem 3: Source of observations matters
-120
-100
-80
-60
-40
-20
0
20
0 200 400 600 800 1000
Exp
loita
tion
(mb/
g)
Exploitability (mb/g)
Self-Play
Random Samples
Data Biased Response
Restricted Nash had one parameter p that represented modelaccuracy, but...
Model wasn’t accurate in states we never observedModel was more accurate in some states than in others
New approach: Use model’s accuracy as part of the restricted game
Data Biased Response
Lets set up another restricted game. Instead of one p value for thewhole tree, we’ll set one p value for each choice node, p(i)
More observations → more confidence in the model → higher p(i)
Set a maximum p(i) value, Pmax, that we vary to produce a range ofstrategies
Data Biased Response
Three examples:
1-Step: p(i) = 0 if 0 observations, p(i) = Pmax otherwise10-Step: p(i) = 0 if less than 10 observations, p(i) = Pmax otherwise0-10 Linear: p(i) = 0 if 0 observations, p(i) = Pmax if 10 or more, andp(i) grows linearly in between
By setting p(i) = 0 in unobserved states, our prior is that theopponent will play as strongly as possible
DBR doesn’t overfit to the model
RNR and several DBR curves:
-5
0
5
10
15
20
25
0 200 400 600 800 1000
Exp
loita
tion
(mb/
h)
Exploitability (mb/h)
1-Step
RN
10-Step
0-10 Linear
DBR works with fewer observations
0-10 Linear DBR curve:
0
5
10
15
20
0 50 100 150 200 250
Exp
loita
tion
(mb/
g)
Exploitability (mb/g)
1m
100k
10k
1k100
DBR accepts any type of observation data
0-10 Linear DBR curve:
2 4 6 8
10 12 14 16 18 20
0 50 100 150 200 250
Exp
loita
tion
(mb/
g)
Exploitability (mb/g)
Self-Play
Random Samples
Experiments against humans
DBR trained on 1.8 million hands of human observations,playing on Poker-Academy.com:
0
2000
4000
6000
8000
10000
12000
0 5000 10000 15000 20000 25000 30000 35000
Tot
al w
inni
ngs
Games played
Data Biased Response
Equilibrium
Conclusion
Data Biased Response technique:
Generate a range of strategies, trading off exploitation and worst-caseperformanceTake advantage of observed informationAvoid overfitting to parts of the model that may be inaccurate
Questions?