data biased robust counter strategies€¦ · motivation: playing strong poker background: games as...

Data Biased Robust Counter Strategies

Michael Johanson, Michael Bowling

April 14, 2009

U

VA♠

A♠

CK♥

K♥

PQ♣

Q♣

RJ♦

J♦

G10♠

10 ♠

University of AlbertaComputer Poker Research Group

Introduction

Data Biased Robust Counter Strategies

In a competitive two player game, we have a tradeoff:

Playing to win: taking advantage of our opponent’s mistakesPlaying well: being robust against any opponent

Our technique:

Partially satisfies both goalsRequires only observations of the opponent’s actions

Demonstrated in the challenging game of 2-player Limit TexasHold’em Poker

Outline

Motivation: Playing strong Poker

Background: Games as POMDP’s

Strategies for playing games

Existing techniques and their problems

Building Counter-Strategies from Data

Results against human opponents

Conclusion

Motivation

University of Alberta Computer Poker Research Group

Polaris - the world’s strongest program for playing Two-Player LimitTexas Hold’em PokerIn 2008, won competitions against top human and computer players

Our approach revolves around being able to find Nash equilibriumstrategies for large abstractions of Poker

Nash equilibrium strategies are guaranteed to not lose, but don’texploit opponent errors

Games as POMDP’s

Games like Poker can be thought of as repeated POMDPs

When cards are dealt, state transitions are drawn from a knowndistributionWhen the opponent acts, state transitions depend on:

Opponent’s strategyOpponent’s cards (that we can’t see)

Nash Equilibrium: A strategy that does well, no matter what statetransitions at opponent node are set to

But what if we have observations of opponent state transitions? Wecan do better.

Types of Strategies

Performance against a static Poker opponent, inmillibets per game

0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000

Exp

loita

tion

of O

ppon

ent (

mb/

g)

Worst Case Performance (mb/g)

Nash Equilibrium

Best ResponseGame Theory:Nash equilibrium.Lowexploitiveness,low exploitability

Decision Theory:Best response.Highexploitiveness,high exploitability

Types of Strategies


0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000

Exp

loita

tion

of O

ppon

ent (

mb/

g)


Nash Equilibrium

Best Response

Other possiblestrategies

Types of Strategies


0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000

Exp

loita

tion

of O

ppon

ent (

mb/

g)


Nash Equilibrium

Best Response

Restricted NashResponse: Set ofbest possibletradeoffs

Restricted Nash Response (Johanson, Zinkevich, Bowling – NIPS 2007)

Choose a value p and solve an unusual game

Opponent is forced to follow a model with probability p

GameTree

GameTree

Model

p1-p

Our Strategy Opponent's Strategy

Restricted Nash Response


0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000

Exp

loita

tion

of O

ppon

ent (

mb/

g)


(0)

(0.5)(0.7)(0.8)

(0.9)(0.93)

(0.97)(0.99)

(1)

Restricted NashResponse: Eachdatapoint showsa different valuep

Restricted Nash Response with Models

2♦2♥ K♦K♥

4/10 6/10 1/4 3/4

0/00/0

Restricted Nash requires theopponent’s strategy

Can we use a model instead?

Observe 1 million gamesplayed by the opponent

Do frequency counts onopponent actions taken atinformation sets

Model assumes opponenttakes actions with observedfrequencies

Need a default policy whenthere are no observations

Problems with Models

Problem 1: Overfitting to the model

0

5

10

15

20

25

30

35

40

45

50

0 200 400 600 800 1000

Per

form

ance

aga

inst

opp

onen

t (m

b/g)

Worst case performance (mb/g)

Opponent Model

Actual Opponent


Problem 2: Requires a lot of training data

-120

-100

-80

-60

-40

-20

0

20

0 200 400 600 800 1000

Exp

loita

tion

(mb/

g)

Exploitability (mb/g)

1m

100k

10k

1k

100


Problem 3: Source of observations matters

-120

-100

-80

-60

-40

-20

0

20

0 200 400 600 800 1000

Exp

loita

tion

(mb/

g)


Self-Play

Random Samples

Data Biased Response

Restricted Nash had one parameter p that represented modelaccuracy, but...

Model wasn’t accurate in states we never observedModel was more accurate in some states than in others

New approach: Use model’s accuracy as part of the restricted game


Lets set up another restricted game. Instead of one p value for thewhole tree, we’ll set one p value for each choice node, p(i)

More observations → more confidence in the model → higher p(i)

Set a maximum p(i) value, Pmax, that we vary to produce a range ofstrategies


Three examples:

1-Step: p(i) = 0 if 0 observations, p(i) = Pmax otherwise10-Step: p(i) = 0 if less than 10 observations, p(i) = Pmax otherwise0-10 Linear: p(i) = 0 if 0 observations, p(i) = Pmax if 10 or more, andp(i) grows linearly in between

By setting p(i) = 0 in unobserved states, our prior is that theopponent will play as strongly as possible

DBR doesn’t overfit to the model

RNR and several DBR curves:

-5

0

5

10

15

20

25

0 200 400 600 800 1000

Exp

loita

tion

(mb/

h)

Exploitability (mb/h)

1-Step

RN

10-Step

0-10 Linear

DBR works with fewer observations

0-10 Linear DBR curve:

0

5

10

15

20

0 50 100 150 200 250

Exp

loita

tion

(mb/

g)


1m

100k

10k

1k100

DBR accepts any type of observation data

0-10 Linear DBR curve:

2 4 6 8

10 12 14 16 18 20

0 50 100 150 200 250

Exp

loita

tion

(mb/

g)


Self-Play

Random Samples

Experiments against humans

DBR trained on 1.8 million hands of human observations,playing on Poker-Academy.com:

0

2000

4000

6000

8000

10000

12000

0 5000 10000 15000 20000 25000 30000 35000

Tot

al w

inni

ngs

Games played


Equilibrium

Conclusion

Data Biased Response technique:

Generate a range of strategies, trading off exploitation and worst-caseperformanceTake advantage of observed informationAvoid overfitting to parts of the model that may be inaccurate

Questions?

data biased robust counter strategies€¦ · motivation: playing strong poker background: games as...

Documents