data biased robust counter strategies€¦ · motivation: playing strong poker background: games as...

23
Data Biased Robust Counter Strategies Michael Johanson, Michael Bowling April 14, 2009 U V A A C K K P Q Q R J J G 10 10 University of Alberta Computer Poker Research Group

Upload: others

Post on 26-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Data Biased Robust Counter Strategies

Michael Johanson, Michael Bowling

April 14, 2009

U

VA♠

A♠

CK♥

K♥

PQ♣

Q♣

RJ♦

J♦

G10♠

10 ♠

University of AlbertaComputer Poker Research Group

Page 2: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Introduction

Data Biased Robust Counter Strategies

In a competitive two player game, we have a tradeoff:

Playing to win: taking advantage of our opponent’s mistakesPlaying well: being robust against any opponent

Our technique:

Partially satisfies both goalsRequires only observations of the opponent’s actions

Demonstrated in the challenging game of 2-player Limit TexasHold’em Poker

Page 3: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Outline

Motivation: Playing strong Poker

Background: Games as POMDP’s

Strategies for playing games

Existing techniques and their problems

Building Counter-Strategies from Data

Results against human opponents

Conclusion

Page 4: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Motivation

University of Alberta Computer Poker Research Group

Polaris - the world’s strongest program for playing Two-Player LimitTexas Hold’em PokerIn 2008, won competitions against top human and computer players

Our approach revolves around being able to find Nash equilibriumstrategies for large abstractions of Poker

Nash equilibrium strategies are guaranteed to not lose, but don’texploit opponent errors

Page 5: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Games as POMDP’s

Games like Poker can be thought of as repeated POMDPs

When cards are dealt, state transitions are drawn from a knowndistributionWhen the opponent acts, state transitions depend on:

Opponent’s strategyOpponent’s cards (that we can’t see)

Nash Equilibrium: A strategy that does well, no matter what statetransitions at opponent node are set to

But what if we have observations of opponent state transitions? Wecan do better.

Page 6: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Types of Strategies

Performance against a static Poker opponent, inmillibets per game

0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000

Exp

loita

tion

of O

ppon

ent (

mb/

g)

Worst Case Performance (mb/g)

Nash Equilibrium

Best ResponseGame Theory:Nash equilibrium.Lowexploitiveness,low exploitability

Decision Theory:Best response.Highexploitiveness,high exploitability

Page 7: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Types of Strategies

Performance against a static Poker opponent, inmillibets per game

0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000

Exp

loita

tion

of O

ppon

ent (

mb/

g)

Worst Case Performance (mb/g)

Nash Equilibrium

Best Response

Other possiblestrategies

Page 8: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Types of Strategies

Performance against a static Poker opponent, inmillibets per game

0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000

Exp

loita

tion

of O

ppon

ent (

mb/

g)

Worst Case Performance (mb/g)

Nash Equilibrium

Best Response

Restricted NashResponse: Set ofbest possibletradeoffs

Page 9: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Restricted Nash Response (Johanson, Zinkevich, Bowling – NIPS 2007)

Choose a value p and solve an unusual game

Opponent is forced to follow a model with probability p

GameTree

GameTree

Model

p1-p

Our Strategy Opponent's Strategy

Page 10: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Restricted Nash Response

Performance against a static Poker opponent, inmillibets per game

0

10

20

30

40

50

60

0 1000 2000 3000 4000 5000 6000

Exp

loita

tion

of O

ppon

ent (

mb/

g)

Worst Case Performance (mb/g)

(0)

(0.5)(0.7)(0.8)

(0.9)(0.93)

(0.97)(0.99)

(1)

Restricted NashResponse: Eachdatapoint showsa different valuep

Page 11: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Restricted Nash Response with Models

2♦2♥ K♦K♥

4/10 6/10 1/4 3/4

0/00/0

Restricted Nash requires theopponent’s strategy

Can we use a model instead?

Observe 1 million gamesplayed by the opponent

Do frequency counts onopponent actions taken atinformation sets

Model assumes opponenttakes actions with observedfrequencies

Need a default policy whenthere are no observations

Page 12: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Problems with Models

Problem 1: Overfitting to the model

0

5

10

15

20

25

30

35

40

45

50

0 200 400 600 800 1000

Per

form

ance

aga

inst

opp

onen

t (m

b/g)

Worst case performance (mb/g)

Opponent Model

Actual Opponent

Page 13: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Problems with Models

Problem 2: Requires a lot of training data

-120

-100

-80

-60

-40

-20

0

20

0 200 400 600 800 1000

Exp

loita

tion

(mb/

g)

Exploitability (mb/g)

1m

100k

10k

1k

100

Page 14: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Problems with Models

Problem 3: Source of observations matters

-120

-100

-80

-60

-40

-20

0

20

0 200 400 600 800 1000

Exp

loita

tion

(mb/

g)

Exploitability (mb/g)

Self-Play

Random Samples

Page 15: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Data Biased Response

Restricted Nash had one parameter p that represented modelaccuracy, but...

Model wasn’t accurate in states we never observedModel was more accurate in some states than in others

New approach: Use model’s accuracy as part of the restricted game

Page 16: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Data Biased Response

Lets set up another restricted game. Instead of one p value for thewhole tree, we’ll set one p value for each choice node, p(i)

More observations → more confidence in the model → higher p(i)

Set a maximum p(i) value, Pmax, that we vary to produce a range ofstrategies

Page 17: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Data Biased Response

Three examples:

1-Step: p(i) = 0 if 0 observations, p(i) = Pmax otherwise10-Step: p(i) = 0 if less than 10 observations, p(i) = Pmax otherwise0-10 Linear: p(i) = 0 if 0 observations, p(i) = Pmax if 10 or more, andp(i) grows linearly in between

By setting p(i) = 0 in unobserved states, our prior is that theopponent will play as strongly as possible

Page 18: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

DBR doesn’t overfit to the model

RNR and several DBR curves:

-5

0

5

10

15

20

25

0 200 400 600 800 1000

Exp

loita

tion

(mb/

h)

Exploitability (mb/h)

1-Step

RN

10-Step

0-10 Linear

Page 19: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

DBR works with fewer observations

0-10 Linear DBR curve:

0

5

10

15

20

0 50 100 150 200 250

Exp

loita

tion

(mb/

g)

Exploitability (mb/g)

1m

100k

10k

1k100

Page 20: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

DBR accepts any type of observation data

0-10 Linear DBR curve:

2 4 6 8

10 12 14 16 18 20

0 50 100 150 200 250

Exp

loita

tion

(mb/

g)

Exploitability (mb/g)

Self-Play

Random Samples

Page 21: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Experiments against humans

DBR trained on 1.8 million hands of human observations,playing on Poker-Academy.com:

0

2000

4000

6000

8000

10000

12000

0 5000 10000 15000 20000 25000 30000 35000

Tot

al w

inni

ngs

Games played

Data Biased Response

Equilibrium

Page 22: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Conclusion

Data Biased Response technique:

Generate a range of strategies, trading off exploitation and worst-caseperformanceTake advantage of observed informationAvoid overfitting to parts of the model that may be inaccurate

Page 23: Data Biased Robust Counter Strategies€¦ · Motivation: Playing strong Poker Background: Games as POMDP’s Strategies for playing games Existing techniques and their problems Building

Questions?