adaptive learning in games

EECS 463 Course Project 1

ADAPTIVE LEARNING IN GAMES

Suvarup Saha3/11/2010

Outline2

� Motivation

� Games

� Learning in Games

� Adaptive Learning

3/11/2010EECS 463 Course Project

� Adaptive Learning

� Example

� Gradient Techniques

� Conclusion

Motivation3

� Adaptive Filtering Techniques generalize to a lot of applications outside

� Gradient Based iterative search

� Stochastic Gradient


� Least Squares

� Application of Game Theory in less than rational multi-agent scenarios demand self-learning mechanisms

� Adaptive techniques can be applied in such instances to help the agents learn the game and play intelligently

Games4

� A game is an interaction between two or more self-interested agents

� Each agent chooses a strategy si from a set of strategies, Si

� A (joint) strategy profile, s, is the set of chosen strategies, also called an outcome of the game in a single play


called an outcome of the game in a single play

� Each agent has a utility function, ui(s), specifying their preference for each outcome in terms of a payoff

� An agent’s best response is the strategy with the highest payoff, given its opponents choice of strategy

� A Nash equilibrium is a strategy profile such that every agent’s strategy is a best response to others’ choice of strategy

A Normal Form Game5

b1 b2

a1 4,4 5,2

a2 0,1 4,3

B

A


� This is a 2 player game with SA={a1,a2}, SB={b1,b2}

� The ui(s) are explicitly given in a matrix form, for example uA(a1, b2) = 5, uB(a1, b2) = 2

� The best response of A to B playing b2 is a1

� In this game, (a1, b1) is the unique Nash Equilibrium

Learning in Games6

� Classical Approach: Compute an optimal/equilibrium strategy

� Some criticisms to this approach are� Other agents’ utilities might be unknown to an agent for computing an equilibrium strategy


computing an equilibrium strategy

� Other agents might not be playing an equilibrium strategy

� Computing an equilibrium strategy might be hard

� Another Approach: Learn how to ‘optimally’ play a game by

� playing it many times

� updating strategy based on experience

Learning Dynamics7

Evolutionary Adaptive Bayesian

Rationality/Sophistication of agents


EvolutionaryDynamics

AdaptiveLearning

BayesianLearning

Focus of Our Discussion

Evolutionary Dynamics8

� Inspired by Evolutionary Biology with no appeal to rationality of the agents

� Entire population of agents all programmed to use some strategy� Players are randomly matched to play with each other

Strategies with high payoff spread within the population by


� Strategies with high payoff spread within the population by� Learning

� copying or inheriting strategies – Replicator Dynamics

� Infection

� Stability analysis – Evolutionary Stable Strategies (ESS)� Players playing an ESS must have strictly higher payoffs than a small group of invaders playing a different strategy

Bayesian Learning9

� Assumes ‘informed agents’ playing repeated games with a finite action space

� Payoffs depend on some characteristics of agents represented by types – each agent’s type is private information

The agents’ initial beliefs are given by a common prior


� The agents’ initial beliefs are given by a common prior distribution over agent types

� This belief is updated according to Bayes’ Rule to a posterior distribution with each stage of the game.

� In every finite Bayesian game, there is at least one Bayesian Nash equilibrium, possibly in mixed strategies

Adaptive Learning10

� Agents are not fully rational, but can learn through experience and adapt their strategies

� Agents do not know the reward structure of the game

� Agents are only able to take actions and observe their own rewards (or oppnents’ rewards as well)

Popular Examples


� Popular Examples� Best Response Update

� Fictitious Play

� Regret Matching

� Infinitesimal Gradient Ascent (IGA)

� Dynamic Gradient Play

� Adaptive Play Q-learning

Fictitious Play11

� The learning process is used to develop a ‘historical distribution’ of the other agents’ play

� In fictitious play, agent i has an exogenous initial weight function ki

t: S-i � R+� Weight is updated by adding 1 to the weight of each opponent strategy, each time it is played


� Weight is updated by adding 1 to the weight of each opponent strategy, each time it is played

� The probability that player i assigns to player -i playing s-i at date t is given by

qit(s-i) = ki

t(s-i) / Σ kit(s-i)

� The ‘best response’ of the agent i in this fictitious play is given by

sit+1 = arg max Σ qi

t(s-i)ui(si, s-it)

An Example12

� Consider the same 2x2 game example as before

� Suppose we assign kA

0 (b1)= kA0 (b2)= kB

0 (a1)= kB0 (a2)= 1

� Then, qA0 (b1)= qA

0 (b2)= qB0 (a1)= qB

0 (a2)= 0.5

� For A, if A chooses a1 q 0(b )u (a , b ) + q 0(b )u (a , b ) = .5*4+.5*5 = 4.5

b1 b2

a1 4,4 5,2

a2 0,1 4,3

B

A


qA0(b1)uA(a1, b1) + qA

0(b2)uA(a1, b2) = .5*4+.5*5 = 4.5

while if A chooses a2qA

0(b1)uA(a2, b1) + qA0(b2)uA(a2, b2) = .5*0+.5*4 = 2

� For B, if B chooses b1 qB

0(a1)uB(a1, b1) + qB0(a2)uB(a2, b1) = .5*4+.5*1 = 2.5

while if B chooses b2qB

0(a1)uB(a1, b2) + qB0(a2)uB(a2, b2) = .5*2+.5*3 = 2.5

� Clearly, A plays a1 , B can choose either b1 or b2; assume B plays b2

Game proceeds.13

stage 0

A’s selection a1

B’s selection b2

A’s payoff 5


A’s payoff 5

B’ payoff 2

kAt(b1), qA

t(b1) 1, 0.5 1, 0.33

kAt(b2), qA

t(b2) 1, 0.5 2, 0.67

kBt(a1), qB

t(a1) 1, 0 .5 2, 0.67

kBt(a2), qB

t(a2) 1, 0 .5 1, 0.33

Game proceeds..14

stage 0 1

A’s selection a1 a1

B’s selection b2 b1

A’s payoff 5 4


A’s payoff 5 4

B’ payoff 2 4

kAt(b1), qA

t(b1) 1, 0.5 1, 0.33 2, 0.5

kAt(b2), qA

t(b2) 1, 0.5 2, 0.67 2, 0.5

kBt(a1), qB

t(a1) 1, 0 .5 2, 0.67 3, 0.75

kBt(a2), qB

t(a2) 1, 0 .5 1, 0.33 1, 0.25

Game proceeds…15

stage 0 1 2

A’s selection a1 a1 a1

B’s selection b2 b1 b1

A’s payoff 5 4 4


A’s payoff 5 4 4

B’ payoff 2 4 4

kAt(b1), qA

t(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6

kAt(b2), qA

t(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4

kBt(a1), qB

t(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2

kBt(a2), qB

t(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8

Game proceeds….16

stage 0 1 2 3

A’s selection a1 a1 a1 a1

B’s selection b2 b1 b1 b1

A’s payoff 5 4 4 4


A’s payoff 5 4 4 4

B’ payoff 2 4 4 4

kAt(b1), qA

t(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6 4, 0.67

kAt(b2), qA

t(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4 2, 0.33

kBt(a1), qB

t(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2 5, 0 .84

kBt(a2), qB

t(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8 1, 0.16

Gradient Based Learning17

� Fictitious Play assumes unbounded computation is allowed in every step – arg max calculation

� An alternative is to proceed in gradient ascent on some objective function – expected payoff


� Two players – row and column – have payoffs

and

� Row player chooses action 1 with probability α while

column player chooses action 2 with probability β

� Expected payoffs are

=

2221

1211

rr

rrR

=

2221

1211

cc

ccC

)1)(1()1()1(),( 22211211 βαβαβααββα −−+−+−+= rrrrVr

)1)(1()1()1(),( 22211211 βαβαβααββα −−+−+−+= ccccVc

Gradient Ascent18

� Each player repeatedly adjusts her half of the current strategy pair in the direction of the current gradient with some step size η

In case the equations take the strategies outside the probability

αβα

ηαα∂

∂+=+

),(1

kkr

kk

V

ββα

ηββ∂

∂+=+

),(1

kkckk

V


� In case the equations take the strategies outside the probability simplex, it is projected back to the boundary

� Gradient ascent algorithm assumes a full information game –both the players know the game matrices and can see the mixed strategy of their opponent in the previous step

β∂

)()( 12212211 rrrru +−+= )()( 12212211

' ccccu +−+=

)(),(

1222 rruVr −−=∂

∂β

αβα

)(),(

2122

' ccuVc −−=∂

∂α

ββα

Infinitesimal Gradient Ascent19

� Interesting to see what happens to the strategy pair and to the expected payoffs over time

� Strategy pair sequence produced by following a gradient ascent algorithm may never converge

� Average payoff of both the players always converges to that of some Nash pair


Nash pair

� Consider a small step size assumption – so that the update equations become

� Point where the gradient is zero – Nash Equilibrium

� This point might even lie outside the probability simplex.

0lim →η

−−

−−+

=

∂∂∂∂

)(

)(

0

0

2122

1222

' cc

rr

u

u

t

tβα

β

α

−−=

u

rr

u

cc 1222

'

2122** ,),( βα

IGA dynamics20

� Denote the off-diagonal matrix containing u and u’ by U

� Depending on the nature of U (noninvertible, real or imaginary e-values) the convergence dynamics will vary


WoLF - W(in)-o(r)-L(earn)-Fast21

� Introduces variable learning rate instead of a fixed η

� Let αe be the equilibrium strategy selected by the row player and βe be the equilibrium strategy selected by the column player

αβα

ηαα∂

∂+=+

),(1

kkrr

kk

Vl k

ββα

ηββ∂

∂+=+

),(1

kkcc

kkk

Vl


and βe be the equilibrium strategy selected by the column player

� If in a two-person, two-action, iterated general-sum game, both players follow the WoLF-IGA algorithm (with lmax>lmin) then their strategies will converge to a Nash equilibrium

=max

min

l

ll rk

=max

min

l

ll ck

gLootherwise

WinningVV k

e

rkkr

sin

),(),(

→

→> βαβα

gLootherwise

WinningVV e

kckkc

sin

),(),(

→

→> βαβα

WoLF-IGA convergence22


To Conclude

� Learning in games is popular in anticipation of a future in which less than rational agents play a game repeatedly to arrive at a stable and efficient equilibrium.

� The algorithmic structure and adaptive techniques involved in such learning are largely motivated by Machine Learning and

23

such learning are largely motivated by Machine Learning and Adaptive Filtering

� A Gradient- based approach relieves this computational burden but might suffer from convergence issues

� A stochastic gradient method (not discussed in the presentation) makes use of minimal information available and still performs near-optimally


adaptive learning in games

Education