adaptive learning in games
DESCRIPTION
This is the project presentation for Adaptive Filter Theory course I was doing this winter.TRANSCRIPT
EECS 463 Course Project 1
ADAPTIVE LEARNING IN GAMES
Suvarup Saha3/11/2010
Outline2
� Motivation
� Games
� Learning in Games
� Adaptive Learning
3/11/2010EECS 463 Course Project
� Adaptive Learning
� Example
� Gradient Techniques
� Conclusion
Motivation3
� Adaptive Filtering Techniques generalize to a lot of applications outside
� Gradient Based iterative search
� Stochastic Gradient
3/11/2010EECS 463 Course Project
� Least Squares
� Application of Game Theory in less than rational multi-agent scenarios demand self-learning mechanisms
� Adaptive techniques can be applied in such instances to help the agents learn the game and play intelligently
Games4
� A game is an interaction between two or more self-interested agents
� Each agent chooses a strategy si from a set of strategies, Si
� A (joint) strategy profile, s, is the set of chosen strategies, also called an outcome of the game in a single play
3/11/2010EECS 463 Course Project
called an outcome of the game in a single play
� Each agent has a utility function, ui(s), specifying their preference for each outcome in terms of a payoff
� An agent’s best response is the strategy with the highest payoff, given its opponents choice of strategy
� A Nash equilibrium is a strategy profile such that every agent’s strategy is a best response to others’ choice of strategy
A Normal Form Game5
b1 b2
a1 4,4 5,2
a2 0,1 4,3
B
A
3/11/2010EECS 463 Course Project
� This is a 2 player game with SA={a1,a2}, SB={b1,b2}
� The ui(s) are explicitly given in a matrix form, for example uA(a1, b2) = 5, uB(a1, b2) = 2
� The best response of A to B playing b2 is a1
� In this game, (a1, b1) is the unique Nash Equilibrium
Learning in Games6
� Classical Approach: Compute an optimal/equilibrium strategy
� Some criticisms to this approach are� Other agents’ utilities might be unknown to an agent for computing an equilibrium strategy
3/11/2010EECS 463 Course Project
computing an equilibrium strategy
� Other agents might not be playing an equilibrium strategy
� Computing an equilibrium strategy might be hard
� Another Approach: Learn how to ‘optimally’ play a game by
� playing it many times
� updating strategy based on experience
Learning Dynamics7
Evolutionary Adaptive Bayesian
Rationality/Sophistication of agents
3/11/2010EECS 463 Course Project
EvolutionaryDynamics
AdaptiveLearning
BayesianLearning
Focus of Our Discussion
Evolutionary Dynamics8
� Inspired by Evolutionary Biology with no appeal to rationality of the agents
� Entire population of agents all programmed to use some strategy� Players are randomly matched to play with each other
Strategies with high payoff spread within the population by
3/11/2010EECS 463 Course Project
� Strategies with high payoff spread within the population by� Learning
� copying or inheriting strategies – Replicator Dynamics
� Infection
� Stability analysis – Evolutionary Stable Strategies (ESS)� Players playing an ESS must have strictly higher payoffs than a small group of invaders playing a different strategy
Bayesian Learning9
� Assumes ‘informed agents’ playing repeated games with a finite action space
� Payoffs depend on some characteristics of agents represented by types – each agent’s type is private information
The agents’ initial beliefs are given by a common prior
3/11/2010EECS 463 Course Project
� The agents’ initial beliefs are given by a common prior distribution over agent types
� This belief is updated according to Bayes’ Rule to a posterior distribution with each stage of the game.
� In every finite Bayesian game, there is at least one Bayesian Nash equilibrium, possibly in mixed strategies
Adaptive Learning10
� Agents are not fully rational, but can learn through experience and adapt their strategies
� Agents do not know the reward structure of the game
� Agents are only able to take actions and observe their own rewards (or oppnents’ rewards as well)
Popular Examples
3/11/2010EECS 463 Course Project
� Popular Examples� Best Response Update
� Fictitious Play
� Regret Matching
� Infinitesimal Gradient Ascent (IGA)
� Dynamic Gradient Play
� Adaptive Play Q-learning
Fictitious Play11
� The learning process is used to develop a ‘historical distribution’ of the other agents’ play
� In fictitious play, agent i has an exogenous initial weight function ki
t: S-i � R+� Weight is updated by adding 1 to the weight of each opponent strategy, each time it is played
3/11/2010EECS 463 Course Project
� Weight is updated by adding 1 to the weight of each opponent strategy, each time it is played
� The probability that player i assigns to player -i playing s-i at date t is given by
qit(s-i) = ki
t(s-i) / Σ kit(s-i)
� The ‘best response’ of the agent i in this fictitious play is given by
sit+1 = arg max Σ qi
t(s-i)ui(si, s-it)
An Example12
� Consider the same 2x2 game example as before
� Suppose we assign kA
0 (b1)= kA0 (b2)= kB
0 (a1)= kB0 (a2)= 1
� Then, qA0 (b1)= qA
0 (b2)= qB0 (a1)= qB
0 (a2)= 0.5
� For A, if A chooses a1 q 0(b )u (a , b ) + q 0(b )u (a , b ) = .5*4+.5*5 = 4.5
b1 b2
a1 4,4 5,2
a2 0,1 4,3
B
A
3/11/2010EECS 463 Course Project
qA0(b1)uA(a1, b1) + qA
0(b2)uA(a1, b2) = .5*4+.5*5 = 4.5
while if A chooses a2qA
0(b1)uA(a2, b1) + qA0(b2)uA(a2, b2) = .5*0+.5*4 = 2
� For B, if B chooses b1 qB
0(a1)uB(a1, b1) + qB0(a2)uB(a2, b1) = .5*4+.5*1 = 2.5
while if B chooses b2qB
0(a1)uB(a1, b2) + qB0(a2)uB(a2, b2) = .5*2+.5*3 = 2.5
� Clearly, A plays a1 , B can choose either b1 or b2; assume B plays b2
Game proceeds.13
stage 0
A’s selection a1
B’s selection b2
A’s payoff 5
3/11/2010EECS 463 Course Project
A’s payoff 5
B’ payoff 2
kAt(b1), qA
t(b1) 1, 0.5 1, 0.33
kAt(b2), qA
t(b2) 1, 0.5 2, 0.67
kBt(a1), qB
t(a1) 1, 0 .5 2, 0.67
kBt(a2), qB
t(a2) 1, 0 .5 1, 0.33
Game proceeds..14
stage 0 1
A’s selection a1 a1
B’s selection b2 b1
A’s payoff 5 4
3/11/2010EECS 463 Course Project
A’s payoff 5 4
B’ payoff 2 4
kAt(b1), qA
t(b1) 1, 0.5 1, 0.33 2, 0.5
kAt(b2), qA
t(b2) 1, 0.5 2, 0.67 2, 0.5
kBt(a1), qB
t(a1) 1, 0 .5 2, 0.67 3, 0.75
kBt(a2), qB
t(a2) 1, 0 .5 1, 0.33 1, 0.25
Game proceeds…15
stage 0 1 2
A’s selection a1 a1 a1
B’s selection b2 b1 b1
A’s payoff 5 4 4
3/11/2010EECS 463 Course Project
A’s payoff 5 4 4
B’ payoff 2 4 4
kAt(b1), qA
t(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6
kAt(b2), qA
t(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4
kBt(a1), qB
t(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2
kBt(a2), qB
t(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8
Game proceeds….16
stage 0 1 2 3
A’s selection a1 a1 a1 a1
B’s selection b2 b1 b1 b1
A’s payoff 5 4 4 4
3/11/2010EECS 463 Course Project
A’s payoff 5 4 4 4
B’ payoff 2 4 4 4
kAt(b1), qA
t(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6 4, 0.67
kAt(b2), qA
t(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4 2, 0.33
kBt(a1), qB
t(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2 5, 0 .84
kBt(a2), qB
t(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8 1, 0.16
Gradient Based Learning17
� Fictitious Play assumes unbounded computation is allowed in every step – arg max calculation
� An alternative is to proceed in gradient ascent on some objective function – expected payoff
3/11/2010EECS 463 Course Project
� Two players – row and column – have payoffs
and
� Row player chooses action 1 with probability α while
column player chooses action 2 with probability β
� Expected payoffs are
=
2221
1211
rr
rrR
=
2221
1211
cc
ccC
)1)(1()1()1(),( 22211211 βαβαβααββα −−+−+−+= rrrrVr
)1)(1()1()1(),( 22211211 βαβαβααββα −−+−+−+= ccccVc
Gradient Ascent18
� Each player repeatedly adjusts her half of the current strategy pair in the direction of the current gradient with some step size η
In case the equations take the strategies outside the probability
αβα
ηαα∂
∂+=+
),(1
kkr
kk
V
ββα
ηββ∂
∂+=+
),(1
kkckk
V
3/11/2010EECS 463 Course Project
� In case the equations take the strategies outside the probability simplex, it is projected back to the boundary
� Gradient ascent algorithm assumes a full information game –both the players know the game matrices and can see the mixed strategy of their opponent in the previous step
β∂
)()( 12212211 rrrru +−+= )()( 12212211
' ccccu +−+=
)(),(
1222 rruVr −−=∂
∂β
αβα
)(),(
2122
' ccuVc −−=∂
∂α
ββα
Infinitesimal Gradient Ascent19
� Interesting to see what happens to the strategy pair and to the expected payoffs over time
� Strategy pair sequence produced by following a gradient ascent algorithm may never converge
� Average payoff of both the players always converges to that of some Nash pair
3/11/2010EECS 463 Course Project
Nash pair
� Consider a small step size assumption – so that the update equations become
� Point where the gradient is zero – Nash Equilibrium
� This point might even lie outside the probability simplex.
0lim →η
−−
−−+
=
∂∂∂∂
)(
)(
0
0
2122
1222
' cc
rr
u
u
t
tβα
β
α
−−=
u
rr
u
cc 1222
'
2122** ,),( βα
IGA dynamics20
� Denote the off-diagonal matrix containing u and u’ by U
� Depending on the nature of U (noninvertible, real or imaginary e-values) the convergence dynamics will vary
3/11/2010EECS 463 Course Project
WoLF - W(in)-o(r)-L(earn)-Fast21
� Introduces variable learning rate instead of a fixed η
� Let αe be the equilibrium strategy selected by the row player and βe be the equilibrium strategy selected by the column player
αβα
ηαα∂
∂+=+
),(1
kkrr
kk
Vl k
ββα
ηββ∂
∂+=+
),(1
kkcc
kkk
Vl
3/11/2010EECS 463 Course Project
and βe be the equilibrium strategy selected by the column player
� If in a two-person, two-action, iterated general-sum game, both players follow the WoLF-IGA algorithm (with lmax>lmin) then their strategies will converge to a Nash equilibrium
=max
min
l
ll rk
=max
min
l
ll ck
gLootherwise
WinningVV k
e
rkkr
sin
),(),(
→
→> βαβα
gLootherwise
WinningVV e
kckkc
sin
),(),(
→
→> βαβα
WoLF-IGA convergence22
3/11/2010EECS 463 Course Project
To Conclude
� Learning in games is popular in anticipation of a future in which less than rational agents play a game repeatedly to arrive at a stable and efficient equilibrium.
� The algorithmic structure and adaptive techniques involved in such learning are largely motivated by Machine Learning and
23
such learning are largely motivated by Machine Learning and Adaptive Filtering
� A Gradient- based approach relieves this computational burden but might suffer from convergence issues
� A stochastic gradient method (not discussed in the presentation) makes use of minimal information available and still performs near-optimally
3/11/2010EECS 463 Course Project