nash qlearning for generalsum stochastic games › ... › presentations › presentation11.pdf ·...

17
 Nash Q-Learning for General-Sum Stochastic Games Hu, J. and Wellman, M.P. Lauri Lyly, [email protected]

Upload: others

Post on 26-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Nash Q­Learning for General­Sum Stochastic Games

Hu, J. and Wellman, M.P.

Lauri Lyly, [email protected]

Page 2: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Stochastic general­sum games

● Stochasticity: Environment is in part formed by other agents– nondeterministic, noncooperative nature, no agreements

● Arbitrary relation between agents' rewards– Extends last time's topic zero­sum

Page 3: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Nash Equilibrium

● Best­response joint strategy● Study limited to stationary strategies (policies)● Rewards of others are perceived, strategies are not

Page 4: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Nash Q­Values

Page 5: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Stage game

● One­period game as opposed to stochastic

● Mainly used in convergence proof● Nash equilibrium for the stage game. M is a “payoff 

function”:

Page 6: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Update rule

● Same update rule for agent itself and its conjecture on other agent's Q­functions

● Q­functions can be initialized for example to 0● Asynchronous updating: only entries pertaining to current 

state are updated

Page 7: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

The Nash Q­learning algorithm

Page 8: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Convergence proof requirements

● Assumption 1: Every state­action tuple is visited infinitely often

● Assumption 2: Learning rate alpha(t) satisfies:– Sum from goes towards infinity– Squared sum does not– alpha = 0 if the element being updated doesn't correspond to 

current state­action tuple (asynchronous updating)

Page 9: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Proof basis and result

● Q­learning process updated by pseudo­contraction operator using the usual form:

Q=(1­alpha(t))Q(t)+alpha(t)[P(t)Q(t)]Contraction: Values approach optimal Q

● Link between stage games and stochastic games● Goal: Show that NashQ is a pseudo­contraction operator● Actually a real contraction operator in restricted conditions

– Game with special types of Nash equilibrium points

Page 10: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Different Nash equilibria

● Global optimal point using stage game notation

● Saddle point

● All equilibria chosen for update must be same type

Page 11: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Stage games compared

Global optimal point (Up, Left) Saddle point (Down,Right)

Page 12: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Experimentation framework

● Two grid­world games

● Motivation for grid games: state­specific actions, qualitative transitions, immediate and long­term rewards

Page 13: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Learning process

● Violates assumption 3 of monotonic selection of global optima or saddle points.

● Still converges in most cases regardless of selection● Offline and online learning rated separately

Page 14: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Page 15: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Page 16: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

Conclusions

● No current method provides performance guarantees for general­sum stochastic games

● Works as a starting point– Other promising variants– Nash equilibrium itself can be refined

Page 17: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other

   

References

● [7] Hu, J. and Wellman, M.P. (2003). Nash Q­Learning for General­Sum Stochastic Games. Journal of Machine Learning Research 4:1039–1069.