multi-agent learning reinforcement learning · multi-agent learning reinforcement learning...

41
Multi-agent learning Reinforcement Learning Multi-agent learning Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Last modified on February 13 th , 2012 at 21:42 Slide 1

Upload: phamcong

Post on 12-Jun-2018

239 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Multi-agent learningReinfor ement Learning

Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department,

Faculty of Sciences, Utrecht University, The Netherlands.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 1

Page 2: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Reinforcement learning: motivation

• Nash equilibria in repeated games is a static analysis.Dynami al analysis:How do (or should) players develop their strategies and

behaviour in a repeated game?

“Do”: descriptive / economics; “should”: normative / agent theory.

• Reinforcement learning (RL) is a rudimentary learning technique.

1. RL is stimulus-response: it plays actions with the highest past payoff.

2. It is myopi : it is only interested in immediate success.

• Reinforcement learning can be applied to learning in games.

• When computer scientists mention RL, they usually mean multi-state RL.

• Single-state RL has already interesting and theoretically important

properties, especially when it is coupled to games.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 2

Page 3: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Plan for today

Part I: Single-state RL. Parts of Ch. 2 of Sutton et al.(1989): “Evaluative

Feedback”: ǫ-greedy, optimistic, value-based, proportional.

Part II: Single-state RL in games. First half of Ch. 2 of Peyton Young (2004):

“Reinforcement and Regret”.

1. By average: 1n r1 + · · ·+ 1

n rn.

2. With discounted past : γn−1r1 + γn−2r2 + · · ·+ γrn−1 + rn.

3. With an aspiration level (Sutton et al.: “reference reward”).

Part III: Convergence to dominant strategies. Begin of Beggs (2005): “On the

Convergence of Reinforcement Learning”.

#Players #Actions Result

☞ Theorem 1 : 1 2 Pr(dominant action) = 1

Theorem 2 : 1 ≥ 2 Pr(sub-dominant actions) = 0

Theorem 3 : ≥ 1 ≥ 2 Pr(dom) = 1, Pr(sub-dom) = 0

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 3

Page 4: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Part I:

Single-state

reinforcement learning

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 4

Page 5: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Exploration vs. exploitation

Problem. You are at the beginning of a new study year. Every fellow

student is interesting as a possible new friend.

How do you divide your time between your classmates to optimise your

happiness?

Strategies:

A. You make friends whe{n|r}ever possible. You could be called an explorer.B. You stick to the nearest fellow-student. You could be called an exploiter.C. What most people do: first explore, then “exploit”.

We ignore:

1. How quality of friendships is measured.

2. How changing personalities of friends (so-called “moving targets”) are

dealt with.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 5

Page 6: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

An array of N slot machines

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 6

Page 7: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Exploitation vs. exploration

Given. An array of N slot

machines.

Suppose the yield of every machine is

normally distributed with mean and

variance unknown to us.

Random questions:

1. How long do to stick with your

first slot machine?

2. When do you leave the second?

3. If machine A so far yields more

than machine B, then would you

explore B ever again?

4. Try many machines, or opt for

security?

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 7

Page 8: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Experiment

Yield Machine 1 Yield Machine 1 Yield Machine 1

2 4 2

2 5 1

3 4

6

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 8

Page 9: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

The N-armed bandit problem

Barto & Sutton: the N-armed bandit.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 9

Page 10: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Computation of the quality (offline version)

A reasonable measure for the quality of a slot ma hine after n tries, would be

the average profit.

Formula for the quality of a slot machine after n tries.

Qn =Defr1 + · · ·+ rn

n

Simple formula, but:

– Every time Qn is computed, all values r1, . . . , rn must be retrieved.

– The idea is to draw conclusions only if you have all the data.

– The data is processed in bat h.

– Learning proceeds o� line.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 10

Page 11: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Computation of the quality (online version)

Qn =r1 + · · ·+ rn

n=

r1 + · · ·+ rn−1

n+

rn

n

=r1 + · · ·+ rn−1

n − 1·

n − 1

n+

rn

n= Qn−1·

n − 1

n+

rn

n

= Qn−1 −

(1

n

)

Qn−1 +

(1

n

)

rn

= Qn−1︸ ︷︷ ︸

oldvalue

+1

n︸︷︷︸

learningrate

( rn︸︷︷︸

goal

value

− Qn−1︸ ︷︷ ︸

oldvalue

︸ ︷︷ ︸

error

)

︸ ︷︷ ︸

correction︸ ︷︷ ︸

new value

.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 11

Page 12: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Progress of quality Qn

– Amplitude of correction is determined by the learning rate.

– Here, the learning rate is 1/n and decreases through time.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 12

Page 13: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Exploration: ǫ-greedy exploration

ǫ-greedy exploration. Let 0 < ǫ ≤ 1 close to 0.

1. Choose (1 − ǫ)% of the time an optimal action.

2. At other times, choose a random action.

– Item 1: exploitation.

– Item 2: exploration.

– With probability one, every action is explored infinitely many times.

(Why?)

– Is it guaranteed that every action is explored infinitely many times?

– It would be an idea to explore sub-optimal a tions with relative high reward

more often. However, that is not how greedy exploration works . . .

. . . and we may lose convergence to optimal actions . . .

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 13

Page 14: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Optimistic initial values

An alternative for ǫ-greedy is to work

with optimisti initial values.1. At the outset, an unrealisti allyhigh quality is attributed to every

slot machine:

Qk0 = high

for 1 ≤ k ≤ N.

2. As usual, for every slot machine

its average profit is maintained.

3. Without exception, always exploit

machines with highest Q-values.

Random questions:

q1: Initially, many actions are tried

⇒ all actions are tried?

q2: How high should “high” be?

q3: What to do in case of ties (more

than one optimal machine)?

q4: Can we speak of exploration?

q5: Is optimism (as a method)

suitable to explore an array of

(possibly) infinitely many slot

machines? Why (not)?

q6: ǫ-greedy: Pr( every action is

explored infinitely many

times ) = 1. Also with optimism?

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 14

Page 15: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Optimistic initial values vs. ǫ-greedy

From: “Reinforcement Learning (...)”, Sutton and Barto, Sec. 2.8, p. 41.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 15

Page 16: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Maintaining and exploring friendships: strategies

ǫ-Greedy Spend most of the time to

your best friends (greedy).

Occasionally spend ǫ of your time

to explore random friendships.

Optimistic In the beginning, foster

(unreasonably) high expectations

of everyone. You will be

disappointed many times. Adapt

your expectations based on

experience. Always spend time

with your best friends.

Values (Cf. Sutton et al.) Let

0 < α << 1. In the beginning rate

everyone with a 6, say. If a

friendship rated r involves a new

experience e ∈ [0, 10] then for

example

rnew = rold + sign(e)· α.

(Watch boundaries!) Other

method:

rnew = (1 − α)· rold + α· e

Proportions Give everyone equal

attention in the beginning. If

there is a positive experience,

then give that person a little more

attention in the future. (Similarly

with negative experiences.)

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 16

Page 17: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Part II:

Single-state

reinforcement learning

in games

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 17

Page 18: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Proportional techniques: basic setup

• There are two players: A (the

protagonist) and B (the

antagonist, sometimes “nature”).

• Play proceeds in (possibly an

infinite number of) rounds

1, . . . , t, . . . .

• Identifiers X and Y denote finite

sets of possible actions.

• Each round, t, players A and B

choose actions x ∈ X and y ∈ Y,

respectively:

(x1, y1), (x2, y2), . . . , (xt, yt), . . . .

• A’s payoff is given by a fixed

function

u : X × Y → R.

In other words, A’s payoff matrix

is known.

• It follows that payoffs are timehomogeneous, i.e.,

(xs, ys) = (xt, yt)

⇒ u(xs, ys) = u(xt, yt).

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 18

Page 19: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Propensity, and mixed strategy of play

• Let t ≥ 0. The propensity of A to play x at t is denoted by θtx.

• A simple model of propensity is cumulative payoff matching (CPM):

θt+1x =

{

θtx + u(x, y) if x is played at round t,

θtx else.

• The vector of initial propensities, θ0 is not the result of play.

• As a vector: θt+1 = θt + utet, where etx =Def x is played at t ? 1 : 0.

• A plausible mixed strategy is to play at round t the normalised propensity of

x at t:(qt

x

)

x∈X,

where

qtx =Def

θtx

∑x′∈X θtx′

.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 19

Page 20: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

An Example

The total a umulated payo� at round t, the sum ∑x∈X θtx is abbreviated by vt.

θ0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 θ15

x1 1 8 3 0 0 0 7 4 0 1 0 0 0 1 0 0 29 θ15x1

x2 1 0 0 6 0 5 0 0 0 0 6 0 0 0 8 0 26 θ15x2

x3 1 0 0 0 9 0 0 0 9 0 0 2 2 0 0 8 31 θ15x3

86 v15

Remarks:

• It is the cumulative payoff from each action that matters, not the average

payoff. (There is a difference!)

• In this example, it is assumed that the initial propensities, θ0x, are one. In

general, they could be anything. (But ‖θ0‖ = 0 is not very useful.)

• Alternatively, scalar vt = ∑x∈X θ0x + ∑s≤t ut.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 20

Page 21: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Dynamics of the mixed strategy

We can obtain further insight in the dynamics of the process through the

change of the mixed strategy:

∆qtx = qt

x − qt−1x =

θtx

vt−

θt−1x

vt−1

=vt−1· θt

x

vt−1· vt−

vt· θt−1x

vt· vt−1=

vt−1· θtx − vt· θt−1

x

vt−1· vt

=vt−1· (θt−1

x + etx· ut)− (vt−1 + ut)· θt−1

x

vt−1· vt

= ��

���

vt−1· θt−1x + vt−1· et

x· ut −�

��

��

vt−1· θt−1x − ut· θt−1

x

vt−1· vt

=vt−1· et

x· ut − ut· θt−1x

vt−1· vt=

ut

vt

vt−1· etx − θt−1

x

vt−1

=ut

vt(et

x −θt−1

x

vt−1) =

ut

vt(et

x − qt−1x ).

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 21

Page 22: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Dynamics of the mixed strategy: convergence

The dynamics of the mixed strategy in round t is given by

∆qt =ut

vt(et − qt−1).

On coordinate x:

∆qtx =

ut

vt(et

x − qt−1x ).

We have:

|∆qt| = |ut

vt(et − qt−1)| =

ut

vt· |et − qt−1| ≤

ut

vt· 2

=ut

u1 + · · ·+ ut· 2 ≤

max{us | s ≤ t}

t·min{us | s ≤ t}· 2 = C·

1

t.

Since all terms except vt are bounded, limt→∞ ‖∆qt‖ = 0 (⇒/ convg.).

Does qt converge? If so, to the “right” (e.g., a Pareto optimal) strategy? Beggs

(2005) provides more clarity in certain circumstances.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 22

Page 23: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Abstraction of past payoffs tp

In 1991 and 1993, B. Arthur proposed the following update formula:

∆qt =ut

Ctp + ut(et − qt−1)

Consequently,

‖∆qt‖ ≤1

tp .

Remarks:

• Arthur’s notation differs considerably from that of Peyton Young (2004).

• If the parameter p is set to, e.g., 2, then there is convergence. However . . .

• In related research, where the value of p is determined through

psychological experiments, it is estimated that p < 1.

B. Arthur (1993): “On Designing Economic Agents that Behave Like Human Agents”. In: Journal of Evolu-tionary Economy 3, pp 1-22.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 23

Page 24: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Past payoffs at discount rate λ < 1

In 1995, Erev and Roth proposed the following update formula:

θt+1 = λθt + ut· et.

Consequently,

∆qt =ut

∑s≤t λt−s· us(et − qt−1)

(For simplicity, we assume ‖θ0‖ = 0.)

Since

(∑s≤t

λt−s) min{us | s ≤ t} ≤ ∑s≤t

λt−s· us ≤ (∑s≤t

λt−s) max{us | s ≤ t}

and since

1 + λ + λ2 + · · ·+ λt−1 =1 − λt

1 − λ

for λ 6= 1, the mixed strategy tends to change at a rate ∼ 1 − λ.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 24

Page 25: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Past payoffs represented by an aspiration level

Assume an aspiration level at ∈ R at every round. (Intuition: payoff with

which one would be satisfied.)

Idea:

utx > at ⇒ positively reinforce action x

utx < at ⇒ negatively reinforce action x

Correspondingly, the mixed strategy evolves according to

∆qt = (ut − at)(et − qt−1).

Typical definitions for aspiration:

• Average past payoffs. at =Def vt/t. A.k.a. satis� ing play (Crandall, 2005).

• Discounted past payoffs. at =Def ∑s≤t λt−s· us. (Erev & Roth, 1995).

Börgers and Sarin (2000). “Naïve Reinforcement Learning with Endogeneous Aspirations” in: Int. EconomicReview 41, pp. 921-950.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 25

Page 26: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Adequacy of reinforcement learning

Does reinforcement learning lead to optimal behaviour against B?

If A and B would both converge to optimal behaviour, i.e., to a best response,

this would yield a Nash equilibrium.

Less demanding:

Does reinforcement learning converge to optimal behaviour in a stationary

(and, perhaps, stochastic) environment?

• A history is a finite sequence of actions ξt : (x1, y1), . . . , (xt, yt).

• A strategy is a function g : H → ∆(X) that maps histories to probability

distributions over X. Write

qt+1 =Def g(ξt)

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 26

Page 27: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Optimality against stationary opponents

• Assume that B plays a fixed probability distribution q∗ ∈ ∆(Y).

• The combination of θ0, g and q∗ yields a realisation

ω = (x1, y1), . . . , (xt, yt), . . . .

• Define B(q∗) =Def {x ∈ X | x is a best response to q∗ }.

Definition. A strategy g is called optimal against q∗ if, with probability one,

for all x /∈ B(q∗) : limt→∞

qtx = 0 (1)

In this case, the phrase “with probability one” means that almost all (read: all

but finitely many) realisations satisfy (1).

Theorem. Given finite action sets X and Y, cumulative payoff matching on

X is optimal against every stationary distribution on Y.

Peyton Young (2004, p. 17): “Its proof is actually quite involved (. . . )”. (FSs.)

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 27

Page 28: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Part III: Beggs, 2005

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 28

Page 29: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

The learning model

Single-state proportional reinforcement learning (Erev & Roth, 1995).

As usual:

Ai(n + 1) =

{

Ai(n) + πi(n + 1) if action i is chosen,

Ai(n) else.

As usual:

Pri(n + 1) =Ai(n)

∑mj=1 Aj(n)

The following two assumptions are crucial:

1. All past, current and future payoffs πi(n) are bounded away from zero

and bounded from above. More precisely, there are 0 < k1 ≤ k2 such that

all payoffs are in [k1, k2].

2. Initial propensities Ai(0) are stri tly positive.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 29

Page 30: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Choice of actions

Lemma 1. Each action is chosen infinitely often with probability one.

Proof. From the above assumptions it follows that

Pri(n + 1) =Ai(n)

∑mj=1 Aj(n)

≥Ai(0)

Ai(0) + nk2

(Which is like worst case for i: as if i was never chosen and all previous n

rounds actions 6= i received the maximum possible payoff.)

Apply so-called onditional Borel-Cantelli lemma:a if {En}n are events, and

∑n=1

Pr(En | X1, . . . , Xn−1)

is unbounded, then the probability that an infinite number of En’s occur is

one. �aA.k.a. the second Borel-Cantelli lemma, or the Borel-Cantelli-Lévy lemma (Shiryaev, p. 518).

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 30

Page 31: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Unboundedness of propensities, and convergence

Lemma 2. For each i, Ai tends to infinity with probability one.

Proof. For each i, action i is chosen infinitely often with probability one. Since

payoff per round is bounded from below by k1, we have ∑∞j=1 k1 ≤ Ai, where j

runs over rounds where i is chosen. �

Now Lemma 1 + Lemma 2 + martingale theory suffice to prove convergence:

Suppose there are only two possible actions: a1 and a2. The expression

E[ π(ai) | history ]

denotes the expected payoff of action ai, given history of play up to and

including the choice to play ai itself.

Theorem 1. If

E[ π(a1) | history ] > γE[ π(a2) | history ] (2)

for some fixed γ > 1, then the probability that a1 will be played converges to one.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 31

Page 32: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Convergence to the dominant action: proof in a nutshell

• If a1 is dominant (like in Eq. 2),

the objective is to show that

A2

A1(n) 0, a.s.

• To this end, it suffices to show

thatAǫ

2

A1(n) C, a.s. (3)

for some C and for some

1 < ǫ < γ (which is possible,

since γ > 1).

Then

limn→∞

A2

A1(n) = lim

n→∞

Aǫ2

A1·

1

Aǫ−12

= limn→∞

Aǫ2

A1· lim

n→∞

1

Aǫ−12

= C· 0

• To this end, Beggs shows that, for some n ≥ N, and for all 1 < ǫ < γ, (3)

is a so-called non-negative super-martingale. (Explained in a moment.)

• It is known that every non-negative super-martingale converges to a finite

limit with probability one. (Explained in a moment.)

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 32

Page 33: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Super-martingale

A super-martingale is a stochastic process in which the conditional expectation of

the next value, given the current and preceding values, is less than or equal to

the current value:

E[ Zn+1 | Z1, . . . , Zn ] ≤ Zn

Think of an unfair gambling game that proceeds in rounds.

1. Expectations decrease. Taking expectations on both sides yields

E[ Zn+1 ] ≤ E[ Zn ].

2. Expectations converge. From (1) and the monotone onvergen e theorema it

follows that the expectations of a non-negative super-martingale converge

to a limit L somewhere in [0, E[ Z1 ]].

3. Doob’s Martingale Convergence Theorem: Values converge a.s. let {Zn}n

be a martingale (or sub-martingale, or super-martingale) such that

E[ |Zn| ] is bounded. Then limn→∞ Zn exists a.s. and is finite.aOrdinary mathematics.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 33

Page 34: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

To show that Aǫ2/A1 is a non-neg super-martingale

E

[

∆Aǫ

2

A1(n + 1) | history

]

=

Pr(1 | history)E

[

∆Aǫ

2

A1(n + 1) | 1, history

]

+

Pr(2 | history)E

[

∆Aǫ

2

A1(n + 1) | 2, history

]

=A1(n)

A1(n) + A2(n)E

[Aǫ

2(n)

A1(n) + π1(n + 1)−

Aǫ2(n)

A1(n)

]

+

A2(n)

A1(n) + A2(n)E

[(A2(n) + π2(n + 1))ǫ

A1(n)−

Aǫ2(n)

A1(n)

]

.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 34

Page 35: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

To show that Aǫ2/A1 is a non-neg super-martingale

Taylor expansion:

f (x + h) = f (x) + h f ′(x) +h2

2!f ′′(x) +

h3

3!f ′′′(x) +

h4

4!f ′′′′(x + θh)

︸ ︷︷ ︸

Lagrange

remainder

for some θ ∈ (0, 1). (Of course, there is nothing special about n = 4.)

Applied to f (x) = x−1 and n = 2 we obtain

(x + h)−1 = x−1 + h(−x−2) +h2

2!(2(x + θh)−3)

= x−1 − hx−2 + h2(x + θh)−3

=1

x−

h

x2+

h2

(x + θh)3.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 35

Page 36: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

To show that Aǫ2/A1 is a non-neg super-martingale

Applied Taylor expansion to f (x) = x−1 and n = 2 yields

(x + h)−1 = x−1 − hx−2 + h2(x + θh)−3

=1

x−

h

x2+

h2

(x + θh)3.

For non-negative x and h we have x3 ≤ (x + θh)3 so that

≤1

x−

h

x2+

h2

x3.

This first inequality puts an upper bound with “pure” x and h on

1

A1(n) + π1(n + 1)≤

1

A1(n)−

π1(n + 1)

A21(n)

+π2

1(n + 1)

A31(n)

.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 36

Page 37: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

To show that Aǫ2/A1 is a non-neg super-martingale

Similarly, applying Taylor expansion to f (x) = (x + h)ǫ with n = 2 yields

(x + h)ǫ = xǫ + hǫxǫ−1 + h2(ǫ − 1)ǫ(x + θh)ǫ−2.

For non-negative x and h and ǫ > 1, we have

(ǫ − 1)(x + θh)ǫ−2 ≤ Cxǫ−2

for some constant C, so that

(x + h)ǫ ≤ xǫ + hǫxǫ−1 + h2Cǫxǫ−2

This second inequality puts an upper bound with “pure” x and h on

(A2(n) + π2(n + 1))ǫ ≤ Aǫ2(n) + . . .

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 37

Page 38: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

To show that Aǫ2/A1 is a non-neg super-martingale

• Using E[aX + b] = aE[X] + b and factoring out common terms, Beggs

obtains

A1

A1 + A2

Aǫ2

A21

(n)

[

−E[π1(n + 1)] + c1E[π1(n + 1)2]

A1(n)

]

+

1

A1 + A2

ǫAǫ2

A1(n)

[

E[π2(n + 1)] + c2E[π2(n + 1)2]

A2(n)

]

.

• Because payoffs are bounded, E[π1(. . . )] > γE[π2(. . . )],

1 − γ < ǫ − γ < 0, constants K1, K2, K3 > 0 can be found such that

Aǫ2

A1(A1 + A2)

(

K1(ǫ − γ) +K2

A1+

K3

A2

)

(n)

• For ǫ ∈ (1, γ) and for n large enough, this expression is non-positive. �

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 38

Page 39: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Generalisation of Begg’s Theorem 1, and application to games

Let there be m ≥ 2 alternative actions, a1, . . . , am (rather than m = 2).

Theorem 2. If the expected payoff (conditional on the history) of ai dominates the

expected payoff (conditional on the history) of aj, for all j 6= i, then the probability

that aj will be played converges to zero, for all j 6= i.

Applied to games:

Theorem 3. In a game with finitely many actions and players, if a player learns

according the ER scheme then,

a. With probability 1, the probability and empirical frequency that he plays any

action that is strictly dominated by another pure strategy converges to zero.

b. Hence if he has a strictly dominant strategy, with probability 1, the probability

and empirical frequency with which he plays that action converges to 1.

(Beggs, 2005).

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 39

Page 40: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

Summary

• There are several rules for

reinforcement learning on single

states.

• Sheer convergence is often easy

to prove.

• To prove convergence to besta tions in a stationary environmentis much more difficult.

• Convergen e to best a tions innon-stationary environments, e.g.,

convergence to dominant actions,

or best responses in self-play, is

state-of-the art research.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 40

Page 41: Multi-agent learning Reinforcement Learning · Multi-agent learning Reinforcement Learning Multi-agent learning ... 2012 at 21:42 Slide 1. ... In other words, A’s payoff matrix

Multi-agent learning Reinforcement Learning

What next?

• No-regret learning: this is a generalisation of reinforcement learning

No-regret =Def play those actions that would have been successful in

the past.

• Similarities with reinforcement learning:

1. Driven by past payo�s.2. Not interested in (the behaviour of) the opponent.

3. Myopi .

• Differences:

a) Keeping accounts of hypothetical actions rests on the assumption that a

player is able to estimate payoffs of actions that were not actually

played. [Knowledge of the payoff matrix definitely helps, but is an

even more severe assumption.]

b) Bit more easy to obtain results regarding performance.

Gerard Vreeswijk. Last modified on February 13th, 2012 at 21:42 Slide 41