multi-agent reinforcement learning: a mean-field perspective · motivation: a sequential auction...

Reinforcement Learning China Summer School

Multi-agent Reinforcement Learning:From a Mean-Field Perspective

Renyuan XuMathematical Institute, University of Oxford

August 8, 2020

A Toy Example: When Does RLChina Lecture Start?Gueant, Lasry, and Lions (2011)

I The RLChina lectures are scheduled to start at t = 7PM but oftenstart some time later than t, say T

I The actual starting time T depends on the arrivals of participants

I A rule is imposed saying that the lecture starts at time t or after90% of the participants have arrived, whichever is earlier.

I Very large number of participants: we agree to consider them as acontinuum of agents

I Question: What is the actual starting time T?

1 / 51

A Toy Example: When Does RLChina Lecture Start?

I τi: time at which participant i decides to arrive

I τi: time at which participant i actually arrives:

τi = τi + σiεi

whereI σiεi is the uncertainty, σi ∼ m0, εi ∼ N (0, 1)

2 / 51


Each participant makes decision upon minimizing the expectation of thetotal cost

c(t, T, τi) = c1(t, T, τi) + c2(t, T, τi) + c3(t, T, τi)

I Lateness compared to the scheduled time t

c1(t, T, τi) = α[τi − t]+

I Lateness compared to the actually time T

c2(t, T, τi) = β[τi − T ]+

I Waiting timec3(t, T, τi) = γ[T − τi]+

3 / 51


Resolution:

1. Anticipate an actual starting time T , solve for an optimal τi:

αN(τi − tσi

)+ (β + γ)N

(τi − Tσi

)= γ

where N is the cumulative distribution of a standard normal.

2. From τ i, together with the noise and the rule, find thecorresponding actual starting time T ∗

I ”Continuum of participants”: 90% quantile of a distributionI N participants: ordered statistics (intractable)

3. Show that the mapping Γ : T 7→ T ∗ has a fixed point:I Banach Fixed Point Theorem

4 / 51

Schedule

Mean-field approximation for MARL with large population

I (45 mins) Non-cooperative games ⇒ mean-field game

I (45 mins) Cooperative games ⇒ mean-field control

4 / 51

Learning in Mean-field Games

4 / 51

Motivation: A Sequential Auction Game

Figure: An Overview of Taobao Display Advertising System.

5 / 51


Ad auction problem for advertisers:

I Ad auction: a stochastic game on an ad exchange platform among alarge number of players (the advertisers)

I Environment: in each round, a web user requests a page, and then a(Vickrey-type) second-best-price auction is run to incentivizeadvertisers to bid for a slot to display advertisement

I Characteristics:I partial information (unknown conversion of clicks)I large population

Question: From an individual bidder’s perspective, how to bid in thissequential game with a large population of competing bidders andunknown distributions of the conversion of clicks/rewards?

6 / 51


Attempt: the

Reinforcement Learning︷︸︸︷simultaneous-learning-and-decision-making problem in a

sequential auction with a large number of homogeneous bidders.

I Full model approach: solve it as an N -player reinforcement learningproblemI Curse of “many agents” when N is large

I intractable interactions on individual levelsI computational complexity growths exponentially with respect to the

number of players

I Structure-dependent policies: Lauer and Riedmiller (2000), Qu,Wierman and Li (2019).

I Provable algorithms for two-player gamesI No theoretical guarantee for general non-zero sum MARL

7 / 51

Motivation: sequential auction game

Attempt: the

Reinforcement Learning︷︸︸︷simultaneous learning and decision-making problem in a

sequential auction with a large number of homogeneous︸︷︷︸Mean-Field Games

bidders.

I Approximation approach: mean-field approximationI When N is large, consider instead the “aggregated” version of theN -player game by letting N →∞

I About small interacting individuals, with each player choosingoptimal strategy in view of the macroscopic information (mean field)

I By (f)SLLN, the aggregated version becomes an “approximation” ofthe N-player game

I The aggregated version, mean-field approximation, is analyticallyfeasible and easy to learn

8 / 51

Other Applications: HFT and Crowded Shorts

(a) High-frequency Trading (b) Crowded Shorts

9 / 51

Outline

I When mean-field approximation works?I Model set-up for mean-field games (MFG)I Existence and uniqueness of MFG solutions

I How to apply mean-field theory in MARL algorithmic design?I Q-learning for GMFGI Smoothing and stabilizing techniquesI Convergence and complexity analysisI More general results: policy-based and value-based algorithms

9 / 51

N -player Games

N -player game

V i(sss,πππ) := E[ ∞∑t=0

γtri(ssst, aaat)|sss0 = sss

]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst).

I N players, state space S, action space AI ssst = (s1

t , . . . , sNt ) ∈ SN is the state profile

I aaat = (a1t , . . . , a

Nt ) ∈ AN is the action profile

I ri is the reward function and P i is the transition kernel of player i

I (ssst, aaat)PPP−→ ssst+1

I admissible policy πi : SN → P(A), with P(A) the space of allprobability measures over A

10 / 51

From N -player Game to MFG

N -player games

maximizeπi V i(sss,πππ) := E[ ∞∑t=0

γtri(ssst, aaat)

∣∣∣∣sss0 = sss

]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst)

I Mean-field approximation will not work for all N-player games

I Need assumptions to embrace law of large number and theory ofpropagation of chaos

Assume

1. Homogeneity: Agents are identical, indistinguishable andinterchangeable

2. Weak interactions: player i depends on other agents through

empirical measure (µ−it (·), α−it (·)) :=

(∑j 6=i I(s

jt=·)

N−1 ,∑j 6=i I(a

jt=·)

N−1

)3. Local policy: ait ∼ πit(sit) here

11 / 51

From N -player Game to MFG

A smaller class of N -player games

V i(si, µ−i, πi, α−i) :=

E[ ∞∑t=0

γtri(sit, µ−it , πit, α

−it )

∣∣∣∣ (si0, µ−i0 , α−i0 ) = (si, µ−i, α−i)

]subject to sit+1 ∼ P i(sit, µ−it , πit, α

−it ), ait ∼ πit(sit)

I Homogeneous agents ⇒ look at a representative agent is enough

I A game between agent i and the empirical measure of other agents(µ−i, α−i)

When the number of players goes to infinity, view the limit of

(µ−it (·), α−it (·)) :=

(∑j 6=i I(s

jt=·)

N−1 ,∑j 6=i I(a

jt=·)

N−1

)as population

state-action joint distribution Lt

12 / 51

General MFG

MFG (Representative agent)

V (s, π, Lt∞t=0) := E[ ∞∑t=0

γtr(st, at, Lt)|s0 = s

]subject to st+1 ∼ P (st, at, Lt), at ∼ πt(st)

I st ∈ S and at ∈ A are the state and action of a representative agentat time t

I r is the reward function, P is the transition dynamics

I Lt ∈ ∆|S||A| is the population state-action distribution at time t,with state marginal µt and action marginal αt

I admissible policy πt : S → P(A) here

13 / 51

Nash Equilibrium in GMFGsParallel definition of NE for N-player game

Denote

I Policy profile π = πt∞t=0

I Population profile L = Lt∞t=0

Definition (NE for GMFGs)

In GMFGs, a policy-population pair (π?, L?) is called a NE if

1. (Representative agent side) Fix L?, for any policy π and anyinitial state s ∈ S,

V (s, π?, L?) ≥ V (s, π, L?) .

2. (Population side) Pst,at = L?t for all t ≥ 0, where st, at∞t=0 is thedynamics under control π?, with at ∼ π?t (st), st+1 ∼ P (·|st, at, L?).

14 / 51

Nash Equilibrium in GMFGParallel Definitions


In GMFGs, a policy-population pair(π?, L?) is called a NE if

1. (Representative agent side)Fix L?, for any policy π andany initial state s ∈ S,

V (s, π?, L?) ≥ V (s, π, L?) .

2. (Population side) Pst,at = L?tfor all t ≥ 0, where st, at∞t=0

is the dynamics under controlπ?, with at ∼ π?t (st),st+1 ∼ P (·|st, at, L?).

Definition (NE for N-playergames)

In N-player game, a policy profile πππ∗

is called a NE if

1. (One agent side) Fix πππ∗,−i,for any policy πi and any initialstate sss ∈ SN ,

V i (sss,πππ∗) ≥ V i(sss, (πππ∗,−i, πi),

).

2. (Population side) Condition 1holds for all agents.

15 / 51

Nash Equilibrium in GMFGStationary Solution


In GMFGs, a policy-population pair(π?, L?) is called a NE if

1. (Representative agent side)Fix L?, for any policy π andany initial state s ∈ S,

V (s, π?, L?) ≥ V (s, π, L?) .

2. (Population side) Pst,at = L?tfor all t ≥ 0, where st, at∞t=0

is the dynamics under controlπ?, with at ∼ π?t (st),st+1 ∼ P (·|st, at, L?).

I Stationary solution: If thereexists (L?, π?) independent oftimeI For single-agent RL: optimal

stationary policy alwaysexists for infinite timehorizon problem

I For game: it is more difficultdue to the competition

16 / 51

Outline


I How to apply mean-field theory in MARL algorithmic design?I Q-learning for GMFGI Smoothing and stabilizing techniquesI General policy-based and value-based algorithms

16 / 51

Fixed point/Three-step approachModel is fully known

I Step 1 (Γ1): given L, solve the stochastic control problem to get π?L:

maximizeπ V (s, π|L) := E[ ∞∑t=0

γtr(st, at|L)|s0 = s

]subject to st+1 ∼ P (st, at, L)

Γ1(L) = π?LI Step 2 (Γ2): given π?L, update from L for one time step to get L′

following the dynamicsΓ2(L, π?L) = L′

I Step 3: Check whether L′ matches L, and repeatΓ2 Γ1(L) = L′

Remark. Well-definedness; Algorithm design; Convergence analysis

17 / 51

Existence and Uniqueness

Theorem 1 (Guo, Hu, X. & Zhang, 2019 here )

1. Under some “small parameter” conditions, Γ2 Γ1 is contractive;

2. For any GMFG, if Γ2 Γ1 is contractive, then there exists a uniquestationary NE. In addition, the three-step approach converges.

I Uniqueness is in the sense of L

I Explicit model conditions to guarantee “small parameter” conditions

I Parallel results for both stationary and non-stationary MFG

18 / 51

(Partial) References on MFGsModel fully known

I Existence:I PDE approach: Huang, Malhame & Caines (2006), Lasry & Lions

(2006)I Probability approach: Carmona & Delarue (2013), Carmona &

Lacker (2014)I Weak solution approach: Lacker (2015)

I Uniqueness:I Small parameter condition: Huang, Malhame & Caines (2006)I Monotonicity condition: Lasry & Lions (2006)

I Convergence (value/policy): Lacker (2015, 2018), Fischer (2017),Cardaliaguet, Delarue, Lasry, and Lions (2019)

I Reachability when multiple equilibrium: Cecchin, Fischer, andPelino (2018), Delarue and Tchuendom (2018), Nutz, Martin, andTan (2019)

I Tutorials: Gueant, Lasry, and Lions (2011); Daniel Lacker (IPAMNotes, 2018); Carmona & Delarue (Two volume books)

19 / 51

Outline



19 / 51

Bridge MFG with RL: Finding NE

Three-step approach revisited:

I Step 1: given L, solve the stochastic control problem to get π?L:

maximizeπ V (s, π, L) := E[ ∞∑t=0

γtr(st, at, L)|s0 = s

],

subject to st+1 ∼ P (st, at, L)

I Step 2: given π?L, update from L for one time step to get L′

following the dynamics

I Step 3: Check whether L′ matches L

20 / 51

Bridge MFG with RL: Finding NE

Three-step approach revisited (when P and distr of r are unknown):

I Step 1 [Inner Iteration]: given L, perform Q-learning with transitionPL(s′|s, a) := P (s′|s, a, L) and reward rL(s, a) := r(s, a, L)⇒ Q∗L(s, a)

I Step 2 [Outer Iteration]: given π?L = argmax-e(Q∗L(s, a)), updatefrom L for one time step to get L′ following the dynamics

I Step 3: Check whether L′ matches L

Remark: π?L(s) ∈ argmaxa Q?L(s, a). When argmax is non-unique,

replace it with argmax-e, which assigns equal probability to themaximizers.

21 / 51

Naive RL Algorithm for GMFG

Algorithm 1 Naive Q-learning for GMFGs

1: Input: Initial population state-action pair L0

2: for k = 0, 1, · · · do3: Given Lk, perform Q-learning to find the Q-function Q?k(s, a) =

Q?Lk(s, a) of an MDP with dynamics PLk(s′|s, a) and reward distri-butions RLk(s, a).

4: Solve πk ∈ Π with πk(s) = argmax-e (Q?k(s, ·)).5: Sample s ∼ µk, where µk is the population state marginal of Lk,

and obtain Lk+1 from G(s, πk, Lk).6: end for

Step 3: Fix LI

Qt+1L (s, a)← QtL(s, a) + βt(s, a) [r(s, a, L)

+ γmaxa′ QtL(s′, a′)−QtL(s, a)

],

I QtL → Q?L, with

Q?L(s, a) := rL(s, a) + γ∑s′∈S

PL(s′|s, a)V ?L (s′).

22 / 51

Naive RL Algorithm for GMFG



2: for k = 0, 1, · · · do3: Given Lk, perform Q-learning to find the Q-function Q?k(s, a) =

Q?Lk(s, a) of an MDP with dynamics PLk(s′|s, a) and reward distri-butions RLk(s, a).

4: Solve πk ∈ Π with πk(s) = argmax-e (Q?k(s, ·)).5: Sample s ∼ µk, where µk is the population state marginal of Lk,

and obtain Lk+1 from G(s, πk, Lk).6: end for

Step 3: Fix LI

Qt+1L (s, a)← QtL(s, a) + βt(s, a) [r(s, a, L)

+ γmaxa′ QtL(s′, a′)−QtL(s, a)

],

I QtL → Q?L, with

Q?L(s, a) := rL(s, a) + γ∑s′∈S

PL(s′|s, a)V ?L (s′).

22 / 51

Problems in the Naive Algorithm: Approximation Errors



2: for k = 0, 1, · · · do

3: Perform Q-learning to find the Q-function

impossible︷︸︸︷Q?k(s, a) = Q?Lk(s, a)

of an MDP with dynamics PLk(s′|s, a) and reward distributionsRLk(s, a).

4: Solve πk ∈ Π with πk(s) =

unstable︷︸︸︷argmax-e (Q?k(s, ·)).

5: Sample s ∼ µk, where µk is the population state marginal of Lk,and obtain Lk+1︸︷︷︸

unstable

from G(s, πk, Lk).

6: end for

I Inner iteration: error controlI argmax-e: not continuousI The update from Lk to Lk+1 is not controlled

24 / 51

Instability of argmax-e:Magnify the Approximation Errors

argmax-e is not continuous:

I x = (1, 1), then argmax-e(x) = (1/2, 1/2)

I y = (1, 1− ε), then for any ε > 0, argmax-e(y) = (1, 0)

I ‖argmax-e(x)− argmax-e(y)‖2/‖x− y‖2 = 1√2ε

25 / 51

Stable Algorithm for GMFG (GMF-Q)

Algorithm 2 Q-learning for GMFGs (GMF-Q)

1: Input: Initial L0, tolerance ε > 0.2: for k = 0, 1, · · · do3: Perform Q-learning for Tk iterations to find the approximate

Q-function Q?k(s, a) = Q?Lk(s, a) of an MDP with dynamicsPLk(s′|s, a) and reward distributions RLk(s, a).

4: Compute πk ∈ Π with πk(s) = softmaxc(Q?k(s, ·)).5: Sample s ∼ µk, where µk is the population state marginal of Lk,

and obtain Lk+1 from G(s, πk, Lk).6: Find Lk+1 = ProjSε(Lk+1)7: end for

1. Tk: carefully chosen in the PAC

2. argmax-e to softmax: softmaxc(x)i = exp(cxi)∑nj=1 exp(cxj)

3. Projection on to Sε: a ε-net (finite cover) of L

26 / 51

Outline



26 / 51

Convergence and Complexity of GMF-Q

Theorem 2 (Guo, Hu, X. & Zhang, 2019)

Given the same assumptions in the existence and uniqueness theorem, forany specified tolerances ε, δ > 0, set Tk, c and Sε appropriately. Thenwith probability at least 1− 2δ, W1(LKε , L

?) = O(ε), and the total

number of iterations T =∑Kε−1k=0 Tk is bounded by

T = O(K19/3ε (log(Kε/δ))

41/3).

Here Kε :=⌈2 max

(ηε)−1/η, logd(ε/maxdiam(S)diam(A), 1) + 1)

⌉is the number of outer iterations.

Here W1 is the `1 Wasserstein distance.

27 / 51

Outline



27 / 51

More General ResultsGuo, Hu, X., Zhang (2020)

Key take-away: “Three-step approach + Smoothing+ Stabalizing”provides a meta framework for learning MFG:

I For inner iterations, any single-agent RL with finite sample boundcan be usedI Value-based algorithms: Q-learningI Policy-based algorithms: Trust Region Policy Optimization (TRPO),

Proximal Policy Optimization (PPO), Conservative Policy Iteration(CPI)

I For outer iterations, different choices of smoothing methodI Boltzmann (softmax)

I MellowMax: MMc(xxx) =log( 1

n

∑ni=1 exp(cxi))

cI Momentum: linear combination of Lt and Lt−1

28 / 51

(Partial) Related Literature on Learning MFG

I Mean Field Multi-Agent Reinforcement Learning: Yang, Luo, Li,Zhou, Zhang, and Wang (2018))I First paper of applying mean-field approximation to MARLI Interaction through actions

I Model-based framework (linear-quadratic): Fu, Yang, Chen, Wang(2019)

I Local NE: Subramanian and Mahajan (2019)

I Deep Deterministic Policy Gradient (continuous action): Elie,Perolat, Lauriere, Geist, Pietquin (2019)

I Variants of Q-learning algorithms:I Fitted Q: Berkay Anahtarcı, Can Deha Karıksız, Naci Saldi (2019)I Regularized Q: Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi

(2020)

29 / 51

Learning Mean-Field Controls

29 / 51

Motivating Example: Ride-sharing Order Dispatch

30 / 51


Real-time (driver,order)-pair match to maximize the combined rewards:

I Rider: short waiting time

I Driver:I short-term reward ∼

distance(origin,destination)I long-term reward:

I Global supply-demandbalance:

—————Li, Qin, Jiao, Yang, Gong, Wang, Wang, Wu and Ye (2019)

31 / 51


Key features:

I Cooperative games: optimization from platform’s perspective

I Large population: N and M are large

I Interaction with (partially) unknown environment: traffic condition,stochastic and time-dependent demand, and etc

I Demand for (fast) real-time resolution: short waiting time

Solution: MARL with mean-field approximation (Li, Qin, Jiao, Yang,Gong, Wang, Wang, Wu, Ye (2019))

I Cooperative order dispatching algorithm

I Mean-field approximation for local dynamics

I Centralized training with decentralized execution

I Convergence proof of Q-learning

32 / 51

Other Examples

(a) Data Routing (b) Food Delivery (c) Autonomous Driving

33 / 51

Outline

I Mean-field control formulationI N-player cooperative gameI Pareto optimality conditionI N →∞

I Dynamic Programming Principle (DPP) on the probability measurespace

I Q-learning algorithm on the probability measure space

33 / 51

Cooperative MARL

At the beginning of each round t, for each agent i

I State: agent i is at sit ∈ SI Action: he/she takes an action ait ∈ A such that ait ∼ πi

I Policy: πi : S → P(A)

I Reward: receive a reward ri(ssst, aaat)

I Value function: V i(sss,πππ) = E[∑∞

t=0 γtri(ssst, aaat)

∣∣sss0 = sss]

I Transition sit+1 ∼ P i(ssst, aaat): the state will change to sit+1 accordingto a transition probability function P i(ssst, aaat)

34 / 51

Pareto Optimality Condition

Pareto optimality

If πππ∗ is Pareto optimal, then there does not exist πππ such that

I V i(sss,πππ) ≥ V i(sss,πππ∗) for all i and sss

I There exist j such that V j(sss,πππ) > V j(sss,πππ∗) for all sss

I No individual can be better off without making at least oneindividual worse off

I Central controller’s solution is pareto optimal

35 / 51

Cooperative MARL

Central controller

V (sss,πππ) := 1N

∑Ni=1 E

[ ∞∑t=0

γtri(ssst, aaat)

∣∣∣∣sss0 = sss

]subject to sit+1 ∼ P i(ssst, aaat), (i = 1, 2, · · · , N)

I Cooperative game ⇒ agreement among agents ⇒ auxiliary centralcontroller

I Central controller:I Observe joint state ssstI Decide ait ∼ πit(ssst) for each agent i from a global perspectiveI Observe joint reward rrrt as feedbackI PPP and the distribution of rrr are unknown

I Curse of dimensionality Vanilla Q-learning algorithm for centralcontroller

Ω

(poly

((|S||A|)N · N

ε· ln(

1

δε)

)).

36 / 51

Mean-field Approximation

Assumptions

I Homogeneity: Agents are identical and interchangeable.

I Weak interaction: Each agent depends on all other agents onlythrough the empirical distributions of states and actions.

When N →∞, this becomes a mean-field control problem.

37 / 51

Mean-field Approximation

Mean-field control

supπ V (π, µ) := E[ ∞∑t=0

γtr(st, at, µt)

∣∣∣∣ s0 ∼ µ, µ0 = µ

]subject to st+1 ∼ P (st, at, µt).

I st ∈ S: dynamics of representative agent with randomized initialstate s0 ∼ µ

I at ∈ A: action

I µt = Law(st) ∈ P(S): limit of µNt (·) =∑Nj=1

1(sjt=·)N

I r: reward of the representative agent

I Admissible policy πt(s, µ) : S × P(S)→ P(A)

38 / 51

Comparison between MFG and MFC

————————Carmona, Delarue & Lachapelle (2013)

39 / 51

Outline


I Dynamic Programming Principle (DPP) on the probabilitymeasure space


39 / 51

Time Consistency (Dynamic Programming Principle)

Dynamic Programming is an umbrella encompassing many algorithms(single agent RL and MARL)

I Value-based method: Q-learning

I Policy-based method: Actor-Critic Algorithm

I Planning: Value Iteration or Policy Iteration

Dynamic Programming does not hold for free with infinite number ofplayers: need to work with the correct “state” and “action” spaces.

I Pham and Wei (2016), Pham and Wei (2017), and Wu and Zhang(2019) for value function

40 / 51

Q-function on the probability measure space

Define the Q-function for MFC:

Q(µ, h) := E[r(s0, a0, µ)

∣∣∣ s0 ∼ µ, a0 ∼ h]

︸︷︷︸Reward of taking a0 ∼ h

+ Es1∼P (s0,a0,µ)

[ ∞∑t=1

γtr(st, at, µt)

∣∣∣∣∣ at ∼ π∗t]

︸︷︷︸Reward of playing optimal afterwards at ∼ π∗t

I local policy h : S → P(A)

I V ∗(µ) = maxhQ(µ, h)

41 / 51

Bellman Equation for Integrated Q-function

Theorem 3 (Bellman Equation for IQ (Gu, Guo, Wei and X.,2019))

For any µ ∈ P(S) and h ∈ H,

Q(µ, h) = r(µ, h) + γ suph′∈H

Q(Φ(µ, h), h′).

I H := h : S → P(A): set of ”local” policies

I Aggregated DynamicsI Φ(µ, h) :=

∑s,a P (s, µ, a)µ(s)h(s, a): aggregated dynamics

I µt+1 = Φ(µt, h): distribution at time t+ 1, flow property

I Aggregated Reward: r(µ, h) :=∑s,a µ(s)h(s, a)r(s, a, µ)

42 / 51

Bellman Equation for Integrated Q-function


Q(Φ(µ, h), h′).

I H is the minimum space under which the Bellman equation holds

I Similar results for finite-time horizon; Bellman equation for valuefunction; Including law of actions

43 / 51

Outline


I Dynamic Programming Principle (DPP) on the probability measurespace


43 / 51

Q-learning Algorithm on Probability Measure Space


Q(Φ(µ, h), h′).

Properties of Q(µ, h):

I Defined on the probability measure space

I Continuous µ and h even when s and a are discrete

I Deterministic dynamics: µt+1 = Φ(µt, h)

=⇒ Q-learning algorithm with continuous state-action anddeterministic dynamics

44 / 51

Q-learning Algorithm on Probability Measure Space


Q(Φ(µ, h), h′).

I Continuous state space: kernel regression

I Continuous action space: only optimize on a finite action space

I Deterministic dynamic: take advantage of it to reduce the samplecomplexity

45 / 51

ε-net and kernel regression

(d) ε-net (e) Kernel regression

I Cε: ε-net on C := P(S)×HI Hε: induced ε-net on HI Kernel regression on Cε

46 / 51

Approximated Bellman Operator

The original Bellman operator B : RC → RC for the problem is

(B q)(c) := r(c) + γ suph∈H

q(Φ(c), h).

I c ∈ C := P(S)×H

To facilitate the algorithm design, we introduce an approximatedBellman operator BK : RCε → RCε such that

(BK q)(ci) = r(ci) + γ max

h∈HεΓKq(Φ(ci), h),

I ΓK : kernel operator

47 / 51

Sketch of the Algorithm

The algorithm consists of 2 steps:

1. Get sampled data for an ε-netI Initial µ, ε

2-net Cε/2, exploration policy π taking actions from Hε/2

I At the current state µt, act ht according to π, observeµt+1 = Φ(µt, ht) and rt = r(µt, ht)

2. Find the fixed point of BK

ql+1(µt, ht)←(r(µt, ht) + γ max

h∈HεΓKql(Φ(µt, ht), h)

)Comment: Visit Cε/2 once is enough ⇒ Low sample-complexity andefficient

48 / 51

Convergence and Complexity Results

Theorem(Gu, Guo, Wei, X. 2020)

Assume mild assumptions, for any ε′ > 0, under the ε′-greedy policy, withprobability 1− δ, for any initial state distribution µ and ε-net, after

Ω(poly((1/ε) · (1/ε′) · log(1/δ)))

samples, MFC-K-Q converges linearly to some function QCε ; and the supdistance between ΓKQCε and QC is upper bounded by Cε, where C is aconstant independent of ε, ε′ and δ.

Comment: There are 3 sources of the approximation error:

I kernel regression

I discretized action space

I estimated data (for both dynamics and rewards)

49 / 51

Comparison of Complexity Results

Work MFC/N-player Sample Complexity Guarantee

Our work MFC Ω(Tcov · log(1/δ))

Carmona et al., 2019 MFC Ω((Tcov · log(1/δ))l · poly(log(1/(δε))/ε))

Vanilla N-player N-player Ω(poly((|S||A|)N · log(1/(δε)) ·N/ε))Qu & Li, 2019 N-player Ω(poly((|S||A|)f(log(1/ε)) · log(1/δ) ·N/ε))

Table: Comparison of algorithms

I Here Tcov is the covering time of the exploration policy.

I l = max3 + 1/κ, 1/(1− κ) > 4 for some κ ∈ (0.5, 1).

I f(log(1/ε)) is a structure-dependent quantity and can scale as Nwhen agents are not interacting locally.

50 / 51

Mean-Field Theory for MachineLearning

Mean-Field Theory for Machine Learning

On the theory front

I Theoretical development is (sometimes) easier on the liftedprobability measure space

I Address the issues of curse of dimensionality

Applications

I Training Neural Network: Mei, Montenari, and Nguyen (2018); E,Han and Li (2018); Sirignano and Spiliopoulos (2019); Hu, Ren,Siska, and Szpruch (2019); Bo, Capponi and Liao (2019); Lu, Ma,Lu, Lu, and Ying (2020); Wojtowytsch and E (2020)

I GAN: Cao, Guo Lauriere (2019), Lin, Fung, Li, Nurbekyan, andOsher (2020); Conforti, Kazeykina, and Ren (2020);Domingo-Enrich, Jelassi, Mensch, Rotskoff, and Bruna (2020)

I MARL: (as mentioned before)

51 / 51

Thank you!

51 / 51

References: Mean-field Games

I Large Population Stochastic Dynamic Games: Closed-loopMcKean-Vlasov Systems and the Nash Certainty EquivalencePrinciple (Huang, Malhame, & Caines, 2006)

I Mean Field Games (Lasry & Lions, 2007)

I Mean Field Games and Applications (Gueant, Lasry & Lions, 2011)

I Probabilistic Theory of Mean Field Games with Applications I&II(Carmona & Delarue, 2018)

I Probabilistic Analysis of Mean-field Games (Carmona & Delarue,2013)

I A Probabilistic Weak Formulation of Mean Field Games andApplications (Carmona & Lacker, 2014)

I A General Characterization of the Mean Field Limit for StochasticDifferential Games (Lacker, 2014)

51 / 51


I On the Convergence of Closed-loop Nash Equilibria to the MeanField Game Limit (Lacker, 2018)

I Mean Field Games via Controlled Martingale Problems: Existence ofMarkovian Equilibria (Lacker, 2018)

I On the Connection Between Symmetric n-player Games and MeanField games (Fischer, 2017)

I The Master Equation and the Convergence Problem in Mean FieldGames (Cardaliaguet, Delarue, Lasry, and Lions, 2015)

I On the Convergence Problem in Mean Field Games: A Two StateModel without Uniqueness (Cecchin, Pra, Fischer, and Pelino, 2018)

I Selection of Equilibria in a Linear quadratic Mean-field Game(Delarue and Tchuendom, 2018)

I Convergence to the Mean Field Game Limit: A Case Study (Nutz,Martin and Tan, 2019)

51 / 51


I Mean Field Multi-agent Reinforcement Learning (Yang, Luo, Li,Zhou, Zhang, and Wang, 2018)

I Actor-Critic Provably Finds Nash Equilibria of Linear-QuadraticMean-Field Games (Fu, Yang, Chen, and Wang, 2019)

I Mean Field Equilibria of Dynamic Auctions with Learning (Iyer,Johari and Sundararajan, 2014)

I Reinforcement Learning in Stationary Mean-field Games(Subramanian and Mahajan, 2019)

I Approximate Fictitious Play for Mean Field Games ( Elie, Perolat,Lauriere, Geist, Pietquin, 2019)

I Fitted Q-Learning in Mean-field Games: (Anahtarcı, Deha Karıksız,Saldi, 2019)

I Q-Learning in Regularized Mean-field Games: (Anahtarcı, DehaKarıksız, Saldi, 2020)

51 / 51

References: Mean-field Controls

I Efficient Ridesharing Order Dispatching with Mean Field Multi-agentReinforcement learning (Li, Qin, Jiao, Yang, Gong, Wang, Wang,Wu and Ye, 2019)

I Dynamic Programming for Optimal Control of StochasticMcKean–Vlasov Dynamics (Pham and Wei, 2017)

I Bellman Equation and Viscosity Solutions for Mean-field StochasticControl Problem (Pham and Wei, 2018)

I Viscosity Solutions to Parabolic Master Equations andMcKean–Vlasov SDEs with Closed-loop Controls (Zhang and Wu,2020)

I Model-Free Mean-Field Reinforcement Learning: Mean-Field MDPand Mean-Field Q-Learning (Carmona, Lauriere and Tan, 2019)

I Scalable Reinforcement Learning of Localized Policies forMulti-Agent Networked Systems (Qu, Wierman, Li, 2020)

51 / 51

References: Mean-field Theory for Deep Learning

I A Mean Field View of the Landscape of Two-Layers NeuralNetworks (Mei, Montenari, and Nguyen, 2018)

I Mean Field Analysis of Neural Networks: A Law of Large Numbers(Sirignano and Spiliopoulos, 2019)

I Mean-Field Langevin Dynamics and Energy Landscape of NeuralNetworks (Hu, Ren, Siska and Szpruch, 2020)

I Relaxed Control and Gamma-Convergence of StochasticOptimization Problems with Mean Field ( Bo, Capponi and Liao,2019)

I A Mean-field Analysis of Deep ResNet and Beyond: TowardsProvable Optimization Via Overparameterization From Depth (Lu,Ma, Lu, Lu, and Ying ,2020)

I Can Shallow Neural Networks Beat the Curse of Dimensionality? Amean field training perspective (Wojtowytsch and E, 2020)

I Connecting GANs, MFGs, and OT (Cao, Guo and Lauriere, 2020)I APAC-Net: Alternating the Population and Agent Control via Two

Neural Networks to Solve High-Dimensional Stochastic Mean FieldGames (Lin, Fung, Li, Nurbekyan, Osher, 2020)

I A Mean-field Analysis of Two-player Zero-sum Games (Jelassi,Mensch, Rotskoff, and Bruna, 2020) 51 / 51

References

I Guo, Hu, X., and Zhang (2019).Learning Mean-Field Games.NeurIPS 2019.

I Guo, Hu, X., and Zhang (2020).A General Framework for Learning Mean-Field Games.https://arxiv.org/pdf/2003.06069.pdf.

I Gu, Guo, Wei, X., (2019).Dynamic Programming Principles for Learning Mean-FieldControls.arxiv.org/abs/1911.07314.

I Gu, Guo, Wei, X., (2020).Q-Learning Algorithm for Mean-Field Controls, withConvergence and Complexity Analysis.https://arxiv.org/pdf/2002.04131.pdf.

51 / 51

Local Policy vs General Policy

For player i,

I Local policy: πi(si) : X → P(A)I Individual agent learns NE: population state distribution is not

observableI Example: sequential ad auction

I General (non-local) policy: πi(sss) : XN → P(A)I Individual agent learns NE: population states are observableI Platform coordinates agents to learn NE: population state

distribution is observableI Example: routing recommendationI Our framework also works under general policy

back

51 / 51

Local Policy vs General Policy

For representative agent,

I Local policy: π(s) : X → P(A)I Individual agent learns NE: population state distribution is not

observableI Example: sequential ad auction

I General (non-local) policy: π(s, µ) : X × P(X )→ P(A)I Individual agent learns NE: population states are observableI Platform coordinates agents to learn NE: population state

distribution is observableI Example: routing recommendationI Our framework also works under general policy

back

51 / 51

“Small parameter” condition

Assumption 1

There exists a constant d1 ≥ 0, such that for any LLL,LLL′ ∈ P(S ×A)∞t=0,

D(Γ1(LLL),Γ1(LLL′)) ≤ d1W1(LLL,LLL′), where

D(πππ,πππ′) := sups∈SW1(πππ(s),πππ′(s)),W1(LLL,LLL′) := sup

t∈NW1(Lt,L′t),

and W1 is the `1-Wasserstein distance.

Assumption 2

There exist constants d2, d3 ≥ 0, such that for any admissible policysequences πππ,πππ1,πππ2 and joint distribution sequences LLL,LLL1,LLL2,

W1(Γ2(πππ1,LLL),Γ2(πππ2,LLL)) ≤ d2D(πππ1,πππ2),

W1(Γ2(πππ,LLL1),Γ2(πππ,LLL2)) ≤ d3W1(LLL1,LLL2).

Assumed1d2 + d3 < 1.

back

51 / 51

multi-agent reinforcement learning: a mean-field perspective · motivation: a sequential auction...

Documents