multi-agent reinforcement learning: a mean-field perspective · motivation: a sequential auction...
TRANSCRIPT
Reinforcement Learning China Summer School
Multi-agent Reinforcement Learning:From a Mean-Field Perspective
Renyuan XuMathematical Institute, University of Oxford
August 8, 2020
A Toy Example: When Does RLChina Lecture Start?Gueant, Lasry, and Lions (2011)
I The RLChina lectures are scheduled to start at t = 7PM but oftenstart some time later than t, say T
I The actual starting time T depends on the arrivals of participants
I A rule is imposed saying that the lecture starts at time t or after90% of the participants have arrived, whichever is earlier.
I Very large number of participants: we agree to consider them as acontinuum of agents
I Question: What is the actual starting time T?
1 / 51
A Toy Example: When Does RLChina Lecture Start?
I τi: time at which participant i decides to arrive
I τi: time at which participant i actually arrives:
τi = τi + σiεi
whereI σiεi is the uncertainty, σi ∼ m0, εi ∼ N (0, 1)
2 / 51
A Toy Example: When Does RLChina Lecture Start?
Each participant makes decision upon minimizing the expectation of thetotal cost
c(t, T, τi) = c1(t, T, τi) + c2(t, T, τi) + c3(t, T, τi)
I Lateness compared to the scheduled time t
c1(t, T, τi) = α[τi − t]+
I Lateness compared to the actually time T
c2(t, T, τi) = β[τi − T ]+
I Waiting timec3(t, T, τi) = γ[T − τi]+
3 / 51
A Toy Example: When Does RLChina Lecture Start?
Resolution:
1. Anticipate an actual starting time T , solve for an optimal τi:
αN(τi − tσi
)+ (β + γ)N
(τi − Tσi
)= γ
where N is the cumulative distribution of a standard normal.
2. From τ i, together with the noise and the rule, find thecorresponding actual starting time T ∗
I ”Continuum of participants”: 90% quantile of a distributionI N participants: ordered statistics (intractable)
3. Show that the mapping Γ : T 7→ T ∗ has a fixed point:I Banach Fixed Point Theorem
4 / 51
A Toy Example: When Does RLChina Lecture Start?
Resolution:
1. Anticipate an actual starting time T , solve for an optimal τi:
αN(τi − tσi
)+ (β + γ)N
(τi − Tσi
)= γ
where N is the cumulative distribution of a standard normal.
2. From τ i, together with the noise and the rule, find thecorresponding actual starting time T ∗
I ”Continuum of participants”: 90% quantile of a distributionI N participants: ordered statistics (intractable)
3. Show that the mapping Γ : T 7→ T ∗ has a fixed point:I Banach Fixed Point Theorem
4 / 51
A Toy Example: When Does RLChina Lecture Start?
Resolution:
1. Anticipate an actual starting time T , solve for an optimal τi:
αN(τi − tσi
)+ (β + γ)N
(τi − Tσi
)= γ
where N is the cumulative distribution of a standard normal.
2. From τ i, together with the noise and the rule, find thecorresponding actual starting time T ∗
I ”Continuum of participants”: 90% quantile of a distributionI N participants: ordered statistics (intractable)
3. Show that the mapping Γ : T 7→ T ∗ has a fixed point:I Banach Fixed Point Theorem
4 / 51
Schedule
Mean-field approximation for MARL with large population
I (45 mins) Non-cooperative games ⇒ mean-field game
I (45 mins) Cooperative games ⇒ mean-field control
4 / 51
Learning in Mean-field Games
4 / 51
Motivation: A Sequential Auction Game
Figure: An Overview of Taobao Display Advertising System.
5 / 51
Motivation: A Sequential Auction Game
Ad auction problem for advertisers:
I Ad auction: a stochastic game on an ad exchange platform among alarge number of players (the advertisers)
I Environment: in each round, a web user requests a page, and then a(Vickrey-type) second-best-price auction is run to incentivizeadvertisers to bid for a slot to display advertisement
I Characteristics:I partial information (unknown conversion of clicks)I large population
Question: From an individual bidder’s perspective, how to bid in thissequential game with a large population of competing bidders andunknown distributions of the conversion of clicks/rewards?
6 / 51
Motivation: A Sequential Auction Game
Attempt: the
Reinforcement Learning︷ ︸︸ ︷simultaneous-learning-and-decision-making problem in a
sequential auction with a large number of homogeneous bidders.
I Full model approach: solve it as an N -player reinforcement learningproblemI Curse of “many agents” when N is large
I intractable interactions on individual levelsI computational complexity growths exponentially with respect to the
number of players
I Structure-dependent policies: Lauer and Riedmiller (2000), Qu,Wierman and Li (2019).
I Provable algorithms for two-player gamesI No theoretical guarantee for general non-zero sum MARL
7 / 51
Motivation: sequential auction game
Attempt: the
Reinforcement Learning︷ ︸︸ ︷simultaneous learning and decision-making problem in a
sequential auction with a large number of homogeneous︸ ︷︷ ︸Mean-Field Games
bidders.
I Approximation approach: mean-field approximationI When N is large, consider instead the “aggregated” version of theN -player game by letting N →∞
I About small interacting individuals, with each player choosingoptimal strategy in view of the macroscopic information (mean field)
I By (f)SLLN, the aggregated version becomes an “approximation” ofthe N-player game
I The aggregated version, mean-field approximation, is analyticallyfeasible and easy to learn
8 / 51
Other Applications: HFT and Crowded Shorts
(a) High-frequency Trading (b) Crowded Shorts
9 / 51
Outline
I When mean-field approximation works?I Model set-up for mean-field games (MFG)I Existence and uniqueness of MFG solutions
I How to apply mean-field theory in MARL algorithmic design?I Q-learning for GMFGI Smoothing and stabilizing techniquesI Convergence and complexity analysisI More general results: policy-based and value-based algorithms
9 / 51
N -player Games
N -player game
V i(sss,πππ) := E[ ∞∑t=0
γtri(ssst, aaat)|sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst).
I N players, state space S, action space AI ssst = (s1
t , . . . , sNt ) ∈ SN is the state profile
I aaat = (a1t , . . . , a
Nt ) ∈ AN is the action profile
I ri is the reward function and P i is the transition kernel of player i
I (ssst, aaat)PPP−→ ssst+1
I admissible policy πi : SN → P(A), with P(A) the space of allprobability measures over A
10 / 51
N -player Games
N -player game
V i(sss,πππ) := E[ ∞∑t=0
γtri(ssst, aaat)|sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst).
I N players, state space S, action space AI ssst = (s1
t , . . . , sNt ) ∈ SN is the state profile
I aaat = (a1t , . . . , a
Nt ) ∈ AN is the action profile
I ri is the reward function and P i is the transition kernel of player i
I (ssst, aaat)PPP−→ ssst+1
I admissible policy πi : SN → P(A), with P(A) the space of allprobability measures over A
10 / 51
N -player Games
N -player game
V i(sss,πππ) := E[ ∞∑t=0
γtri(ssst, aaat)|sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst).
I N players, state space S, action space AI ssst = (s1
t , . . . , sNt ) ∈ SN is the state profile
I aaat = (a1t , . . . , a
Nt ) ∈ AN is the action profile
I ri is the reward function and P i is the transition kernel of player i
I (ssst, aaat)PPP−→ ssst+1
I admissible policy πi : SN → P(A), with P(A) the space of allprobability measures over A
10 / 51
N -player Games
N -player game
V i(sss,πππ) := E[ ∞∑t=0
γtri(ssst, aaat)|sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst).
I N players, state space S, action space AI ssst = (s1
t , . . . , sNt ) ∈ SN is the state profile
I aaat = (a1t , . . . , a
Nt ) ∈ AN is the action profile
I ri is the reward function and P i is the transition kernel of player i
I (ssst, aaat)PPP−→ ssst+1
I admissible policy πi : SN → P(A), with P(A) the space of allprobability measures over A
10 / 51
N -player Games
N -player game
V i(sss,πππ) := E[ ∞∑t=0
γtri(ssst, aaat)|sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst).
I N players, state space S, action space AI ssst = (s1
t , . . . , sNt ) ∈ SN is the state profile
I aaat = (a1t , . . . , a
Nt ) ∈ AN is the action profile
I ri is the reward function and P i is the transition kernel of player i
I (ssst, aaat)PPP−→ ssst+1
I admissible policy πi : SN → P(A), with P(A) the space of allprobability measures over A
10 / 51
N -player Games
N -player game
V i(sss,πππ) := E[ ∞∑t=0
γtri(ssst, aaat)|sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst).
I N players, state space S, action space AI ssst = (s1
t , . . . , sNt ) ∈ SN is the state profile
I aaat = (a1t , . . . , a
Nt ) ∈ AN is the action profile
I ri is the reward function and P i is the transition kernel of player i
I (ssst, aaat)PPP−→ ssst+1
I admissible policy πi : SN → P(A), with P(A) the space of allprobability measures over A
10 / 51
From N -player Game to MFG
N -player games
maximizeπi V i(sss,πππ) := E[ ∞∑t=0
γtri(ssst, aaat)
∣∣∣∣sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst)
I Mean-field approximation will not work for all N-player games
I Need assumptions to embrace law of large number and theory ofpropagation of chaos
Assume
1. Homogeneity: Agents are identical, indistinguishable andinterchangeable
2. Weak interactions: player i depends on other agents through
empirical measure (µ−it (·), α−it (·)) :=
(∑j 6=i I(s
jt=·)
N−1 ,∑j 6=i I(a
jt=·)
N−1
)3. Local policy: ait ∼ πit(sit) here
11 / 51
From N -player Game to MFG
N -player games
maximizeπi V i(sss,πππ) := E[ ∞∑t=0
γtri(ssst, aaat)
∣∣∣∣sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), ait ∼ πit(ssst)
I Mean-field approximation will not work for all N-player games
I Need assumptions to embrace law of large number and theory ofpropagation of chaos
Assume
1. Homogeneity: Agents are identical, indistinguishable andinterchangeable
2. Weak interactions: player i depends on other agents through
empirical measure (µ−it (·), α−it (·)) :=
(∑j 6=i I(s
jt=·)
N−1 ,∑j 6=i I(a
jt=·)
N−1
)3. Local policy: ait ∼ πit(sit) here
11 / 51
From N -player Game to MFG
A smaller class of N -player games
V i(si, µ−i, πi, α−i) :=
E[ ∞∑t=0
γtri(sit, µ−it , πit, α
−it )
∣∣∣∣ (si0, µ−i0 , α−i0 ) = (si, µ−i, α−i)
]subject to sit+1 ∼ P i(sit, µ−it , πit, α
−it ), ait ∼ πit(sit)
I Homogeneous agents ⇒ look at a representative agent is enough
I A game between agent i and the empirical measure of other agents(µ−i, α−i)
When the number of players goes to infinity, view the limit of
(µ−it (·), α−it (·)) :=
(∑j 6=i I(s
jt=·)
N−1 ,∑j 6=i I(a
jt=·)
N−1
)as population
state-action joint distribution Lt
12 / 51
From N -player Game to MFG
A smaller class of N -player games
V i(si, µ−i, πi, α−i) :=
E[ ∞∑t=0
γtri(sit, µ−it , πit, α
−it )
∣∣∣∣ (si0, µ−i0 , α−i0 ) = (si, µ−i, α−i)
]subject to sit+1 ∼ P i(sit, µ−it , πit, α
−it ), ait ∼ πit(sit)
I Homogeneous agents ⇒ look at a representative agent is enough
I A game between agent i and the empirical measure of other agents(µ−i, α−i)
When the number of players goes to infinity, view the limit of
(µ−it (·), α−it (·)) :=
(∑j 6=i I(s
jt=·)
N−1 ,∑j 6=i I(a
jt=·)
N−1
)as population
state-action joint distribution Lt
12 / 51
General MFG
MFG (Representative agent)
V (s, π, Lt∞t=0) := E[ ∞∑t=0
γtr(st, at, Lt)|s0 = s
]subject to st+1 ∼ P (st, at, Lt), at ∼ πt(st)
I st ∈ S and at ∈ A are the state and action of a representative agentat time t
I r is the reward function, P is the transition dynamics
I Lt ∈ ∆|S||A| is the population state-action distribution at time t,with state marginal µt and action marginal αt
I admissible policy πt : S → P(A) here
13 / 51
General MFG
MFG (Representative agent)
V (s, π, Lt∞t=0) := E[ ∞∑t=0
γtr(st, at, Lt)|s0 = s
]subject to st+1 ∼ P (st, at, Lt), at ∼ πt(st)
I st ∈ S and at ∈ A are the state and action of a representative agentat time t
I r is the reward function, P is the transition dynamics
I Lt ∈ ∆|S||A| is the population state-action distribution at time t,with state marginal µt and action marginal αt
I admissible policy πt : S → P(A) here
13 / 51
General MFG
MFG (Representative agent)
V (s, π, Lt∞t=0) := E[ ∞∑t=0
γtr(st, at, Lt)|s0 = s
]subject to st+1 ∼ P (st, at, Lt), at ∼ πt(st)
I st ∈ S and at ∈ A are the state and action of a representative agentat time t
I r is the reward function, P is the transition dynamics
I Lt ∈ ∆|S||A| is the population state-action distribution at time t,with state marginal µt and action marginal αt
I admissible policy πt : S → P(A) here
13 / 51
Nash Equilibrium in GMFGsParallel definition of NE for N-player game
Denote
I Policy profile π = πt∞t=0
I Population profile L = Lt∞t=0
Definition (NE for GMFGs)
In GMFGs, a policy-population pair (π?, L?) is called a NE if
1. (Representative agent side) Fix L?, for any policy π and anyinitial state s ∈ S,
V (s, π?, L?) ≥ V (s, π, L?) .
2. (Population side) Pst,at = L?t for all t ≥ 0, where st, at∞t=0 is thedynamics under control π?, with at ∼ π?t (st), st+1 ∼ P (·|st, at, L?).
14 / 51
Nash Equilibrium in GMFGParallel Definitions
Definition (NE for GMFGs)
In GMFGs, a policy-population pair(π?, L?) is called a NE if
1. (Representative agent side)Fix L?, for any policy π andany initial state s ∈ S,
V (s, π?, L?) ≥ V (s, π, L?) .
2. (Population side) Pst,at = L?tfor all t ≥ 0, where st, at∞t=0
is the dynamics under controlπ?, with at ∼ π?t (st),st+1 ∼ P (·|st, at, L?).
Definition (NE for N-playergames)
In N-player game, a policy profile πππ∗
is called a NE if
1. (One agent side) Fix πππ∗,−i,for any policy πi and any initialstate sss ∈ SN ,
V i (sss,πππ∗) ≥ V i(sss, (πππ∗,−i, πi),
).
2. (Population side) Condition 1holds for all agents.
15 / 51
Nash Equilibrium in GMFGStationary Solution
Definition (NE for GMFGs)
In GMFGs, a policy-population pair(π?, L?) is called a NE if
1. (Representative agent side)Fix L?, for any policy π andany initial state s ∈ S,
V (s, π?, L?) ≥ V (s, π, L?) .
2. (Population side) Pst,at = L?tfor all t ≥ 0, where st, at∞t=0
is the dynamics under controlπ?, with at ∼ π?t (st),st+1 ∼ P (·|st, at, L?).
I Stationary solution: If thereexists (L?, π?) independent oftimeI For single-agent RL: optimal
stationary policy alwaysexists for infinite timehorizon problem
I For game: it is more difficultdue to the competition
16 / 51
Outline
I When mean-field approximation works?I Model set-up for mean-field games (MFG)I Existence and uniqueness of MFG solutions
I How to apply mean-field theory in MARL algorithmic design?I Q-learning for GMFGI Smoothing and stabilizing techniquesI General policy-based and value-based algorithms
16 / 51
Fixed point/Three-step approachModel is fully known
I Step 1 (Γ1): given L, solve the stochastic control problem to get π?L:
maximizeπ V (s, π|L) := E[ ∞∑t=0
γtr(st, at|L)|s0 = s
]subject to st+1 ∼ P (st, at, L)
Γ1(L) = π?LI Step 2 (Γ2): given π?L, update from L for one time step to get L′
following the dynamicsΓ2(L, π?L) = L′
I Step 3: Check whether L′ matches L, and repeatΓ2 Γ1(L) = L′
Remark. Well-definedness; Algorithm design; Convergence analysis
17 / 51
Existence and Uniqueness
Theorem 1 (Guo, Hu, X. & Zhang, 2019 here )
1. Under some “small parameter” conditions, Γ2 Γ1 is contractive;
2. For any GMFG, if Γ2 Γ1 is contractive, then there exists a uniquestationary NE. In addition, the three-step approach converges.
I Uniqueness is in the sense of L
I Explicit model conditions to guarantee “small parameter” conditions
I Parallel results for both stationary and non-stationary MFG
18 / 51
Existence and Uniqueness
Theorem 1 (Guo, Hu, X. & Zhang, 2019 here )
1. Under some “small parameter” conditions, Γ2 Γ1 is contractive;
2. For any GMFG, if Γ2 Γ1 is contractive, then there exists a uniquestationary NE. In addition, the three-step approach converges.
I Uniqueness is in the sense of L
I Explicit model conditions to guarantee “small parameter” conditions
I Parallel results for both stationary and non-stationary MFG
18 / 51
Existence and Uniqueness
Theorem 1 (Guo, Hu, X. & Zhang, 2019 here )
1. Under some “small parameter” conditions, Γ2 Γ1 is contractive;
2. For any GMFG, if Γ2 Γ1 is contractive, then there exists a uniquestationary NE. In addition, the three-step approach converges.
I Uniqueness is in the sense of L
I Explicit model conditions to guarantee “small parameter” conditions
I Parallel results for both stationary and non-stationary MFG
18 / 51
Existence and Uniqueness
Theorem 1 (Guo, Hu, X. & Zhang, 2019 here )
1. Under some “small parameter” conditions, Γ2 Γ1 is contractive;
2. For any GMFG, if Γ2 Γ1 is contractive, then there exists a uniquestationary NE. In addition, the three-step approach converges.
I Uniqueness is in the sense of L
I Explicit model conditions to guarantee “small parameter” conditions
I Parallel results for both stationary and non-stationary MFG
18 / 51
(Partial) References on MFGsModel fully known
I Existence:I PDE approach: Huang, Malhame & Caines (2006), Lasry & Lions
(2006)I Probability approach: Carmona & Delarue (2013), Carmona &
Lacker (2014)I Weak solution approach: Lacker (2015)
I Uniqueness:I Small parameter condition: Huang, Malhame & Caines (2006)I Monotonicity condition: Lasry & Lions (2006)
I Convergence (value/policy): Lacker (2015, 2018), Fischer (2017),Cardaliaguet, Delarue, Lasry, and Lions (2019)
I Reachability when multiple equilibrium: Cecchin, Fischer, andPelino (2018), Delarue and Tchuendom (2018), Nutz, Martin, andTan (2019)
I Tutorials: Gueant, Lasry, and Lions (2011); Daniel Lacker (IPAMNotes, 2018); Carmona & Delarue (Two volume books)
19 / 51
Outline
I When mean-field approximation works?I Model set-up for mean-field games (MFG)I Existence and uniqueness of MFG solutions
I How to apply mean-field theory in MARL algorithmic design?I Q-learning for GMFGI Smoothing and stabilizing techniquesI Convergence and complexity analysisI More general results: policy-based and value-based algorithms
19 / 51
Bridge MFG with RL: Finding NE
Three-step approach revisited:
I Step 1: given L, solve the stochastic control problem to get π?L:
maximizeπ V (s, π, L) := E[ ∞∑t=0
γtr(st, at, L)|s0 = s
],
subject to st+1 ∼ P (st, at, L)
I Step 2: given π?L, update from L for one time step to get L′
following the dynamics
I Step 3: Check whether L′ matches L
20 / 51
Bridge MFG with RL: Finding NE
Three-step approach revisited (when P and distr of r are unknown):
I Step 1 [Inner Iteration]: given L, perform Q-learning with transitionPL(s′|s, a) := P (s′|s, a, L) and reward rL(s, a) := r(s, a, L)⇒ Q∗L(s, a)
I Step 2 [Outer Iteration]: given π?L = argmax-e(Q∗L(s, a)), updatefrom L for one time step to get L′ following the dynamics
I Step 3: Check whether L′ matches L
Remark: π?L(s) ∈ argmaxa Q?L(s, a). When argmax is non-unique,
replace it with argmax-e, which assigns equal probability to themaximizers.
21 / 51
Naive RL Algorithm for GMFG
Algorithm 1 Naive Q-learning for GMFGs
1: Input: Initial population state-action pair L0
2: for k = 0, 1, · · · do3: Given Lk, perform Q-learning to find the Q-function Q?k(s, a) =
Q?Lk(s, a) of an MDP with dynamics PLk(s′|s, a) and reward distri-butions RLk(s, a).
4: Solve πk ∈ Π with πk(s) = argmax-e (Q?k(s, ·)).5: Sample s ∼ µk, where µk is the population state marginal of Lk,
and obtain Lk+1 from G(s, πk, Lk).6: end for
Step 3: Fix LI
Qt+1L (s, a)← QtL(s, a) + βt(s, a) [r(s, a, L)
+ γmaxa′ QtL(s′, a′)−QtL(s, a)
],
I QtL → Q?L, with
Q?L(s, a) := rL(s, a) + γ∑s′∈S
PL(s′|s, a)V ?L (s′).
22 / 51
Naive RL Algorithm for GMFG
Algorithm 2 Naive Q-learning for GMFGs
1: Input: Initial population state-action pair L0
2: for k = 0, 1, · · · do3: Given Lk, perform Q-learning to find the Q-function Q?k(s, a) =
Q?Lk(s, a) of an MDP with dynamics PLk(s′|s, a) and reward distri-butions RLk(s, a).
4: Solve πk ∈ Π with πk(s) = argmax-e (Q?k(s, ·)).5: Sample s ∼ µk, where µk is the population state marginal of Lk,
and obtain Lk+1 from G(s, πk, Lk).6: end for
Step 3: Fix LI
Qt+1L (s, a)← QtL(s, a) + βt(s, a) [r(s, a, L)
+ γmaxa′ QtL(s′, a′)−QtL(s, a)
],
I QtL → Q?L, with
Q?L(s, a) := rL(s, a) + γ∑s′∈S
PL(s′|s, a)V ?L (s′).
22 / 51
Failure of the Naive Algorithm
Failure examples:
0 20 40 60 80 100outer iteration
0.0
0.2
0.4
0.6
0.8
1.0
|t
t+1|
(c) Fluctuation in l∞: |Lk − Lk+1|∞
0 20 40 60 80 100outer iteration
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
|t
t+1| 1
(d) Fluctuation in l1: |Lk − Lk+1|l1
Figure: Fluctuations of Naive Algorithm (30 sample paths).
23 / 51
Problems in the Naive Algorithm: Approximation Errors
Algorithm 1 Naive Q-learning for GMFGs
1: Input: Initial population state-action pair L0
2: for k = 0, 1, · · · do
3: Perform Q-learning to find the Q-function
impossible︷ ︸︸ ︷Q?k(s, a) = Q?Lk(s, a)
of an MDP with dynamics PLk(s′|s, a) and reward distributionsRLk(s, a).
4: Solve πk ∈ Π with πk(s) =
unstable︷ ︸︸ ︷argmax-e (Q?k(s, ·)).
5: Sample s ∼ µk, where µk is the population state marginal of Lk,and obtain Lk+1︸ ︷︷ ︸
unstable
from G(s, πk, Lk).
6: end for
I Inner iteration: error controlI argmax-e: not continuousI The update from Lk to Lk+1 is not controlled
24 / 51
Instability of argmax-e:Magnify the Approximation Errors
argmax-e is not continuous:
I x = (1, 1), then argmax-e(x) = (1/2, 1/2)
I y = (1, 1− ε), then for any ε > 0, argmax-e(y) = (1, 0)
I ‖argmax-e(x)− argmax-e(y)‖2/‖x− y‖2 = 1√2ε
25 / 51
Stable Algorithm for GMFG (GMF-Q)
Algorithm 2 Q-learning for GMFGs (GMF-Q)
1: Input: Initial L0, tolerance ε > 0.2: for k = 0, 1, · · · do3: Perform Q-learning for Tk iterations to find the approximate
Q-function Q?k(s, a) = Q?Lk(s, a) of an MDP with dynamicsPLk(s′|s, a) and reward distributions RLk(s, a).
4: Compute πk ∈ Π with πk(s) = softmaxc(Q?k(s, ·)).5: Sample s ∼ µk, where µk is the population state marginal of Lk,
and obtain Lk+1 from G(s, πk, Lk).6: Find Lk+1 = ProjSε(Lk+1)7: end for
1. Tk: carefully chosen in the PAC
2. argmax-e to softmax: softmaxc(x)i = exp(cxi)∑nj=1 exp(cxj)
3. Projection on to Sε: a ε-net (finite cover) of L
26 / 51
Outline
I When mean-field approximation works?I Model set-up for mean-field games (MFG)I Existence and uniqueness of MFG solutions
I How to apply mean-field theory in MARL algorithmic design?I Q-learning for GMFGI Smoothing and stabilizing techniquesI Convergence and complexity analysisI More general results: policy-based and value-based algorithms
26 / 51
Convergence and Complexity of GMF-Q
Theorem 2 (Guo, Hu, X. & Zhang, 2019)
Given the same assumptions in the existence and uniqueness theorem, forany specified tolerances ε, δ > 0, set Tk, c and Sε appropriately. Thenwith probability at least 1− 2δ, W1(LKε , L
?) = O(ε), and the total
number of iterations T =∑Kε−1k=0 Tk is bounded by
T = O(K19/3ε (log(Kε/δ))
41/3).
Here Kε :=⌈2 max
(ηε)−1/η, logd(ε/maxdiam(S)diam(A), 1) + 1)
⌉is the number of outer iterations.
Here W1 is the `1 Wasserstein distance.
27 / 51
Outline
I When mean-field approximation works?I Model set-up for mean-field games (MFG)I Existence and uniqueness of MFG solutions
I How to apply mean-field theory in MARL algorithmic design?I Q-learning for GMFGI Smoothing and stabilizing techniquesI Convergence and complexity analysisI More general results: policy-based and value-based algorithms
27 / 51
More General ResultsGuo, Hu, X., Zhang (2020)
Key take-away: “Three-step approach + Smoothing+ Stabalizing”provides a meta framework for learning MFG:
I For inner iterations, any single-agent RL with finite sample boundcan be usedI Value-based algorithms: Q-learningI Policy-based algorithms: Trust Region Policy Optimization (TRPO),
Proximal Policy Optimization (PPO), Conservative Policy Iteration(CPI)
I For outer iterations, different choices of smoothing methodI Boltzmann (softmax)
I MellowMax: MMc(xxx) =log( 1
n
∑ni=1 exp(cxi))
cI Momentum: linear combination of Lt and Lt−1
28 / 51
More General ResultsGuo, Hu, X., Zhang (2020)
Key take-away: “Three-step approach + Smoothing+ Stabalizing”provides a meta framework for learning MFG:
I For inner iterations, any single-agent RL with finite sample boundcan be usedI Value-based algorithms: Q-learningI Policy-based algorithms: Trust Region Policy Optimization (TRPO),
Proximal Policy Optimization (PPO), Conservative Policy Iteration(CPI)
I For outer iterations, different choices of smoothing methodI Boltzmann (softmax)
I MellowMax: MMc(xxx) =log( 1
n
∑ni=1 exp(cxi))
cI Momentum: linear combination of Lt and Lt−1
28 / 51
More General ResultsGuo, Hu, X., Zhang (2020)
Key take-away: “Three-step approach + Smoothing+ Stabalizing”provides a meta framework for learning MFG:
I For inner iterations, any single-agent RL with finite sample boundcan be usedI Value-based algorithms: Q-learningI Policy-based algorithms: Trust Region Policy Optimization (TRPO),
Proximal Policy Optimization (PPO), Conservative Policy Iteration(CPI)
I For outer iterations, different choices of smoothing methodI Boltzmann (softmax)
I MellowMax: MMc(xxx) =log( 1
n
∑ni=1 exp(cxi))
cI Momentum: linear combination of Lt and Lt−1
28 / 51
(Partial) Related Literature on Learning MFG
I Mean Field Multi-Agent Reinforcement Learning: Yang, Luo, Li,Zhou, Zhang, and Wang (2018))I First paper of applying mean-field approximation to MARLI Interaction through actions
I Model-based framework (linear-quadratic): Fu, Yang, Chen, Wang(2019)
I Local NE: Subramanian and Mahajan (2019)
I Deep Deterministic Policy Gradient (continuous action): Elie,Perolat, Lauriere, Geist, Pietquin (2019)
I Variants of Q-learning algorithms:I Fitted Q: Berkay Anahtarcı, Can Deha Karıksız, Naci Saldi (2019)I Regularized Q: Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi
(2020)
29 / 51
Learning Mean-Field Controls
29 / 51
Motivating Example: Ride-sharing Order Dispatch
30 / 51
Motivating Example: Ride-sharing Order Dispatch
Real-time (driver,order)-pair match to maximize the combined rewards:
I Rider: short waiting time
I Driver:I short-term reward ∼
distance(origin,destination)I long-term reward:
I Global supply-demandbalance:
—————Li, Qin, Jiao, Yang, Gong, Wang, Wang, Wu and Ye (2019)
31 / 51
Motivating Example: Ride-sharing Order Dispatch
Key features:
I Cooperative games: optimization from platform’s perspective
I Large population: N and M are large
I Interaction with (partially) unknown environment: traffic condition,stochastic and time-dependent demand, and etc
I Demand for (fast) real-time resolution: short waiting time
Solution: MARL with mean-field approximation (Li, Qin, Jiao, Yang,Gong, Wang, Wang, Wu, Ye (2019))
I Cooperative order dispatching algorithm
I Mean-field approximation for local dynamics
I Centralized training with decentralized execution
I Convergence proof of Q-learning
32 / 51
Motivating Example: Ride-sharing Order Dispatch
Key features:
I Cooperative games: optimization from platform’s perspective
I Large population: N and M are large
I Interaction with (partially) unknown environment: traffic condition,stochastic and time-dependent demand, and etc
I Demand for (fast) real-time resolution: short waiting time
Solution: MARL with mean-field approximation (Li, Qin, Jiao, Yang,Gong, Wang, Wang, Wu, Ye (2019))
I Cooperative order dispatching algorithm
I Mean-field approximation for local dynamics
I Centralized training with decentralized execution
I Convergence proof of Q-learning
32 / 51
Other Examples
(a) Data Routing (b) Food Delivery (c) Autonomous Driving
33 / 51
Outline
I Mean-field control formulationI N-player cooperative gameI Pareto optimality conditionI N →∞
I Dynamic Programming Principle (DPP) on the probability measurespace
I Q-learning algorithm on the probability measure space
33 / 51
Cooperative MARL
At the beginning of each round t, for each agent i
I State: agent i is at sit ∈ SI Action: he/she takes an action ait ∈ A such that ait ∼ πi
I Policy: πi : S → P(A)
I Reward: receive a reward ri(ssst, aaat)
I Value function: V i(sss,πππ) = E[∑∞
t=0 γtri(ssst, aaat)
∣∣sss0 = sss]
I Transition sit+1 ∼ P i(ssst, aaat): the state will change to sit+1 accordingto a transition probability function P i(ssst, aaat)
34 / 51
Pareto Optimality Condition
Pareto optimality
If πππ∗ is Pareto optimal, then there does not exist πππ such that
I V i(sss,πππ) ≥ V i(sss,πππ∗) for all i and sss
I There exist j such that V j(sss,πππ) > V j(sss,πππ∗) for all sss
I No individual can be better off without making at least oneindividual worse off
I Central controller’s solution is pareto optimal
35 / 51
Cooperative MARL
Central controller
V (sss,πππ) := 1N
∑Ni=1 E
[ ∞∑t=0
γtri(ssst, aaat)
∣∣∣∣sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), (i = 1, 2, · · · , N)
I Cooperative game ⇒ agreement among agents ⇒ auxiliary centralcontroller
I Central controller:I Observe joint state ssstI Decide ait ∼ πit(ssst) for each agent i from a global perspectiveI Observe joint reward rrrt as feedbackI PPP and the distribution of rrr are unknown
I Curse of dimensionality Vanilla Q-learning algorithm for centralcontroller
Ω
(poly
((|S||A|)N · N
ε· ln(
1
δε)
)).
36 / 51
Cooperative MARL
Central controller
V (sss,πππ) := 1N
∑Ni=1 E
[ ∞∑t=0
γtri(ssst, aaat)
∣∣∣∣sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), (i = 1, 2, · · · , N)
I Cooperative game ⇒ agreement among agents ⇒ auxiliary centralcontroller
I Central controller:I Observe joint state ssstI Decide ait ∼ πit(ssst) for each agent i from a global perspectiveI Observe joint reward rrrt as feedbackI PPP and the distribution of rrr are unknown
I Curse of dimensionality Vanilla Q-learning algorithm for centralcontroller
Ω
(poly
((|S||A|)N · N
ε· ln(
1
δε)
)).
36 / 51
Cooperative MARL
Central controller
V (sss,πππ) := 1N
∑Ni=1 E
[ ∞∑t=0
γtri(ssst, aaat)
∣∣∣∣sss0 = sss
]subject to sit+1 ∼ P i(ssst, aaat), (i = 1, 2, · · · , N)
I Cooperative game ⇒ agreement among agents ⇒ auxiliary centralcontroller
I Central controller:I Observe joint state ssstI Decide ait ∼ πit(ssst) for each agent i from a global perspectiveI Observe joint reward rrrt as feedbackI PPP and the distribution of rrr are unknown
I Curse of dimensionality Vanilla Q-learning algorithm for centralcontroller
Ω
(poly
((|S||A|)N · N
ε· ln(
1
δε)
)).
36 / 51
Mean-field Approximation
Assumptions
I Homogeneity: Agents are identical and interchangeable.
I Weak interaction: Each agent depends on all other agents onlythrough the empirical distributions of states and actions.
When N →∞, this becomes a mean-field control problem.
37 / 51
Mean-field Approximation
Mean-field control
supπ V (π, µ) := E[ ∞∑t=0
γtr(st, at, µt)
∣∣∣∣ s0 ∼ µ, µ0 = µ
]subject to st+1 ∼ P (st, at, µt).
I st ∈ S: dynamics of representative agent with randomized initialstate s0 ∼ µ
I at ∈ A: action
I µt = Law(st) ∈ P(S): limit of µNt (·) =∑Nj=1
1(sjt=·)N
I r: reward of the representative agent
I Admissible policy πt(s, µ) : S × P(S)→ P(A)
38 / 51
Mean-field Approximation
Mean-field control
supπ V (π, µ) := E[ ∞∑t=0
γtr(st, at, µt)
∣∣∣∣ s0 ∼ µ, µ0 = µ
]subject to st+1 ∼ P (st, at, µt).
I st ∈ S: dynamics of representative agent with randomized initialstate s0 ∼ µ
I at ∈ A: action
I µt = Law(st) ∈ P(S): limit of µNt (·) =∑Nj=1
1(sjt=·)N
I r: reward of the representative agent
I Admissible policy πt(s, µ) : S × P(S)→ P(A)
38 / 51
Mean-field Approximation
Mean-field control
supπ V (π, µ) := E[ ∞∑t=0
γtr(st, at, µt)
∣∣∣∣ s0 ∼ µ, µ0 = µ
]subject to st+1 ∼ P (st, at, µt).
I st ∈ S: dynamics of representative agent with randomized initialstate s0 ∼ µ
I at ∈ A: action
I µt = Law(st) ∈ P(S): limit of µNt (·) =∑Nj=1
1(sjt=·)N
I r: reward of the representative agent
I Admissible policy πt(s, µ) : S × P(S)→ P(A)
38 / 51
Mean-field Approximation
Mean-field control
supπ V (π, µ) := E[ ∞∑t=0
γtr(st, at, µt)
∣∣∣∣ s0 ∼ µ, µ0 = µ
]subject to st+1 ∼ P (st, at, µt).
I st ∈ S: dynamics of representative agent with randomized initialstate s0 ∼ µ
I at ∈ A: action
I µt = Law(st) ∈ P(S): limit of µNt (·) =∑Nj=1
1(sjt=·)N
I r: reward of the representative agent
I Admissible policy πt(s, µ) : S × P(S)→ P(A)
38 / 51
Mean-field Approximation
Mean-field control
supπ V (π, µ) := E[ ∞∑t=0
γtr(st, at, µt)
∣∣∣∣ s0 ∼ µ, µ0 = µ
]subject to st+1 ∼ P (st, at, µt).
I st ∈ S: dynamics of representative agent with randomized initialstate s0 ∼ µ
I at ∈ A: action
I µt = Law(st) ∈ P(S): limit of µNt (·) =∑Nj=1
1(sjt=·)N
I r: reward of the representative agent
I Admissible policy πt(s, µ) : S × P(S)→ P(A)
38 / 51
Comparison between MFG and MFC
————————Carmona, Delarue & Lachapelle (2013)
39 / 51
Outline
I Mean-field control formulationI N-player cooperative gameI Pareto optimality conditionI N →∞
I Dynamic Programming Principle (DPP) on the probabilitymeasure space
I Q-learning algorithm on the probability measure space
39 / 51
Time Consistency (Dynamic Programming Principle)
Dynamic Programming is an umbrella encompassing many algorithms(single agent RL and MARL)
I Value-based method: Q-learning
I Policy-based method: Actor-Critic Algorithm
I Planning: Value Iteration or Policy Iteration
Dynamic Programming does not hold for free with infinite number ofplayers: need to work with the correct “state” and “action” spaces.
I Pham and Wei (2016), Pham and Wei (2017), and Wu and Zhang(2019) for value function
40 / 51
Time Consistency (Dynamic Programming Principle)
Dynamic Programming is an umbrella encompassing many algorithms(single agent RL and MARL)
I Value-based method: Q-learning
I Policy-based method: Actor-Critic Algorithm
I Planning: Value Iteration or Policy Iteration
Dynamic Programming does not hold for free with infinite number ofplayers: need to work with the correct “state” and “action” spaces.
I Pham and Wei (2016), Pham and Wei (2017), and Wu and Zhang(2019) for value function
40 / 51
Q-function on the probability measure space
Define the Q-function for MFC:
Q(µ, h) := E[r(s0, a0, µ)
∣∣∣ s0 ∼ µ, a0 ∼ h]
︸ ︷︷ ︸Reward of taking a0 ∼ h
+ Es1∼P (s0,a0,µ)
[ ∞∑t=1
γtr(st, at, µt)
∣∣∣∣∣ at ∼ π∗t]
︸ ︷︷ ︸Reward of playing optimal afterwards at ∼ π∗t
I local policy h : S → P(A)
I V ∗(µ) = maxhQ(µ, h)
41 / 51
Bellman Equation for Integrated Q-function
Theorem 3 (Bellman Equation for IQ (Gu, Guo, Wei and X.,2019))
For any µ ∈ P(S) and h ∈ H,
Q(µ, h) = r(µ, h) + γ suph′∈H
Q(Φ(µ, h), h′).
I H := h : S → P(A): set of ”local” policies
I Aggregated DynamicsI Φ(µ, h) :=
∑s,a P (s, µ, a)µ(s)h(s, a): aggregated dynamics
I µt+1 = Φ(µt, h): distribution at time t+ 1, flow property
I Aggregated Reward: r(µ, h) :=∑s,a µ(s)h(s, a)r(s, a, µ)
42 / 51
Bellman Equation for Integrated Q-function
Theorem 3 (Bellman Equation for IQ (Gu, Guo, Wei and X.,2019))
For any µ ∈ P(S) and h ∈ H,
Q(µ, h) = r(µ, h) + γ suph′∈H
Q(Φ(µ, h), h′).
I H := h : S → P(A): set of ”local” policies
I Aggregated DynamicsI Φ(µ, h) :=
∑s,a P (s, µ, a)µ(s)h(s, a): aggregated dynamics
I µt+1 = Φ(µt, h): distribution at time t+ 1, flow property
I Aggregated Reward: r(µ, h) :=∑s,a µ(s)h(s, a)r(s, a, µ)
42 / 51
Bellman Equation for Integrated Q-function
Theorem 3 (Bellman Equation for IQ (Gu, Guo, Wei and X.,2019))
For any µ ∈ P(S) and h ∈ H,
Q(µ, h) = r(µ, h) + γ suph′∈H
Q(Φ(µ, h), h′).
I H := h : S → P(A): set of ”local” policies
I Aggregated DynamicsI Φ(µ, h) :=
∑s,a P (s, µ, a)µ(s)h(s, a): aggregated dynamics
I µt+1 = Φ(µt, h): distribution at time t+ 1, flow property
I Aggregated Reward: r(µ, h) :=∑s,a µ(s)h(s, a)r(s, a, µ)
42 / 51
Bellman Equation for Integrated Q-function
Q(µ, h) = r(µ, h) + γ suph′∈H
Q(Φ(µ, h), h′).
I H is the minimum space under which the Bellman equation holds
I Similar results for finite-time horizon; Bellman equation for valuefunction; Including law of actions
43 / 51
Outline
I Mean-field control formulationI N-player cooperative gameI Pareto optimality conditionI N →∞
I Dynamic Programming Principle (DPP) on the probability measurespace
I Q-learning algorithm on the probability measure space
43 / 51
Q-learning Algorithm on Probability Measure Space
Q(µ, h) = r(µ, h) + γ suph′∈H
Q(Φ(µ, h), h′).
Properties of Q(µ, h):
I Defined on the probability measure space
I Continuous µ and h even when s and a are discrete
I Deterministic dynamics: µt+1 = Φ(µt, h)
=⇒ Q-learning algorithm with continuous state-action anddeterministic dynamics
44 / 51
Q-learning Algorithm on Probability Measure Space
Q(µ, h) = r(µ, h) + γ suph′∈H
Q(Φ(µ, h), h′).
Properties of Q(µ, h):
I Defined on the probability measure space
I Continuous µ and h even when s and a are discrete
I Deterministic dynamics: µt+1 = Φ(µt, h)
=⇒ Q-learning algorithm with continuous state-action anddeterministic dynamics
44 / 51
Q-learning Algorithm on Probability Measure Space
Q(µ, h) = r(µ, h) + γ suph′∈H
Q(Φ(µ, h), h′).
I Continuous state space: kernel regression
I Continuous action space: only optimize on a finite action space
I Deterministic dynamic: take advantage of it to reduce the samplecomplexity
45 / 51
ε-net and kernel regression
(d) ε-net (e) Kernel regression
I Cε: ε-net on C := P(S)×HI Hε: induced ε-net on HI Kernel regression on Cε
46 / 51
Approximated Bellman Operator
The original Bellman operator B : RC → RC for the problem is
(B q)(c) := r(c) + γ suph∈H
q(Φ(c), h).
I c ∈ C := P(S)×H
To facilitate the algorithm design, we introduce an approximatedBellman operator BK : RCε → RCε such that
(BK q)(ci) = r(ci) + γ max
h∈HεΓKq(Φ(ci), h),
I ΓK : kernel operator
47 / 51
Approximated Bellman Operator
The original Bellman operator B : RC → RC for the problem is
(B q)(c) := r(c) + γ suph∈H
q(Φ(c), h).
I c ∈ C := P(S)×H
To facilitate the algorithm design, we introduce an approximatedBellman operator BK : RCε → RCε such that
(BK q)(ci) = r(ci) + γ max
h∈HεΓKq(Φ(ci), h),
I ΓK : kernel operator
47 / 51
Sketch of the Algorithm
The algorithm consists of 2 steps:
1. Get sampled data for an ε-netI Initial µ, ε
2-net Cε/2, exploration policy π taking actions from Hε/2
I At the current state µt, act ht according to π, observeµt+1 = Φ(µt, ht) and rt = r(µt, ht)
2. Find the fixed point of BK
ql+1(µt, ht)←(r(µt, ht) + γ max
h∈HεΓKql(Φ(µt, ht), h)
)Comment: Visit Cε/2 once is enough ⇒ Low sample-complexity andefficient
48 / 51
Sketch of the Algorithm
The algorithm consists of 2 steps:
1. Get sampled data for an ε-netI Initial µ, ε
2-net Cε/2, exploration policy π taking actions from Hε/2
I At the current state µt, act ht according to π, observeµt+1 = Φ(µt, ht) and rt = r(µt, ht)
2. Find the fixed point of BK
ql+1(µt, ht)←(r(µt, ht) + γ max
h∈HεΓKql(Φ(µt, ht), h)
)Comment: Visit Cε/2 once is enough ⇒ Low sample-complexity andefficient
48 / 51
Sketch of the Algorithm
The algorithm consists of 2 steps:
1. Get sampled data for an ε-netI Initial µ, ε
2-net Cε/2, exploration policy π taking actions from Hε/2
I At the current state µt, act ht according to π, observeµt+1 = Φ(µt, ht) and rt = r(µt, ht)
2. Find the fixed point of BK
ql+1(µt, ht)←(r(µt, ht) + γ max
h∈HεΓKql(Φ(µt, ht), h)
)Comment: Visit Cε/2 once is enough ⇒ Low sample-complexity andefficient
48 / 51
Convergence and Complexity Results
Theorem(Gu, Guo, Wei, X. 2020)
Assume mild assumptions, for any ε′ > 0, under the ε′-greedy policy, withprobability 1− δ, for any initial state distribution µ and ε-net, after
Ω(poly((1/ε) · (1/ε′) · log(1/δ)))
samples, MFC-K-Q converges linearly to some function QCε ; and the supdistance between ΓKQCε and QC is upper bounded by Cε, where C is aconstant independent of ε, ε′ and δ.
Comment: There are 3 sources of the approximation error:
I kernel regression
I discretized action space
I estimated data (for both dynamics and rewards)
49 / 51
Comparison of Complexity Results
Work MFC/N-player Sample Complexity Guarantee
Our work MFC Ω(Tcov · log(1/δ))
Carmona et al., 2019 MFC Ω((Tcov · log(1/δ))l · poly(log(1/(δε))/ε))
Vanilla N-player N-player Ω(poly((|S||A|)N · log(1/(δε)) ·N/ε))Qu & Li, 2019 N-player Ω(poly((|S||A|)f(log(1/ε)) · log(1/δ) ·N/ε))
Table: Comparison of algorithms
I Here Tcov is the covering time of the exploration policy.
I l = max3 + 1/κ, 1/(1− κ) > 4 for some κ ∈ (0.5, 1).
I f(log(1/ε)) is a structure-dependent quantity and can scale as Nwhen agents are not interacting locally.
50 / 51
Mean-Field Theory for MachineLearning
Mean-Field Theory for Machine Learning
On the theory front
I Theoretical development is (sometimes) easier on the liftedprobability measure space
I Address the issues of curse of dimensionality
Applications
I Training Neural Network: Mei, Montenari, and Nguyen (2018); E,Han and Li (2018); Sirignano and Spiliopoulos (2019); Hu, Ren,Siska, and Szpruch (2019); Bo, Capponi and Liao (2019); Lu, Ma,Lu, Lu, and Ying (2020); Wojtowytsch and E (2020)
I GAN: Cao, Guo Lauriere (2019), Lin, Fung, Li, Nurbekyan, andOsher (2020); Conforti, Kazeykina, and Ren (2020);Domingo-Enrich, Jelassi, Mensch, Rotskoff, and Bruna (2020)
I MARL: (as mentioned before)
51 / 51
Mean-Field Theory for Machine Learning
On the theory front
I Theoretical development is (sometimes) easier on the liftedprobability measure space
I Address the issues of curse of dimensionality
Applications
I Training Neural Network: Mei, Montenari, and Nguyen (2018); E,Han and Li (2018); Sirignano and Spiliopoulos (2019); Hu, Ren,Siska, and Szpruch (2019); Bo, Capponi and Liao (2019); Lu, Ma,Lu, Lu, and Ying (2020); Wojtowytsch and E (2020)
I GAN: Cao, Guo Lauriere (2019), Lin, Fung, Li, Nurbekyan, andOsher (2020); Conforti, Kazeykina, and Ren (2020);Domingo-Enrich, Jelassi, Mensch, Rotskoff, and Bruna (2020)
I MARL: (as mentioned before)
51 / 51
Thank you!
51 / 51
References: Mean-field Games
I Large Population Stochastic Dynamic Games: Closed-loopMcKean-Vlasov Systems and the Nash Certainty EquivalencePrinciple (Huang, Malhame, & Caines, 2006)
I Mean Field Games (Lasry & Lions, 2007)
I Mean Field Games and Applications (Gueant, Lasry & Lions, 2011)
I Probabilistic Theory of Mean Field Games with Applications I&II(Carmona & Delarue, 2018)
I Probabilistic Analysis of Mean-field Games (Carmona & Delarue,2013)
I A Probabilistic Weak Formulation of Mean Field Games andApplications (Carmona & Lacker, 2014)
I A General Characterization of the Mean Field Limit for StochasticDifferential Games (Lacker, 2014)
51 / 51
References: Mean-field Games
I On the Convergence of Closed-loop Nash Equilibria to the MeanField Game Limit (Lacker, 2018)
I Mean Field Games via Controlled Martingale Problems: Existence ofMarkovian Equilibria (Lacker, 2018)
I On the Connection Between Symmetric n-player Games and MeanField games (Fischer, 2017)
I The Master Equation and the Convergence Problem in Mean FieldGames (Cardaliaguet, Delarue, Lasry, and Lions, 2015)
I On the Convergence Problem in Mean Field Games: A Two StateModel without Uniqueness (Cecchin, Pra, Fischer, and Pelino, 2018)
I Selection of Equilibria in a Linear quadratic Mean-field Game(Delarue and Tchuendom, 2018)
I Convergence to the Mean Field Game Limit: A Case Study (Nutz,Martin and Tan, 2019)
51 / 51
References: Mean-field Games
I Mean Field Multi-agent Reinforcement Learning (Yang, Luo, Li,Zhou, Zhang, and Wang, 2018)
I Actor-Critic Provably Finds Nash Equilibria of Linear-QuadraticMean-Field Games (Fu, Yang, Chen, and Wang, 2019)
I Mean Field Equilibria of Dynamic Auctions with Learning (Iyer,Johari and Sundararajan, 2014)
I Reinforcement Learning in Stationary Mean-field Games(Subramanian and Mahajan, 2019)
I Approximate Fictitious Play for Mean Field Games ( Elie, Perolat,Lauriere, Geist, Pietquin, 2019)
I Fitted Q-Learning in Mean-field Games: (Anahtarcı, Deha Karıksız,Saldi, 2019)
I Q-Learning in Regularized Mean-field Games: (Anahtarcı, DehaKarıksız, Saldi, 2020)
51 / 51
References: Mean-field Controls
I Efficient Ridesharing Order Dispatching with Mean Field Multi-agentReinforcement learning (Li, Qin, Jiao, Yang, Gong, Wang, Wang,Wu and Ye, 2019)
I Dynamic Programming for Optimal Control of StochasticMcKean–Vlasov Dynamics (Pham and Wei, 2017)
I Bellman Equation and Viscosity Solutions for Mean-field StochasticControl Problem (Pham and Wei, 2018)
I Viscosity Solutions to Parabolic Master Equations andMcKean–Vlasov SDEs with Closed-loop Controls (Zhang and Wu,2020)
I Model-Free Mean-Field Reinforcement Learning: Mean-Field MDPand Mean-Field Q-Learning (Carmona, Lauriere and Tan, 2019)
I Scalable Reinforcement Learning of Localized Policies forMulti-Agent Networked Systems (Qu, Wierman, Li, 2020)
51 / 51
References: Mean-field Theory for Deep Learning
I A Mean Field View of the Landscape of Two-Layers NeuralNetworks (Mei, Montenari, and Nguyen, 2018)
I Mean Field Analysis of Neural Networks: A Law of Large Numbers(Sirignano and Spiliopoulos, 2019)
I Mean-Field Langevin Dynamics and Energy Landscape of NeuralNetworks (Hu, Ren, Siska and Szpruch, 2020)
I Relaxed Control and Gamma-Convergence of StochasticOptimization Problems with Mean Field ( Bo, Capponi and Liao,2019)
I A Mean-field Analysis of Deep ResNet and Beyond: TowardsProvable Optimization Via Overparameterization From Depth (Lu,Ma, Lu, Lu, and Ying ,2020)
I Can Shallow Neural Networks Beat the Curse of Dimensionality? Amean field training perspective (Wojtowytsch and E, 2020)
I Connecting GANs, MFGs, and OT (Cao, Guo and Lauriere, 2020)I APAC-Net: Alternating the Population and Agent Control via Two
Neural Networks to Solve High-Dimensional Stochastic Mean FieldGames (Lin, Fung, Li, Nurbekyan, Osher, 2020)
I A Mean-field Analysis of Two-player Zero-sum Games (Jelassi,Mensch, Rotskoff, and Bruna, 2020) 51 / 51
References
I Guo, Hu, X., and Zhang (2019).Learning Mean-Field Games.NeurIPS 2019.
I Guo, Hu, X., and Zhang (2020).A General Framework for Learning Mean-Field Games.https://arxiv.org/pdf/2003.06069.pdf.
I Gu, Guo, Wei, X., (2019).Dynamic Programming Principles for Learning Mean-FieldControls.arxiv.org/abs/1911.07314.
I Gu, Guo, Wei, X., (2020).Q-Learning Algorithm for Mean-Field Controls, withConvergence and Complexity Analysis.https://arxiv.org/pdf/2002.04131.pdf.
51 / 51
Local Policy vs General Policy
For player i,
I Local policy: πi(si) : X → P(A)I Individual agent learns NE: population state distribution is not
observableI Example: sequential ad auction
I General (non-local) policy: πi(sss) : XN → P(A)I Individual agent learns NE: population states are observableI Platform coordinates agents to learn NE: population state
distribution is observableI Example: routing recommendationI Our framework also works under general policy
back
51 / 51
Local Policy vs General Policy
For representative agent,
I Local policy: π(s) : X → P(A)I Individual agent learns NE: population state distribution is not
observableI Example: sequential ad auction
I General (non-local) policy: π(s, µ) : X × P(X )→ P(A)I Individual agent learns NE: population states are observableI Platform coordinates agents to learn NE: population state
distribution is observableI Example: routing recommendationI Our framework also works under general policy
back
51 / 51
“Small parameter” condition
Assumption 1
There exists a constant d1 ≥ 0, such that for any LLL,LLL′ ∈ P(S ×A)∞t=0,
D(Γ1(LLL),Γ1(LLL′)) ≤ d1W1(LLL,LLL′), where
D(πππ,πππ′) := sups∈SW1(πππ(s),πππ′(s)),W1(LLL,LLL′) := sup
t∈NW1(Lt,L′t),
and W1 is the `1-Wasserstein distance.
Assumption 2
There exist constants d2, d3 ≥ 0, such that for any admissible policysequences πππ,πππ1,πππ2 and joint distribution sequences LLL,LLL1,LLL2,
W1(Γ2(πππ1,LLL),Γ2(πππ2,LLL)) ≤ d2D(πππ1,πππ2),
W1(Γ2(πππ,LLL1),Γ2(πππ,LLL2)) ≤ d3W1(LLL1,LLL2).
Assumed1d2 + d3 < 1.
back
51 / 51