[lecture notes of the institute for computer sciences, social informatics and telecommunications...

Mean Field Stochastic Games with Discrete

States and Mixed Players

Minyi Huang

School of Mathematics and Statistics, Carleton University,Ottawa, ON K1S 5B6, [email protected]

Abstract. We consider mean field Markov decision processes with amajor player and a large number of minor players which have their indi-vidual objectives. The players have decoupled state transition laws andare coupled by the costs via the state distribution of the minor players.We introduce a stochastic difference equation to model the update ofthe limiting state distribution process and solve limiting Markov decisionproblems for the major player and minor players using local information.Under a solvability assumption of the consistent mean field approxima-tion, the obtained decentralized strategies are stationary and have anε-Nash equilibrium property.

Keywords: mean field game, finite states, major player, minor player.

1 Introduction

Large population stochastic dynamic games with mean field cou-pling have attracted substantial interest in the recent years; see, e.g.,[1,4,11,16,12,13,18,19,22,23,24,26,27]. To obtain low complexity strategies,consistent mean field approximations provide a powerful approach, and in theresulting solution, each agent only needs to know its own state informationand the aggregate effect of the overall population which may be pre-computedoff-line. One may further establish an ε-Nash equilibrium property for the setof control strategies [12]. The technique of consistent mean field approximationsis also applicable to optimization with a social objective [5,14,23]. The survey[3] on differential games presents a timely report of recent progress in meanfield game theory. This general methodology has applications in diverse areas[4,20,27]. The mean field approach has also appeared in anonymous sequentialgames [17] with a continuum of players individually optimally responding tothe mean field. However, the modeling of a continuum of independent processesleads to measurability difficulties and the empirical frequency of the realizationsof the continuum-indexed individual states cannot be meaningfully defined [2].

A recent generalization of the mean field game modeling has been introducedin [10] where a major player and a large number of minor players coexist pursuingtheir individual interests. Such interaction models are often seen in economic orengineering settings, simple examples being a few large corporations and many

V. Krishnamurthy et al. (Eds.): GameNets 2012, LNICST 105, pp. 138–151, 2012.c© Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2012

Games with Mixed Players 139

much smaller competitors, a network service provider and a large number ofsmall users with their respective objectives. An extension of the modeling in [10]to dynamic games with Markovian switches in the dynamics is presented in [25].The random switches model the abrupt changes of the decision environment.Traditionally, game models differentiating vastly different strengths of playershave been well studied in cooperative game theory, and static models are usuallyconsidered [6,8,9]. Such players with very different strengths are called mixedplayers.

The linear-quadratic-Gaussian (LQG) model in [10] shows that the presenceof the major player causes an interesting phenomenon called the lack of sufficientstatistics. More specifically, in order to obtain asymptotic equilibrium strategies,the major player cannot simply use a strategy as a function of its current stateand time; for a minor player, it cannot simply use the current states of the majorplayer and itself. To overcome this lack of sufficient statistics for decision, thesystem dynamics are augmented by adding a new state, which approximatesthe mean field and is driven by the major player’s state. This additional stateenters the obtained decentralized strategy of each player and it captures the pastinfluence of the major player. The recent work [21] considered minor playersparametrized by a continuum which causes high complexity to the state spaceaugmentation approach, and a backward stochastic differential equation basedapproach (see, e.g., [28]) was used to deal with the random mean field process.The resulting decentralized strategies are not Markovian.

In this paper, we consider the interaction modeling of a major player and alarge number of minor players in the setting of discrete time Markov decisionprocesses (MDPs). Although the major player modeling is conceptually verysimilar to [10] which considers an LQG game model, the lack of linearity inthe MDP context will give rise to many challenges in analysis. Additionally,an important motivation to use the MDP framework is that our method maypotentially be applicable to many practical problems. In relation to mean fieldgames with discrete state and action spaces, related work can also be found in[15,23,7,17]; they all consider a population of comparably small decision makerswhich may be called peers.

A key step in our decentralized control design is to describe the evolution ofthe mean field, as the distribution of the minor players’ states, by a stochasticdifference equation driven by the major player’s state. Given the above repre-sentation of the limiting mean field, we may approximate the original problemsof the major player and a typical minor player by limiting MDPs with hybridstate spaces where the player in question has a finite state space and the meanfield process is a continuum evolving on a simplex.

The organization of the paper is as follows. Section 2 formulates the meanfield Markov decision game with a major player. Section 3 proposes a stochasticrepresentation of the update of the mean field and analyzes two auxiliary MDPsin the mean field limit. The consistency condition for mean field approximationsis introduced in Section 4, and Section 5 shows an asymptotic Nash equilibriumproperty. Section 6 presents concluding remarks of the paper.

140 M. Huang

2 The Mean Field Game Model

We adopt the framework of Markov decision processes to formulate the mean fieldgame which involves a major player A0 and a large population of minor players{Ai, 1 ≤ i ≤ N}. The state and action spaces of all players are finite, and denotedby S0 = {1, . . . ,K0} and A0 = {1, . . . , L0}, respectively, for the major player.For simplicity, we consider uniform minor players which share common state andaction spaces denoted by S = {1, . . . ,K} and A = {1, . . . , L}, respectively. Attime t ∈ Z+ = {0, 1, 2, . . .}, the state and action of Aj are, respectively, denotedby xj(t), uj(t), 0 ≤ j ≤ N . To model the mean field interaction of the players,we denote the random measure process as follows

I(N)(t) = (I(N)1 (t), . . . , I

(N)K (t)), t ≥ 0,

where I(N)k (t) = (1/N)

∑Ni=1 1(xi(t)=k). The process I(N)(t) describes the fre-

quency of occurrence of the states in S at time t.For the major player, the state transition law is determined by the stochastic

kernel

Q0(z|y, a0) = P (x0(t+ 1) = z|x0(t) = y, u0(t) = a0), (1)

where y, z ∈ S0 and a0 ∈ A0. Following the usual convention in Markov decisionprocesses, the transition probability of the process x0 from t to t + 1 is solelydetermined by x0(t) = y and u0(t) = a0 observed at t even if additional stateand action information before t is known.

The one-stage cost of the decision problem of the major player is given byc0(x0, θ, a0), where θ is the state distribution of the minor players. The infinitehorizon discounted cost is

J0 = E

∞∑

t=0

ρtc0(x0(t), I(N)(t), u0(t)),

where ρ ∈ (0, 1) is the discount factor.The state transition of minor player Ai is specified by

Q(z|y, a) = P (xi(t+ 1) = z|xi(t) = y, ui(t) = a), (2)

where y, z ∈ S and a ∈ A. The one-stage cost is c(x, x0, θ, a) and the infinitehorizon discounted cost is

Ji = E

∞∑

t=0

ρtc(xi(t), x0(t), I(N)(t), ui(t)).

Due to the structure of the costs J0 and Ji, the major player has a significantimpact on each minor player. By contrast, each minor player has a negligibleimpact on another minor player or the major player. Also, from the point ofview of the major player or a fixed minor player, it does not distinguish other


specific individual minor players. Instead, only the aggregate state informationI(N)(t) matters at each step, which is an important feature of mean field decisionproblems.

For the N + 1 decision processes, we specify the joint distribution as follows.Given the states and actions of all players at time t, the transition probabilityto a value of (x0(t+ 1), x1(t+ 1), . . . , xN (t+ 1)) is simply given by the productof the individual transition probabilities under their respective actions.

For integer k ≥ 2, denote the simplex

Dk =

⎧⎨

⎩(λ1, . . . , λk) ∈ R

k+

∣∣∣

k∑

j=1

λj = 1

⎫⎬

⎭.

To ensure that the individual costs are finite, we introduce the assumption.(A1) The one-stage costs c0 and c are functions on S0 × DK × A0 and S ×

S0 ×DK ×A, respectively, and they are both continuous in θ. ♦

Remark 1. By the continuity condition in (A1), there exists a fixed constant Csuch that |c0|+ |c| ≤ C for all x0 ∈ S0, x ∈ S, a0 ∈ A0, a ∈ A and θ ∈ DK .

We further assume the following condition on the initial state distribution ofthe minor players.

(A2) The initial states x1(0), . . . , xN (0) are independent and there exists adeterministic θ0 ∈ DK such that

limN→∞

I(N)(0) = θ0

with probability one. ♦

2.1 The Traditional Approach and Complexity

Denote the so-called t-history

ht = (xj(s), uj(s− 1), s ≤ t, j = 0, . . . , N), t ≥ 1, (3)

and h0 = (x0). We may further specify mixed strategies (or policies; we shall usethe two names strategy and policy interchangeably), as a probability measure onthe action space, of each player depending on ht, and use the method of dynamicprogramming to identify Nash strategies for the mean field game. However, for alarge population of minor players, this traditional approach is impractical. First,each player must use centralized information which causes high complexity inimplementation; second, numerically solving the dynamic programming equationis a prohibitive or even impossible task when the number of minor players exceedsa few dozen.

142 M. Huang

3 The Mean Field Approximation

To overcome the fundamental complexity difficulty, we use the mean field ap-proximation approach. The basic idea is to introduce a limiting process to ap-proximate the random measure process I(N)(t) and solve localized optimizationproblems for both the major player and a representative minor player.

Regarding the informational requirement in our decentralized strategy design,we assume (i) the limiting distribution θ0 and the state x0(t) of the major playerare known to all players, (ii) each minor player knows its own state but not thestate of any other particular minor player.

We use a process θ(t) with state space DK to approximate I(N)(t) whenN → ∞. Before specifying the rule governing the evolution of θ(t), we give someintuitive explanation. Due to the presence of the major player, the action ofeach minor player should be affected by x0(t) and its own state xi(t), and thiscauses the correlation of the individual state processes {xi(t), 1 ≤ i ≤ N} in theclosed-loop system. The resulting process θ(t) should be a random process. Wepropose the updating rule

θ(t+ 1) = ψ(x0(t), θ(t)), (4)

where θ(0) = θ0. The specific form of ψ will be determined by a procedure ofconsistent mean field approximations. We consider ψ from the following functionclass

Ψ = {φ(i, θ) = (φ1, . . . , φK)|φk ≥ 0,∑k∈S φk = 1},

where φ(i, ·) is continuous on DK for all i ∈ S0. The structure of (4) is analogousto the stochastic ordinary differential equation (ODE) modeling of the randommean field in the mean field LQG game model in [10], where the the evolutionof the ODE is driving by the state of the major player.

It is possible to consider a function of the form ψ(t, x0, θ), which is moregeneral than in (4). For computational efficiency, we will not seek this generality.And on the other hand, the consideration of a time-invariant function will besufficient for developing our mean field approximation scheme. More specifically,by introducing (4), we may develop stationary feedback strategies for all theplayers, and furthermore, the mean field limit of the closed-loop will regenerate astationary transition law of θ(t) which is in agreement with the initial assumptionof time-invariant dynamics.

3.1 The Limiting Problem of the Major Player

Suppose the function ψ in (4) has been given. The original problem of the majorplayer is now approximated by a new Markov decision process. We will often usex0, xi, θ to denote a value of the corresponding processes.

Problem (P0): Minimize

J0 = E

∞∑

t=0

ρtc0(x0(t), θ(t), u0(t)),


where x0(t) has the transition law (1) and θ(t) satisfies (4).Problem (P0) gives a standard Markov decision process. To solve this problem,

we use the dynamic programming approach by considering a family of optimiza-tion problems associated with different initial conditions. Given the initial state(x0, θ) ∈ S0 ×DK at t = 0, define the cost function

J0(x0, θ, u(·)) = E

[ ∞∑

t=0

ρtc0(x0(t), θ(t), u0(t))|x0, θ]

.

Denote the value function v(x0, θ) = inf J0(x0, θ, u(·)), where the infimum iswith respect to all mixed policies/strategies of the form π = (π(0), π(1), . . . , )such that each π(s) is a probability measure on A0, indicating the probabilityto take a particular action, and depends on all past history (. . . , x0(s− 1), θ(s−1), u0(s − 1), x0(s), θ(s)). By taking two different initial conditions (x0, θ) and(x0, θ

′) and comparing the associated optimal costs, we may easily obtain thefollowing continuity property.

Proposition 1. For each x0, the value function v(x0, ·) is continuous on DK . �

We write the dynamic programming equation

v(x0, θ)

= mina0∈A0

{c0(x0, θ, a0) + ρEv(x0(t+ 1), θ(t+ 1))}

= mina0∈A0

{

c0(x0, θ, a0) + ρ∑

k∈S0

Q0(k|x0, a0)v(k, ψ(x0, θ))}

.

Since the action space is finite, an optimal policy π0 solving the dynamic pro-gramming equation exists and is determined as a stationary Markov policy ofthe form π0(x0, θ), i.e., π0 is a function of the current state. Let the set of opti-mal policies be denoted by Π0. It is possible that Π0 consists of more than oneelement.

3.2 The Limiting Problem of the Minor Player

Suppose a particular optimal strategy π0 ∈ Π0 has been fixed for the majorplayer. The resulting state process is x0(t). The decision problem of the minorplayer is approximated by the following limiting problem.

Problem (P1): Minimize

Ji = E

∞∑

t=0

ρtc(xi(t), x0(t), θ(t), ui(t)),

144 M. Huang

where xi(t) has the state transition law (2); θ(t) satisfies (4); and x0(t) is subjectto the control policy π0 ∈ Π0. This leads to a Markov decision problem with thestate (xi(t), x0(t), θ(t)) and control action ui(t). Following the steps in Section3.1, we define the value function w(xi, x0, θ).

Before analyzing the value function w, we specify the state transition law ofthe major player under any mixed strategy π0. Suppose

π0 = (α1, . . . , αL0), (5)

which is a probability vector. By the standard convention in Markov decisionprocesses, the strategy π0 selects action k with probability αk. We further define

Q0(z|y, π0) =∑

l∈A0

αlQ0(z|y, l),

where π0 is given by (5).The dynamic programming equation is now given by

w(xi, x0, θ)

=mina∈A

{c(xi, x0, θ, a) + ρEw(xi(t+ 1), x0(t+ 1), θ(t+ 1))}

=mina∈A

{c(xi, x0, θ, a0) + ρ

∑

j∈S,k∈S0

Q(j|xi, a)Q0(k|x0, π0)w(j, k, ψ(x0, θ))}.

The following continuity property parallels Proposition 1.

Proposition 2. For each pair (xi, x0), the value function w(xi, x0, ·) is contin-uous on DK . �Again, since the action space in Problem (P1) is finite, the value function isattained by at least one optimal strategy. Let the optimal strategy set be denotedby Π . Note that Π is determined after π0 is selected first.

Let π be a mixed strategy of the minor player and represented in the form

π = (β1, . . . , βL).

We determine the state transition law of the minor player as follows

Q(z|y, π) =∑

l∈AβlQ(z|y, l). (6)

We have the following theorem on the closed-loop system.

Theorem 1. Suppose π0 ∈ Π0 and π ∈ Π is determined after π0. Under thepolicy pair (π0, π), (xi(t), x0(t), θ(t)) is a Markov chain with stationary transitionprobabilities.

Proof. It is clear that π0 and π are stationary feedback policies as a function ofthe current state of the corresponding system. They may be represented as twoprobability vectors

π0 = (π10(x0, θ), . . . , π

L00 (x0, θ)),

π = (π1(xi, x0, θ), . . . , πL(xi, x0, θ)).


The process (xi(t), x0(t), θ(t)) is a Markov chain since the transition probabilityfrom time t to t + 1 depends only on the value of (xi(t), x0(t), θ(t)) and not onthe past history. Suppose at time t, (xi(t), x0(t), θ(t)) = (j, k, θ). Then at t+ 1,we have the transition probability

P(xi(t+ 1) = j′, x0(t+ 1) = k′, θ(t+ 1) = θ′

∣∣∣xi(t), x0(t), θ(t)) = (j, k, θ)

)

= Q(j′|j, π(j, k, θ))Q0(k′|k, π0(k, θ))δψ(k,θ)(θ′).

We use δa(x)to denote the dirac function, i.e., δa(x) = 1 if x = a, and δa(x) = 0elsewhere. It is seen that the transition probability is determined by (j, k, θ) anddoes not depend on time. �

3.3 Discussions on Mixed Strategies

If Problems (P0) and (P1) are considered alone, one may always select an optimalpolicy which is a pure policy, i.e., given the current state, the action can beselected in a deterministic manner. However, in the mean field game setting weneed to eventually determine the function ψ by a fixed point argument. For thisreason, it is generally necessary to consider the optimal policies from the largerclass of mixed policies. The restriction to deterministic policies may potentiallylead to a nonexistence situation when the consistency requirement is imposedlater on the mean field approximation.

4 Replication of the Frequency Process

This section develops the procedure to replicate the dynamics of θ(t) from theclosed-loop system when the minor players apply the control strategies obtainedfrom the limiting Markov decision problems.

We start with a system of N minor players. Suppose the major player hasselected its optimal policy π0(x0, θ) from Π0. Note that for the general caseof Problem (P1), there may be more than one optimal policy. We make theconvention that the same optimal policy π(xi, x0, θ) is used by all the minorplayers while each minor player substitutes its own state into the feedback policyπ. It is necessary to make this convention since otherwise the mean field limitcannot be properly defined if there are multiple optimal policies and if eachminor player can take an arbitrary one.

We have the following key theorem on the asymptotic property of the updateof I(N)(t) when N → ∞. Note that the range of I(N)(t) is a discrete set. Forany θ ∈ DK , we take an approximation procedure. We suppose the vector θ hasbeen used by the minor players (of the finite population) at time t in solvingtheir limiting control problems and used in their optimal policy.

Theorem 2. Fix any θ = (θ1, . . . , θK) ∈ DK . Suppose the major player appliesπ0 and the N minor players apply π, and at time t the state of the major player

146 M. Huang

is x0 and I(N)(t) = (s1, . . . , sK), where (s1, . . . , sK) → θ as N → ∞. Then given(x0, I

(N)(t), π), as N → ∞,

I(N)(t+ 1) →( K∑

l=1

θlQ(1|l, π(l, x0, θ)), . . . ,K∑

l=1

θlQ(K|l, π(l, x0, θ)))

(7)

with probability one.

Proof. By the assumption on I(N)(t), there are skN minor players in state k ∈ Sat time t. In determining the distribution of I(N)(t + 1), by symmetry of theminor players, we may assume without loss of generality that at time t minorplayers A1, . . . ,As1N are in state 1, As1N+1, . . . ,A(s1+s2)N are in state 2, etc.We check the contribution of A1 alone in generating different states in S. Dueto the transition of A1, state k ∈ S will appear with probability

Q(k|1, π(1, x0, θ)).We further obtain a probability vector Q1 := (Q(k|1, π(1, x0, θ)))Kk=1 with itsentries assigned on the set S indicating the probability that each state appearsresulting from the transition of A1.

An important fact is that in the closed-loop system with x0(t) = x0, condi-tional independence holds for the transition from xi(t) to xi(t + 1) for the Nprocesses.

Thus, the distribution of NI(N)(t + 1) given (x0, I(N)(t), π) is obtained as

the convolution of N independent distributions corresponding to all N minorplayers. And Q1 is one of these N distributions. We have

Ex0,I(N)(t),πI(N)(t+ 1) =

( K∑

l=1

slQ(1|l, π(l, x0, θ)), . . . ,K∑

l=1

slQ(K|l, π(l, x0, θ))),

(8)

where Ex0,I(N)(t),π denotes the conditional mean given (x0, I(N)(t), π).

So by the law of large numbers I(N)(t+1)−Ex0,I(N)(t),πI(N)(t+1) converges

to zero with probability one, as N → ∞. We obtain (7). �

Based on the right hand side of (7), we introduce the N ×N matrix

Q∗(x0, θ) =

⎡

⎢⎢⎢⎣

Q(1|1, π(1, x0, θ)) . . . Q(N |1, π(1, x0, θ))Q(1|2, π(2, x0, θ)) . . . Q(N |2, π(2, x0, θ))

.... . .

...Q(1|N, π(N, x0, θ)) . . . Q(N |N, π(N, x0, θ))

⎤

⎥⎥⎥⎦. (9)

Theorem 2 implies that within the infinite population limit if the random mea-sure of the states of the minor players is θ(t) at time t, then θ(t + 1) should begenerated as

θ(t+ 1) = θ(t)Q∗(x0(t), θ(t)). (10)


4.1 The Consistent Condition

The fundamental requirement of consistent mean field approximations is thatthe mean field initially assumed should be the same as what is replicated bythe closed-loop system when the number of minor players tends to infinity. Bycomparing (4) with (10), this consistency requirement reduces to the followingcondition

ψ(x0, θ) = θQ∗(x0, θ), (11)

where Q∗ is given by (9). Recall that when we introduce the class Ψ for ψ,we have a continuity requirement. By imposing (11), we implicitly require acontinuity property of Q∗ with respect to the variable θ.

Combining the solutions to Problems (P0) and (P1) and the consistent re-quirement, we write the so-called mean field equation system

θ(t+ 1) = ψ(x0(t), θ(t)), (12)

v(x0, θ) = mina0∈A0

{c0(x0, θ, a0) + ρ

∑

k∈S0

Q0(k|x0, a0)v(k, ψ(x0, θ))}, (13)

w(xi, x0, θ) = mina∈A

{c(xi, x0, θ, a0)+

ρ∑

j∈S,k∈S0

Q(j|xi, a)Q0(k|x0, π0)w(j, k, ψ(x0, θ))}, (14)

ψ(x0, θ) = θQ∗(x0, θ). (15)

In the above, we use xi to denote the state of the generic minor player. Note thatonly a single generic minor player appears in this mean field equation system.

Definition 1. We call (π0, π, ψ(x0, θ)) a consistent solution to the mean fieldequation system (12)-(15) if π0 solves (13) and π solves (14) and if the constraint(15) is satisfied. ♦

5 Decentralized Strategies and Performance

We consider a system of N + 1 players. We specify randomized strategies withcentralized information and decentralized information, respectively.

Centralized Information. Define the t-history ht by (3). For any j = 0, ..., N ,the admissible control set Uj of player Aj consists of control (uj(0), uj(1), . . .),where each uj(t) is a mixed strategy as a mapping from ht to DL0 if j = 0, andto DL if 1 ≤ j ≤ N .

Decentralized Information. For the major player, denote

h0,dect =(x0(0), θ(0), u0(0), . . . , x0(t− 1), θ(t− 1), u0(t− 1), x0(t), θ(t)

).

148 M. Huang

A decentralized strategy at time t is such that u0(t) is a randomized strategy

depending on h0,dect . For minor player Ai, denote

hi,dect =(xi(0), x0(0), θ(0), ui(0), . . . ,

xi(t− 1), x0(t− 1), θ(t− 1), u0(t− 1), xi(t), x0(t), θ(t)).

A decentralized strategy at time t is such that ui(t) depends on hi,dect .

For the mean field equation system, if a solution triple (π0, π, ψ) exists, we willobtain π0 and π as decentralized Markov strategies as a function of the currentstate (x0(t), θ(t)) and (xi(t), x0(t), θ(t)), respectively.

Suppose all the players use their decentralized strategies π0(x0, θ), π(xi, x0, θ),1 ≤ i ≤ N , respectively. In the setup of mean field decision problems, a centralissue is to examine the performance change for playerAj if it unilaterally changesto a policy in Uj by utilizing extra information.

For examining the performance, we have the following error estimate on themean field approximation.

Theorem 3. Suppose (i) θ(t) is generated by (4), where θ0 is given by (A2);(ii) (π0, π, ψ(x0, θ)) is a consistent solution to the mean field equation system(12)-(15). Then we have

limN→∞

E|I(N)(t)− θ(t)| = 0

for each given t.

Proof. We use the technique introduced in the proof of Theorem 2. Fix anyε > 0. We have

P (|I(N)(0)− θ0| ≥ ε) ≤ E|I(N)(0)− θ(0)|/ε.We take a sufficiently large N0 such that for all N ≥ N0, we have

P (|I(N)(0)− θ0| < ε) > 1− ε. (16)

Then following the method for (8), we may estimate I(N)(1). By the consistencycondition (11), we further obtain

limN→∞

E|I(N)(1)− θ(1)| = 0.

Carrying out the estimates recursively, we obtain the desired result for eachfixed t. �

For j = 0, ..., N , denote u−j = (u0, u1, ..., uj−1, uj+1, ..., uN).

Definition 2. A set of strategies uj ∈ Uj, 0 ≤ j ≤ N , for the N + 1 playersis called an ε-Nash equilibrium with respect to the costs Jj, 0 ≤ j ≤ N , whereε ≥ 0, if for any j, 0 ≤ j ≤ N , we have Ji(uj, u−j) ≤ Jj(u

′j, u−j) + ε, when any

alternative u′j is applied by player Aj . ♦


Theorem 4. Assume the conditions in Theorem 3 hold. Then the set of strate-gies uj, 0 ≤ j ≤ N , for the N + 1 players is an εN -Nash equilibrium, i.e., for0 ≤ j ≤ N,

Jj(uj , u−j)− εN ≤ infuj

Jj(uj , u−j) ≤ Jj(uj , u−j),

where 0 ≤ εN → 0 as N → ∞ and uj is a centralized information based strategy.

Proof. The theorem may be proven by following the usual argument in our pre-vious work [12,10]. First, by using Theorem 3, we may approximate I(N)(t) inthe original game by θ(t). Then the optimization problems of the major playerand any minor player are approximated by Problems (P0) and (P1), respec-tively. Finally, it is seen that each player can gain little if it deviates from thedecentralized strategy determined from the mean field equation system. �

6 Conclusion Remarks and Future Work

This paper considers a class of Markov decision processes involving a majorplayer and a large population of minor players. The players have independentdynamics for fixed actions and have mean field coupling in their costs accordingto the state distribution process of the minor players. We introduce a stochasticdifference equation depending on the state of the major player to characterizethe evolution of the minor players’ state distribution process in the infinite pop-ulation limit and solve local Markov decision problems. This approach providesdecentralized stationary strategies and offers a low complexity solution.

This paper presents the main conceptual framework for decentralized decisionmaking in the setting of Markov decision processes. The existence analysis andthe associated computation of a solution to the mean field equation system ismore challenging than in linear models. It is of interest to develop fixed pointanalysis to study the existence of solutions. Also, the development of iterativecomputation procedures for solutions is of practical interest.

References

1. Adlakha, S., Johari, R., Weintraub, G., Goldsmith, A.: Oblivious equilibrium forlarge-scale stochastic games with unbounded costs. In: Proc. IEEE CDC 2008,Cancun, Mexico, pp. 5531–5538 (December 2008)

2. Al-Najjar, N.I.: Aggregation and the law of large numbers in large economies.Games and Economic Behavior 47(1), 1–35 (2004)

3. Buckdahn, R., Cardaliaguet, P., Quincampoix, M.: Some recent aspects of differ-ential game theory. Dynamic Games and Appl. 1(1), 74–114 (2011)

4. Dogbe, C.: Modeling crowd dynamics by the mean field limit approach. Math.Computer Modelling 52, 1506–1520 (2010)

5. Gast, N., Gaujal, B., Le Boudec, J.-Y.: Mean field for Markov decision processes:from discrete to continuous optimization (2010) (Preprint)

150 M. Huang

6. Galil, Z.: The nucleolus in games with major and minor players. Internat. J. GameTheory 3, 129–140 (1974)

7. Gomes, D.A., Mohr, J., Souza, R.R.: Discrete time, finite state space mean fieldgames. J. Math. Pures Appl. 93, 308–328 (2010)

8. Haimanko, O.: Nonsymmetric values of nonatomic and mixed games. Math. Oper.Res. 25, 591–605 (2000)

9. Hart, S.: Values of mixed games. Internat. J. Game Theory 2, 69–86 (1973)10. Huang, M.: Large-population LQG games involving a major player: the Nash cer-

tainty equivalence principle. SIAM J. Control Optim. 48(5), 3318–3353 (2010)11. Huang, M., Caines, P.E., Malhame, R.P.: Individual and mass behaviour in large

population stochastic wireless power control problems: centralized and Nash equi-librium solutions. In: Proc. 42nd IEEE CDC, Maui, HI, pp. 98–103 (December2003)

12. Huang, M., Caines, P.E., Malhame, R.P.: Large-population cost-coupled LQGproblems with nonuniform agents: individual-mass behavior and decentralized ε-Nash equilibria. IEEE Trans. Autom. Control 52(9), 1560–1571 (2007)

13. Huang, M., Caines, P.E., Malhame, R.P.: The NCE (mean field) principle withlocality dependent cost interactions. IEEE Trans. Autom. Control 55(12), 2799–2805 (2010)

14. Huang, M., Caines, P.E., Malhame, R.P.: Social optima in mean field LQG control:centralized and decentralized strategies. IEEE Trans. Autom. Control (in press,2012)

15. Huang, M., Malhame, R.P., Caines, P.E.: On a class of large-scale cost-coupledMarkov games with applications to decentralized power control. In: Proc. 43rdIEEE CDC, Paradise Island, Bahamas, pp. 2830–2835 (December 2004)

16. Huang, M., Malhame, R.P., Caines, P.E.: Nash equilibria for large-population linearstochastic systems of weakly coupled agents. In: Boukas, E.K., Malhame, R.P.(eds.) Analysis, Control and Optimization of Complex Dynamic Systems, pp. 215–252. Springer, New York (2005)

17. Jovanovic, B., Rosenthal, R.W.: Anonymous sequential games. Journal of Mathe-matical Economics 17, 77–87 (1988)

18. Lasry, J.-M., Lions, P.-L.: Mean field games. Japan. J. Math. 2(1), 229–260 (2007)19. Li, T., Zhang, J.-F.: Asymptotically optimal decentralized control for large popu-

lation stochastic multiagent systems. IEEE Trans. Automat. Control 53(7), 1643–1660 (2008)

20. Ma, Z., Callaway, D., Hiskens, I.: Decentralized charging control for large popula-tions of plug-in electric vehicles. IEEE Trans. Control Systems Technol. (to appear,2012)

21. Nguyen, S.L., Huang, M.: Mean field LQG games with a major player: continuum-parameters for minor players. In: Proc. 50th IEEE CDC, Orlando, FL, pp. 1012–1017 (December 2011)

22. Nourian, M., Malhame, R.P., Huang, M., Caines, P.E.: Mean field (NCE) formula-tion of estimation based leader-follower collective dyanmics. Internat. J. RoboticsAutomat. 26(1), 120–129 (2011)

23. Tembine, H., Le Boudec, J.-Y., El-Azouzi, R., Altman, E.: Mean field asymptoticsof Markov decision evolutionary games and teams. In: Proc. International Confer-ence on Game Theory for Networks, Istanbul, Turkey, pp. 140–150 (May 2009)

24. Tembine, H., Zhu, Q., Basar, T.: Risk-sensitive mean-field stochastic differentialgames. In: Proc. 18th IFAC World Congress, Milan, Italy (August 2011)


25. Wang, B.-C., Zhang, J.-F.: Distributed control of multi-agent systems with randomparameters and a major agent (2012) (Preprint)

26. Weintraub, G.Y., Benkard, C.L., Van Roy, B.: Markov perfect industry dynamicswith many firms. Econometrica 76(6), 1375–1411 (2008)

27. Yin, H., Mehta, P.G., Meyn, S.P., Shanbhag, U.V.: Synchronization of coupledoscillators is a game. IEEE Trans. Autom. Control 57(4), 920–935 (2012)

28. Yong, J., Zhou, X.Y.: Stochastic Controls: Hamiltonian Systems and HJB Equa-tions. Springer, New York (1999)

[lecture notes of the institute for computer sciences, social informatics and telecommunications...

Documents