infinite-state discrete-time markov decision processes · in nite-state discrete-time markov...

Infinite-State Discrete-time Markov DecisionProcesses

Eugene A. Feinberg

Stony Brook University

Eurandom, EindhovenSeptember 5, 2012

Motivation

Early studies of inventory control problems by Arrow, Harris, andMarschak (1951), Dovertzky, Kiefer, and Wolfowitz (1952), Arrow,Karlin, and Scarf (1958), Scarf (1960) were one of the majormotivating factors for the development of the theory of MarkovDecision Processes (MDPs).

Inventory control models lead to MDPs with infinite state sets,noncompact action sets, and unbounded costs. We describe recentdevelopments for such MDPs.

Plan of the talk

1. Finite state-action MDPs

2. Discounted MDPs with Borel state and compact action sets

3. Weak and setwise continuity assumptions for transitionprobabilities

4. Countable state MDPs with average costs

5. Average-costs MDPs with Borel state and compact action sets

6. MDPs with setwise continuous transition probabilities

7. MDPs with weakly continuous transition probabilities

8. Berge’s theorem

9. Fatou’s lemma

Finite state and action MDPs

(i) X is a state set;

(ii) A is an action set;

(iii) A(x) is the set of actions available at state x ∈ X;

(iv) c(x, a) is the one step cost;

(v) p(y | x, a) is the transition probability;

At time n = 0, 1, . . ., the history hn = x0, a0, . . . xn−1, an−1, xn isobserved and the action an is selected.

A policy is a sequence π0, π1, . . . of regular transition probabilitiesπn from Hn = (X ×A)n−1 ×X to A such thatπn(A(xn) | x0, a0, . . . xn−1, an−1, xn) = 1.

According to the Ionescu Tulcea theorem, any initial state x andany policy π define a probability measure Pπx on(X ×A)∞,B((X ×A)∞) and the expectation with respect to Pπxis denoted by Eπx .

A policy ϕ is called Markov, if decisions are non-randomized anddepend only on the current state and time, an = ϕn(xn). A policy

ϕ is called stationary, if decisions are non-randomized and dependonly on the current state, an = ϕ(xn).

Let Π be the set of all policies.

Cost criterion

Finite-horizon expected total cost:

vπα,n(x) = Eϕx

n−1∑t=1

αtc(xt, at),

where α ≥ 0 is a discount factor.

Infinite-horizon expected total cost:

vπα(x) = Eϕx

∞∑t=1

αtc(xt, at),

where 0 ≤ α < 1.

Average costs per unit time:

wπ(x) = lim supn→∞

1,n(x).

For any objective function gπ(x), let g(x) = infπ∈Π

gπ(x) be the

value function. A policy π is called optimal, if gπ(x) = g(x) for allx ∈ X.

Value functions vα,n, vα satisfy the optimality equations

vα,n+1(x) = mina∈A(x)

vα,n(x) +∑y∈X

p(y | x, a)vα,n(y)

, x ∈ X,

vα(x) = mina∈A(x)

vα(x) +∑y∈X

p(y | x, a)vα(y)

, x ∈ X. (2)

A stationary policy ϕ is discount optimal if and only if, for allx ∈ X, the minimum in (2) is achieved for a = ϕ(x).

Value function w for average costs per unit time satisfies thecannonical equations

w(x) = mina∈A(x)

∑y∈X

p(y | x, a)w(y), (3)

w(y) + u(y) = mina∈A(x)

c(x, a) +∑y∈X

p(y | x, a)u(x)

The canonical equations define a stationary optimal policy. Formany problems w(x) = const. Then (3) and (4) lead to

w + u(y) = mina∈A(x)

c(x, a) +∑y∈X

p(y | x, a)u(x)

Vanishing discount optimality approach

The existence of stationary discount optimal policy follows fromthe optimal equation.

The existence of a stationary average cost optimal policy followsfrom

wϕα(x) = limn→∞

(1− α)vϕα(x), x ∈ X,

for any stationary policy ϕ. In general, from the Tauberiantheorem, for any policy π,

lim infn→∞

nvπ1,n(x) ≤ lim inf

α↑1(1−α)vπα(x) ≤ lim sup

α↑1(1−α)vπα(x) ≤ wπ(x).

Infinite state space

(i) X and A are standard Borel spaces; (X ×B(X)),(A×B(A)).

(ii) The graph Gr(A) = {x ∈ X, a ∈ A(x)} is a Borel subset of(X ×A) such that there exists a stationary policy ϕ (themeasurable mapping ϕ : X → A such that ϕ(x) ∈ A(x).

(iii) The cost function c is measurable and bounded above.

(iv) p is the regular transition probability from X ×A to X;

p(· | x, a) is a probability measure on (X ×B(X)) for all(x, a) ∈ (X ×A)p(E | x, a) is a measurable function on (X ×A) for eachE ∈ B(X).

Discounted costs

For x ∈ X and n = 0, 1, . . ., the optimality equations can bewritten as

vα,n+1(x) = infa∈A(x)

c(x, a) + α

vα,n(y)p(dy | x, a)

vα(x) = infa∈A(x)

c(x, a) + α

vα(y)p(dy | x, a)

The relevant questions are:(i) Can “inf” be replaced with min?(ii) Is it true that vα,n(x)→ vα(x)?(iii) Does an optimal policy exist?

Conditions for discount optimality for compact action sets(Schal, 1975)

Let P(X) be the set of probability measures on X and C(A) bethe set of nonempty compact subsets of A.

Assumption (S):

1. A(x) ∈ C(A) for all x ∈ X,

2. For x ∈ X, the function c(x, a) is lower semi-continuous in a,

3. For each x ∈ X, the transition probability p(· | x, a) iscontinuous in a with respect to setwise convergence in P(x);that is if an → a, then∫

Xf(y)p(dy | x, an)→

∫Xf(y)p(dy | x, a)

for any bounded measurable function f on X.

Conditions for discount optimality for compact action sets(Schal, 1975)

Assumption (W):

1. A(x) ∈ C(A) for all x ∈ X,

2. c is lower semi-continuous on Gr(A),

3. A : X → C(A) is upper semi-continuous (A−1(B) is closed inX for every closed set B ⊂ A),

4. p : Gr(A)→ P(X) is continuous with respect to weakconvergence in P(X), that is, if xn → x and an → a then∫

Xf(y)p(dy | xn, an)→

∫Xf(y)p(dy | x, a)

for any bounded continuous function f on X.

Why A(x) : X → A is assumed to be upper semi-continuous?

Answer: Berge’s Theorem

Let X and Y be Hausdorff topological spaces. Berge’s Theorem: If

u : X × Y → R is a lower semi-continuous function andφ : X → C(Y ) is upper semi-continuous set-valued mapping, thenthe function

v(x) = infy∈φ(x)

u(x, y)

is lower semi-continuous on X.

vα,n+1(x) = mina∈A(x)

c(x, a) + α

vα,n(y)p(dy | x, a)

Both summands should be lower semi-continuous for the minimumto be achieved. Thus, if vα,n is lower semi-continuous then theminimum is achieved. But, we also need lower semi-continuity ofvα,n+1 for (6) with n increased by 1.

Conditions for average cost optimality

Sennott (1986-2002) introduced conditions for average-costoptimality when X is countable and A is finite.

Sennott’s assumptions: There exists a “distinguished state” zsuch that:

(1) The function (1− α)vα(z) is bounded for α ∈ (0, 1);

(2) There exists a non-negative finite constant L and anon-negative finite function M(x) such that

−L ≤ vα(x)− vα(z) ≤M(x),

for all x ∈ X and for all α ∈ (0, 1).

Sennott used the vanishing discount optimality approach andshowed that the optimality inequality

w + u(x) ≥ mina∈A(x)

{c(x, a) +

∫Xu(y)p(dy | x, a)

}, x ∈ X (7)

holds. Then, a stationary policy ϕ such that

w + u(x) ≥ c(x, ϕ(x)) +

∫Xu(y)p(dy | x, ϕ(x)), x ∈ X

is optimal.

Cavozos-Cadena (1991) constructed an example when theoptimality inequality holds but the optimality equality does not.

Conditions for average-cost optimality for Borel state andcompact action sets (Schal, 1993)

Assumptions (G) (general): g := infx∈X w(x) <∞.

Assumptions (B): Assumption (G) holds and

supα∈[0,1)

uα(x) <∞,

where, u(x) = vα(x)−mα formα = inf

x∈Xvα(x)

[Assumption B] and [either Assumption S or Assumption W]implies the validity of the optimality inequality, existence ofstationary optimal policies, and convergence of discount optimalpolicies to average-cost optimal policies.

Conditions for average-cost optimality for general actionsets (Hernandez-Lerma, 1991)

Condition S∗:

(i) For each x ∈ X, the function c(x, a) is inf-compact in A, thatis, the set Ar(x) = {a ∈ A(x) | c(x, a) ≤ r} is compact foreach finit constant r.

(ii) For each x ∈ X, the transition probability p(· | x, a) iscontinuous with respect to setwise convergence on P(X).

Assumptions (B) and (W) imply the validity of the optimalityinequality, existence of stationary optimal policies, and convergenceof discount optimal policies to average-cost optimal policies.

Conditions for discounted optimality for general action sets(Feinberg and Lewis, 2007)

Assumption (Wu):

1. c is inf-compact on Gr(A); that is the set

{(x, a) ∈ Gr(A) : c(x, a) ≤ r}

is compact for any finite constant r.

2. p : Gr(A)→ P(X) is continuous with respect to weakconvergence on P(X).

Under assumption (Wu), there exist stationary discount optimalpolicies that satisfy optimality equations and vn,α → vα.

In fact, F. and Lewis introduced a new version of Berge’s theorem.

Conditions for average-cost optimality (Feinberg andLewis, 2007)

Let rα0(x) := supα0≤α<1 uα(x), where α ∈ [0, 1) and

uα(x) = vα(x)−mα.

Assumption (LB): Assumption (B) holds and there existsα0 ∈ [0, 1) such that the function rα0(x) is locally bounded on X,that is, for any x ∈ X there is a neighbourhood of x on whichrα0(x) is bounded.

Assumptions (LB) and (Wu) imply the validity of the optimalityinequality, existence of stationary optimal policies, and convergenceof discount-optimal policies to average cost optimal policies.

Applications: Inventory control, cash management problem.

Average-cost optimality (Feinberg, Kasyanov, Zadoianchuk2012)

Assumption (W∗):

1. c is lower semi-continuous on Gr(A)

2. If a sequence {xn}n=1,2,... with values in X converges and itslimit x belongs to X, then any sequence {an}n=1,2,... withan ∈ A(xn), n = 1, 2, . . . , satisfying the condition that thesequence {c(xn, an)}n=1,2,... is bounded above has a limitpoint a ∈ A(x)

3. p : Gr(A)→ P(X) is continuous with respect to weakconvergence on P(X).

Lemma:

(i) Assumption (W) implies Assumption (W∗),

(ii) Assumption (W∗u) implies Assumption (W∗).

Average-cost optimality

Let w = lim supα↑1

(1− α)mα, where mα = infx∈X vα(x).

Recall that,Assumption (G): g := infx∈X w(x) <∞.

Optimality inequality: Let Assumption (G) hold. If there exists ameasurable function u : X → [0,∞) and a stationary policy φ suchthat

w + u(x) ≥ c(x, φ(x)) +

∫Xu(y)p(dy | x, φ(x))

then φ is average-cost optimal and

w(x) = wφ(x) = lim supα↑1

vα(x) = w = g.

Assumption (B): Assumption (G) holds and lim infα↑1

uα(x) <∞ for

all x ∈ X.

Assumption (B) is weaker than assumption (B) that requireslim supα↑1

uα(x) <∞.

Define,

Uβ(x) = infα∈[β,1)

uα(x), uβ(x) = lim infy→∞

Uβ(y), β ∈ [0, 1), x ∈ X,

andu(x) = lim

β↑1uβ(x).

Theorem (1)

Suppose assumptions (W∗) and (B) hold. There exists a stationarypolicy satisfying the optimality inequality and therefore, thisstationary policy is optimal and

w(x) = wΦ(x) = lim supα↑1

vα(x) = lim supα↑1

mα = g.

Define, for all x ∈ X,

A∗(x) :=

{a ∈ A(x) : w + u(x) ≥ c(x, a) +

∫Xu(y)p(dy | x, a)

Any stationary policy ϕ such that Φ(x) ∈ A∗(x) for all x ∈ X isoptimal.

Theorem (2)

Suppose assumption (W∗) and (B) hold. Then, for the optimalpolicy ϕ satisfying the optimality inequality, the followingadditional properties hold

wϕ(x) = lim infα↑1

mα = limα↑1

(1− α)vα(x) = limn→∞

nvϕ1,n(x).

There are many open questions: Theorem (2) implies that the anypolicy ϕ and for any policy π

wϕ(x) = limn→∞

nvϕ1,n(x) ≥ lim sup

n→∞

nvπ1,n(x).

But we do not know whether

wϕ(x) ≥ lim infn→∞

nvπ1,n(x).

Discount-optimality

Theorem (Feinberg, Kasyanov, Zadoianchuk)

Let assumption (W∗) hold. Then:

(i) The function vα,n, n = 0, 1, . . . and vα are lowersemi-continuous and vα,n(x)→ vα(x) as n→∞ for allx ∈ X.

(ii) vα,n+1(x) = mina∈A(x)

{c(x, a) + α

∫X vα,n(y)p(dy | x, a)

}, x ∈

X,n = 0, 1, . . .

(iii) vα(x) = mina∈A(x)

{c(x, a) + α

∫X vα(y)p(dy | x, a)

}(iv) Let

Aα(x) = {a ∈ A(x) : vα(x) = c(x, a)+α

∫Xvα(y)p(dy | x, a)}.

A stationary optimal policy is discount optimal, if and only if,ϕ(x) ∈ Aα(x) for all x ∈ X.

Approximations of average cost optimal policies

Consider the upper topological limit of the sets Aα(x) as α ↑ 1.

limα↑1

Gr(Aα) =

(x, a) ∈ X ×A :∃ αn ↑ 1 as n→∞,∃ (xn, an) ∈ Gr(Aαn), n ≥ 1such that (x, a) = lim

n→∞(xn, an)

Let Aapp(x) = {a ∈ A∗(x) : (x, a) ∈ lim

α↑1Gr(Aα), x ∈ X}.

Theorem

Under assumptions (W∗) and (B), the set Aapp(x) are nonemptyand compact. Furthermore, there exists a stationary policy ϕaap

such that ϕaap(x) ∈ Aapp(x) and any such policy is average-costoptimal.

Extension of Berge’s theorem to non-compact action sets

vα,n+1 = mina∈A(x)

{c(x, a) +

∫X vα,n(y)p(dy | x, a)

}is upper semi-continuous? This does not follow from Berge’stheorem that states the continuity of the function

v(x) := infy∈φ(x)

u(x, y)

for compact action sets. But, we do not assume that A(x) iscompact.

Berge’s theorem

Definition

A function u : X × Y → R is called K-inf-compact on GrX(φ), iffor every K ∈ C(X) this function is inf-compact on GrK(φ)

Theorem (Generalization of Berge’s theorem, F., Kasyanov,Zadoianchuk, 2012)

If the function u : X × Y → R is K-inf-compact on Gr(φ), thenthe function v(x) = infy∈φ(x) u(x, y) is inf-compact on GrK(φ).

(i) Under the assumptions of Berge’s theorem, the functionu(x, y) is K-inf-compact

(ii) If u(x, y) is inf-compact on GrX(φ) then it is K-inf-compact.

Berge’s theorem

(i) If u(·, ·) is a K-inf-compact function on GrX(φ), then, forevery x ∈ X, the function u(x, a) is inf-compact on φ(x)

(ii) A K-inf-compact function u(·, ·) on GrX(φ) is lowersemi-continuous on GrX(φ).

Properties (i) and (ii) typically hold for inventory control problems

Let X and Y are metrizable space. Then GrX(φ) isK-inf-compact on GrX(φ) if and only if conditions (W∗ 1,2) hold.

Fatou’s lemma for weakly converging measures

Fatou’s Lemma: For any µ ∈ P(X) and for any sequence ofnonnegative measurable functions f1, f2, . . .∫

lim infn→∞

fn(x)µ(dx) ≤ lim infn→∞

fn(x)µ(dx).

(Royden, 1968) If {µn} ∈ P(X) setwise converges to µ ∈ P(X)then, ∫

lim infn→∞

fn(x)µ(dx) ≤ lim infn→∞

fn(x)µn(dx).

(Schal, 1993, Feinberg, Kasyanov, Zadoianchuk, 2012) If µnconverge weakly to µ then∫

f(x)µ(dx) ≤ lim infn→∞

fn(x)µn(dx),

where f(x) = lim infn→∞ s′→s

fn(x), x ∈ X.

Conclusions

This talk covered:

(i) General conditions of optimality for discounted andaverage-reward MDPs with Borel state and action sets,unbounded costs, and weakly continuous transitionprobabilities. The results are applicable to inventory controlproblems, workload control in queues, . . .

(ii) Extension of Berge’s theorem to noncompact action sets

(iii) Proof and extensions of Fatou’s theorem for weaklyconverging measures formulated by Schal (1993); relevantresults were established by Serfozo (1982).

infinite-state discrete-time markov decision processes · in nite-state discrete-time markov...

Documents

1 chapter 4 discrete time markov chain learning objectives :...

dtmc: discrete-time homogeneous markov chain · dtmc:...

discrete time markov chains ee384x review 2 winter 2006

minimal critical subsystems for discrete-time markov models

embedding discrete- into continuous-time markov...

discrete time markov chains. 2 discrete time markov chain...

markov chains. summary markov chains discrete time markov...

discrete stochastic processes, chapter 6: markov … ·...

7.1. introduction: markov property 7.2....

a new discrete markov chain model of binary exponential

discrete time markov chains, limiting distribution and...

discrete-time markov chains: advanced applications in...

discrete hidden markov model implementation digital speech...

7. markov chains (discrete-time markov chains) 7.1....

social behavior modeling based on incremental discrete...

i.1 discrete time markov chains : de nition, rst...

markov chains -...

discrete state markov - mit opencourseware · discrete...

stochastic optimal control – part 2 discrete time, markov...

introduction to discrete time semi markov process nur aini...