process-based risk measures for observable and …process-based risk measures for observable and...

Process-Based Risk Measures for Observable and PartiallyObservable Discrete-Time Controlled Systems

Jingnan Fan∗ Andrzej Ruszczynski†

November 5, 2014; revised April 15, 2015

Abstract

For controlled discrete-time stochastic processes we introduce a new class of dynamic riskmeasures, which we call process-based. Their main features are that they measure risk ofprocesses that are functions of the history of the base process. We introduce a new conceptof conditional stochastic time consistency and we derive the structure of process-based riskmeasures enjoying this property. We show that they can be equivalently represented by a col-lection of static law-invariant risk measures on the space of functions of the state of the baseprocess. We apply this result to controlled Markov processes and we derive dynamic program-ming equations. Next, we consider partially observable processes and we derive the structureof stochastically conditionally time-consistent risk measures in this case. We prove that theycan be represented by a sequence of law invariant risk measures on the space of function of theobservable part of the state. We also prove corresponding dynamic programming equations.

Keywords: Dynamic Risk Measures; Time Consistency; Partially Observable Markov Pro-cesses

1 IntroductionThe main objective of this paper is to provide theoretical foundations of the theory of dynamicrisk measures for controlled stochastic processes, in particular, Markov processes, with onlypartial observation of the state.

The theory of dynamic risk measures in discrete time has been intensively developed in thelast 10 years (see [12, 43, 37, 39, 20, 10, 42, 2, 35, 26, 25, 11] and the references therein). Thebasic setting is the following: we have a probability space (Ω,F , P), a filtration Ft t=1,...,Twith a trivial F1, and we define appropriate spaces Zt of Ft -measurable random variables,t = 1, . . . , T . For each t = 1, . . . , T , a mapping ρt,T : ZT → Zt is called a condi-tional risk measure. The central role in the theory is played by the concept of time consis-tency, which regulates relations between the mappings ρt,T and ρs,T for different s and t .

∗Rutgers University, Department of Management Science and Information Systems, Piscataway, NJ 08854, USA,Email: [email protected]

†Rutgers University, Department of Management Science and Information Systems, Piscataway, NJ 08854, USA,Email: [email protected]

1

One definition employed in the literature is the following: for all Z ,W ∈ ZT , if ρt,T (Z) ≤ρt,T (W ) then ρs,T (Z) ≤ ρs,T (W ) for all s < t . This can be used to derive recursive relationsρt,T (Z) = ρt

(ρt+1,T (Z)

), with simpler one-step conditional risk mappings ρt : Zt+1 → Zt ,

t = 1, . . . , T−1. Much effort has been devoted to derive dual representations of the conditionalrisk mapping and to study their evolution in various settings.

When applied to controlled processes, and in particular, to Markov decision problems, thetheory of dynamic measures of risk encounters difficulties. The spaces Zt become larger, whent grows, and each one-step mapping ρt has different domain and range spaces. There is no easyway to explore stationarity of the process, nor a clear path exists to extend the theory to infinite-horizon problems. Moreover, no satisfactory theory of law invariant dynamic risk measuresexists, which would be suitable for Markov control problems (the restrictive definitions of lawinvariance employed in [27] and [44] lead to conclusions of limited practical usefulness, whilethe translation of the approach of [46] to the Markov case appears to be difficult).

Motivated by these issues, in [41], we introduced a specific class of dynamic risk measures,which is well-suited for Markov problems. We postulated that the one-step conditional riskmappings ρt have a special form, which allows for their representation in terms of risk measureson a space of functions defined on the state space of the Markov process. This restrictionallowed for the development of dynamic programming equations and corresponding solutionmethods, which generalize the well-known results for expected value problems. Our ideas weresuccessfully extended in [8, 7, 30, 45].

In this paper, we introduce and analyze a general class of risk measures, which we callprocess-based. We consider a controlled process X t t=1,...,T taking values in a Polish space X(the state space), whose conditional distributions are described by controlled transition kernelsQt : X t

× U → P(X ), t = 1, . . . , T − 1, where U is a certain control space. Any history-dependent (measurable) control ut = πt (x1, . . . , xt ) is allowed. In this setting, we are onlyinterested in measuring risk of stochastic processes of the form Z t = ct (X t , ut ), t = 1, . . . , T ,where ct : X×U → R can be any bounded measurable function. This restriction of the class ofstochastic processes for which risk needs to be measured is one of the two cornerstones of ourapproach. The other cornerstone is our new concept of stochastic conditional time-consistency.It is more restrictive than the usual time consistency, because it involves conditional distribu-tions and uses stochastic dominance rather than the pointwise order. These two foundationsallow for the development of a theory of dynamic risk measures which can be fully describedby a sequence of law invariant risk measures on a space V of measurable functions on the statespace X . In the special case of controlled Markov processes, we derive the structure postulatedin [41], thus providing its solid theoretical foundations. We also derive dynamic programmingequations in a much more general setting than that of [41].

In the extant literature, three basic approaches to introduce risk aversion in Markov decisionprocesses have been employed: utility functions (see, e.g., [23, 24, 15]), mean–variance models(see. e.g., [47, 19, 31, 1]), and entropic (exponential) models (see, e.g., [22, 32, 13, 17, 29, 5]).Our approach generalizes the utility and exponential models; the mean–variance models do notsatisfy, in general, the monotonicity and time-consistency conditions, which are relevant for ourtheory. However, a version satisfying these conditions has been recently proposed in [9].

In the second part of the paper, we extend our theory to partially observable controlledMarkov processes. In the expected-value case, this classical topic is covered in many mono-graphs (see, e.g., [21, 6, 4] and the references therein). The standard approach is to considerthe belief state space, involving the space of probability distributions of the unobserved part ofthe state. The recent report [18] provides the state-of-the-art setting. The risk averse case hasbeen dealt, so far, with the use of the entropic risk measure (see [36, 5]). Our main result in the

2

risk-averse case is that the conditional risk mappings can be equivalently modeled by risk mea-sures on the space of functions defined on the observable part of the state only. We derive thisobservation in two parallel, but equivalent ways: by considering history-dependent descriptionof the evolution of the observable part, and by considering belief states, that is, posterior distri-butions of the unobservable part. We also derive risk-averse dynamic programming equationsfor partially observable Markov models.

The paper is organized as follows. In sections 2.1–2.3, we formalize the basic model andreview the concepts of risk measures and their time consistency. The first set of original re-sults are presented in section 2.4; we introduce a new concept of stochastic conditional timeconsistency and we characterize the structure of dynamic risk measures enjoying this property(Theorem 2.14). In section 3, we extend these ideas to the case of controlled processes, and weprove Theorem 3.3 on the structure of measures of risk in this case. These results are furtherspecialized to controlled Markov processes in section 4. We introduce the concept of a Markovrisk measure and we derive its structure (Theorem 4.4). In section 4.2, we prove an analogof dynamic programming equations in this case. Section 5 is devoted to the analysis of thestructure of risk measures for partially observable controlled Markov processes. We attack theproblem in two different ways. In section 5.3, we focus on the observable part of the processand its history-dependent dynamics. The key result is Theorem 5.6, which simplifies modelingof conditional risk measures in this case. In section 5.5 we consider the Markov model in theextended state space, involving the posterior distribution of the unobservable part (the “beliefstate”). In Theorem 5.8 we derive an equivalent representation of risk measures, which alsoreduces to law invariant risk models on the space of functions of the next observation. Finally,in section 5.6, we derive an analog of dynamic programming equations for risk-averse partiallyobservable models.

2 Risk Measures Based on Observable ProcessesIn this section we introduce fundamental concepts and properties of dynamic risk measures foruncontrolled stochastic processes in discrete time. In subsection 2.1 we set up our a probabilis-tic framework, and in subsections 2.2 and 2.3 we revisit some important concepts existing in theliterature. Beginning from subsection 2.4, we introduce new notions of conditional time con-sistency and stochastic conditional time-consistency, which are strictly stronger requirementson the dynamic risk measure than the standard concept od time consistency, and which are par-ticularly useful for controlled stochastic processes. Based on these two concepts, we derive thestructure of dynamic risk measures, which is based on so-called transition risk mappings.

2.1 PreliminariesIn all subsequent considerations, we work with a Borel subset (X ,B(X )) of a Polish spaceand the canonical measurable space

(X T ,B(X )T

)where T is a natural number and B(X )T

is the product σ -algebra. We use X t t=1,...,T to denote the discrete-time process of canonicalprojections. We also define Ht = X t to be the space of possible histories up to time t , andwe use ht for a generic element of Ht : a specific history up to time t . The random vector(X1, · · · , X t ) will be denoted by Ht .

We assume that for all t = 1, . . . , T − 1, the transition kernels, which describe the condi-tional distribution of X t+1, given X1, · · · , X t , are measurable functions

Qt : X t→ P(X ), t = 1, . . . , T − 1, (1)

3

where P(X ) is the set of probability measures on (X ,B(X )). These kernels, along with theinitial distribution of X1, define a unique probability measure P on the product space X T withthe product σ -algebra.

For a stochastic system described above, we consider a sequence of random variablesZ t t=1,...,T taking values in R; we assume that lower values of Z t are preferred (e.g., Z t rep-resents a “cost” at time t). We require that Z t t=1,...,T is bounded and adapted to Ft t=1,...,T- the natural filtration generated by the process X . In order to facilitate our discussion, weintroduce the following spaces:

Zt =

Z : X T→ R

∣∣ Z is Ft -measurable and bounded, t = 1, . . . , T . (2)

It is then equivalent to say that Z t ∈ Zt . We also introduce the spaces

Zt,T = Zt × · · · × ZT , t = 1, . . . , T .

Since Z t is Ft -measurable, a measurable function φt : X t→ R exists such that Z t =

φt (X1, . . . , X t ). With a slight abuse of notation, we still use Z t to denote this function.

2.2 Dynamic risk measuresIn this subsection, we quickly review some definitions related to risk measures. Note that all re-lations (e.g., equality, inequality) between random variables are understood in the “everywhere”sense.

Definition 2.1. A mapping ρt,T : Zt,T → Zt , where 1 ≤ t ≤ T , is called a conditional riskmeasure, if it satisfies the monotonicity property: for all (Z t , . . . , ZT ) and (Wt , . . . ,WT ) inZt,T , if Zs ≤ Ws , for all s = t, . . . , T , then

ρt,T (Z t , . . . , ZT ) ≤ ρt,T (Wt , . . . ,WT ).

Definition 2.2. A conditional risk measure ρt,T : Zt,T → Zt(i) is normalized if ρt,T (0, . . . , 0) = 0;

(ii) is translation-invariant if ∀ (Z t , . . . , ZT ) ∈ Zt,T ,

ρt,T (Z t , . . . , ZT ) = Z t + ρt,T (0, Z t+1, . . . , ZT ).

Throughout the paper, we assume all conditional risk measures to be at least normalized.Translation-invariance is a fundamental property, which will also be frequently used; undernormalization, it implies that ρt,T (Z t , 0, · · · , 0) = Z t .

Definition 2.3. A conditional risk measure ρt,T has the local property if for any (Z t , . . . , ZT ) ∈

Zt,T and for any event A ∈ Ft , we have

1Aρt,T (Z t , . . . , ZT ) = ρt,T (1A Z t , . . . ,1A ZT ).

The local property is important because it means that the conditional risk measure at time trestricted to any Ft -event A is not influenced by the values that Z t , . . . , ZT take on Ac.

Definition 2.4. A dynamic risk measure ρ =ρt,T

t=1,...,T is a sequence of conditional risk

measures ρt,T : Zt,T → Zt . We say that ρ is translation-invariant or has the local property, ifall ρt,T , t = 1, . . . , T , satisfy the respective conditions of Definitions 2.2 or 2.3.

4

2.3 Time ConsistencyThe notion of time-consistency can be formulated in different ways, with weaker or strongerassumptions; but the key idea is that if one sequence of costs, compared to another sequence, hasthe same current cost and lower risk in the future, then it should have lower current risk. In thisand the next subsection, we discuss three formulations of time-consistency from the weakest tothe strongest one. We also show how the tower property (the recursive relation between ρt,T andρt+1,T that time-consistency implies) improves with more refined time consistency concepts.The following definition of time consistency was employed in [41].

Definition 2.5. A dynamic risk measureρt,T

t=1,...,T is time-consistent if and only if for any

1 ≤ τ < θ ≤ T , for any (Zτ , . . . , ZT ) and (Wτ , . . . ,WT ) in Zτ,T , the conditionsZ t = Wt , ∀ t = τ, . . . , θ − 1,ρθ,T (Zθ , . . . , ZT ) ≤ ρθ,T (Wθ , . . . ,WT ),

implyρτ,T (Zτ , . . . , ZT ) ≤ ρτ,T (Wτ , . . . ,WT ).

The following proposition simplifies the definition of a time-consistent measure, which willbe useful in further derivations. Proof is omitted and can be found in [41].

Proposition 2.6. A dynamic risk measureρt,T

t=1,...,T is time-consistent if for any 1 ≤ t < T

and for all (Z t , . . . , ZT ), (Wt , . . . ,WT ) ∈ Zt , the conditionsZ t = Wt ,

ρt+1,T (Z t+1, . . . , ZT ) ≤ ρt+1,T (Wt+1, . . . ,WT ),

imply that ρt,T (Z t , . . . , ZT ) ≤ ρt,T (Wt , . . . ,WT ).

It turns out that a translation-invariant and time-consistent dynamic risk measure can bedecomposed into and then reconstructed from so-called one-step conditional risk mappings.

Theorem 2.7 ([41]). A dynamic risk measureρt,T

t=1,...,T is translation-invariant and time-

consistent if and only if there exist ρt : Zt+1 → Zt , t = 1, . . . , T−1 satisfying the monotonicityproperty and such that ρt (0) = 0, called one-step conditional risk mappings, such that for allt = 1, . . . , T − 1,

ρt,T (Z t , . . . , ZT ) = Z t + ρt (ρt+1,T (Z t+1, . . . , ZT )). (3)

As an immediate comment, we would like to point out that translation invariance and timeconsistency do not imply the local property, as is demonstrated by the following example.

Example 2.8. An example of a dynamic risk measure which is translation-invariant and time-consistent but fails to have the local property: T = 2, X = 1, 2,

ρ1,2(Z1, Z2)(x1) = Z1(x1)+ E[Z2|X1 6= x1], x1 ∈ 1, 2,

ρ2,2(Z2)(x1, x2) = Z2(x1, x2), (x1, x2) ∈ 1, 22.

Conceptually, the one-step conditional risk mappings play a similar role to one-step con-ditional expectations, and will be very useful when tower property is involved. At this stagewithout further refinement of assumptions, it remains a fairly abstract and general object that is

5

hard to characterize. In [41], a special variational form of this one-step conditional risk map-pings is imposed, which is well suited for Markovian applications, while it is unclear if thereexist other forms of such mappings. In order to gain deeper understanding of these concepts,in the following we introduce two stronger notions of time consistency, and we argue that anyone-step conditional risk mapping is of the variational form postulated in [41]. To this end,we use the particular structure of the space

(X T ,B(X )T

)and the way a probability measure is

defined on this space.

2.4 Stochastic Conditional Time-Consistency and Transition Risk Map-pingsWe now refine the concept of time-consistency for process-based risk measures.


t=1,...,T is conditionally time-consistent if for

any 1 ≤ t ≤ T−1, for any ht ∈ X t , for all (Z t , . . . , ZT ), (Wt , . . . ,WT ) ∈ Zt,T , the conditionsZ t (ht ) = Wt (ht ),

ρt+1,T (Z t+1, . . . , ZT )(ht , ·) ≤ ρt+1,T (Wt+1, . . . ,WT )(ht , ·),(4)

implyρt,T (Z t , . . . , ZT )(ht ) ≤ ρt,T (Wt , . . . ,WT )(ht ).

Proposition 2.10. The conditional time-consistency implies the time-consistency and the localproperty.

Proof. Ifρt,T

t=1,...,T is conditionally time-consistent, then it satisfies the conditions of Propo-

sition 2.6 and is thus time-consistent.Let us prove by induction on t from T down to 1 that ρt,T satisfy the local property. Clearly,

ρT,T does: if A ∈ FT , then[1AρT,T

](ZT ) = 1A ZT = ρT,T (1A ZT ).

Suppose ρt+1,T satisfies the local property for some 1 ≤ t < T , and consider any A ∈ Ft ,any ht ∈ X t , and any (Z t , . . . , ZT ) ∈ Zt,T . Two cases may occur.• If 1A(ht ) = 0, then [1A Z t ](ht ) = 0 and the local property for t + 1 yields:[

ρt+1,T (1A Z t+1, . . . ,1A ZT )](ht , ·) =

[1Aρt+1,T (Z t+1, . . . , ZT )

](ht , ·) = 0.

By conditional time consistency,

ρt,T (1A Z t ,1A Z t+1, . . . ,1A ZT )(ht ) = ρt,T (0, . . . , 0)(ht ) = 0.

• If 1A(ht ) = 1, then [1A Z t ](ht ) = Z t (ht ) and the local property for t + 1 implies that[ρt+1,T (1A Z t+1, . . . ,1A ZT )

](ht , ·) =

[1Aρt+1,T (Z t+1, . . . , ZT )

](ht , ·)

= ρt+1,T (Z t+1, . . . , ZT )(ht , ·).

By conditional time consistency, ρt,T (1A Z t , . . . ,1A ZT )(ht ) = ρt,T (Z t , . . . , ZT )(ht ).

In both cases, ρt,T (1A Z t , . . . ,1A ZT )(ht ) =[1Aρt,T (Z t , . . . , ZT )

](ht ).

The conditional time-consistency is thus a stronger property than the time-consistency; itallows us to exclude risk measures which do not satisfy the local property, like Example 2.8.

6

Generally, the definition of dynamic risk measures in Section 2.2 and the definition of time-consistency property in Section 2.3 are valid with any filtration on an underlying probabilityspace (,F , P), instead of process-generated filtration. However, from Section 2.4 and to theend of the paper, we only consider dynamic risk measures defined with a filtration generated bya specific process, because we are interested in those risk measures which can be evaluated oneach specific history path. That is why we call these risk measures “process-based.”

The following proposition shows that the conditional time-consistency also yields a refinedversion of the one-step risk-mapping. Throughout the paper, we define V to be the set of allbounded measurable functions on X , and it turns out that such mappings can be equivalentlyrepresented by risk measures on V .

Theorem 2.11. A dynamic risk measureρt,T

t=1,...,T is translation-invariant and condition-

ally time-consistent if and only if there exist functionals

σt : X t× V → R, t = 1, . . . , T − 1,

called transition risk mappings, such that(i) for all t = 1, . . . T − 1 and all ht ∈ X t , the functional σt (ht , ·) is a normalized andmonotonic risk measure on V;(ii) for all t = 1, . . . T − 1, for all (Z t , . . . , ZT ) ∈ Zt,T , and for all ht ∈ X t , we have thefollowing recursive relation:

ρt,T (Z t , . . . , ZT )(ht ) = Z t (ht )+ σt(ht , ρt+1,T (Z t+1, . . . , ZT )(ht , ·)

). (5)

Moreover, for all t = 1, . . . , T − 1 the functional σt is uniquely determined by ρt,T asfollows: for every ht ∈ X t and every v ∈ V ,

σt (ht , v) = ρt,T (0, V, 0, . . . , 0)(ht ), (6)

where V ∈ Zt+1 satisfies the equation V (ht , ·) = v(·), and can be arbitrary elsewhere.

Proof. Assumeρt,T

t=1,...,T is translation-invariant and conditionally time-consistent. For

any ht ∈ X t , the formula (6) defines a normalized and monotonic risk measure on the space V .Setting

v(x) = ρt+1,T (Z t+1, . . . , ZT )(ht , x), ∀ x ∈ X ,

V (ht+1) =

v(x), if ht+1 = (ht , x),0, otherwise,

we obtain, by conditional time-consistency,

ρt+1,T (V, 0, . . . , 0)(ht , ·) = ρt+1,T (Z t+1, . . . , ZT )(ht , ·).

Thus, by the translation property,

ρt,T (Z t , . . . , ZT )(ht ) = Z t (ht )+ ρt,T (0, Z t+1, . . . , ZT )(ht )

= Z t (ht )+ ρt,T (0, V, 0, . . . , 0)(ht )

= Z t (ht )+ σt (ht , v).

This chain of relations proves also the uniqueness of σt .

7

On the other hand, if such transition risk mappings exist, thenρt,T

t=1,...,T is conditionally

time-consistent by the monotonicity of σ(ht , ·). We can now use (5) to obtain for any t =1, . . . , T − 1, and for all ht ∈ X t the following identity:

ρt,T (0, Z t+1, . . . , ZT )(ht ) = σt(ht , ρt+1,T (Z t+1, . . . , ZT )(ht , ·)

)= ρt,T (Z t , . . . , ZT )(ht )− Z t (ht ),

which is translation invariance of ρt,T .

Further refinement of the concept of time-consistency can be obtained by the use of thetransition kernels Qt and stochastic, rather than pointwise, orders. We stress that unlike theprevious notions of time-consistency, the new one will depend on the underlying kernels Qt .


t=1,...,T is stochastically conditionally time-

consistent with respect to Qt t=1,...,T−1 if for any 1 ≤ t ≤ T − 1, for any ht ∈ X t , for all(Z t , . . . , ZT ), (Wt , . . . ,WT ) ∈ Zt,T , the conditions

Z t (ht ) = Wt (ht ),(ρt+1,T (Z t+1, . . . , ZT ) | Ht = ht

)st

(ρt+1,T (Wt+1, . . . ,WT ) | Ht = ht

),

(7)

implyρt,T (Z t , . . . , ZT )(ht ) ≤ ρt,T (Wt , . . . ,WT )(ht ), (8)

where the relation st stands for the conditional stochastic order understood as follows:

Qt (ht )(

x | ρt+1,T (Z t+1, . . . , ZT )(ht , x) > η)

≤ Qt (ht )(

x | ρt+1,T (Wt+1, . . . ,WT )(ht , x) > η), ∀ η ∈ R.

When the choice of the underlying transition kernels is clear from the context, we willsimply say that the dynamic risk measure is stochastically conditionally time-consistent.

We first note that this new notion of time-consistency is even stronger than the conditionaltime-consistency introduced in Definition 2.9.

Proposition 2.13. The stochastic conditional time-consistency implies the conditional time-consistency.

Proof. Supposeρt,T

t=1,...,T is stochastically conditionally time-consistent. Fix t = 1, . . . , T−

1, and let ht and (Z t , . . . , ZT ), (Wt , . . . ,WT ) satisfy the conditions (4) of Definition 2.9. Thenfor all η ∈ R we obtain

x | ρt+1,T (Z t+1, . . . , ZT )(ht , x) > η⊂

x | ρt+1,T (Wt+1, . . . ,WT )(ht , x) > η,

and thus (8) follows.

It comes with no surprise that with this stronger notion of time-consistency, the mapping σtappearing in Theorem 2.11 has a more refined characterization. The following theorem provesthat this extra characterization is the law invariance property.

8

Theorem 2.14. A process-based dynamic risk measureρt,T

t=1,...,T is translation-invariant

and stochastically conditionally time-consistent if and only if there exist functionals

σt : graph(Qt )× V → R, t = 1, . . . , T − 1,

such that(i) for all t = 1, . . . , T − 1 and all ht ∈ X t , the functional σt (ht , Qt (ht ), ·) is a normalized,monotonic, and law-invariant risk measure on V with respect to the distribution Qt (ht );

(ii) for all t = 1, . . . , T − 1, for all (Z t , . . . , ZT ) ∈ Zt,T , and for all ht ∈ X t ,

ρt,T (Z t , . . . , ZT )(ht ) = Z t (ht )+ σt (ht , Qt (ht ), ρt+1,T (Z t+1, . . . , ZT )(ht , ·)). (9)

Moreover, for all t = 1, . . . , T − 1, σt is uniquely determined by ρt,T as follows: for everyht ∈ X t and every v ∈ V ,

σt (ht , Qt (ht ), v) = ρt,T (0, V, 0, . . . , 0)(ht ), (10)

where V ∈ Zt+1 satisfies the equation V (ht , ·) = v(·), and can be arbitrary elsewhere.

Proof. On the one hand, ifρt,T

t=1,...,T is translation-invariant and stochastically condition-

ally time-consistent, then by Proposition 2.13 and Theorem 2.11, there exist normalized andmonotonic risk measures σt (ht , Qt (ht ), ·), ht ∈ X t , t = 1, . . . , T − 1, that satisfy (9) and (6).We need only verify the postulated law invariance of σt (ht , Qt (ht ), ·). If V, V ′ ∈ Zt+1 have thesame conditional distribution, given ht , then Definition 2.12 implies that ρt,T (0, V, 0, . . . , 0)(ht ) =

ρt,T (0, V ′, 0, . . . , 0)(ht ), and law invariance follows from (6). On the other hand, if suchσt t=1,...,T−1 exist, the law-invariance of σt (ht , Qt (ht ), ·) implies the stochastic conditionaltime-consistency.

Remark 2.15. With a slight abuse of notation, we included the distribution Qt (ht ) as an argu-ment of the transition risk mapping in view of the application to controlled processes.

Example 2.16. In the theory of risk-sensitive Markov decision processes, the following familyof entropic risk measures is employed (see [22, 36, 32, 14, 40, 13, 17, 29, 5]):

ρt,T (Z t , . . . , ZT ) =1γ

ln(E

[exp

(γ∑T

s=t Zs

) ∣∣ Ft

]), t = 1, . . . , T, γ > 0.

It is stochastically conditionally time-consistent, and corresponds to the transition risk mapping

σt (ht , q, v) =1γ

ln(Eq [eγ v]

)=

1γ

ln(∫

Xeγ v(x) q(dx)

), γ > 0. (11)

We could also make γ in (11) dependent on the time t, the current state xt , or even the entirehistory ht , and still obtain a stochastically conditionally time-consistent dynamic risk measure.

Example 2.17. The following transition risk mapping satisfies the condition of Theorem 2.14and corresponds to a stochastically conditionally time-consistent dynamic risk measure:

σt (ht , q, v) =∫Xv(s) q(ds)+ ~t (ht )

(∫X

[(v(s)−

∫Xv(s′) q(ds′)

)+

]pq(ds)

)1/p

, (12)

where ~t : X t→ [0, 1] is a measurable function, and p ∈ [1,+∞). It is an analogue of

the static mean–semideviation measure of risk, whose consistency with stochastic dominance iswell–known [33, 34]. In the construction of a dynamic risk measure, we use q = Qt (ht ). If~t depends on xt only, the mapping (12) corresponds to a Markov risk measure discussed insections 4.1 and 4.2.

9

Example 2.18. Our use of the stochastic dominance relation in the definition of stochasticconditional time consistency rules out some candidates for transition risk mappings. Supposeσt (ht , q, v) = v(x1), where x1 ∈ X is a selected state. Such a mapping is a coherent measureof risk, as a function of the last argument, and may be law invariant. In particular, it is lawinvariant with X = x1, x2, q(x1) = 1/3, q(x2) = 2/3, v(x1) = 3, v(x2) = 1, w(x1) = 2,w(x2) = 4. For this mapping, we have v st w under q, but σt (ht , q, v) > σt (ht , q, w),and thus the condition of stochastic conditional time consistency is violated. We consciouslyexclude such cases, because in controlled systems, which will be discussed in the next section,the second argument (q) is the only one that depends on our decisions. It should be included inthe definition of our preferences, if practically meaningful results are to be obtained.

3 Risk Measures for Controlled Stochastic ProcessesWe now extend the setting of Section 2 by making the kernels (1) depend on some controlvariables ut .

3.1 The ModelWe still work with the process X t t=1,...,T on the space X T and introduce a measurable controlspace (U ,B(U)). At each time t , we observe the state xt and then apply a control ut ∈ U . Weassume that the admissible control sets and the transition kernels (conditional distributions ofthe next state) depend on all currently-known state and control values. More precisely, we makethe following assumptions:1. For all t = 1, . . . , T , we require that ut ∈ Ut (x1, u1, . . . , xt−1, ut−1, xt ), where Ut : Gt ⇒U is a measurable multifunction, and G1, . . . ,GT are the sets of histories of all currently-knownstate and control values before applying each control:

G1 = X ,Gt+1 = graph(Ut )× X ⊆ (X × U)t × X , t = 1, . . . , T − 1;

2. For all t = 1, . . . , T , the control-dependent transition kernels

Qt : graph(Ut )→ P(X ), t = 1, . . . , T − 1 (13)

are measurable, and for all t = 1, . . . , T−1, for all (x1, u1, . . . , xt , ut ) ∈ graph(Ut ), Qt (x1, u1, . . . , xt , ut )

describes the conditional distribution of X t+1, given currently-known states and controls.The reason why we allow Ut and Qt to depend on all known state and control values is that

in partially observable Markov decision problems, which we study in Section 5, the conditionaldistributions of the next observable part of the state does depend on all past controls. We planto apply the results of the present section to the partially observable case.

For this controlled process, a (deterministic) history-dependent admissible policy π =(π1, . . . , πT ) is a sequence of measurable selectors, called decision rules, πt : Gt → U suchthat πt (gt ) ∈ Ut (gt ) for all gt ∈ Gt . We can easily prove by induction on t that for such anadmissible policy π , each πt reduces to a measurable function on X t , as us = πs(hs) for alls = 1, . . . , t − 1. We are still using πs to denote the decision rule, although it is a differentfunction, formally; it will not lead to any misunderstanding. The set of admissible policies is

Π :=π = (π1, . . . , πT ) |

∀t, πt (x1, . . . , xt ) ∈ Ut (x1, π1(x1), . . . , xt−1, πt−1(x1, . . . , xt−1), xt ).

(14)

10

For any fixed policy π ∈ Π , the transition kernels can be rewritten as measurable functionsfrom X t to P(X ):

Qπt : (x1, . . . , xt ) 7→ Qt

(x1, π1(x1), . . . , xt , πt (x1, . . . , xt )

), t = 1, . . . , T − 1, (15)

just like the transition kernels of the uncontrolled case given in (1), but indexed by π . Thus forany policy π ∈ Π , we can consider X t t=1,...,T as an “uncontrolled” process, on the probabilityspace

(X T ,B(X )T , Pπ

)with Pπ defined by Qπ

t t=1,...,T−1, which is adapted to the policy-independent filtration Ft t=1,...,T . As before and throughout this paper, ht ∈ X t will stand for(x1, . . . , xt ).

We still use the same spaces Zt , t = 1, . . . , T as defined in (2) for the costs incurred at eachstage; these spaces also allow us to consider control-dependent costs as collections of policy-indexed costs in Z1,T . Thus, we are able to classify (time-consistent) dynamic risk measuresρπ for each fixed π ∈ Π , as in Section 2. Note that ρπ are defined on the same spacesindependently of π , because the filtration and the spaces Zt , t = 1, . . . , T , are not dependenton π ; however, we do need to index the measures of risk by the policy π , because the transitionkernels and, consequently, the probability measure on the space X T , depend on π .

3.2 Stochastic Conditional Time-Consistency and Transition Risk Map-pingsIn the previous subsection we discussed a family of dynamic risk measures (ρπ )π∈5 such thatfor each fixed π ∈ Π , ρπ is a dynamic risk measure. In future considerations, we will comparerisk levels among different policies, so a meaningful order among (ρπ )π∈Π is needed. It turnsout that our concept of stochastic conditional time-consistency can be extended to this settingand help us achieve this goal.

Definition 3.1. A family of process-based dynamic risk measuresρπt,T

π∈Πt=1,...,T−1 is stochasti-

cally conditionally time-consistent if for any π, π ′ ∈ Π , for any 1 ≤ t < T , for all ht ∈ X t , all(Z t , . . . , ZT ) ∈ Zt,T and all (Wt , . . . ,WT ) ∈ Zt,T , the conditions

Z t (ht ) = Wt (ht ),(ρπt+1,T (Z t+1, . . . , ZT ) | Hπ

t = ht)st

(ρπ′

t+1,T (Wt+1, . . . ,WT ) | Hπ ′

t = ht),

implyρπt,T (Z t , . . . , ZT )(ht ) ≤ ρ

π ′

t,T (Wt , . . . ,WT )(ht ).

Remark 3.2. As in Definition 2.12, the conditional stochastic order “st” should be under-stood as follows: for all η ∈ R we have

Qπt (ht )

(x | ρπt+1,T (Z t+1, . . . , ZT )(ht , x) > η

)≤ Qπ ′

t (ht )(

x | ρπ′

t+1,T (Wt+1, . . . ,WT )(ht , x) > η).

The above definition of stochastic conditional time-consistency of a family of dynamic riskmeasures (ρπ )π∈Π plays a pivotal role in building an intrinsic connection among them, aswe explain it below. If a family of process-based dynamic risk measures

ρπt,T

π∈Πt=1,...,T−1 is

11

stochastically conditionally time-consistent, then for each fixed π ∈ Π the process-based dy-namic risk measure

ρπt,T

t=1,...,T−1 is stochastically conditionally time-consistent, as defined

in Definition 2.12. By virtue of Proposition 2.14, for each π ∈ Π , there exist functionals

σπt : graph(Qπt )× V → R, t = 1 . . . T − 1,

such that for all t = 1, . . . , T − 1, all ht ∈ X t , the functional σπt (ht , Qπt (ht ), · ) is a law-

invariant risk measure on V with respect to the distribution Qπt (ht ) and

ρπt,T (Z t , . . . , ZT )(ht ) = Z t (ht )+ σπt(ht , Qπ

t (ht ), ρπt+1,T (Z t+1, . . . , ZT )(ht , ·)

), ∀ht ∈ X t .

Consider any π, π ′ ∈ Π , ht ∈ X t , and (Z t , . . . , ZT ) ∈ Zt,T , (Wt , . . . ,WT ) ∈ Zt,T such thatZ t (ht ) = Wt (ht ),

Qπt (ht ) = Qπ ′

t (ht ),

ρπt+1,T (Z t+1, . . . , ZT )(ht , ·) = ρπ ′

t+1,T (Wt+1, . . . ,WT )(ht , ·).

Then we have(ρπt+1,T (Z t+1, . . . , ZT ) | Hπ

t = ht)∼st

(ρπ′

t+1,T (Wt+1, . . . ,WT ) | Hπ ′

t = ht),

where the relation ∼st means that both st and st are true; in other words, equality in law.Because of the stochastic conditional time-consistency,

ρπt,T (Z t , . . . , ZT )(ht ) = ρπ ′

t,T (Wt , . . . ,WT )(ht ),

whence

σπt(ht , Qπ

t (ht ), ρπt+1,T (Z t+1, . . . , ZT )(ht , ·)

)= σπ

′

t(ht , Qπ ′

t (ht ), ρπ ′

t+1,T (Wt+1, . . . ,WT )(ht , ·)).

This proves that in fact σπ does not depend on π , and all dependence on π is carried by thecontrolled kernel Qπ

t . This is a highly desirable property, when we apply dynamic risk measuresto a control problem. We summarize this important observation in the following theorem, whichextends Theorem 2.14 to the case of controlled processes.

Theorem 3.3. A family of process-based dynamic risk measuresρπt,T

π∈Πt=1,...,T is translation-

invariant and stochastically conditionally time-consistent if and only if there exist functionals

σt :

⋃π∈Π

graph(Qπt )

× V → R, t = 1 . . . T − 1,

such that(i) For all t = 1, . . . , T−1 and all ht ∈ X t , σt (ht , ·, ·) is normalized and satisfies the followingstrong monotonicity with respect to stochastic dominance:

∀q1, q2∈

Qπt (ht ) : π ∈ Π

, ∀v1, v2

∈ V,(v1; q1) st (v

2; q2) H⇒ σt (ht , q1, v1) ≤ σt (ht , q2, v2),

where (v; q) = q v−1 means “the distribution of v under q;”

12

(ii) For all π ∈ Π , for all t = 1, . . . , T − 1, for all (Z t , . . . , ZT ) ∈ Zt,T , and for all ht ∈ X t ,

ρπt,T (Z t , . . . , ZT )(ht ) = Z t (ht )+ σt (ht , Qπt (ht ), ρ

πt+1,T (Z t+1, . . . , ZT )(ht , ·)). (16)

Moreover, for all t = 1, . . . , T − 1, σt is uniquely determined by ρt,T as follows: for everyht ∈ X t , for every q ∈

Qπ

t (ht ) : π ∈ Π, and for every v ∈ V ,

σt (ht , q, v) = ρπt,T (0, V, 0, . . . , 0)(ht ), (17)

where π is any admissible policy such that q = Qπt (ht ), and V ∈ Zt+1 satisfies the equation

V (ht , ·) = v(·), and can be arbitrary elsewhere.

Proof. We have shown the existence of σt t=1,...,T satisfying (16) and (17) in the discussionpreceding the theorem. We can verify the strong law-invariance by (17) and Definition 3.1.

3.3 Strong monotonicity with respect to stochastic dominanceIn Theorem 3.3 we introduced the notion of strong monotonicity with respect to stochasticdominance, and in this subsection we shed more light on this notion. For a measurable spaceX , we use P(X ) to denote the space of probability measures on X , and we use S to denote asubset of P(X ). Recall that V stands for the set of bounded measurable functions on X . Wesay that a function σ : S × V → R is strongly monotonic with respect to stochastic dominanceon S, if

∀q1, q2∈ S, ∀v1, v2

∈ V, (v1; q1) st (v

2; q2) H⇒ σ(q1, v1) ≤ σ(q2, v2). (18)

The property of strong monotonicity with respect to stochastic dominance is strictly strongerthan the usual monotonicity and law-invariance of σ(q, ·) with respect to q for all q ∈ S. Weremind the readers that the usual law-invariance means that for all v1 and v2 having the samedistribution under q , we must have σ(v1, q) = σ(v2, q). Furthermore, a mapping σ , whichis strongly monotonic with respect to stochastic dominance, is nothing else but a function thatevaluates the distributions of v under q for all v ∈ V and q ∈ S, which preserves the (first-order) stochastic dominance. Thus we can rewrite it as a function of distributions, or a functionof cumulative distribution functions, Fv;q(α) = qx : v(x) ≤ α, α ∈ R.

Proposition 3.4. A function σ : S × V → R is strongly monotonic with respect to stochasticdominance on S if and only if there exist σ :

Fv;q | v ∈ V, q ∈ S

→ R such that σ(q, v) =

σ (Fv;q) for all v ∈ V and q ∈ S, and for all F1, F2∈

Fv;q | v ∈ V, q ∈ S,

F1≤ F2

H⇒ σ (F1) ≥ σ (F2). (19)

Consequently, to construct risk measures, which are strongly monotonic with respect tostochastic dominance, one possibility is to take any function of distributions satisfying (19)and the normalization condition. Such functions include the expected value, the mean-upper-semideviations, as in Example 2.18, or, more generally, coherent risk measures given by theKusuoka-representation:

σ(q, v) = supµ∈M

∫ 1

0AVaR1−α(v; q) µ(dα), q ∈ P(X ), v ∈ V, (20)

13

where M is a convex subset of P((0, 1]

), and AVaR defined by

AVaR1−α(v; q) =1α

∫ 1

1−αF−1v;q (s) ds,

with F−1v;q is the quantile function of v under q. Even more general mappings are monotonic

and normalized risk measures on the space of bounded quantile functions, as discussed in [16].We can also include all non-decreasing transformations of such measures, provided that theysatisfy the normalization condition, such as the entropic risk measure of Example 2.16.

Note that

Fv;q | v ∈ V, q ∈ S

does not always span all bounded distributions onR, espe-cially when X is a discrete space; thus, we also allow different forms for each q, as shown inthe following example:

Example 3.5. Let X = 1, 2, 3, S =

q1= (0, 1, 0.1, 0.8); q2

= ( 13 ,

13 ,

13 )

, and

σ(q, v) =

Eq1(v)+ κEq1

([v −Eq1(v)]+

), if q = q1,

0.4 maxv(1), v(2), v(3)

+ 0.6Eq2(v), if q = q2,

with any 0 ≤ κ ≤ 1. Then σ(·, ·) is strongly monotonic with respect to stochastic dominanceon S. Indeed, each σ(q i , ·) is a coherent risk measure on V . We fix any v2

∈ V and denotemin

v2(1), v2(2), v2(3)

= a and max

v2(1), v2(2), v2(3)

= c. Then we have

0.4c + 0.2(a + a + c) ≤ σ(q2, v2) ≤ 0.4c + 0.2(a + c + c),

and(i) for any v1, v2

∈ V such that (v1; q1) st (v

2; q2), we have v1

≤ w := (c, c, a), and thus

σ(q1, v1) ≤ Eq1(w)+ κEq1([w −Eq1(w)]

+)

= 0.2c + 0.8a + 0.16κ(c − a) ≤ 0.6c + 0.4a ≤ σ(q2, v2);

(ii) for any v1, v2∈ V such that (v1

; q1) st (v2; q2), we have v1

≥ w′ := (a, a, c), and thus

σ(q1, v1) ≥ Eq1(w′)+ κEq1

([w′ −Eq1(w

′)]+)

= 0.8c + 0.2a + 0.16κ(c − a) ≥ 0.8c + 0.2a ≥ σ(q2, v2).

4 Application to Controlled Markov SystemsOur results can be further specialized to the case when X t is a controlled Markov system, inwhich we assume the following conditions:(i) The admissible control sets are measurable multifunctions of the current state, i.e., Ut :

X ⇒ U , t = 1, . . . , T ;(ii) The dependence in the transition kernel (13) on the history is carried only through the laststate and control:

Qt : graph(Ut )→ P(X ), t = 1, . . . , T − 1;

(iii) The step-wise costs are only dependent on the current state and control:

Z t = ct (xt , ut ), t = 1, . . . , T,

where ct : graph(Ut )→ R, t = 1, . . . , T are measurable bounded functions.

14

Let Π be the set of admissible history-dependent policies:

Π :=π = (π1, . . . , πT ) | ∀t, πt (x1, . . . , xt ) ∈ Ut (xt )

.

To alleviate notation, for all π ∈ Π and for all measurable c = (c1, . . . , cT ), we write

vc,πt (ht ) := ρ

πt,T(ct (X t , πt (Ht )), . . . , cT (XT , πT (HT ))

)(ht ).

The following result is the direct translation of Theorem 3.3 to the Markovian case:

Corollary 4.1. For a controlled Markov system, a family of process-based dynamic risk mea-sures

ρπt,T

π∈Πt=1,...,T is translation-invariant and stochastically conditionally time-consistent if

and only if there exist functionals

σt :(

ht , Qt (xt , u)): ht ∈ X t , u ∈ Ut (xt )

× V → R, t = 1 . . . T − 1,

such that(i) For all t = 1, . . . , T − 1 and all ht ∈ X t , σt (ht , ·, ·) is normalized and strongly monotonicwith respect to stochastic dominance on

Qt (xt , u) : u ∈ Ut (xt )

;

(ii) For all π ∈ Π , for all bounded measurable c, for all t = 1, . . . , T − 1, and for all ht ∈ X t ,

vc,πt (ht ) = ct (xt , πt (ht ))+ σt

(ht , Qt (xt , πt (ht )), v

c,πt+1(ht , ·)

). (21)

Proof. To verify the “if and only if” statement, we can show that (17) is true if σt satisfies (21)for all measurable bounded c.

4.1 Markov Risk MeasuresBecause of the Markov property of the transition kernels, for a fixed Markov policy1 π , thefuture evolution of the process Xτ τ=t,...,T is solely dependent on the current state xt , so is thedistribution of the future costs cτ (Xτ , πτ (Xτ )), τ = t, . . . , T . Therefore, it is reasonable toassume that the dependence of the conditional risk measure on the history is also carried by thecurrent state only.


π∈Πt=1,...,T for a con-

trolled Markov system is Markov if for all Markov policies π ∈ Π , for all measurable c =(c1, . . . , cT ), and for all ht = (x1, . . . , xt ) and h′t = (x

′

1, . . . , x ′t ) in X t such that xt = x ′t , wehave

vc,πt (ht ) = v

c,πt (h′t ).

Proposition 4.3. Under the translation invariance and the stochastic conditional time consis-tency,

ρπt,T

π∈Πt=1,...,T is Markov if and only if the dependence of σt on ht is carried only by xt ,

for all t = 1, . . . , T − 1.

Proof. The Markov property in Definition 4.2 implies that for all t = 1, . . . , T − 1, for allht , h′t ∈ X t such that xt = x ′t , for all u ∈ Ut (xt ) and for all v ∈ V , there exists a Markov π ∈ Πsuch that πt (xt ) = u. By setting c = (0, . . . , 0, ct+1, 0, . . . , 0) with ct+1 : (x ′, u′) 7→ v(x ′),we obtain

σt (ht , Qt (xt , u), v) = vc,πt (ht ) = v

c,πt (h′t ) = σt (h′t , Qt (xt , u), v).

1A Markov policy π is composed of state-dependent measurable decision rules πt : X 7→ U), t = 1, . . . , T .

15

Therefore, σt is indeed memoryless, that is, its dependence on ht is carried by xt only.If σt , t = 1, . . . , T − 1 are all memoryless, we can prove by induction backward in time

that for all t = T, . . . , 1, vc,πt (ht ) = v

c,πt (h′t ) for all Markov π and all ht , h′t ∈ X t such that

xt = x ′t .

Theorem 4.4. For a controlled Markov system, a family of process-based dynamic risk mea-sures

ρπt,T

π∈Πt=1,...,T is translation-invariant, stochastically conditionally time-consistent and

Markov if and only if there exist functionals

σt :(

x, Qt (x, u)): x ∈ X , u ∈ Ut (x)

× V → R, t = 1 . . . T − 1,

where V is the set of bounded measurable functions on X , such that:(i) For all t = 1, . . . , T − 1 and all x ∈ X , σt (x, ·, ·) is normalized and strongly monotonicwith respect to stochastic dominance on

Qt (x, u) : u ∈ Ut (x)

;

(ii) For all π ∈ Π , for all measurable bounded c, for all t = 1, . . . , T − 1, and for all ht ∈ X t ,


(xt , Qt (xt , πt (ht )), v

c,πt+1(ht , ·)

). (22)

Remark 4.5. In [41], formula (22) was postulated as a definition of a Markov risk measure.

Theorem 4.4 provides us with a simple recursive formula (22) for the evaluation of risk ofa Markov policy π :

vc,πT (x) = cT (x, πT (x)), x ∈ X ,v

c,πt (x) = ct (x, πt (x))+ σt

(x, Qt (x, πt (x)), v

c,πt+1), x ∈ X , t = T − 1, . . . , 1.

It involves calculation of the values of functions vc,πt (·) on the state space X .

4.2 Dynamic ProgrammingIn this section, we fix the cost functions c1, . . . , cT and consider a family of dynamic risk mea-sures

ρπt,T

π∈Πt=1,...,T which is translation-invariant, stochastically conditionally time-consistent

and Markov. Our objective is to analyze the risk minimization problem:

minπ∈Π

vπ1 (x1), x1 ∈ X .

For this purpose, we introduce the family of value functions:

v∗t (ht ) = infπ∈Πt,T

vπt (ht ), t = 1, . . . , T, ht ∈ X t , (23)

whereΠt,T is the set of feasible deterministic policies π = πt , . . . , πT . As stated in Theorem4.4, transition risk mappings

σt

t=1,...,T−1 exist, such that

vπt (ht ) = ct (xt , πt (ht ))+ σt(xt , Qt (xt , πt (ht )), v

πt+1(ht , ·)

),

t = 1, . . . , T − 1, π ∈ Π, ht ∈ X t . (24)

Our intention is to prove that the value functions v∗t (·) are memoryless, that is, for all ht =

(x1, . . . , xt ) and h′t = (x′

1, . . . , x ′t ) such that xt = x ′t , we have v∗t (ht ) = v∗t (h′t ). In this case,

with a slight abuse of notation, we shall simply write v∗t (xt ).In order to formulate the main result of this subsection, we equip the space P(X ) of proba-

bility measures on X with the topology of weak convergence.

16

Theorem 4.6. We assume in addition that the following conditions are satisfied:(i) The transition kernels Qt (·, ·), t = 1, . . . , T , are continuous;

(ii) For every lower semicontinuous v ∈ V the transition risk mappings σt (·, ·, v), t = 1, . . . , T ,are lower semi-continuous;

(iii) The functions ct (·, ·), t = 1, . . . , T , are lower semicontinuous;(iv) The multifunctions Ut (·), t = 1, . . . , T , are measurable, compact-valued, and upper semi-continuous.Then the functions v∗t , t = 1, . . . , T , are measurable, memoryless, lower semicontinuous, andsatisfy the following dynamic programming equations:

v∗T (x) = minu∈UT (x)

cT (x, u), x ∈ X , (25)

v∗t (x) = minu∈Ut (x)

ct (x, u)+ σt

(x, Qt (x, u), v∗t+1

), x ∈ X , t = T − 1, . . . , 1. (26)

Moreover, an optimal Markov policy π exists and satisfies the equations:

πT (x) ∈ argminu∈UT (x)

cT (x, u), x ∈ X , (27)

πt (x) ∈ argminu∈Ut (x)

ct (x, u)+ σt

(x, Qt (x, u), v∗t+1

), x ∈ X , t = T − 1, . . . , 1. (28)

Proof. We prove the memoryless property of v∗t (·) and construct the optimal Markov policy byinduction backwards in time. For all hT ∈ X T we have

v∗T (hT ) = infπ∈Π

cT (xT , πT (hT )) = infu∈UT (xT )

cT (xT , u). (29)

Since cT (·, ·) is lower semicontinuous, it is a normal integrand, that is, its epigraphical mapping

x 7→ (u, α) ∈ U ×R : cT (x, u) ≤ α

is a closed-valued and measurable multifunction [38, Ex. 14.31]. Due to assumption (iv), themapping

cT (x, u) =

cT (x, u) if u ∈ UT (x),+∞ otherwise,

is a normal integrand as well. By virtue of [38, Thm. 14.37], the infimum in (29) is attainedand is a measurable function of xT . Hence, v∗T (·) is measurable and memoryless. By assump-tions (iii) and (iv) and Berge theorem, it is also lower semicontinuous (see, e.g. [3, Thm.1.4.16]). Moreover, the optimal solution mapping ΨT (x) =

u ∈ UT (x) : cT (x, u) = v∗T (x)

is measurable and has nonempty and closed values. Therefore, a measurable selector πT of ΨTexists [28], [3, Thm. 8.1.3].

Suppose v∗t+1(·) is measurable, memoryless, and lower semicontinuous, and Markov deci-sion rules πt+1, . . . , πT exist such that

v∗t+1(xt+1) = vπt+1,...,πT t+1 (xt+1), ∀ ht+1 ∈ X t+1.

Then for any ht ∈ X t we have

v∗t (ht ) = infπ∈Π

vπt (ht ) = infπ∈Π

ct (xt , πt (ht ))+ σt


πt+1(ht , ·)

).

17

On the one hand, since vπt+1(ht , ·) ≥ v∗

t+1(·) and σt is non-decreasing with respect to thelast argument, we obtain

v∗t (ht ) ≥ infπ∈Π



∗

t+1)

= infu∈Ut (xt )

ct (xt , u)+ σt

(xt , Qt (xt , u), v∗t+1

).

(30)

By assumptions (i)–(iii), the mapping (x, u) 7→ ct (x, u)+ σt (x, Qt (x, u), v∗t+1) is lower semi-continuous. Invoking [38, Thm. 14.37] and assumption (iv) again, exactly as in the case oft = T , we conclude that the optimal solution mapping

Ψt (x) =

u ∈ Ut (x) : ct (x, u)+ σt(x, Qt (x, u), v∗t+1

)= inf

u∈Ut (x)

ct (x, u)+ σt

(x, Qt (x, u), v∗t+1

) is measurable and has nonempty and closed values; hence, a measurable selector πt of Ψt exists[3, Thm. 8.1.3]. Substituting this selector into (30), we obtain

v∗t (ht ) ≥ ct (xt , πt (xt ))+ σt(xt , Qt (xt , πt (xt )), v

πt+1,...,πT t+1

)= vπt ,...,πT t (xt ).

In the last equation, we used (24) and the fact that the decision rules πt , . . . , πT are Markov.On the other hand,

v∗t (ht ) = infπ∈Π

vπt (ht ) ≤ vπt ,...,πT t (xt ).

Therefore, v∗t (ht ) = vπt ,...,πT t (xt ) is measurable, memoryless, and

v∗t (xt ) = minu∈Ut (xt )

ct (xt , u)+ σt

(xt , Qt (x, u), v∗t+1

)= ct (xt , πt (xt ))+ σt

(xt , Qt (xt , πt (xt )), v

∗

t+1).

By assumptions (ii), (iii), (iv), and Berge theorem, v∗t (·) is lower semicontinuous (see, e.g. [3,Thm. 1.4.16]). This completes the induction step.

Let us verify the weak lower semicontinuity assumption (ii) of the mean–semideviationtransition risk mapping of Example 2.17. To make the mapping Markovian, we assume that theparameter ~ depends on x only, that is,

σ(x, q, v) =∫Xv(s) q(ds)+ ~(x)

(∫X

[(v(s)−


)+

]p

q(ds))1/p

. (31)

As before, p ∈ [1,∞). For simplicity, we skip the subscript t of σ and ~.

Lemma 4.7. Suppose ~(·) is continuous. Then for every lower semicontinuous function v, themapping (x, q) 7→ σ(x, q, v) is lower semicontinuous.

Proof. Let qk → q weakly and xk → x . For all s ∈ X we have the inequality

0 ≤[v(s)−


]+

≤

[v(s)−

∫Xv(s′) qk(ds′)

]+

+

[ ∫Xv(s′) qk(ds′)−


]+

.

18

By the triangle inequality for the norm in Lp(X ,B(X ), qk),(∫X

[v(s)−


]p

+

qk(ds))1/p

≤

(∫X

[v(s)−


]p

+

qk(ds))1/p

+

[ ∫Xv(s′) qk(ds′)−


]+

.

Adding∫X v(s) q(ds) to both sides, we obtain∫

Xv(s) q(ds)+

(∫X

[v(s)−


]p

+

qk(ds))1/p

≤

(∫X

[v(s)−


]p

+

qk(ds))1/p

+max[ ∫

Xv(s) qk(ds),

∫Xv(s) q(ds)

].

By the lower semicontinuity of v and weak convergence of qk to q , we have∫Xv(s) q(ds) ≤ lim inf

k→∞

∫Xv(s) qk(ds),

that is, for every ε > 0, we can find kε such that for all k ≥ kε∫Xv(s) qk(ds) ≥

∫Xv(s) q(ds)− ε.

Therefore, for these k we obtain∫Xv(s) q(ds)+

(∫X

[v(s)−


]p

+

qk(ds))1/p

≤

∫Xv(s) qk(ds)+

(∫X

[v(s)−


]p

+

qk(ds))1/p

+ ε.

Taking the “lim inf” of both sides, and using the weak convergence of qk to q and the lowersemicontinuity of the functions integrated, we conclude that∫

Xv(s) q(ds)+

(∫X

[v(s)−


]p

+

q(ds))1/p

≤ lim infk→∞

∫Xv(s) qk(ds)+

(∫X

[v(s)−


]p

+

qk(ds))1/p

+ ε.

As ε > 0 was arbitrary, the last relation proves the lower semicontinuity of σ in the case when~(x) ≡ 1. The case of a continuous ~(x) ∈ [0, 1] can be now easily analyzed by noticingthat σ(x, q, v) is a convex combination of the expected value, and the risk measure of the lastdisplayed relation:

σ(x, q, v) =(1− ~(x)

) ∫Xv(s) q(ds)

+ ~(x)

∫Xv(s) q(ds)+

(∫X

[v(s)−


]p

+

q(ds))1/p

.

As both components are lower semicontinuous in (x, q), so is their sum.

19

5 Application to Partially-Observable Markov Decision Pro-cessesIn this section, we apply our previous results to a partially-observable Markov decision processX t , Yt t=1,...,T , in which X t t=1,...,T is observable and Yt t=1,...,T is not. We use the term“partially-observable Markov decision process” in a more general way than the extant literature,because we consider dynamic risk measures as the objective of control, rather than just theexpected value of the cost.

5.1 The ModelIn order to develop our subsequent theory, it is essential to define our partially-observableMarkov decision process (POMDP) in a clear and rigorous way. This subsection mostly followsChapter 5 of [4], and readers are encouraged to consult [4] for more details.

The state space of the model is defined as X × Y where (X ,B(X )) and (Y,B(Y)) are twoPolish spaces. From the modeling perspective, x ∈ X is the part of the state that we can observeat each step, while y ∈ Y is unobservable. The measurable space that we will work with is thengiven by (X ×Y)T endowed with the canonical product σ -field, and we use xt and yt to denotethe canonical projections at time t . The control space is still given by U , and since only X isobservable, the set of admissible controls at step t is still given by a multifunction Ut : X ⇒ U .The transition kernel at time t is now given by

Qt : graph(Ut )× Y → P(X × Y);

in other words, given that at time t the state is (x, y) and we apply control u, the distribution ofthe next state is given by Qt ( · |x, y, u). Finally, the cost incurred at each stage t is given by thefunctional ct : graph(Ut )→ R, which is assumed to be measurable and bounded.

We would also like to comment on the notion of history in the current POMDP setting. Attime t , all the information available for making a decision is given by gt = (x1, u1, · · · , xt )

because yt is unobservable. In the following, the notion of a history will always correspond tothis observable history.

5.2 Bayes OperatorIn a POMDP, the Bayes operator provides a way to update from prior belief to posterior belief.More precisely, assume that our current state observation is x and action is u, and our currentbest guess of the distribution of the unobservable state is ξ . Given a new observation x ′, we canfind a formula to determine the posterior distribution of the unobservable state.

Let us start with a fairly general construction of the Bayes operator. Assuming the abovesetup, for given (x, ξ, u) ∈ X×P(Y)×U , define a new measure mt (x, ξ, u) on X×Y , initiallyon all measurable rectangles A × B as

mt (x, ξ, u)(A × B) =∫Y

Qt (A × B | x, y, u) ξ(dy).

We verify readily that this uniquely defines a probability measure on X × Y . If the measurablespace (Y,B(Y)) is standard, i.e., isomorphic to a Borel subspace of R, we can disintegrate

20

mt (x, ξ, u) into its marginal λt (x, ξ, u)(dx ′) on X and a transition kernel Kt (x, ξ, u)(x ′, dy′)from X to Y , which writes:

mt (x, ξ, u)(dx ′, dy′) = λt (x, ξ, u)(dx ′) Kt (x, ξ, u)(x ′, dy′).

For all C ∈ B(Y), we define the Bayes operator of the POMDP as follows:

8t (x, ξ, u, x ′)(C) = Kt (x, ξ, u)(x ′,C).

The above argument shows that the the Bayes operator exists and is unique as long as the spaceY is standard, which is almost always the case in applications of POMDP. In the followingconsierations, we always assume that the Bayes operator exists.

Example 5.1 (Bayes operator with kernels given by density functions). Assume that each tran-sition kernel Qt (x, y, u) has a density qt (·, · | x, y, u) with respect to a finite product measureµX ⊗ µY on X × Y . Then the Bayes operator has the form

[Φt (x, ξ, u, x ′)

](A) =

∫A

∫Y qt (x ′, y′ | x, y, u) ξ(dy) µY (dy′)∫

Y∫Y qt (x ′, y′ | x, y, u) ξ(dy) µY (dy′)

, ∀ A ∈ B(Y).

In particular, if Y is a finite space, then

[Φt (x, ξ, u, x ′)

](y′) =

∑y∈Y qt (x ′, y′ | x, y, u) ξ(y)∑

z∈Y∑

y∈Y qt (x ′, z | x, y, u) ξ(y), y′ ∈ Y.

If the formulas above have a zero denominator for some (x, ξ, u, x ′), we can formally defineΦt (x, ξ, u, x ′) to be an arbitrarily selected distribution on Y .

5.3 Risk Measures Based on the Observable Part of the ProcessIf Ξ1 : X → P(Y) is the conditional distribution of Y1, given X1, we can consider POMDPas a special case of a fully-observable non-Markov model discussed in Section 3.1, with theprocess X (and omitting Y ), because the admissible control sets Ut (xt ), the costs ct (xt , ut ), andthe kernel Υt (d X t+1|x1, u1, . . . , xt , ut ) do not depend on Y . To justify this transformation, wemake the following observations, valid for all t = 1, . . . , T :1. The control ut ∈ Ut (xt ) can be applied after observing xt ;2. After observing xt , with the help of the Bayes operator, the conditional distribution of Yt isa function of gt = (x1, u1, . . . , xt−1, ut−1, xt ) and can be calculated iteratively with Ξ1(g1) =

Ξ1(x1) and

Ξt (gt ) = 8t−1

(xt−1, Ξt−1(gt−1), ut−1, xt

), t = 2, . . . , T ; (32)

3. After applying a control ut , the conditional distribution of X t+1 is given by the integral:

Υt (gt , ut ) =

∫y∈Y

Q Xt (xt , y, ut )

[Ξt (gt )

](dy), (33)

where Q Xt (xt , y, ut ) is the marginal distribution of Qt (xt , y, ut ) on X . The precise meaning

(e.g., set-wise or weak) of the integral depends on the selected topology in the space P(X );

21

4. We can define an admissible decision rule as a measurable selector πt : gt 7→ Ut (xt ),because gt encompasses all information available before making a decision at time t . As wediscussed it in Section 3.1, for an admissible policy π = (π1, . . . , πT ), each πt reduces to afunction of ht = (x1, . . . , xt ), and the set of admissible policies is as in (14), but with a reducedform of Ut :

Π =π = (π1, . . . , πT ) | ∀t, ∀(x1, . . . , xt ) ∈ X t , πt (x1, . . . , xt ) ∈ Ut (xt )

. (34)

We denote for each π ∈ Π

Ξπt (ht ) = Ξt (x1, π1(x1), . . . , xt−1, πt−1(x1, . . . , xt−1), xt )

andΥ πt (ht ) = Υt (x1, π1(x1), . . . , xt , πt (x1, . . . , xt ));

5. For any π ∈ Π , the function ht 7→ ct (xt , πt (ht )) is an element of Zt , the exact same spacethat we defined in (2) and used in the previous section.

The integral in (33) is an involved issue, especially for the dynamic programming part,where semicontinuity properties of various elements of the model are relevant. To avoid serioustechnical complications, we assume from now on that the unobserved state space Y is finite,and thus all convergence concepts in (33) become equivalent.

Therefore, without any model limitation or information loss at each time t , we can simplyconsider a POMDP as a system with the observable process X only, whose dynamics is de-scribed in (33). Consequently, we can construct dynamic risk measures as in Section 2 and 3.To alleviate notation, for such family of process-based dynamic risk measures

ρπt,T

π∈Πt=1,...,T ,

for all π ∈ Π and for all measurable c = (c1, . . . , cT ), we still write

vc,πt (ht ) := ρ


)(ht ).

The following statement specializes Corollary 4.1 to the partially observable case.

Corollary 5.2. For a POMDP, a family of process-based dynamic risk measuresρπt,T

π∈Πt=1,...,T

is translation-invariant and stochastically conditionally time-consistent if and only if there existfunctionals

σt :(

ht , Υπt (ht )

): π ∈ Π, ht ∈ X t

× V → R, t = 1 . . . T − 1,

in which V is the set of measurable and bounded functions on X , such that:(i) For all t = 1, . . . , T − 1 and all ht ∈ X t , σ(ht , ·, ·) is normalized and strongly monotonicwith respect to stochastic dominance on

Υ πt (ht ) : π ∈ Π

;

(ii) For all π ∈ Π , for all bounded measurable c, for all t = 1, . . . , T − 1, and for all ht ∈ X t ,


(ht , Υ

πt (ht ), v

c,πt+1(ht , ·)

). (35)

5.4 Markov PropertyFor finite Y , the set P(Y) can be represented as follows:

P(Y) =ξ ∈ R|Y | : ξ ≥ 0, |ξ |1 = 1

, (36)

22

which obviously is a Borel subset of a finite dimensional real space. We can deduce from (32)and (33) that (X t , Ξt )t=1,...,T is a Markov process with the following transition rules:

[Υt (gt , ut )

](dx ′) =

∑y∈Y

[Q X

t (xt , y, ut )](dx ′) [Ξt (gt )] (y),

Ξt+1(gt+1) = 8t

(xt , Ξt (gt ), ut , xt+1

).

(37)

We say that an admissible policy is Markov if each decision rule depends only on the currentobserved state x and the current belief state ξ .

Definition 5.3. In POMDP, a policy π ∈ Π is Markov if πt (ht ) = πt (h′t ) for all t = 1, . . . , Tand all ht , h′t ∈ X t such that xt = x ′t and Ξπ

t (ht ) = Ξπt (h′t ).

For a fixed Markov policy π , the future evolution of the process (Xτ , Ξτ )τ=t,...,T issolely dependent on the current (xt , Ξ

πt (ht )), and so is the distribution of the future costs

cτ (Xτ , πτ (Xτ )), τ = t, . . . , T . Therefore, we can define an analogical Markov property ofrisk measures for POMDP.


π∈Πt=1,...,T for a POMDP

is Markov if for all Markov policies π ∈ Π , for all measurable c = (c1, . . . , cT ), and for allht = (x1, . . . , xt ) and h′t = (x

′

1, . . . , x ′t ) in X t such that xt = x ′t and Ξπt (ht ) = Ξ

πt (h′t ),

vc,πt (ht ) = v

c,πt (h′t ).

Proposition 5.5. A translation invariant and stochastically conditionally time-consistent familyof process-based dynamic risk measures

ρπt,T

π∈Πt=1,...,T is Markov if and only if the dependence

of σt on ht is carried by (xt , Ξπt (ht )) only, for all t = 1, . . . , T − 1.

Proof. A minor modification of the proof of Proposition 4.3 is sufficient.

The following theorem summarizes our observations.



is translation-invariant, stochastically conditionally time-consistent and Markov if and only ifthere exist functionals

σt :(

xt , Ξπt (ht ), Υ

πt (ht )

): π ∈ Π, ht ∈ X t

× V → R, t = 1 . . . T − 1,

such that(i) for all t = 1, . . . , T − 1 and all (x, ξ) ∈

(xt , Ξ

πt (ht )

): π ∈ Π, ht ∈ X t, σt (x, ξ, ·, ·)

is normalized and strongly monotonic with respect to stochastic dominance onΥ πt (ht ) : π ∈

Π, ht ∈ X t such that xt = x, Ξπt (ht ) = ξ

;

(ii) for all π ∈ Π , for all measurable bounded c, for all t = 1, . . . , T − 1, and for all ht ∈ X t ,


(xt , Ξ

πt (ht ), Υ

πt (ht ), v

c,πt+1(ht , ·)

). (38)

23

5.5 Risk Measures Based on the Extended State ProcessAs (X t , Ξt )t=1,...,T is a Markov process, another way to define risk measures is to apply theresults for fully observable Markov models (Section 4) to this process. This approach is inter-esting to investigate, because we will show that it leads to the same transition risk mappingsσt as in Theorem 5.6, and is, therefore, equivalent to the approach of section 5.4. However, anadditional feature of the extended state space approach is that it allows us to include a broaderclass of cost functions.

As we build risk measures based on the process (X t , Ξt )t=1,...,T , we use for each time t ,

Zt =

Z : (X × P(Y))t → R∣∣ Z is measurable and bounded

,

the space of measurable and bounded functionals of the extended state histories

ht = (x1, ξ1, . . . , , xt , ξt ),

for both cost incurred and risk evaluated. Due to the finiteness of Y , the set (36) is a Borelsubset of a finite dimensional space, and thus the measurability property here is well-defined.We also define Π to be the set of admissible policies composed of measurable decision rulesdependent on the extended state histories.

To alleviate notation, for a family of process-based dynamic risk measuresρπt,T

π∈Πt=1,...,T ,

for all π ∈ Π and for all measurable c = (c1, . . . , cT ), we still write

vc,πt (ht ) := ρ


)(ht ).

Thus the direct translation of Theorem 4.4 to the POMDP is the following:

Corollary 5.7. For a POMDP, a family of extended-process-based dynamic risk measuresρπt,T

π∈Πt=1,...,T is translation-invariant, stochastically conditionally time-consistent and Markov

if and only if for t = 1 . . . T − 1 there exist functionals

σt :(

x, ξ,(dxt+1, dξt+1 | x, ξ, u

)): x ∈ X , ξ ∈ P(X ), u ∈ Ut (x)

× V → R,

where

(dxt+1, dξt+1 | x, ξ, u

)=

∑y∈Y

[Q X

t (x, y, u)](dx ′) ξ(y)

⊗ δ8t (x,ξ,u,x ′), (39)

and V is the set of bounded measurable functions on (X × P(Y)), such that(i) For all t = 1, . . . , T − 1, all x ∈ X and all ξ ∈ P(Y), σt (x, ξ, ·, ·) is normalized andstrongly monotonic with respect to stochastic dominance on

(dxt+1, dξt+1 | x, ξ, u

): u ∈ Ut (x)

;

(ii) for all π ∈ Π , for all measurable bounded c, for all t = 1, . . . , T − 1, and for all ht ∈

(X × P(Y))t ,


(xt , ξt ,

(dxt+1, dξt+1 | xt , ξt , πt (ht )

), v

c,πt+1(ht , ·, ·)

). (40)

The strong monotonicity with respect to stochastic dominance of σt is understood in exactlythe same way, as in section 3.3, only the state space becomes X × P(Y).

24

Owing to the special form of the transition kernel (39), and the strong monotonicity withrespect to stochastic dominance of σt , we can ignore the values of the function vc,π

t+1(ht , x ′, ξ ′)for ξ ′ 6= 8t

(xt , ξt , πt (ht ), x ′

). Therefore, mappings

σ redt :

(x, ξ,

(dxt+1 | x, ξ, u

)): x ∈ X , ξ ∈ P(X ), u ∈ Ut (x)

× V → R, t = 1 . . . T − 1,

(41)exist, where V is the set of bounded measurable functions on X , such that for every w ∈ V wehave

σt(xt , ξt ,

(dxt+1, dξt+1 | xt , ξt , πt (ht )

), w(·, ·)

)= σ red

t(xt , ξt ,

(dxt+1 | xt , ξt , πt (ht )

), x ′ 7→ w

(x ′,8t

(xt , ξt , πt (ht ), x ′

))).



is translation-invariant, stochastically conditionally time-consistent and Markov if and only iffunctionals (41) exist, such that(i) For all t = 1, . . . , T − 1, all x ∈ X and all ξ ∈ P(Y), σ red

t (x, ξ, ·, ·) is normalized andstrongly monotonic with respect to stochastic dominance on

∑y∈Y

[Q X

t (x, y, u)](dx ′) ξ(y)

: u ∈ Ut (x)

;(ii) For all π ∈ Π , for all measurable bounded c, for all t = 1, . . . , T − 1, and for all ht ∈

(X ,P(Y))t ,

vc,πt (ht ) = ct (xt , πt (ht ))+ σ

redt

(xt , ξt ,

∑y∈Y

[Q X

t (xt , y, πt (ht ))](dx ′) ξt (y),

x ′ 7→ vc,πt+1(ht , x ′,8t (xt , ξt , πt (ht ), x ′)

)). (42)

Proof. The existence of the mapping σ red has been established in the discussion preceding thetheorem. The strong monotonicity with respect to stochastic dominance follows from the factthat the distribution function of w(·, ·) with respect to the measure (39) is the same as the distri-bution of x ′ 7→ w

(x ′,8t

(x, ξ, u, x ′

))under the marginal measure

∑y∈Y

[Q X

t (x, y, u)](dx ′) ξ(y).

Let us observe that the statement of Theorem 5.8 corresponds to the statement of Theorem5.6: the mapping σ red

t coincides with σt for (xt , ξt ) which can be parts of legitimate historiesht . In this way, we have proved the equivalence of the two approaches to risk-averse partiallyobservable Markov models: the history-dependent approach, and the “belief state” approach.

We may also remark here that our results allow for the use of a broader class of cost func-tions, namely, bounded measurable functions ct : X×P(Y)×U → R, which explicitly dependon the belief state ξt . In this way, we may incorporate immediate costs of the uncertainty of theactual state Yt . All derivations remain unchanged, just the cost part in formula (42) becomesct (xt , ξt , πt (ht )).

25

5.6 Dynamic ProgrammingAs in subsection 4.2, we fix the cost functions c1, . . . , cT and consider a family of dynamicrisk measures

ρπt,T

π∈Πt=1,...,T which is translation-invariant, stochastically conditionally time-

consistent and Markov. Our objective is to analyze the risk minimization problem:

minπ∈Π

vπ1 (x1, ξ1), x1 ∈ X , ξ1 ∈ P(X ).

For this purpose, we introduce the family of value functions:


vπt (ht ), t = 1, . . . , T, ht ∈ Ht , (43)

whereΠt,T is the set of feasible deterministic policies π = πt , . . . , πT . As stated in Theorem5.6, there exist transition risk mappings

σt

t=1,...,T−1 such that equations (38) hold.We assume that the space P(X ) is equipped with the topology of weak convergence of

probability measures, and the space V is equipped with the topology of pointwise convergence.Recall that the space Y is assumed finite, and thus all convergence concepts in P(Y) coincide.All continuity statements are made with respect to the said topologies in these spaces.

We also assume that the kernels Qt (x, y, u) have densities qt (·, · | x, y, u) with respect toa finite product measure µX ⊗ µY on X × Y , as in Example 5.1. In this case,[ ∫

YQ X

t (x, ·, u) dξ](dx ′) =

[ ∑y′∈Y

∑y∈Y

qt (x ′, y′ | x, y, u) ξ(y)]µX (dx ′). (44)

The following theorem is an analogue of Theorem 4.6. Its main result is that the valuefunctions (43) are memoryless, that is, they depend on (xt , ξt ) only, and that they satisfy ageneralized form of a dynamic programming equation. The equation also allows us to identifythe optimal policy.

Theorem 5.9. In addition to the general assumptions, we assume the following conditions:(i) The densities qt (x ′, y′|x, y, u) are uniformly bounded and continuous with respect to (x, u);

(ii) The transition risk mappings σt (·, ·, ·, ·), t = 1, . . . , T , are lower semi-continuous;(iii) The functions ct (·, ·), t = 1, . . . , T , are lower semicontinuous;(iv) The multifunctions Ut (·), t = 1, . . . , T , are compact-valued and upper-semicontinuous.

Then the functions v∗t , t = 1, . . . , T are memoryless, lower semicontinuous, and satisfy thefollowing dynamic programming equations:

v∗T (x, ξ) = minu∈UT (x)

cT (x, u), x ∈ X , ξ ∈ P(X ), (45)

v∗t (x, ξ) = minu∈Ut (x)

ct (x, u)+ σt

(x, ξ,

∫Y

Q Xt (x, ·, u) dξ, x ′ 7→ v∗t+1

(x ′, Φt (x, ξ, u, x ′)

)),

x ∈ X , ξ ∈ P(Y), t = T − 1, . . . , 1.(46)

Moreover, an optimal Markov policy π exists and satisfies the equations:

πT (x, ξ) ∈ argminu∈UT (x)

cT (x, u), x ∈ X , ξ ∈ P(Y), (47)

πt (x, ξ) ∈ argminu∈Ut (x)

ct (x, u)+ σt

(x, ξ,

∫Y

Q Xt (x, ·, u) dξ, x ′ 7→ v∗t+1

(x ′, Φt (x, ξ, u, x ′)

)),

x ∈ X , ξ ∈ P(Y), t = T − 1, . . . , 1.(48)

26

Proof. We follow the pattern of the proof of Theorem 4.6, with adjustments and refinementsdue to the specific structure of the partially observable case. Recall that X t is the space ofobservable state histories up to time t ; we denote its generic elements as ht = (x1, . . . , xt ). Forall hT ∈ X T we have

v∗T (hT ) = infπT∈ΠT,T

cT (xT , πT (hT )) = infu∈UT (xT )

cT (xT , u). (49)

By assumptions (iii) and (iv), owing to the Berge theorem (see [3, Theorem 1.4.16]), the in-fimum in (49) is attained and is a lower semicontinuous function of xT . Hence, v∗T is memo-ryless. Moreover, the optimal solution mapping ΨT (x) =

u ∈ UT (x) : cT (x, u) = v∗T (x)

has nonempty and closed values and is measurable. Therefore, a measurable selector πT of ΨTexists (see, [28], [3, Thm. 8.1.3]).

Suppose v∗t+1(·) is memoryless, lower semicontinuous, and Markov decision rulesπt+1, . . . , πT exist such that

v∗t+1(xt+1, ξt+1) = vπt+1,...,πT t+1 (xt+1, ξt+1), ∀ht+1 ∈ X t+1.

Then for any ht ∈ X t formula (38) yields


vπt (ht )

= infπ∈Πt,T


(xt , ξt ,

∫Y

Q Xt (xt , ·, πt (ht )) dξt , v

πt+1(ht , ·)

).

Since vπt+1(ht , x ′) ≥ v∗t+1(x ′, Φt

(xt , ξt , πt (ht ), x ′

))for all x ′ ∈ X , and σt is non-decreasing

with respect to the last argument, we obtain

v∗t (ht ) ≥ infπ∈Πt,T

ct (xt , πt (ht ))

+ σt

(xt , ξt ,

∫Y

Q Xt (xt , ·, πt (ht )) dξt , x ′ 7→ v∗t+1(x

′, Φt (xt , ξt , πt (ht ), x ′)))

= infu∈Ut (xt )

ct (xt , u)+ σt

(xt , ξt ,

∫Y

Q Xt (xt , ·, u) dξt , x ′ 7→ v∗t+1(x

′, Φt (xt , ξt , u, x ′))).

(50)

In order to complete the induction step, we need to establish lower semicontinuity of the map-ping

(x, ξ, u) 7→ σt

(x, ξ,

∫Y

Q Xt (x, ·, u) dξ, x ′ 7→ v∗t+1

(x ′, Φt (x, ξ, u, x ′)

)). (51)

To this end, suppose x (k)→ x , ξ (k)→ ξ , u(k)→ u, as k →∞.First, we verify that the mapping (x, ξ, u) 7→

∫Y Q X

t (x, ·, u) dξ appearing in the thirdargument of σt is continuous. By formula (44), assumption (i), and Lebesgue theorem, for any

27

bounded continuous function f : X → R we have∫X

f (x ′)[ ∫

YQ X

t (x(k), ·, u(k)) dξ (k)

](dx ′)

=

∫X

f (x ′)[ ∑

y′∈Y

∑y∈Y

qt (x ′, y′ | x (k), y, u(k)) ξ (k)(y)]µX (dx ′)

→

∫X

f (x ′)[ ∑

y′∈Y

∑y∈Y

qt (x ′, y′ | x, y, u) ξ(y)]µX (dx ′)

=

∫X

f (x ′)[ ∫

YQ X

t (x, ·, u) dξ](dx ′).

Thus, the third argument of σt in (51) is continuous with respect to (x, ξ, u).Let us examine the last argument of σt in (51). Owing to assumption (i), the operator

[Φt (x, ξ, u, x ′)

](y′) =

∑y∈Y qt (x ′, y′ | x, y, u) ξ(y)∑

z∈Y∑

y∈Y qt (x ′, z | x, y, u) ξ(y), y′ ∈ Y.

is continuous with respect to (x, ξ, u) at all points (x, ξ, u, x ′) at which∑z∈Y

∑y∈Y

qt (x ′, z | x, y, u) ξ(y) > 0. (52)

Consider the sequence of functions V (k): X → R, k = 1, 2, . . . , and the function V : X → R,

defined for x ′ ∈ X as follows:

V (k)(x ′) = v∗t+1(x ′, Φt (x (k), ξ (k), u(k), x ′)

),

V (x ′) = v∗t+1(x ′, Φt (x, ξ, u, x ′)

).

Since v∗t+1(·, ·) is lower-semicontinuous and Φt (·, ·, ·, x ′) is continuous, whenever condition(52) is satisfied, we infer that

V (x ′) ≤ lim infk→∞

V (k)(x ′),

for all x ′ ∈ X satisfying (52).As v∗t+1 and Φt are measurable, both V and lim infk→∞ V (k) are measurable as well. By

the fact that the mapping σt is preserving the stochastic order st of the last argument withrespect to the measure

∫Y Q X

t (x, ·, u) dξ , and by assumption (ii), with the view at the alreadyestablished continuity of the third argument, we obtain the following chain of relations:

σt

(x, ξ,

∫Y

Q Xt (x, ·, u) dξ, V

)≤ σt

(x, ξ,

∫Y

Q Xt (x, ·, u) dξ, lim inf

k→∞V (k)

)= σt

(x, ξ, lim

k→∞

∫Y

Q Xt (x

(k), ·, u(k)) dξ (k), lim infk→∞

V (k))

≤ lim infk→∞

σt

(x (k), ξ (k),

∫Y

Q Xt (x

(k), ·, u(k)) dξ (k), V (k)).

Consequently, the mapping (51) is lower semicontinuous.

28

Using assumptions (ii) and (iv) and invoking Berge theorem again (see, e.g., [3, Theorem1.4.16]), we deduce that the infimum in (50) is attained and is a lower semicontinuous functionof (xt , ξt ). Moreover, the optimal solution mapping, that is, the set of u ∈ UT (x) at whichthe infimum in (50) is attained, is nonempty, closed-valued, and measurable. Therefore, aminimizer πt in (50) exists and is a measurable function of (xt , ξt ) (see, e.g., [28], [3, Thm.8.1.3]). Substituting this minimizer into (50), we obtain

v∗t (ht ) ≥ ct(xt , πt (xt , ξt )

)+ σt

(xt , ξt ,

∫Y

Q Xt(xt , ·, πt (xt , ξt )

)dξt , x ′ 7→ v∗t+1

(x ′, Φt (x, ξ, πt (xt , ξt ), x ′)

))= vπt ,...,πT t (xt , ξt ).

In the last equation, we used (38) and the fact that the decision rules πt , . . . , πT are Markov.On the other hand, we have


vπt (ht ) ≤ vπt ,...,πT t (xt , ξt ).

Therefore v∗t (ht ) = vπt ,...,πT t (xt , ξt ) is memoryless, lower semicontinuous, and

v∗t (xt , ξt )

= minu∈Ut (xt )

ct (xt , u)+ σt

(xt , ξt ,

∫Y

Q Xt (xt , ·, u) dξt , x ′ 7→ v∗t+1

(x ′, Φt (xt , ξt , u, x ′)

))= ct

(xt , πt (xt , ξt )

)+ σt

(xt , ξt ,

∫Y

Q Xt(xt , ·, πt (xt , ξt )

)dξt , x ′ 7→ v∗t+1

(x ′, Φt (xt , ξt , πt (xt , ξt ), x ′)

)).

This completes the induction step.

We can remark here that Theorem 5.9 remains valid with more general bounded, measur-able, and lower semicontinuous cost functions ct : X × P(Y) × U → R, which explicitlydepend on the belief state ξt . The formulation is almost the same, just the cost part in formulas(45)–(48) becomes ct (x, ξ, u). Formally, we could have also made the sets Ut dependent on ξt ,but this is hard to justify from the practical point of view.

References[1] A. Arlotto, N. Gans, and J. M. Steele. Markov decision problems where means bound

variances. Operations Research, 62(4):864–875, 2014.

[2] P. Artzner, F. Delbaen, J.-M. Eber, D. Heath, and H. Ku. Coherent multiperiod risk ad-justed values and Bellman’s principle. Annals of Operations Research, 152:5–22, 2007.

[3] J.-P. Aubin and H. Frankowska. Set-valued analysis. Birkhauser, Boston, MA, 2009.

[4] N. Bauerle and U. Rieder. Markov decision processes with applications to finance. Uni-versitext. Springer, Heidelberg, 2011.

[5] N. Bauerle and U. Rieder. More risk-sensitive Markov decision processes. Mathematicsof Operations Research, 39(1):105–120, 2013.

29

[6] D. P. Bertsekas and S. E. Shreve. Stochastic optimal control, volume 139 of Mathematicsin Science and Engineering. Academic Press, New York-London, 1978.

[7] O. Cavus and A. Ruszczynski. Computational methods for risk-averse undiscounted tran-sient Markov models. Operations Research, 62(2):401–417, 2014.

[8] O. Cavus and A. Ruszczynski. Risk-averse control of undiscounted transient Markovmodels. SIAM Journal on Control and Optimization, 52(6):3935–3966, 2014.

[9] Z. Chen, G. Li, and Y. Zhao. Time-consistent investment policies in Markovian markets:a case of mean-variance analysis. J. Econom. Dynam. Control, 40:293–316, 2014.

[10] P. Cheridito, F. Delbaen, and M. Kupper. Dynamic monetary risk measures for boundeddiscrete-time processes. Electronic Journal of Probability, 11:57–106, 2006.

[11] P. Cheridito and M. Kupper. Composition of time-consistent dynamic monetary riskmeasures in discrete time. International Journal of Theoretical and Applied Finance,14(01):137–162, 2011.

[12] F. Coquet, Y. Hu, J. Memin, and S. Peng. Filtration-consistent nonlinear expectations andrelated g-expectations. Probability Theory and Related Fields, 123(1):1–27, 2002.

[13] S. P. Coraluppi and S. I. Marcus. Risk-sensitive and minimax control of discrete-time,finite-state Markov decision processes. Automatica, 35(2):301–309, 1999.

[14] P. Dai Pra, L. Meneghini, and W. J. Runggaldier. Explicit solutions for multivariate,discrete-time control problems under uncertainty. Systems & Control Letters, 34(4):169–176, 1998.

[15] E. V. Denardo and U. G. Rothblum. Optimal stopping, exponential utility, and linearprogramming. Math. Programming, 16(2):228–244, 1979.

[16] D. Dentcheva and A. Ruszczynski. Risk preferences on the space of quantile functions.Math. Programming, 148(1-2):181–200, 2014.

[17] G. B. Di Masi and Ł. Stettner. Risk-sensitive control of discrete-time Markov processeswith infinite horizon. SIAM J. Control Optim., 38(1):61–78, 1999.

[18] E. A. Feinberg, P. O. Kasyanov, and M. Z. Zgurovsky. Partially observable total-costMarkov decision processes with weakly continuous transition probabilities. preprint,arXiv:1401.2168v1, 2014.

[19] J. A. Filar, L. C. M. Kallenberg, and H.-M. Lee. Variance-penalized Markov decisionprocesses. Math. Oper. Res., 14(1):147–161, 1989.

[20] H. Follmer and I. Penner. Convex risk measures and the dynamics of their penalty func-tions. Statistics & Decisions, 24(1/2006):61–96, 2006.

[21] K. Hinderer. Foundations of non-stationary dynamic programming with discrete timeparameter. Lecture Notes in Operations Research and Mathematical Systems, Vol. 33.Springer-Verlag, Berlin-New York, 1970.

[22] R. A. Howard and J. E. Matheson. Risk-sensitive Markov decision processes. Manage-ment Sci., 18:356–369, 1971/72.

[23] S. C. Jaquette. Markov decision processes with a new optimality criterion: discrete time.Ann. Statist., 1:496–505, 1973.

[24] S. C. Jaquette. A utility criterion for Markov decision processes. Management Sci.,23(1):43–49, 1975/76.

30

[25] A. Jobert and L. C. G. Rogers. Valuations and dynamic convex risk measures. Mathemat-ical Finance, 18(1):1–22, 2008.

[26] S. Kloppel and M. Schweizer. Dynamic indifference valuation via convex risk measures.Math. Finance, 17(4):599–627, 2007.

[27] M. Kupper and W. Schachermayer. Representation results for law invariant time consistentfunctions. Mathematics and Financial Economics, 2(3):189–210, 2009.

[28] K. Kuratowski and C. Ryll-Nardzewski. A general theorem on selectors. Bull. Acad.Polon. Sci. Ser. Sci. Math. Astronom. Phys, 13(1):397–403, 1965.

[29] S. Levitt and A. Ben-Israel. On modeling risk in Markov decision processes. In Op-timization and related topics (Ballarat/Melbourne, 1999), volume 47 of Appl. Optim.,pages 27–40. Kluwer Acad. Publ., Dordrecht, 2001.

[30] K. Lin and S. I. Marcus. Dynamic programming with non-convex risk-sensitive measures.In American Control Conference (ACC), 2013, pages 6778–6783. IEEE, 2013.

[31] S. Mannor and J. N. Tsitsiklis. Algorithmic aspects of mean-variance optimization inMarkov decision processes. European J. Oper. Res., 231(3):645–653, 2013.

[32] S. I. Marcus, E. Fernandez-Gaucherand, D. Hernandez-Hernandez, S. Coraluppi, andP. Fard. Risk sensitive Markov decision processes. In Systems and control in the twenty-first century (St. Louis, MO, 1996), volume 22 of Progr. Systems Control Theory, pages263–279. Birkhauser, Boston, MA, 1997.

[33] W. Ogryczak and A. Ruszczynski. From stochastic dominance to mean-risk models:Semideviations as risk measures. European Journal of Operational Research, 116(1):33–50, 1999.

[34] W. Ogryczak and A. Ruszczynski. On consistency of stochastic dominance and mean–semideviation models. Mathematical Programming, 89(2):217–232, 2001.

[35] G.Ch. Pflug and W. Romisch. Modeling, Measuring and Managing Risk. World Scientific,Singapore, 2007.

[36] Z. Porosinski, K. Szajowski, and S. Trybuła. Bayes control for a multidimensionalstochastic system. System Sciences, 11:51–64, 1985.

[37] F. Riedel. Dynamic coherent risk measures. Stochastic Processes and Their Applications,112:185–200, 2004.

[38] R T. Rockafellar and R. J.-B. Wets. Variational analysis, volume 317. Springer, Berlin,1998.

[39] B. Roorda, J. M. Schumacher, and J. Engwerda. Coherent acceptability measures in mul-tiperiod models. Mathematical Finance, 15(4):589–612, 2005.

[40] W. J. Runggaldier. Concepts and methods for discrete and continuous time control underuncertainty. Insurance: Mathematics and Economics, 22(1):25–39, 1998.

[41] A. Ruszczynski. Risk-averse dynamic programming for Markov decision processes.Math. Program., 125(2, Ser. B):235–261, 2010.

[42] A. Ruszczynski and A. Shapiro. Conditional risk mappings. Mathematics of OperationsResearch, 31:544–561, 2006.

[43] G. Scandolo. Risk Measures in a Dynamic Setting. PhD thesis, Universita degli Studi diMilano, Milan, Italy, 2003.

31

[44] A. Shapiro. Time consistency of dynamic risk measures. Operations Research Letters,40(6):436–439, 2012.

[45] Y. Shen, W. Stannat, and K. Obermayer. Risk-sensitive Markov control processes. SIAMJournal on Control and Optimization, 51(5):3652–3672, 2013.

[46] S. Weber. Distribution-invariant risk measures, information, and dynamic consistency.Mathematical Finance, 16(2):419–441, 2006.

[47] D. J. White. Mean, variance, and probabilistic criteria in finite Markov decision processes:a review. J. Optim. Theory Appl., 56(1):1–29, 1988.

32

process-based risk measures for observable and …process-based risk measures for observable and...

Documents