calculating the variance in markov-processes with random reward

TRABAJOS DE ESTADISTICA Y DE INVESTIGACION OPERATIVA Vol. 33, N,',m. 3, 1982, pp. 73 a 85

C A L C U L A T I N G T H E V A R I A N C E I N M A R K O V - P R O C E S S E S

W I T H R A N D O M R E W A R D

Francisco Benito lnstitut flit Operations Research Eidgen6ssische Technische Hochschule Ziirich

ABSTRACT

In this article we present a generalization of Markov Decision Processes with discreet time where the immediate rewards in every period are not deterministic but random, with the two first moments of the distribution given.

Formulas are developed to calculate the expected value and the variance of the reward of the process, which formulas generalize and partially correct other results. We make some observations about the distribution of rewards for processes with limited or unlimited horizon and with or without discounting.

Applications with risk sensitive policies are poss~le; this is illustrated in a numerical example where the results are revalidated by simulation.

Key words: Valued Markov Chains / Markov decision processes / dynamic programming.

Classification: 90 C40, 60 JlO, 90 C39.

RESUMEN

En este articulo se presenta una generalizaci6n de los procesos de decisi6n nmrkovianos en tiempo discreto: las ganancias en el tr~nsito de un estado a otro no son deterministas sino aleatorias; de las funciones de distribuci6n se suponen c~ nocidos fmicamente los dos primeros momentos.

Se deducen fbrmulas para calcular la esperanza matemfilica y la varianza de la ganancia total del proceso en horizonte finito o infmito y con o sin descuento. Se hacen algunas observaciones sobre la funcibn de distribuci6n de la ganancia total.

73

Los resultados tienen inter6s para introducir la noci6n de riesgo en la b(tsqueda de politicas 6ptimas.

Este trabajo amplia y corrige resultados de otros autores, ilustrdndolo con un ejemplo nun~rieo.

Palabras dave: Cadenas valoradas de Markov/procesos de deeisibnmarkovianos/ programaei6n dindmica.

I. Introduction and summary

With his optimization algorithm based on dynamic programming Howard [4] made the application o f the theory o f Markov Decision Processes (MDP) in practical fields possible. In this article we present a generalization of the MDP with discreet time making allowance for the fact that the immedia te rewards in every period are not deterministic but random with arbitrary distributions, of which only the first two moments are known.

We give formulas to calculate the expected value and the variance of process reward. These formulas generalize and partially revise the former results (Brown [1] and Goldwerger [3]). The knowledge of the variance permits the inclusion o f the notion of risk to search for an optimal policy, not only in order to maximize the expected value of reward, but also to take account of the possibility in one realisation of

process to obtain a value very different from the one expected.

In section 2 we describe the MDP and indicate the notat ion we use. It follows the s tudy of three cases: total reward with limited horizon (section 3), total reward with unlimited horizon with discounting (section 4), and average reward of a period (section 5). In section 6 we

solve a numerical example from literature.

2. MDP with random immediate reward

With a notat ion in accordance with Howard [4] a Markov Decision Process is characterized by a set of N states: i = 1, ..., N and a f'mite set of ac t ions k E Ki for every state, the corresponding t rans i t ion

74

A I k probabilities i ~ j . p~., k E Ki, i, j = 1, ..., A ( pij b IL ~ p/j = 1 for eve-

j= l

ry i, k) and transition rewards r~, k E Ki, i, j = 1, ..., N, and a discounting factor ~ (0 ~</I~< 1).

In our case the r/~ will be independent r andom variables with distri- k [rki/] = butions F/~ so that and Vat

k The s ta tement of bounded R/~ and oii implies no l imitat ion ha applications and simplifies the reasoning. Every reasoning in general case with random rewards is also applied, when these are deterministic; it is enough to take oi~ = O.

k The random transition reward rii is independent f rom the previous and from the future evolution of the system. The actual transition

k i ~ j only shows the distribution F/~ corresponding to r//.

For a realization of the MDP with arbitrary but fixed policy and initial state i, the followed trajectory can be represented schematically:

Period ( ~ ~ - - ~ = ... -- ~ - ~ ~ )

action k k" ... k (n)

k " k (n) t r a n s i t i o n probabilities Pii pk~ ... Pmo

Ak (n) transition rewards ~ ~ ... rmo

The probability that the process follows this part icular trajectory is

,, k ~n) k x p ~ x Pmo x ( 1 ) Pii . . . . . .

Once the trajectory has been fixed, the total re~ ard G, in this realization, with discounting factor/~ (/~ = 1 in the case wi thout discounting) is:

. . k { n )

G =r~i + [J i ~ + "" + l~n rmo +.-.

75

Because of the random character of the immediate rewards the total reward G for a given trajectory of the process is a random variable and its distribution is the convolution of the distributions o f the immediate rewards:

, /7k (n) F~ | (3 Fj k | ... | /3n .mo | (2)

(the meaning of/3 Fj~' is evident: [t3 F N] (x) = Fj~'(x/13), x E R and analogously for the other terms of the convolution).

If the immediate rewards have normal distributions, the former convolution results in a normal distribution that directs the total reward for the given trajectory (or for series of equivalent trajectories: for example, with /3 = 1, because of the conmutative proper ty of convolution, the trajectories which realize the same transitions in different order are equivalent: they generate the same distribution for the total reward).

Nevertheless in a new realization of the MDP with the same policy

and the same initial state the trajectory followed may be different from the earlier one and the total reward will obey another distribution, different from (2), to be calculated analogously as convolution of distri-

butions which depend on the trajectory. For a horizon h (h transitions) the number of different trajectories starting with a fixed initial state is N h (some may be equivalent in the sense indicated before) and every

one has an event-probabil i ty which can be calculated with a formula

a s ( l ) .

We also obtain a mixture of distributions. It is not easy to comple- tely characterize the resulting distribution as a funct ion of the parame-

ters o f the MDP.

For/3 = 1 and stationary policy with only one ergodic class, the distribution o f the one-period reward becomes asymptotically (h-* ~ ) normal. This convergence has a statistical interpretation: in (2) with /3 = 1 the order of the Fkmo is irrelevant and their relative frequency tends to the limit probabil i ty of the corresponding transition m ~ O:

k 7rrn "Pmo (Trm is the probabil i ty to be in state m once the stationary si- tuation has been reached). For a given horizon h, large enough, the total reward (sum of h transition rewards) has a distribution which is

approximately normal.

76

The case with discounting ((3< 1) is equivalent to a horizon-

limitation, so that the average duration of a realization of MDP has

1/(1 -(3) periods. For values of (3 not near to 1 the approximation of the distribution of the total discounted reward by a normal distribution

cannot be used.

With a given policy (homogeneous and deterministic) the action

corresponding to every case is unique, so that hereafter we eliminate

the action's upper index: for example we note Pi! instead ofp~:.,/

3. Total reward with limited horizon

With a fixed deterministic policy let vi(n) be the total reward in a realization of MDP with initial state i and n periods (n < oo). vi(n) is a

random variable with a distribution we do not analyse. It is E [vi(n)] = = V i ( n ) and Var [v i (n)] = o ~ ( n ) (both expressions exist and are bounded, according to the bounds assumed in section 2; in this section let

o<(3<1).

In the first transition, starting in state i, we will reach state ] = 1 . . . . .

N (with the probability Pii):

v i (n ) = rij + (3 v l (n -- 1 ) with the probability Pij, J = 1 . . . . . N (3)

N Also E [v i (n)] = .~_,

1=1 Pij {E [rijl + ( 3 E l v j ( n - I)])

It means

N - Pi i ( R i / + (3 1I/(n - 1)) , i = 1, N Vi(n) --i~=1 ..., (4)

From (3) it follows

v~ ( n ) = r~ + 2 [3 r i / v j (n - 1 ) + (3 2 v] (n - 1 )

with the probability Pi], ] = 1 . . . . . N

77

Note that ri/ and v / ( n - 1) are independent random variables because we suppose that the transition reward has no influence on the future evolution of the system; therefore:

N E [v~(n)] - j Z Pi/ (E [r~l + 2 13 E [ri/] E [v/(n -- 1)1 +

+ tJ2 e [ v ] ( n - 1)]}

Taking into account that E [X 2 ] = Var [X] + {E [X] } 2, it results in

N o?(n) + V~(n) = r,

]=1 p~j { o ? / + R ~ + 2 t~ R~j V i (~ - l ) +

+/3 2 (o~(n - l ) + V ? ( n - 1))}

or

N

1=1

4" (Ri/ "4" ~ Vj (n - l ) ) 2} -- V~(/ / ) (5)

The formula (4) corresponds to the "value iteration" of Howard

[4]. In the case of non-random immediate rewards (o~. = 0; i, ] = 1, ..., N) it results in (5) a non zero variance o f the total reward: the total reward is random because it depends on the trajectory followed, and this varies from one realization to another. The expressions (4) and (5)

allow us to calculate the variance by iteration.

4. Total discounted reward with unfimited horizon

In this section we shall assume that the discounting factor is smaller than one (0 ~</3 < 1 ); therefore, according to the bounds of the moments of the immediate rewards assumed in section 2, the following limits exist and they are bounded:

78

Ifm o~(n)=o~ <oo

_ o . < l im V i ( n ) = V i < + ~ ,

In the limit for n ~ (5) becomes

N o~ - j ~ l Pij {o~/+/32 o] + (Rii + (3 V/) 2 ) - V/2, i = 1 . . . . . N (6)

where, according to (4):

N v~ - ~ , p~j {R~i + t~ Vi), i = 1, ..., U (7)

It is known, that in the case 1131 < 1 the system of equations (7) has one and only one solution V i, i = 1, ..., N, which is finite (vid. for example Denardo [2]). In the same way, by writing:

N Q(i, ~) = .7_, t,~j {o~ + (R~j +/3 V~) ~ } - V{

1=1

(note that Q depends on/3 directly and on Vj, j = 1, ..., N which also depends on/3) this results in (6):

N

o~=Q(i, f3)+~i~= pijo ~, i = I , . . . , N

and with ~ < 1, this system of equations has the unique solution o~ < o * , i = 1 . . . . ,N.

As was expected, (7) is a repetition of the "value determinat ion" in Howard's algorithm [4], and (6) is an analogous expression for the variance o f the total discounted reward. The application of (6) to the

79

case of deterministic immediate rewards ( o i / = 0 , i , / = 1, . . . ,N) is trivial.

o i allows one to calculate confidence intervals for the total discounted reward in one realization and therefore to compare given policies.

5. Average reward per period

In this case there is no discounting (/~ = 1). Assume that the transition probabilities determine only one ergodic class in the states of the system: thus the average reward per period is independent from the initial state.

We suppose the horizon being suffici+ntly large for the system to be in the steady state and ~r i, i = 1 .... , N as the steady state probabilities. Taking g as the immediate reward in a transition, the state o f the system is i with the probability hi, and the conditional probabil i ty for the

transition to be i -~ j is Pij" Therefore

= ri/ with the probability 7r i Pi/

and consequently

N N E [~1 = ~ 7r i =~ p , . /E [r~/!

i=1 j i

Takingg as the expected reward per period this becomes

N N g=i=lV rr iIZl.= p q R i / (8)

Analogously, g2 = r~/ with the probability 7r i Pij, and it follows that

N N

= z ~ i j z p . Ei41 Eig2] i = l

80

and finally:

N N

V a r t d ] + g 2 = ~ ~r i ~ Pij{~ i = 1 ]' 1

N N

Var t~] =i~, lri jZ=, Pij {o~. + Ri~}- g2 (9)

In the case of non- random immediate reward, it is sufficient to

substitute trii = 0, L J = 1, ..., N in (9).

This expression of the variance of the immediate reward per period, in the steady state, allows policies to be compared with each other (for example to compare the optimal policy related to g and to the relative values, obtained by Howard 's algorithm 14, cap. 4], with other subop- timal policies) by the oscillations of the rewards in each period. In any case, for the determination o f the state probabilities 7r i, it is necessary to resolve a system of N equations with N variables. The distribution of

is not necessarily normal, even if the distributions of r i j , i, j = 1 . . . . . N are normal.

6. A numerical example

The known example of the toy-maker represents a MDP with two

states and two actions. As Goldwerger [3] does, we will assume that the transition rewards are random, with normal distributions shown in table 1.

The policy which maximizes the expected reward, wi thout discounting is shown in table 2 for a horizon of twenty periods.

With this optimal policy we use a simulation to realize the process. A statistical test can show the compatibili ty of the analytical and simulated results. Table 3 shows the values, corresponding to 500 repetitions with horizon n = 3 and n = 40 and initial state i = 1. Figure 1 shows the simulated one-period rewards with initial state i = 1 and action k = 1. The histograms o f the cumulated rewards by 3 and 40 periods using a time homogeneous and deterministic policy (2,2), obtained by

81

Table 1

State

i

1 (success)

2 (no success)

Action

k

1 (without publicity)

2 (with publicity)

1 (without research)

2 (with research)

Transition probabilities

Pi~

0.5 0.5

0.8 0.2

0.4 0.6

0.7 0.3

Distribution of the transition reward

N(9,. ~ N(3,2)

N(4,~ N(4,1)

N(3,2 N(-7,3)

N(1, 0.5) N(-19,2)

Table 2 Optimal policy for the toymaker: maximal expected total reward

without discounting

Pohcy: mkax tk)(n)

Horizon k n i =" 1 i= 2 Vt(n) Ol(n) V2(n) o2(n)

1 2 3 4 5 6 7 8 9

I0 11 12 13 14 15 16 17 18 19 20

1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

6.00 8.20

10.22 12.22 14.22 16.22 18.22 20.22 22,22 24.22 26.22 28.22 30.22 32.22 34.22 36.22 38.22 40.22 42.22 44.22

3.54 5.48 8.97

11.65 13.84 15.74 17.43 18.97 20.39 21.72 22.98 24.17 25.30 26.38 27.43 28.43 29.40 30.34 31.25 32.13

-3.00 -1.70

0.23 2.22 4.22 6.22 8.22

10.22 12.22 14.22 16.22 18.22 20.22 22.22 24.22 26.22 28.22 30.22 32.22 34.22

5.16 13.94 16.37 18.07 19.57 20.95 22.25 23.48 24.64 25.75 26.82 27.84 28.83 29.79 30.72 31.62 32.49 33.34 34.17 34.98

82

Table 3 Values expected and obtained by simulation

Horizon Exp Policy reward

n

V1(n)

3 10 opt. 40 84

3 8.22 (2,2) 40 82

Average reward in 5 0 0

realizations

7~ (n)

10.24 85.75

7.96 81.86

Standard deviation

at(n)

8.97 46.41

9.21 46.46

Average deviation in 5 0 0

realizations sl(n)

9.03 46.89

10.03 45.68

~ :g ~ .~: ~ : o o e e o e ~ ~ �9 �9 ~ 1 7 6 : : : : : : . : : ~ : ~

! i : : : : : : : : : : ~ " . . . . . : : iiii i.. I I * ' ~ I ~ I ' ' I I I I : I I ~ I ~ o . �9 ~ I I I I I I I I ~

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : . ~ �9 ~ �9 ~

" ' 5 ~ ' ~ ' ~ L ~ L ~ L L ~ ~ ~ ' ' . ' ~ ' ' ~ ' ~

~tl , -~l l l l Vl~l~ Vltl~*01~1'

Fig 1. Histogram o f 500 realizations o f the one-per iod reward, with initial state i = 1 and action k = I

simulation appear in figures 2 and 3. It is obvious that the respective distributions are not normal.

The generalization of MDP to the case of random transition rewards permits a bet ter adaptation of the models Io realil~. For applications, where no risk-indifference exists, the knowledge o f the variance

helps the decision to be reached. The decision rules based only on the knowledge of expected reward (or cost) should be subordinate to those which consider the variance too.

8 3

~ . : : : :

~ ~

� 9 1 4 9

o � 9

. o " : o 0 o �9

�9 � 9 �9

...-.:- ~

. . ~ . - . : l l l l l l

�9 :. �9 .: .: ... t : l l : _ " 1 l : l ~ . l l

: 1 - ' ' - 1 1 1 1

g-gigi$" : : : : : : ' : " : : ' : : ' ' ' : : . .

o o ~ . : : : . : : : : : : : :

Fig 2. Histogram o f 500 realizations o f the three-period reward, with policy (2,2) and initial state i = 1

B

. t "

�9 o : �9 : -. : . : - .: :.~ ~ .~ . : " . ' - " - : " : i ~ . . : . ~

0 ~ i �9 * e , * t �9 I �9 ~ 6 * �9

: : 1 : o : : : : ~ 1 4 9

. ! : : ~ . : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : .

: . . : : . i ~ i ! : i : ~ : . . . . i : ~ : : ! : ' ~ : ~ ! ~ : ! ~ : ' ~ : ~ : i ~ : ~ : . . . . : . - . . . . . . .

~(40)-01~40~ ~40) ~0~'01(40)

Fig. 3 Histogram o f 500 realizations o f the 40-periods reward, with policy (2,2) and initial state i = 1

84

REFERENCES

[1] D.B. Brown, H.F. Martz Jr., A.G. Walvekar: "Dynamic programming for the conservative decision maker", Opsearch, Vol. 6, No. 4 (December 1969), p. 283 - 294.

[2] E.V. Denardo: "Contraction mappings in the theory underlying dynamic programming", SIAM Rev., Vol. 9, No. 2 (April 1967), p. 165-177.

[3] J. Goldwerger: "Dynamic programming for a stochastic markovian process with an application to the mean variance models", Management Sci., Vol. 23, No. 6 (February 1977), p. 612-620.

[4] 1LA. Howard: "Dynamic programming and Markov processes", Wiley, New York, 1960.

85

calculating the variance in markov-processes with random reward

Documents