on the evaluation of strategies for branching bandit processes

Annals of Operations Research 30 (1991) 299-320 299

ON T H E EVALUATION OF S T R A T E G I E S FOR B R A N C H I N G BANDIT P R O C E S S E S

K.D. G L A Z E B R O O K and R.J. BOYS

Department of Mathematics and Statistics, University of Newcastle upon Tyne, Newcastle upon Tyne, England

N.A. FAY

Department of Mathematical Sciences, University of Durham, Durham, England

Glazebrook [1] has given an account of improved procedures for strategy evaluation for resource allocation in a stochastic environment. These methods are extended in the paper in such a way that they can be applied to problems which, for example, have precedence constraints and/or an arrivals process of new jobs. Theoretical results, backed up by numerical studies, show that quasi-myopic heuristics often perform well.

Keywords: Bandit process, Gittins' index, Markov decision process, stopping time, strategy evaluation.

1. Introduction

A single resource is available for allocation among a collection of competing options. All options evolve stochastically when in receipt of this resource. The search for allocation strategies which are optimal with respect to various cost criteria has often centred around the theory of Gitt ins ' indices (see Gittins [2]). This theory indicates, inter alia, that under certain conditions each job should have an index which is a function of its current state and that an optimal strategy should always process a job whose index is maximal.

Of late, this theory has developed in two directions which are relevant to the current work:

(i) Whittle [3] extended Gitt ins ' original result on the optimality of index strategies to an " open" process in which new options can arrive over time. This work has been successively generalised by Varaiya, Walrand and Buyukkoc [4] and Weiss [5]. Weiss referred to "branch ing bandi t processes" and conceived of an open process in terms of options from which new options branched. We shall retain that terminology here.

(ii) Glazebrook [6] demonstra ted the role which Gitt ins ' indices can play in the evaluation of subopt imal strategies. This work led, inter alia, to the development by Glazebrook [7] of an approach to sensitivity analysis for stochastic scheduling.

�9 J.C. Baltzer A.G. Scientific Publishing Company

300 K.D. Glazebrook et al. ,/Branching bandit processes

Latterly, an improved approach to strategy evaluation has been reported in Glazebrook [1]. This paper extends that work to open stochastic resource allocation problems of the kind described in (i).

Section 2 contains a general account of the theory. The key result, theorem 1, is used in section 3 to develop results for the evaluation of a heuristic for a class of stochastic scheduling problems with precedence constraints. In section 4 we consider the important question of when it is worth modelling a collection of independent Bernoulli streams of arriving jobs in a stochastic resource allocation problem. Sections 3 and 4 both conclude with numerical studies which show that quasi-myopic heuristics often perform well.

2. An evaluation procedure

By the term a family of branching bandit processes we shall mean a cost discounted Markovian decision process with the following constituent parts:

(a) Individual (non-branching) bandit processes. Denote bandit process j by the quintuple (s ~ r ~oj, P~, R j) and by xj ( t ) the state of j at time t e IN. We require that x j ( t ) e ~2j, the (general) state space for j . State 2j is j ' s initial state and wj its completion set. Bandit process j is deemed to be completed as soon as its state enters ~oj_ ~j. Should the decision maker choose j at decision epoch t E IN then its state changes according to the Markovian law of motion Pj and a reward a 'R j ( x j ( t ) , x j ( t + 1)} is earned. Discount rate a is in [0, 1) and reward function R f ~2j X ~2j -~ R + is bounded. As soon as the state of j enters ~oj, no further state transitions are possible, nor are any more rewards earned.

(b) Collections of bandit processes. Cj is an infinite set of independent copies of (2j, ~2j, %, Pj, R j). We have N such sets C~, C 2 . . . . . C N all independent of one another.

(c) Families of branching bandit processes. Denote by n , ( t ) the number of (uncompleted) members of C~ present in the system at time t ~ N. The state of our family of branching bandit processes at time t records the numbers n~(t) and the states of all individual bandit processes present. If Y~N=~n~(t) = 0, no decision is taken at time t. In this event denote by p(n~, n 2 . . . . . nN ]0) the probabil i ty that n i members of C i arrive in the system at time t + 1, each in its initial state, l <~ i <~ N.

If Z ~ n ~ ( t ) > 0 then at time t the decision-maker chooses one of the bandit processes present then. Suppose this is a member of ~ ( j , say) in state xj( t) . This choice will have the following consequences: (i) the state xj( t + 1) of j at time t + 1 is determined according to Pj; (ii) the states of all other bandit processes present at t are unchanged; (iii) a reward a 'R j ( x j ( t ) , x j ( t + 1)} is earned; (iv) all bandit processes present at t remain present at t + 1 unless xj( t + 1) ~ coj,

in which case j leaves the system. In addition, n i members of C~ arrive in the

K.D. Glazebrook et al. / Branching bandit processes 301

system at t + 1, each in its initial state, 1 ~< i ~< N, with probabili ty

pj{nl , n2 . . . . . n N I X j ( t + l )} .

If at time t + a one or more new bandit processes arrive in the system we say that a branch occurs then.

(d) Strategies. The decision epochs for any family of branching bandit processes are those times t ~ IN at which the system is non-empty. A strategy is any rule for choosing a bandit process from among those present at each decision epoch. Such a rule can in general take account of the entire history of the process. An optimal strategy is one which maximises the total expected reward earned during [0, oe). Standard theory (see e.g. Ross [8]) indicates the existence of an optimal strategy which is deterministic, stationary and Markov - i.e., a rule which looks at the current state of the system and specifies which bandit process to choose next.

Standard theory also indicates that an optimal strategy and its associated value function will satisfy Bellman's optimality equations. An important computat ional reduction is available via a generalisation of Gittins' celebrated index result [2] due to Varaiya et al. [4]. According to this reduction, there exist functions Gj: ~j ---, R +, ] < j ~< N, such that at epoch t ~ I%1, bandit process j from class Cj present in the system and in state xj( t ) has associated value Gj{xj( t )} . Any strategy which always chooses bandit processes with maximal index values is optimal. The index Gj(xj) may be thought of intuitively as the best reward rate available from bandit process j starting from state xj. More precisely, we characterise the function Gj as follows: consider the family of branching bandit processes described in (c) such that (i) at time 0, only one bandit process is present in the system - namely bandit

process j in arbitrary state xj ~ ~2j; and (ii) the action space of the family is augmented at all times t ~ IN by the addition

of retirement option r v. If action r c is taken at time t then the state of the system is unchanged and a reward a ' G ( 1 - a ) is earned. Hence at each decision epoch either one of the bandit processes present in the system is chosen with the consequences described in (c) (i)-(iv) or retirement option r G is taken.

We use the notation (rc;, j, xj) to denote the above system.

DEFINITION 1 The Gittins index of bandit process j ~ Cj is a function Gj: ~2j ~ R +

Gj(xj) a= in f [G; r a i s o p t i m a l a t t i m e 0 for r c, j , xj)].

such that

(a)

To prepare for the main results of this section, we need some additional notation. Denote by ~r an arbitrary stationary strategy for our family of branching bandit processes and by vr* an optimal one. The total expected rewards associated with er and ~r* are denoted R(Tr) and R(vr*) respectively. Suppose

302 K.D. Glazebrook et al. / Branching bandit processes

that the family of branching bandit processes has initial state ( j , xj) as in (d) (i) above and that the decision-maker adopts strategy 7r. If r is some stopping time on this process then R/(xj, rr, r) denotes the expected reward earned under rr during the interval [0, r). We now write

G,(xj, ~r, r ) = R , ( x j , rr, r ) ( a - E a ' } - ' (2)

G/(x], rr, r ) may be thought of as a Gittins index corresponding to a choice of strategy ~r and of stopping time r in the following sense: from (1) comes readily the notion of a Gittins index as an equioalent retirement reward. Now Gj(x/, rr, r) also has an interpretation as an equivalent retirement reward. It is that value of the retirement reward G which renders the decision-maker indifferent between retirement at t = 0 (when the system is in state ( j , x~)) and the application of strategy ~r to the system throughout [0, r ) and subsequent retirement at r. With these characterisations in mind, the following result is no surprise. Lemma 1 combines two results to be found in Gittins and Glazebrook [9] and Glazebrook [10].

LEMMA 1

G, (x , ) = sup sup {G~(x.,., ~r, r ) } , (3)

the suprema in (3) being taken over all stationary strategies and all positive-valued stopping times respectively.

The main result of this section evaluates arbitrary policy rr by putting an upper bound on R ( ~ r * ) - R ( ~ ' ) , the total expected reward lost when opting for ~" instead of an optimal strategy. This upper bound essentially compares the reward states (as measured by Gittins indices) available under ~'* with those available under ~r. This has been achieved for non-branching bandit processes (i.e., such that n, > 0 for any i ~ p / ( n 1, n 2 . . . . . n N [ " ) -= 0) by Glazebrook [1]. Theorem 1 is an extension of Glazebrook 's main result, use of which is made in the proof.

Lemma 2 is a preparatory result. In it {o,(~r), n>_-0} is the sequence of random times at which strategy ~" switches between bandit processes - i.e., ~r chooses at o,,(rr) a process which was not chosen at o,,(~r) - 1. Further, {/?,(~r), n >/0} is the sequence of times at which branches occur under strategy rr. We adopt the convention that %(rr) = ,80(rr ) = 0.

If r is a stopping time defined on the process under strategy 9r, denote by ~rrrr* that strategy which chooses bandit processes according to rr during [0, r ) and which thereafter adopts an optimal strategy. Lastly, we always denote by B(t) the collection of bandit processes present in the system at time t.

LEMMA 2 For arbitrary stationary strategy 7r and stopping time r satisfying

r~< min{ o,(Tr), fi,(~r)} (4)


with probabi l i ty one then

R ( T r * ) - R ( v ' r ~ r * ) ~ [ m a x G ~ { x , ( O ) } - G j { x j ( O ) , ~ r , ' r } ] ( 1 - E a ' ) , (5)

where the maximisa t ion in (5) is over B(0) and where bandi t process j is chosen by ~" at t ime 0.

Proof Suppose that Y~=lni(0) = I B(0) I = M. N u m b e r the processes present at t ime 0

from 1 to M. Th roughou t the evolution of the system under s trategy ~r* we are able to associate with each process i ~ B(0) a bandi t process Tdef ined as follows:

(a) At each decision epoch t ~ [~, /-is associated with i( t) c_ B(t) , a subset of those bandi t processes present at t. No te that i ( 0 ) = i, i.e., at t ime 0, / is associated with the single bandi t process i. We shall const ruct the i ( t ) , 1 ~< i ~< M, in such a way as to ensure that they are mutual ly exclusive with

M

U i(t) = B(t) . i = l

(b) If at decision epoch t ~ I~, s t rategy ~'* chooses a process f rom i(t), k say, then

l [ i ( t ) \ ( k ) ] U A ( t + l ) ) , if k is comple ted at t + l , i( t + 1)

{ i ( t ) } U { A ( t + 1)}, otherwise, (6) \

where A(t + 1) is the set of new processes which arrives at t + 1. If s t ra tegy ~-* does not choose a process f rom i(t) at t ime t then i(t + 1) = i(t). We adopt the convent ion that if the system is empty at t ime t then new arrivals at t + 1 enter set l ( t + 1).

(c) The state of bandi t process 7 at t ime t ~ I%l is the vector of states of the processes in i(t). Should s trategy ~'* choose k ~ i( t) at t ime t ~ ~ then the t r ans fo rmat ion in the state of i between times t and t + 1 is such as is implied by (6) and Pk, the M a r k o v i a n law of mot ion of bandi t process k. Should s t rategy 7r* not choose a m e m b e r of 7 at t ime t ~ I~ then the state of i- remains unchanged.

(d) Should s t rategy v * choose k ~ i( t) at t ime t ~ ~ then the reward accruing to T at t ime t is a 'Rk(xk ( t ) , xk(t + 1)}. N o rewards accrue at t to any o ther process l, l 4= i.

F r o m (a ) - (d ) we see that the appl ica t ion of s t ra tegy 7r* to our family of branching bandi t processes has been model led by reference to the family of (non-branching) bandi t processes 1, 2 . . . . . M in which (inter alia) the b ranch ing mechan ism is a c c o m o d a t e d within the M a r k o v i a n s tochast ic s t ructure of each /.

Cons ider now the appl ica t ion of s t ra tegy 7r~-~'* (see (4) above) to our family of branching bandi t processes, and the equivalent cons t ruc t ion to (a) and (b) above implied by this new strategy. Since ~ - ~ m i n { o l ( ~ r ), /~1(7r)} a lmost surely, the following can be said of bo th of these const ruct ions : bo th strategies 7r* and


~r~-~ * always choose processes from within the respective collections i( t ) according to whichever has the largest Gittins index. This statement is true for strategy ~"rrr* since throughout [0, ~-) the corresponding collection j ( t ) consists of the single member j. Thereafter the conclusion is a consequence of the index result of Varaiya et al. [4].

It now follows from this statement that the application of strategy ~r~-~r* to our family of branching bandit processes may be modelled by reference to the bandit processes 1, 2 . . . . . JQ defined above with respect to ~r*. In terms of these processes the difference between ~'* and ~r~'Tr* is that (for the latter) the first ~- occasions upon which f is chosen are brought forward to the interval [0, ~-). (N.B.: We may say f is chosen at t if and only if one of the bandit processes in

j ( t ) is chosen.) We are now in a position to use GIazebrook's [1] result for the evaluation of strategies for (non-branching) processes. A straightforward application implies that (in an obvious notation)

the maximisation in (7) being over i ~ B(0). To obtain the inequality (5) we simply utilise the identification of I" with i at time 0, 1 ~< i ~< M. []

The preamble to lemma 2 indicated that the evaluation of 7r would proceed on the basis of a comparison of reward rates available under ~r* and zr. This comparison will essentially be conducted within successive time intervals [%(~-), %+~(~r)), n >/0, defined inductively and with respect to the evolution of the system under strategy 7r as follows: (i) %(~-) =0; (ii) if at time %(~-) either B(%(~r ) ) = ~ or B(%(~r ) ) 4: ep and strategy r chooses

optimally (i.e., a bandit process with maximal Gittins index) then ~'n+l(Tr) is the first time after %(~) at which the system is non-empty and ~r chooses sub-optimally;

(iii) if at time %(7r) the system is non-empty and 7r chooses j ~ B ( % ( ~ ) ) sub-optimally then "r,+l(Tr ) is the smallest integer greater than %(~') which is in either of the sequences (o,,(~'), m >~ 0) or { fl,,,(~), m >~ 0). Less formally, %+1(~') is the first time after %(~r) at which either a switch or a branch occurs under strategy ~r.

In fact, the random time intervals [u,,(~r), u,,+l(~-)), n>~0, within which a comparison of reward rates can take place must all lie within [%,,(~r), %,+1(7r)) for some m >/O.

DEFINITION 2 The sequence { u,,(~r), n >/0} with Uo(~r) = 0 and ~,,,+1(~) > t,~(~), n >/0, with

probability one is at least as f ine as { %(~r), n >/0} if the sequence { u,,(cr), n >/0} contains the sequence { %(7r), n >/0} as a subsequence with probability one.


For each such interval [u.(Ir), 1,.+1(~r)), n>~0, we define a d iscrepancy measure A as follows: (i) if at t ime p.(Tr) either B( i, (~r)) = q5 or B(p.(~r)) ~ ~ and strategy ~r chooses

opt imal ly th roughout [v.(Tr), u.+l(Tr)) we set

(ii) if at t ime i,.(~-) the system is n o n - e m p t y and ~r chooses j ~ B ( u . ( I r ) } sub-opt imal ly for some t �9 [p.(~r), u.+l(~r)) then we set

A { ~ ( ~ r ) , u,+~(~r)) = maxG,[x,(z , , (~r)}] i

- - (91 the maximisa t ion in (9) being over B(l , , (~r)) .

Theo rem 1 establishes the theoretical basis for our evaluat ion procedure .

T H E O R E M 1

For arbi t rary s ta t ionary s trategy ~r,

R ( ~ r * ) - R ( ~ ) < E , ~ A{u,(~r) , t ,+a(~r)} (a~, (~) -a ".+,(~)} , (10)

where (t,,(7r), n >/0) is any sequence at least as fine as (~-,(~r), n >~ 0} and where E~ denotes an expecta t ion taken over all real isat ions of the family of b ranch ing bandi t processes under s t rategy 7r.

Proof In fact, we are able to show that

R (~r*) - R (~ru,,+,(Tr) 7r* )

] , m > 0 . (11 )

The result then follows s imply by taking the limit as m goes to infinity. We demons t r a t e (11) by means of an induct ion on m. To establish (11) when

m = 0 consider the following two cases: (a) At t ime 0 either B ( 0 ) = r or B(0)4= q~ and ~r chooses opt imal ly th roughout [0, vl(Tr)). In this case ~rvl(qr)Tr* is an op t imal s t ra tegy and A(0, vl(~r)} = 0. Hence bo th sides of (11) are zero and the inequal i ty is satisfied. (b) At t ime 0 the sys tem is non-empty . I t then follows f rom the charac ter i sa t ion of the sequence { u,(~r), n >/0} and f rom l e m m a 2 that

R ( ~ * ) - R { ~ , , ( ~ ) ~ * ) < a { 0 , ~ , ( ~ ) } ( 1 - e a ' : ~ ' } .

We now have shown that (11) is true when m = 0. Suppose now that (11) holds for m ~< r - 1 and deduce that it mus t hold for

m = r. A s imple repet i t ion of the a rgumen t in (a) and (b) above, though now


centred on the system at time p,(Tr), leads via appropriate conditioning to the conclusion

R (qTPr(eIT)gT $ } -- R (qT/)r+l(qT) qr $ )

< E~r [ A {/Jr (97), Pr+l('lT))(aVr(~r'--a'+'(~r)}]. The inductive hypothesis for m = r is now established by a simple summation. Hence (II) is true and the result holds. []

We now proceed to illustrate our evaluation procedure by reference to two important classes of stochastic scheduling problems.

3. Evaluating heuristics for stochastic scheduling problems with out-forest precedence constaints

The class of stochastic scheduling problems with which we shall be concerned here can most naturally be thought of in the following terms: a fixed collection J of independent (non-branching) bandit processes (in this context we shall call those "jobs") requires processing by a single machine. Technological constraints delimit the class of processing strategies. These constraints are expressed in the form of a partial ordering F on the job set, with ( j , k) ~ F indicating that job j must be completed before processing on k can begin.

The digraph representation of F is an out-forest, i.e. each job has at most one immediate predecessor. An out-forest is a disjoint union of connected components, called out-trees, each of which has a single root (a job with no predeces- sors). We suppose that there are M such components. Figure 1 depicts a case with M = 3. Note that it is a feature of this form of precedence constraint that at any time t during the evolution of the process, the implied set of precedence constaints on J(t), defined to be the set of jobs remaining to be processed at time t, also has a representation as an out-forest. In this context B(t) is interpreted as

3 8 9

6 7

Fig. 1. Precedence constaints forming an out-tree with three components.


the set of jobs available for processing at t, each of which will be a root of an out-tree corresponding to a subset of J( t) .

It is clear from the general model described in section 2 that this class of stochastic scheduling problems with out-forest precedence constraints is a family of branching bandit processes. The functions pj determining arrival probabilities must be chosen to admit jobs into the set B(t) as completions occur. Hence, as was proved by Glazebrook [10] an optimal processing strategy is determined by a collection of Gittins' indices (see definition 1). At decision epoch t each j ~ B(t) has an associated Gj (xj( t )} , this index representing (in the sense of definition 1) the best reward rate available from the component of which (at t) j is a root. These indices will in general be complex, and will need to reflect the current status of all jobs within a given component. Hence we shall explore the possibility of using theorem 1 as a means of evaluating some simpler heuristics for scheduling. To this end, we need the following definition.

DEFINITION 3

The non-branching Gittins index of bandit process (job) j ~ Cj in state xj ~ g2j is a function Hi: ~ - - - , ~+ defined as follows: consider a system (r a, j , xj) identical to (r c, j , xj) (see (1)) except that no branches are allowed, i.e.,

n ; > 0 fo rany i---~pj(n 1, n 2 . . . . . n N I x ) = O V x E f 2 j ,

then

Hj(x j ) ~ inf[G; r c is optimal at time 0 for (r a, j , xj ) ' ] . (12)

The function Hj is in general much simpler than the corresponding Gj, since it reflects rewards obtainable from j only. The following preparatory result will be crucial in applying theorem 1 to heuristics based on these non-branching indices.

LEMMA 3

For any job j within a stochastic scheduling problem with out-forest precedence constraints

Hj(x j ) <~ Gj(xj) <~ max { Hj(x j ) , Hk(2k)}, xj ~ ~2j, (13)

where

Proof The left-hand inequality is a trivial consequence of definitions 1 and 3. To

establish the right-hand inequality, denote by (ra; ( j , xj); (k,-~k), k ~ ~ } a system for which:


(i) at time 0, 1 + I F~-! non-branching bandit processes are present in the system - namely bandit process j in state xj together with the processes in Fj each in its initial state;

(ii) at each decision epoch t ~ N either one of the above bandit processes is chosen (with the consequences described in the model description in section 2) or retirement option r c is taken.

Since it is a trivial consequence of Gittins' [2] celebrated result that an optimal strategy for {rG; ( j , Xj); (k, 2k), k ~ F j } is determined by a collection of (non-branching) Gittins indices, it follows easily that

inf[G; r a is optimal at time 0 for (rG; (j l Xj); (k, 2: k), k ~ Fj }]

= max (Hj(x j ) , Hk(2k) } . (14) kcF,

A simple comparison of the left-hand side of (14) with the right-hand side of (1) yields the right-hand inequality of (13). []

The interpretation of the right-hand side of (13) is of a Gittins index for the collection { j} u F s where all of the precedence constraints have been removed. We shall now use theorem 1 and lemma 3 to develop procedures for evaluating the class of heuristics described in the next definition.

DEFINITION 4 A strategy for our stochastic scheduling problem with out-forest precedence

constraints is quasi-myopic if it chooses at each decision epoch any job y ~ B(t) satisfying

Hj{x j ( t ) } = max Hk{xk(t) }. (15) k E B ( t )

In theorems 2 and 3 we proceed to an evaluation of quasi-myopic strategies for the simple case in which all jobs have identical stochastic and reward structures. They will as a consequence have identical non-branching Gittins indices H( . ) , say. Consider the system under any strategy which processes one particular job from its initial state 4 at time 0 through to completion. Define stopping time T as

T = in f [ t : H{x( t ) } < H ( 2 ) ] , (16) t > O

i.e. the first occasion upon which the non-branching index falls below its initial value.

THEOREM 2 If ~r is a quasi-myopic strategy for this class of stochastic scheduling problems

with out-forest precedence constraints and identical jobs then

R(Tr* ) - R(~r) < [ Ear]4m~ H(2) - i n f H ( x ) } , (17)

where the infimum in (16) is taken over the common state space.


Proof We number the root jobs of the IB(0) I components of the out-forest 1,

2 . . . . . I B(0) I- With each such job i is associated a stopping time T, identical in distribution of T defined in (16).

Consider a strategy which processes job 1 throughout the interval [0, 7"1). It follows from lemma 3 and the definition of T 1 that

t ~ [0, 7"1)~H~{x~(t)} = max[H~{xl(t)) , Hk(.~k) ] k E F I

>1 Gl(x l ( t )} >~ Ha(xl(t)} >~ max[Hj (x j ( t ) } , Hk(.~'k) ]

>~Gj(xy(t)) >~ , j ( x i ( t ) ) , 2<~j<~ ]B(0) I. (18)

It is a consequence of (18) that any strategy which processes job 1 throughout [0, 7"1) is both quasi-myopic and optimal during that interval. This claim is also (trivially) true for any strategy in which this initial period of processing job 1 is followed by the processing of job 2 throughout [T1, T 1 + T2) then the processing of job 3 throughout [T l + T 2, T 1 + T 2 + T3) and so on until all members of B(0) have been included. We may now deduce the following:

7r is quasi-myopic ~ ~r is optimal throughout [0, S) ,

where S is identical in distribution to ~'lB(o)l'r Refer to theorem 1. We may now t " i = 1 * i "

choose ul(qr)= S and

A(0, u,(~r)) = O. (19)

Similar calculations to those summarised in (18) lead to the following conclu- sions: (a) if quasi-myopic strategy ~r chooses job k at time t ~ [~ where

H~{x~(t)} > H ( ~ ) ,

then that choice is optimal; (b) if quasi-myopic strategy ~r chooses job k at time t ~ N where

H (xk(t)} then no member of B(t) can have Gittins index greater than H(~) , i.e.

max Gi(xi(t)} <~H(~). (20) iEB(t)

To use theorem 1, we now choose (u,,(Tr), n >_- 2) as follows: (i) if at time u,(Tr), n >/1, the system is empty, then set

= o o ;

(ii) if at time p.(~r), n >/1, the system is non-empty and ~r chooses optimally, then u.+l(~r ) is the first time after ~,n(~r) at which the system is non-empty and ~r chooses sub-optimally;

310 K.D. Glazebrook et aL / Branching bandit processes

(iii) if at time v,,(Tr), n >/1, the system is non-empty and ~r chooses k sub-optimally, then

v.+l(vr )= inf (t; H~(xk(t) } <~Hk[x~(v,,(r ). t > v,,(~r)

In this case it is not difficult to show that

Gk[xk(v.( . r )} , 9, v+l(*r) -- v,,(~) ] =Hk[xk(v.(~r)}] >1 i n f H ( x ) (21) x

and hence from (a) and (b) above together with (9), (20) and (21) we deduce that in this case

A(p,(~r), u,+,(~r)} ~< H ( : ~ ) - i n f H ( x ) . (22) x

It now follows from theorem 1, (19) and (22) that

-R(~r)<~E=(A(O, v , ( T r ) } ( 1 - a '''t=,} + maxA(u,( I r ) , v,+,(Tr)} R(~r*)

< infH(x/} = [EarllS'~ H(-~)- i n fH(x)} ,

as required. []

Comment It is certainly possible to improve the bound on the right hand side of (17) by

(for example) considering times of job completions more carefully. It is also possible to apply the ideas within the proof to more general models involving non-identical jobs. It is not easy, though, to obtain bounds which are both effective and easily interpretable.

For the same class of stochastic scheduling problems with out-forest precedence constraints we now undertake an analysis of non-preemptive strategies.

D E F I N I T I O N 5

A strategy is non-preemptive if it only switches between jobs at job completions.

Plainly for any problem involving identical jobs, all feasible non-preemptive strategies have the same total expected reward. In order to conduct the analysis, we need some notation. Consider the system under any strategy which processes one particular job from its initial state ~ at time 0 through to completion. Denote


by C the job's completion time and by R the expected reward earned by the job during [0, C).

THEOREM 3 If w is a non-preemptive strategy for this class of stochastic scheduling

problems with out-forest precedence constraints and identical jobs then

R ( T r * ) - R(~r)~< { 1 - ( EaC)tSt}[ H(2) - R( EaC){1- EaC}-'].

Proof Choose the sequence (u,(Tr), n >/1} as follows:

u,, (Tr) = { timeoo, Ofn>lji.the nth job completion under ~r, I ~ n ~< ] J I,

It is trivial to show from (9) and lemma 3 that

A(~,,(~r), u,+,(Tr)} = { H ( 2 ) - R ( E a r 0~<n~< I J l - 1

~0, n>~ [J I ,

where the expression R(EaC){1 - E a c } -1 is the non-branching Gittins index of one of the identical jobs in its initial state under the non-preemptive strategy ~r. The result is now a simple consequence of theorem 1. []

We conclude the section with an account of a numerical study which gives insight into the power of methods based on theorem 1 for the evaluation of strategies for stochastic scheduling problems with out-forest precedence constraints. In the problems considered by this study, each job j is simple, i.e., (a) the stochastic structure of j is that it has a random processing time Xj, a

random variable taking values in the positive integers; (b) a reward t)a' is received if job j completes at time t. No other rewards are

earned by job j. Following Glazebrook and Gittins [11], we have the following definition:

DEFINITION 6 Processing time Xj is said to be IERDR (increasing expected remaining

discounted reward) if the function

mj(x) A E(aX,-X l Xj> x)

is increasing in x.

We shall assume that all the processing times are IERDR. It is easy to show that if Xj has an increasing completion (hazard) rate then it must be IERDR. This then includes the Binomial, Poisson and Negative Binomial families of processing time distributions.

312 K.D. Glazebrook et a L / Branching bandit processes

Glazebrook and Gittins [11] were able to demonstrate the existence of a non-preemptive strategy which was (globally) optimal for any stochastic scheduling problem involving simple jobs with IERDR processing times and general precedence constraints. That fact notwithstanding, the problem of finding the optimal non-preemptive strategy can be computationally exceedingly complex. We shall use theorem 1 to evaluate quasi-myopic strategies. In this, we follow Morton and Dharan [12] who evaluated the performance of certain heuristics for a related problem involving linear costs with the objective of reducing computational complexity.

It is trivial to show that in this context, Hj(x j ) , the non-branching Gittins index for job j which has received xj units of processing but which has yet to complete is given by

H i ( x , . ) = r j m j ( x j ) { 1 - mj(x j ) } - ' , x2>~ O. (23) It then follows from definition 6 that there exists a quasi-myopic strategy 7r which is non-preemptive. In passing we also note that in the calculation of Gittins indices according to lemma 1 the suprema over strategies and stopping times in (3) are always attained respectively by a non-preemptive strategy and by a stopping time which takes values in the set of times at which jobs complete.

Denote by b(~r) the value of the bound on the right hand side of (10) when the sequence (u,(~r), n >~ 1} is chosen as follows:

v,(Tr) = { ~,time Ofn>lji.the nth completion under ~r, 1 ~< n ~< ] J 1,

This choice is permissible since ~r may be assumed non-preemptive. Our study consists of an evaluation of R ( ~ r * ) - R(~r) and b(Tr) for 27,000

stochastic scheduling problems with out-forest precedence constraints and simple jobs with IERDR processing times. These problems come in 27 collections of 1000 problems, each collection corresponding to a choice of parameters M and a where M + 3, 4 or 5 and where ct takes the values 0.1, 0.2, . . . ,0.9. Further details are as follows: (i) M is the number of components in the out-forest. (ii) Each component is generated as a birth process with at most 3 generations,

with a single individual (the root job) in generation 1. In the birth process, each parent gives rise to 0, 1 or 2 children (independently for different parents) with probabilities 0.02, 0.49 and 0.49 respectively. Hence, inter alia, each component has at most 7 jobs. All components are generated independently of one another.

(iii) All terminal rewards 5 are generated independently from a uniform distribution on the interval [0, 1].

(iv) The values m j(0) are generated independently from a uniform distribution on the interval [0, a]. Please note that in the context envisaged here the strategies ~r *, ~r, the associated rewards R ( 7r * ), R (~r) and the b (~r) depend

Tab

le 1

A

n e

valu

atio

n of

qua

si-m

yopi

c st

rate

gy ~

r fo

r so

me

stoc

hast

ic s

ched

ulin

g p

rob

lem

s w

ith

out-

fore

st p

rece

denc

e co

nstr

aint

s

a 0.

1 0.

2 0.

3 0.

4 0.

5 0.

6 0.

7 0.

8 0.

9

103{

n(T

r*)-

R

(~r)

}( R

(Tr)

}-2

102[

b(~

)- {

R(~

* )-

R(~

)}](

R(~

)} -

M =

3

0.00

0 0.

000

0.00

0 0.

000

0.00

0 0.

000

0.00

0 0.

001

0.00

1 L

Q

4 0.

000

0.00

0 0.

000

0.00

0 0.

000

0.00

2 0.

038

0.09

1 0.

235

5 0.

000

0.00

0 0.

000

0.00

0 0.

001

0.01

2 0.

192

0.34

9 1.

375

M =

3

0.00

0 0.

000

0.00

0 0.

001

0.01

8 0.

130

0.83

6 2.

718

3.52

3 M

E

4 0.

000

0.00

0 0.

000

0.00

9 0.

077

0.65

3 2.

475

4.13

7 8.

221

5 0.

000

0.00

0 0.

000

0.02

8 0.

169

0.83

3 3.

551

7.56

5 14

.993

M =

3

0.00

0 0.

000

0.04

1 0.

852

3.45

2 9.

817

17.5

97

33.6

28

43.3

07

UQ

4

0.00

0 0.

001

0.10

5 0.

112

3.71

1 13

.189

25

.967

35

.685

50

.719

5

0.00

0 0.

003

0.15

5 1.

504

4.40

8 10

.302

28

.983

42

.000

59

.650

M =

3

0.00

3 0,

020

0.08

9 0.

226

0.42

4 1.

056

1.56

2 2.

485

3.93

9 L

Q

4 0.

001

0.01

5 0.

064

0.14

8 0.

332

0.64

6 1.

304

2.12

3 3.

556

5 0.

000

0.00

8 0.

030

0.11

4 0.

235

0.54

2 0.

140

1.74

0 3.

448

M =

3

0.10

3 0.

370

0.76

5 1.

500

2.26

8 4.

386

5.68

9 7.

488

11.2

84

ME

4

0.04

7 0.

255

0.63

8 1.

083

1.87

9 2.

783

4.19

3 6.

422

8.67

4 5

0.01

9 0.

126

0.34

1 0.

731

1.47

1 2.

100

4.18

9 5.

264

8.73

8

M =

3

1.61

8 2.

871

4.59

9 6.

735

7.77

5 12

.250

14

.459

16

.653

21

.897

U

Q

4 0.

711

1.48

4 2.

797

5.10

3 6.

639

8.30

8 10

.350

13

.935

18

.186

5

0.33

6 1.

186

1.95

7 3.

375

4.59

8 6.

043

9.96

4 12

.090

18

.286

c~

LO


upon the processing times Xj only through m./(0). Note also that a decreasing corresponds to increasing processing times.

Hence for each pair (M, a) we have 1000 independently generated problems. In table 1 find summary statistics for the distributions of ( R ( q r * ) - R(~r)} { R(~r)) -a and [b(~r) - (R(~r*) - R(~')}]( R(~')} -~. Because of the highly skewed nature of these distributions (lots of zeroes) the distributions have been summarised by the order statistics LQ (lower quartile), ME (median) and UQ (upper quartile).

Comment The results clearly indicate that as c~ decreases the performance of quasi-myopic

strategy 7r improves. This is not surprising since the larger individual processing times are, the less impact jobs other than those currently available should have on current preferences. It should be noted, though, that ~r seems to perform well almost always. Even when a = 0 . 9 , the proportion lost ( ( R ( T r * ) - R(~r)}{R(~r))(R(Tr)} -1) only exceeds 0.06 in about 25% of cases with five components. The performance of ~" deteriorates as M increases from 3 to 5, i.e. as the number of jobs in individual problems increases. The performance of the bound as measured by [b(~r)- (R (~ r* ) - R(~r)}]{R(~r)} -1 improves with increasing M.

Lastly note that if we drop the condition of IERDR processing times the above still yields an evaluation of ~r among non-preemptive strategies.

4. Sensit ivi ty analysis - assess ing the need to model an arrivals process

The quasi-myopic strategies described in section 3 may be thought of as strategies constructed on the assumption that no branching takes place - indeed they are optimal when that is the case. Should this class of strategies ever perform well then plainly their simplicity makes them very attractive. Computational evidence cited by Glazebrook and Fay [13] suggests that quasi-myopic strategies perform well when new bandit processes/jobs arrive over time according to a collection of independent Bernoulli streams. Hence (in the sense described above) in this case there seems little practical need to model the process of arrivals carefully. A study of these issues was conducted by Fay and Glazebrook [14]. We take the opportunity which theorem 1 affords to improve and extend evaluation procedures used in their earlier study.

For simplicity of discussion we shall specialise to the following model:

(a) Individual (non-branching) bandit processes will be simple jobs. As in section 3 this means that each job j has a random processing requirement Xj, a positive integer-valued random variable with completion rate function

O j ( t ) ~ = p ( X s = t + l l X j > t ) , t ~ N ,

and that only terminal rewards rj are earned.

K.D. Glazebrook et al. ,/Branching bandit processes 315

(b) Collections of bandit processes. As before, Cj is an infinite set of independent copies of job j. New jobs arrive from sets C a, C 2 . . . . . C, according to independent Bernoulli streams with associated probabilities Pa, P2 . . . . . Pn-

It is plain from the general model described in section 2 that any member of this class of stochastic scheduling problems with arrivals is a family of branching bandit processes. Hence, as before, optimal strategies are determined by collections of Gittins indices.

It has been known for some time (see, for example, Whittle [3), that in the case where all completion rates are monotone non-increasing, quasi-myopic strategies are optimal. The case with all completion rates monotone non-decreasing is arguably more important since it takes in most of the standard processing time distributions (see section 3). In this case, Fay and Glazebrook [14] were able to show that quasi-myopic strategies are optimal when n = 2 but not generally when n >/3. Hence we specialise to this case.

(c) Monotone non-decreasing completion rates. Each processing requirement Xj is such that pj(.) is monotone non-decreasing.

Note that here the non-branching Gittins indices Hi(x j) are given by expression (23). Suppose that the classes Cj are ordered as follows:

//1(0 ) >// /2(0 ) >/ . . . >/1t,(0).

A result due to Fay and Glazebrook [14] is an analogue of lemma 3 in this context and states that bounds may be put on Gittins indices as follows:

g j ( x ) <~ Gj(x) <~ max{ Hi(x) , H,(0)} . (24)

It is then a simple consequence of (24) that quasi-myopic strategies may be assumed to make optimal choices whenever a member of class C a is present.

In order to use theorem 1 we need to introduce an appropriate sequence (p,(Tr), n >~ 0} where ~r denotes a quasi-myopic strategy. In this context we make this choice as follows: (i) v0(~r) = 0 ; (ii) if at time vn(rr ) either B(vn(~r)} = 0 or B{v.(~r)} r ~ and ~r chooses

optimally then v.+~(Ir) is the first time after v.(~r) at which the system is non-empty and ~r chooses sub-optimally;

(iii) if at time v.(Tr) the system is non-empty and ~r chooses sub-optimally then v.+a(~r ) is the first time after v.(rr) at which either an arrival occurs or ~r switches to a different job.

To describe an appropriate discrepancy measure, we expand the notations Gj(xj), Hj(xj) to Gj(xj, a), Hj(xj , a) and thereby register the dependence of these indexations upon the discount rate a. In what follows x denotes an


arbitrary state for the process. Define discrepancy measures 6(x), 71(x) as follows:

rr*(x) =j = 6(x) = Gj(xg, a) - Hj(xj , a), (25)

7r(x) = j = T / (X)=Hj (x j , a ) - { 1 - - a ( 1 - p ) } ( 1 - - a ) - l H j

• {xj, a(1 - -p)} , (26)

where

p = l - # ( 1 - p j ) (27) j = l

is the global arrivalprobability. Functions 6 and 71 may be thought of as measures of the impact of the arrivals process on jobs chosen by ~r* and ~r respectively.

THEOREM 4 If 7r is a quasi-myopic strategy for this class of stochastic scheduling problem

with arrivals

R(~*)- R(~) < e~ [(~[x(~.(~)}] +~[x{~o(~))]) t

x;(~*[x{~o(~)}]. ,,[x {,,,,(~-)}]) ( a ' " - a"'+""' }]), (28)

where I(A) denotes the indicator function for set A and E,, an expectation taken over all realisations of the process under strategy yr.

Proof Note initially that the sequence { u.(~r), n >~ 0} conforms to the requirement of

theorem 1. We now consider two cases: (a) ~r*[x{ v.(Tr)}] = ~r[x{ v.(Ir)]}. In this case it follows trivially from (8) that

l(~*[x{v.(~)}] ~ ~[x{..(~)}])= o =~{~.(~). ~o+,(~)}, (29) (b) ~r*[x(v (Tr)}] = j r k = ~r[x{v (~r)}]. In this case it follows trivially from

(9) that

A{u.(Tr), u.+,(Tr)} =Gj[xj{p#(~r)}, a]

- G ~ [ x , { ~ . ( ~ ) } , ~, v.+,(~) - ~ . (~ ) ] . (30)

However, some simple calculations together with the characterisation of the sequence { u.(~r), n >~ 0} yield

= {1 - a ( 1 - p ) } ( 1 - - a ) - ' H , [ X k { V , ( r r ) } , a ( 1 - - p ) ] . (31)

K.D. Glazebrook et aL / Branching bandit processes 317

Now the characterisation of quasi-myopic strategies given in definition 4 implies that

Hk[xk{v . (rr )} , a] >i Hy[xj{v . (~r)} , a] , (32)

and it then follows from (30)-(32) that

~< Gy[xj{~.(qr)}, a ] - Hj[xj{~,.(~r)}, a] + H.[xk{~,.(Tr)}, a]

- {1 - a(1 - p ) } ( 1 - a ) - l H k [ x k { r , ( q r ) } , a(1 - -p) ]

= 8[x{u,(~r)}] + ~/[x { t,,(cr)}]. (33)

Theorem 4 now follows from theorem 1, together with (29) and (33). []

Comments (1) It is a trivial consequence of theorem 4 that R(r - R(Tr) will converge to

zero for a suitably defined sequence of problems in which either p ---, 0 or Pl ~ 1. To see the former, note that from (26) the discrepancy measure 7) goes to zero as p ~ 0. Further, if p - - , 0, we must have py---, 0 and for this case Gj and Hj coincide. Hence the discrepancy measure ~5 also goes to zero as p ---, 0. To see the latter, recall the quasi-myopic qr chooses optimally whenever a member of Cl is present. If we require that a member of C 1 is present initially, then, as p~ ~ 1, the first occasion upon which ~r* and ~r differ tends to or with probability one. The conclusion then follows simply from (28). Hence the conclusion is that little is lost by ignoring the process of arrivals either when arrivals are sparse or when arrivals of good jobs are frequent.

(2) Inequalities (24) hold generally, not only for the non-decreasing completion rate case considered here. In fact the section describes an approach to the assessment of the importance of modelling the arrivals process which has quite general validity.

We conclude the section with a small numerical study. It is very difficult to make a study of the kind reported at the end of section 3 meaningful in the context of this section since quasi-myopic strategies are optimal so much of the time. We report just six examples of the simple jobs model with increasing completion rates incorporating Bernoulli arrivals. In each example there are four classes of jobs, equal arrival probabilities for all classes and all processing times are no larger than 5 with probability one. The examples embody a variety of assumptions with regard to terminal rewards and discount rates. For instance, in problem number 3 we have

(r,, r2, r3, r,) = (7.5, 8.5, 5.0, 6.0),

a = 0.9.

318 K.D. Glazebrook et al. , /Branching bandit processes

Table 2 An evaluation of quasi-myopic strategy ~" for six problems with four job classes and common arrival probabilities

Arrival probability Problem no.

1 2 3 4 5 6

0.05 0.00 (0.00) 0.00 (0.02) 0.00 (0.40) 0.00 (0.80) 0.00 (0.97) 0.00 (0.00) 0.10 0.00 (0.00) 0.87 (4.19) 1.75 (1.19) 0.15 (1.32) 0.00 (1.67) 0.00 (0.00) 0.25 0.08 (0.05) 0.76 (0.77) 0.03 (0.09) 4.54 (5.85) 0.01 (0.01) 3.10 (3.20) 0.40 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.04) 0.00 (0.00) o.7o 0.00 (0.o5) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 1.00 0.00(0.00) 0.00 (0.00) o.o0(0.00) 0.00(0.00) 0.00(0.00) 0.00(0.00)

This p r o b l e m has been inc luded (a long with the o ther five) because for it, quas i -myop ic s t ra tegy 7r fails to be op t ima l for some values of the c o m m o n arr ival p robab i l i ty . F o r the large ma jo r i t y of p r o b l e m s s tud ied (but not r epor t ed ) there were no such di f ferences found.

F o r each of the six p rob l ems s tudied, the u p p e r b o u n d on R ( r r * ) - R ( ~ ' ) given in theorem 4 has been es t ima ted f rom 5000 M o n t e - C a r l o s imula t ions and then expressed as a pe rcen tage of R(Tr). This has been done for a range of values of the c o m m o n arr ival p robab i l i ty . In all cases it is assumed that at t ime 0 one m e m b e r of each class is p resen t in its ini t ia l state. The results are r epo r t ed in table 2. The figures in b racke t s are the equiva len t quant i t i es der ived f rom an ear l ie r b o u n d based on work by G l a z e b r o o k [6].

Comment The values in the table are cons is ten t wi th c o m m e n t (1) above which impl ies

zero in the l imit as the c o m m o n arr ival p r o b a b i l i t y a pp roa c he s zero or one for all p rob lems . A l though it would be unwise to make too much of the m o d e s t ev idence in tab le 2, it wou ld seem that quas i -myop ic s t ra tegies lose mos t for i n t e r m e d i a t e values of a c o m m o n arr ival p robab i l i t y .

It seems, too, that the b o u n d p r o p o s e d here usual ly p e r f o r m s be t te r than the ear l ie r one. It is ac tua l ly poss ib le to show that it has i m p r o v e d p rope r t i e s in the l imi t as a tends to 1, but the evidence of table 2 is that enhanced p e r f o r m a n c e is not conf ined to p r o b l e m s with large d i scoun t rates.

References

[1] K.D. Glazebrook, Procedures for the evaluation of strategies for resource allocation in a stochastic environment, J. Appl. Prob. (1990), to appear.

[2] J.C. Gittins, Bandit processes and dynamic allocation indices, J. Roy. Statist. Soc. B41 (1979) 148 (with disccssion).

[3] P. Whittle, Arm-acquiring bandits, Ann. Prob. 9 (1981) 284. [4] P. Varaiya, J. Walrand and C. Buyukkoc, Extensions of the multi-armed bandit problem: the

discounted case, IEEE Trans. Aut. Control AC-30 (1985) 425.


[5] G. Weiss, Branching bandit processes, Technical Report, Georgia Institute of Technology. [6] K.D. Glazebrook, On the evaluation of suboptimal strategies for families of alternative bandit

processes, J. Appl. Prob. 19 (1982) 716. [7] K.D. Glazebrook, Sensitivity analysis for stochastic scheduling problems, Math. Oper. Res. 12

(1987) 205. [8] S.M. Ross, Applied Probability Models with Optirnisation Applications (Holden-Day, 1970). [9] J.C. Gittins and K.D. Glazebrook, On Bayesian models in stochastic scheduling, J. Appl. Prob.

14 (1977) 556. [10] K.D. Glazebrook, Stochastic scheduling with order constraints, Int. J. Sys. Sci. 7 (1976) 657. [11] K.D. Glazebrook and J.C. Gittins, On single machine scheduling with precedence relations and

linear or discounted costs, Oper. Res. 29 (1981) 161. [12] T.E. Morton and B.C.T. Dharan, Algoristics for single-machine sequencing with precedence

constraints, Mgmt. Sci. 24 (1978) 1011. [13] K.D. Glazebrook and N.A. Fay, Evaluating strategies for Markov decision processes in

parallel, Math. Oper. Res. (1990), to appear. [14] N.A. Fay and K.D. Glazebrook, Stochastic scheduling - do we need to model arrivals?,

submitted.

on the evaluation of strategies for branching bandit processes

Documents