a q-learning-based method applied to stochastic resource constrained project scheduling with new...

INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROLInt. J. Robust Nonlinear Control 2007; 17:1214–1231Published online 11 January 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/rnc.1164

A Q-Learning-based method applied to stochastic resourceconstrained project scheduling with new project arrivals

Jaein Choi, Matthew J. Realff and Jay H. Lee*,y

School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA 30332, U.S.A.

SUMMARY

In many resource-constrained project scheduling problems (RCPSP), the set of candidate projects is notfixed a priori but evolves with time. For example, while performing an initial set of projects according to acertain decision policy, a new promising project can emerge. To make an appropriate resource allocationdecision for such a problem, project cancellation and resource idling decisions should complement theconventional scheduling decisions. In this study, the problem of stochastic RCPSP (sRCPSP) with dynamicproject arrivals is addressed with the added flexibility of project cancellation and resource idling. To solvethe problem, a Q-Learning-based approach is adopted. To use the approach, the problem is formulated asa Markov Decision Process with appropriate definitions of states, including information state and actionvariables. The Q-Learning approach enables us to derive an empirical state transition rules from simulationdata so that analytical calculations of potentially exorbitantly complicated state transition rules can becircumvented. To maximize the advantage of using the empirically learned state transition rules, specialtype of actions including project cancellation and resource idling, which are difficult to incorporate intoheuristics, were randomly added in the simulation. The random actions are filtered during the Q-Valueiteration and properly utilized in the online decision making to maximize the total expected reward.Copyright # 2007 John Wiley & Sons, Ltd.

Received 11 September 2006; Accepted 11 November 2006

KEY WORDS: approximate dynamic programming; Q-Learning; stochastic project scheduling

INTRODUCTION

A large number of candidate products in the agricultural and pharmaceutical industries mustundergo a series of tests related to safety, efficacy and environmental impact, in order to obtaincertification. Depending on the nature of the product, the tests may last up to 15 years. Given the

*Correspondence to: Jay H. Lee, School of Chemical and Biomolecular Engineering, Georgia Institute of Technology,Atlanta, GA 30332, U.S.A.yE-mail: [email protected]

Contract/grant sponsor: National Science Foundation; contract/grant number: CTS-#0301993

Copyright # 2007 John Wiley & Sons, Ltd.

limited resource and highly competitive market environment, product development projects shouldbe selected and managed with the goal of minimizing the time to market and the cost of the testing.The problem of selecting projects and scheduling the associated tasks can be considered as ageneralization of the well-known job shop scheduling problem. The case in which all the problemdata have known values belongs to the NP-hard class of combinatorial problems [1]. In general, taskoutcome (i.e. success or failure) is uncertain and project reward varies with time, adding complexityto the scheduling problem. As soon as a product fails a required test, all the remaining work on thatproduct is likely to be halted and the prior investment in the project would be lost. In a specializedR&D pipeline management problem, the time value of project reward decreases as the time tointroduce the product to the market increases due to incoming competitive products and fixedpatent periods. Hence, a company has to manage its various resources, including capital,manpower, lab space and pilot facilities, in order to ensure its best return on the new productpipeline. Besides the uncertainty about the outcome of the task, there can be several additionaluncertain parameters in real problems, such as the task duration and resource requirement.

Various abstract versions of the problem have been studied previously and there has beensignificant progress in solution methods [2–9] for the problem with resource constraints as wellas with uncertain task outcome. In reality, however, the set of candidate projects is dynamic.For example, while performing an initial set of projects according to a certain decision policy, apromising new project can emerge. To make appropriate resource allocation decisions in suchcases, project cancellation decisions [8] must complement the conventional scheduling decisions.

In this study, a stochastic RCPSP (sRCPSP) with dynamic project arrival is addressed with theadded flexibility of project cancellation and resource idling. The proposed solution strategy isbased on the simulation-based approximate dynamic programming (DP) approach, which wehave previously developed and applied to small sRCPSPs [10, 11]. From an algorithmicstandpoint, the previous approach is modified to handle an extended problem structure thatincludes dynamic project arrivals. The previous strategy has limitations in handling complicatedsRCPSPs: the analytical calculation of all the possible state transition probabilities is notpractically feasible for a large size problem due to complex interactions among states, actions anduncertain parameters. This bottleneck is overcome by developing an appropriate Q-Learningalgorithm [12–14], which avoids explicit evaluation of the transition rules in the learning phase.

The Q-Learning approach can be viewed as simultaneous identification of the probabilisticstate transition rule and the cost-to-go function. Hence, it removes the need to evaluate thetransition probabilities analytically, which can be painstakingly tedious. Stochastic simulationunder a set of known heuristics is used here to obtain a set of relevant state–action pairs, forwhich the value (Q-Value) iteration is performed.

Next, we present a mathematical formulation of sRCPSP with new project arrivals and itsdecision problem structure. Then, a generalized Q-learning algorithm for the problem isdiscussed with appropriate definitions of the state, action, state transition rules, and objectivefunction. Finally, the proposed approach is verified by solving a sRCPSP with dynamic projectarrivals under billions of possible scenarios.

PROBLEM DESCRIPTION: STOCHASTIC RCPSP WITH NEW PROJECT ARRIVALS

We consider a RCPSP with M projects, each of which consists of mi tasks, for i ¼ 1; . . . ;M:There are N resources (laboratories). A specific resource type among them may have to be used

Q-LEARNING FOR STOCHASTIC RESOURCE CONSTRAINED PROJECT SCHEDULING 1215

Copyright # 2007 John Wiley & Sons, Ltd. Int. J. Robust Nonlinear Control 2007; 17:1214–1231

DOI: 10.1002/rnc

to perform each task. On top of the basic structure of the RCPSP, there are L potential projectsthat can emerge on a random basis while performing tasks in the initial M projects. The ‘newproject arrival’ changes the decision structure of the problem dramatically due to new types ofdecisions such as cancellation of an on-going project or idling of available resource for futureusage. These decisions can be made upon arrival of a new project in order to improve the overallprofit. Arrivals of the L potential projects are governed by the distribution functions for theirarrival times and their realization probabilities.

The uncertain parameters of each task are the result (success or failure), the duration and thecost. The uncertainty is modelled by a discrete time Markov chain designed to representcorrelations among the uncertain parameters. A brief description of the Markov chain model isgiven in ‘Uncertain Parameter Modelling: Markov Chain and Conditional Probability’ and amore detailed description is available in our previous paper [10]. A time-varying reward functionover a discrete time index k is given for each project to represent the decreasing value of theproject with time. The reward function (Equation (1)) is characterized by three parameters:the ‘initial value’, R0; the ‘stiffness parameter’, l; and the ‘project deadline indicator’, PD. b isthe ‘final value’, which takes the value of R0 � elPD:

Rð0Þ ¼ R0 at k ¼ 0

RðkÞ ¼ R0 � elk for 14k4PD

RðkÞ ¼ b for k > PD ð1Þ

Figure 1 shows the reward function with R0 ¼ 5000; l ¼ 0:235 and PD ¼ 34:For the initial projects, the reward function starts at the initial time, k ¼ 0 and for the

potential projects, the reward function is defined in the same manner but starts at the time ofproject arrival.

Figure 1. Decreasing reward function.

J. CHOI, M. J. REALFF AND J. H. LEE1216


DOI: 10.1002/rnc

Uncertain parameter modelling: Markov chain and conditional probability

The probabilistic correlations among the uncertain parameters are modelled with discrete-timeMarkov chains. The nth task of a project i has rni realizations and each realization consists ofthe values of the result, duration, and cost of the task from a discrete set as shown in Figure 2.For example, ‘F ; D11i; C11i’ (the first candidate realization of the task 1 in Figure 2) represents afailed outcome with the duration of D11i and the cost of C11i: The discrete values for theparameters may represent the actual values or more likely the mean values of random parameteroutcome. Furthermore, to represent the quality of the task result, multiple success levels may beintroduced. For example, the result of a task can be categorized as ‘failure(F)’, ‘moderatesuccess(S1)’ and ‘high success(S2)’ as shown in Figure 2. In the case of ‘high success’,the probability of success in the next task can be made significantly higher by specifying theunderlying Markov state transition probability accordingly. An explicit representation of theprobabilistic correlations of the uncertain parameters in a project with three tasks is shown inFigure 2. In the figure, PIi; PM1i and PIi are the transition probabilities. Notice that theelements in all the columns except for the first one (corresponding to a project failure) add toone.

Q-LEARNING FOR THE STOCHASTIC RCPSP

In our previous work [10], we gave a prototypical formulation of stochastic DP for the RCPSPwith uncertainty. Furthermore, we developed an appropriate algorithmic framework designedto circumvent the ‘curse of dimensionality’ associated with the conventional DP solutionprocedure. However, application of the algorithmic framework to larger-size RCPSPs, forexample, one with more complicated structure, is limited due to the heavy computational loadassociated with the evaluation of state transition rules, in which all the possible next states andtheir conditional probabilities of realization are calculated.

Figure 2. An example of uncertain parameter modelling for a project with three tasks.



DOI: 10.1002/rnc

From a programming perspective, analytical state transition rules are awkward to use sincethe state is represented as an integer to indicate a certain ‘status’ of a project and the transitionrules as many logical constraints. Especially with the possibility of new project arrival, theanalytical state transition becomes much more complicated due to the various types of decisionsthat can be made. We are motivated to develop a more powerful solution method based on theQ-Learning approach, which is a model-free version of the simulation-based approximate DPwe used in our previous work. The overall procedure of applying the Q-Learning algorithm forthe given problem is similar to our previous work [10] and is shown in Figure 3.

The objective of the simulation with heuristic policies is to explore the system under a largenumber of realizations and identify a relevant set of state–action pairs. As a result of thesimulation, an initial Q-Value table is obtained as a function of state–action pair. At thebeginning of the simulation, the state–action table is an empty set; new state–action pairs areadded as the simulation goes on. The heuristic simulation can also be viewed as an empiricalmodel building process because it extends the model to new state–action pairs and refines thecurrent Q-Values in the table for revisited state–action pairs.

Since heuristic policies are applied in the simulation, the initial Q-Value table is not optimal.A generalized Q-Value iteration equation is shown below

QðxðkÞ; uðkÞÞ ¼ ð1� gÞQðxðkÞ; uðkÞÞ þ g gðxðkÞ;xðkþ 1Þ; uðkÞÞf

þ a maxuðkþ1Þ2Uxðkþ1Þ

E½Qðxðkþ 1Þ; uðkþ 1ÞÞ��

ð2Þ

In the above, x is the state variable, u is the decision variable and Uxðkþ1Þ denotes the set of allcandidate decisions for a given state xðkþ 1Þ: a is a discount factor that is between 0 and 1,typically chosen close to 1. The ‘forgetting factor’ g represents a relative ratio of ‘previousinformation’ (previously recorded Q-Values) to ‘new information’ (current Q-Value) to be usedin updating the Q-Value while exploring the system. In this work, however, we propose to setg ¼ 1 because simultaneous exploration and updating of the Q-Value is not all that meaningfulfor the given problem due to the significant stochastic complexity. In other words, any Q-Valueis not reliable until significant simulation has been performed and that is why we propose toperform the Q-Value iteration after the completion of the simulation as shown in Figure 3.Then, the initial Q-Values are iterated over the state–action pairs collected in the simulationstage and the Q-Value eventually converges. With g ¼ 1; the iteration equation becomes

QðxðkÞ; uðkÞÞ ¼ E gðxðkÞ; uðkÞÞ þ a maxuðkþ1Þ2Uxðkþ1Þ

Qðxðkþ 1Þ; uðkþ 1ÞÞ� �

ð3Þ

Figure 3. Q-Learning approach.



DOI: 10.1002/rnc

The converged Q-Value table can be utilized for decision-making in the following manner:

unðkÞ ¼ arg maxuðkÞ2UxðkÞ

QðxðkÞ; uðkÞÞ ð4Þ

The major difference between the stochastic DP-based approach and the Q-Learning approachis that the ‘state–action pairs’ instead of just the ‘states’ are recorded during the stochasticsimulation stage with heuristic policies. Any state visited in the simulation is recorded with theaction taken at the state as well as the resulting next state and its state transition frequency.Since the simulation is performed over many realizations, the state transition frequency from astate to another state as a result of the action approximates the conditional probability of thecorresponding state transition. Different state transitions from a same state with same action isdue to the stochastic nature governed by the underlying Markov chains. Thus, in theQ-Learning approach, the objective of simulation is not only to obtain relevant states (or state–action pairs) but also to explore the system and identify empirical state transition rules.

Another major difference of the algorithmic framework and the Q-Learning approach is theobjective function. The cost-to-go value is a function of the state and action rather than just thestate in the Q-Learning approach. As a result, the Q-Learning approach requires more memoryand computation at its iteration stage than the previous DP-based approach because theiteration has to be done over every state–action pair instead of every state. However, thisapparent computational drawback is more than offset by the empirical state transition ruleembedded into the Q function, which avoids the costly analytical evaluation of all next states.Figure 4 shows a conceptual diagram of the state–action pair and the Q-Value representation.

The state transition probabilities, P1;P2; . . . ;Pn; in Figure 4 are empirical conditionalprobabilities obtained via simulation. Suppose that, for a state xðkÞ; an action uðkÞ was taken Ntimes in the simulation. Also suppose that there were n different next states from this state–action pair (denoted as x1ðkþ 1Þ; . . . ;xnðkþ 1ÞÞ and the state transition frequency from statexðkÞ to state xiðkþ 1Þ is Ni for i ¼ 1; 2; . . . ; n: From the definitions of N and Ni; it is obvious thatP

i Ni ¼ N and Pi ¼ Ni=N:

Figure 4. State–action pair and Q-Value.



DOI: 10.1002/rnc

Given the conceptual similarity between the stochastic DP and the Q-Learning approach, onecan expect that the mathematical formulation of the Q-Learning approach will also have closesimilarity to that of the stochastic DP-based approach. Detailed definitions of the state, action,and state transition rules used for the Q-Learning algorithm are discussed in the followingsections. It should be noted that all the following definitions are direct extensions from thestochastic DP formulation presented in our previous work [10].

Definition of state

In defining the state of a system, it is important to adopt as parsimonious a representation aspossible because any redundancy will increase the computational complexity. For a RCPSPwith M projects with L types of resources (laboratories) and additional R potential projects thatmay emerge in the future, the state can be defined as follows:

X ¼ ½s1; s2; . . . ; sM ; sMþ1; . . . ; sMþR; z1; z2; . . . ; zM ; zMþ1; . . . ; zMþR;

L1;L2; . . . ;LL; a1; a2; . . . ; aR; k�T ð5Þ

In (5), si for i ¼ 1; 2; . . . ;M þ R represents the current status of project i; containing theinformation on which tasks are finished and which task is on-going for project i: Because eachproject consists of a finite number of tasks, si can be represented as an integer variable. Forexample, there can be 7 possible state (circled number) in a project with three tasks as illustratedin Figure 5. zi for i ¼ 1; 2; . . . ;M þ R represents the information state of project i; whichindicates the result of the most recent task in the project. As explained in the problemdescription, the uncertain parameter values (i.e. the duration, cost and result) of each task arerealized according to the conditional probabilities given by the corresponding Markov chain.Once a task in project i is completed, zi is updated according to the realized result of the task. ziis an integer variable ranging from 1 to rni; where rni is the number of possible realizations forthe nth task, which is the task most recently completed for project i: The third set of statevariables, Lj for j ¼ 1; 2; . . . ;L represents the time that the resource has been used for the on-going task. Lj ¼ 0 indicates that the resource is idle. The next set of state variables, ar; r ¼1; 2; . . . ;R; represents the realized arrival time of project r: Finally, time k is added as a statevariable in order to consider the time-varying value of the reward of each project.

Actions

With the state defined as in Equation (5), the action, U; can be defined as

U ¼ ½d1; d2; . . . ; dM �T ð6Þ

di is an integer variable that represents whether to perform a task ðd ¼ 1Þ or not to perform atask ðd ¼ 0Þ or to cancel a task ðd ¼ 2Þ of project i: The decision can be made only when thenecessary resource for the task is available, that is, 9Lj ¼ 0 for some j ¼ 1; 2; . . . ;L:

Figure 5. Possible project status of a project with three states.



DOI: 10.1002/rnc

State transition rules

In the Q-Value iteration of the Q-Learning approach, the need to evaluate the complicated statetransition rules are avoided by embedding them into the value function. However, simple statetransition rules do have to be defined for the simulation. According to the definition of the state,there is only one initial state at time k ¼ 0:

Xð0Þ ¼ 1; . . . ; 1|fflfflfflffl{zfflfflfflffl}M Initial Projects

; 0; . . . ; 0zfflfflfflffl}|fflfflfflffl{

R Potential Projects; 0; . . . ; 0|fflfflfflffl{zfflfflfflffl}MþR Information SV

; 0; . . . ; 0zfflfflfflffl}|fflfflfflffl{

L Types of Resource; 0; . . . ; 0|fflfflfflffl{zfflfflfflffl}R Arrival Time

; 0z}|{Time

264

375T

ð7Þ

The initial state evolves with actions taken by the heuristics and realizations of the uncertainparameters until it reaches a terminal state. One characteristic of the RCPSP is that the tasknetwork of the problem is not deterministic due to the uncertain outcomes (success or failure) ofthe tasks in the problem. Even though there is a unique initial state, the problem can end withone of the numerous terminal states according to the realization of the uncertainty. The numberof possible terminal states depends on the stochastic complexity of the problem. We define theterminal state as the state where all the projects are terminated. The termination condition foreach project is defined by either the successful completion of the final task or the failure of anintermediate task.

Objective function: Q-Value

The objective of the RCPSP is the maximization of the final reward after finishing all theprojects. The Q-Value iteration equation (3) also represents the Q-Value as a recursive additionof the one-stage profit function gðxðkÞ; uðkÞÞ so that it naturally reflects final reward. The one-stage profit gðxðkÞ; uðkÞÞ is a summation of the cost incurred by an action, uðkÞ; and anyreward(profit) retrieved by completion of a project at the state, xðkÞ (Figure 6). Since the

Figure 6. Definition of Q-Value.



DOI: 10.1002/rnc

Q-Value table is expanded by the simulation, the Q-Value at the initial state with the optimalaction, Qðxð0Þ; unð0ÞÞ; represents the optimal expected final reward.

SUBOPTIMAL POLICIES

Choice of the suboptimal policy for the simulation is very important since it affects the state–action pairs examined and ultimately determines the quality of the final solution. In this section,three greedy heuristics that utilize information from the state (as defined in ‘Definition of State’)for decisions are developed for the sRCPSP. These heuristics emphasize different informationabout the problem and hence combining their merits together into a single policy could lead to abetter overall performance. On top of the greedy heuristics, special types of actions (cancellationand idling), which cannot be taken by the heuristics, are randomly added.

Greedy heuristics

1. Heuristic 1: high success probability task first. In the sRCPSP, the result (success or failure) ofa task is a very important factor affecting the final reward, as well as the future schedule.Heuristic 1 is based on maximizing the success probability of the next allocated task. Whenevera resource frees up, the expected success probability of each potential next task is calculatedaccording to the current information state, ziðkÞ for i ¼ 1; 2; . . . ;M: The success probability of atask can be calculated by summing the probabilities of all successful outcomes. This heuristiccan be modified for the case of multiple levels of success by assigning appropriate weightingfactors to the different levels of success.

2. Heuristic 2: short duration task first. Another way to increase the final reward of theprojects in sRCPSP is to finish the projects as quickly as possible in order to minimize the loss ofreward with time. Heuristic 2 considers the time value of the project in a greedy way byperforming a task with the shortest expected duration first whenever a resource conflict exists.The expected duration can be calculated by utilizing the current information state, ziðkÞ; as inthe calculation of the probability of success.

3. Heuristic 3: high reward project first. Heuristic 3 gives the priority to an impending task ofthe project that has the highest potential reward. This is a greedy decision to get highest rewardin shortest time with smallest reward decrease. Heuristic 3 may work well if the project with thehighest reward is completed successfully. Its drawback is that the other projects can be delayedtoo long. Therefore, if the project with the highest reward fails, the total reward can be poor. Inthe R&D pipeline management problem, the priority for each project is decided on the basis ofits initial reward value. The decision-making procedure is straightforward: all the tasks in theproject with the highest reward are performed first, then the tasks with the next highest reward,and so on. If there is an idle resource after assigning a pending task in the current target project,the resource is used to perform a task in the next priority project.

Random perturbation

The sRCPSP addressed in ‘Problem Description: Stochastic RCPSP with New Project Arrivals’includes new project arrivals that can be realized while the initial projects are on-going. Tomaximize the total reward with new project arrivals, one may reserve resources (availablelaboratories) for the potential new projects instead of utilizing them for currently on-going



DOI: 10.1002/rnc

projects that may not be all that profitable. Furthermore, complete cancellation of a currentlyon-going project also has to be considered in order to allocate resources for more profitable newprojects. These ‘idling’ and ‘cancellation’ actions are not considered in the three heuristics. Inorder to explore their potential benefits, the cancellation and idling actions are added randomlyin the simulation of the heuristic policies. The random actions are applied with very smallprobabilities, so as to avoid too many perturbations that can cause the overall reward todeteriorate significantly.

ILLUSTRATIVE EXAMPLE

As an illustrative example of the sRCPSP, we consider a R&D pipeline problem that has threeinitially given projects and two new potential project candidates, which may arrive in the future.The activity-on-node (AoN) graph of the example is shown in Figure 7. The AoN displays thesequence of tasks involved in each project with the resources required to complete the tasks (e.g.task ‘I1’ has to be performed in laboratory 1 (Lab.1)). A parenthesized number over each taskrepresents the possible number of outcomes (in terms of duration, cost, and result) of the task.A Markov chain is given for each project to represent probabilistic correlations among theoutcomes of adjacent tasks in a project. For example, for task I2; which has three possibleoutcomes, a 3� 1 probability vector is given to represent the probabilities for the three possibleoutcomes. The conditional probabilities for the potential outcomes of the task P1 are assignedbased on the realized outcome of I2: Since both of I2 and P1 have three possible realizations, thesize of the probability transition matrix of P1 is 3� 3: Each column represents the conditionalprobability vector for the possible outcomes of P1: All the probabilities and parameters of theexample are summarized in Table I. Ri; for i ¼ 1; . . . ; 5; indicates the initial reward of project i attime k ¼ 0: After time k ¼ 0; the rewards of the three initial projects decrease as shown in

Figure 7. RCPSP example.



DOI: 10.1002/rnc

Table I. Example 1, probabilities and parameters.

Project Realized result, duration and cost of the task

I1 I2 P1

F 3 300S1 5 600

� � F 4 350S1 4 300S2 5 200

24

35 F 4 300

F 7 650S1 5 400

24

35

Project 1 PI1 PM11 PM21

0:300:70

� � 0 0:250 0:500 0:25

24

35 0 0:20 0:08

0 0:30 0:020 0:50 0:90

24

35

I3 I4 P2

F 3 300S1 4 450S2 5 600

24

35 F 5 700

S1 7 500

� �F 4 400S1 6 600

� �


0:250:400:35

24

35 0 0:30 0:05

0 0:70 0:95

� �0 0:250 0:75

� �

I5 I6 I7 P3

F 3 400S1 4 300

� �F 5 700S1 7 400

� � F 4 500S1 7 600S2 5 300

24

35 F 5 250

S1 3 300

� �

Project 3 PI3 PM13 PM23 PM33

0:150:85

� �0 0:300 0:70

� � 0 0:250 0:600 0:15

24

35 0 0:25 0:10

0 0:75 0:90

� �

I8 I9 P4

F 5 500S1 7 450S2 5 600

24

35 F 3 400

S1 6 300

� �F 2 800S1 6 450

� �


0:350:450:20

24

35 0 0:25 0:15

0 0:75 0:85

� �0 0:200 0:80

� �

I10 I11 I12 P5

F 5 600S1 5 400S2 6 900

24

35 F 5 700

S1 7 400

� � F 3 400S1 5 800S2 6 950

24

35 F 4 1000

S1 5 700

� �

Project 5 PI5 PM15 PM25 PM35

0:300:550:15

24

35 0 0:25 0:02

0 0:75 0:98

� � 0 0:200 0:650 0:15

24

35 0 0:20 0:05

0 0:80 0:95

� �



DOI: 10.1002/rnc

Figure 8. Those reward profiles represent the ‘time value’ of each project due to the competitivemarket situation. If a project is delayed for too long (longer than PDi; ‘project deadline’), thereward earned from completing the project can be insignificant because similar drugs (products)developed by competitors may have taken large market shares. Reward profiles of the potentialproject candidates are introduced at the time of their arrival. Also shown in Figure 8 is one ofthe possible realizations of the reward profiles for the remaining two projects, in which both thepotential projects 4 and 5 arrive at time k ¼ 10: Probabilities of the various scenarios for thenew project arrivals are summarized in Table II. There are 16 new project arrival scenariosincluding the case where neither of the projects are realized.

Stochastic complexity of the example. The illustrative example is a small-size sRCPSP, whichconsists of only five projects including three initial projects and two potential new projectcandidates. Despite being a small-size problem, its stochastic complexity is quite high. Onemeasure of stochastic complexity of a problem is the total number of possible scenarios underdifferent parameter realizations. Figure 9 shows project 1 in the illustrative example and itsrealization data. According to the realization data, there are six scenarios, 3� 2� 1 (threerealizations in P1; two realizations in I2 linked to P1 and one realization in I1 linked to I2), inthe case that P1 is completed, i.e. all the three tasks in the project are successfully completed.Similarly, in the case of project termination after the completion of I2 or I1 only, there is onescenario, respectively. Thus, the total number of scenarios for the project 1 is 8. In summary,

Table II. Probabilities for the timing of project appearance.

10 20 30 Never

P4 0.3 0.5 0.1 0.1

P5 0.4 0.3 0.2 0.1

Figure 8. A realized reward profiles in the illustrative example.



DOI: 10.1002/rnc

with given realization data in Table I, the total number of scenarios of the projects 2–5 are 7, 7, 7and 13, respectively. Furthermore, the total number of new project arrival scenarios is 16. Theprobabilities for the scenarios of each project and new project arrivals are independent of oneanother; thus, the total number of scenarios of the problem is 570 752; which is obtained bymultiplying the number of scenarios of all the projects.

Simulation with the three heuristics and random perturbation

The three heuristics introduced in ‘Suboptimal Policies’ are implemented in the illustrativeexample. For the simulation, 30 000 realizations of the Markov chains are performed, for eachheuristic. The simulation results of Heuristics 1, 2, and 3 are shown in Figure 10. The simulationresults show that all three heuristics can generate large losses in the worst case realizations, inwhich all the projects progress successfully until the last task fails. Therefore, a large cost wasincurred to perform all the tasks without any reward. The simulation results (in Tables III andIV and Figure 10) indicate that none of the heuristics was uniformly superior throughoutdifferent realizations. The relative performance of the different heuristics vary by realization.

Figure 9. Project 1 in the illustrative example.

Figure 10. Simulation results for 30 000 realizations under the three heuristic policies.



DOI: 10.1002/rnc

Since the heuristics are not able to take ‘unusual’ actions such as ‘cancelling’ and ‘idling’,those actions are randomly mixed in with the actions chosen by the heuristics during thesimulation. At each decision, the idling and the cancellation decisions replace the heuristicdecisions with 0:5 and 1:0% of probabilities, respectively. The cancellation decision is applied tothe on-going projects and a cancelled project is considered to be ‘failed’ with zero action costand zero reward. The idling decision for available resource is also considered as ‘cost-free’action and the idling action is continued until next event(state). Table V shows the performanceof the heuristics with randomly introduced ‘cancelling’ and ‘idling’ actions for the same set of30 000 realizations. Since the actions are introduced with small probabilities, the overallperformance of the heuristics is similar to the one (Table III) without the random actions. Thesimulation is performed over 10 sets of 30 000 realizations by applying the three heuristics withrandomly mixed ‘cancellation’ and ‘idling’ actions.

Implementation of the DP in heuristically restricted state space

The state of the illustrative example consists of 15 state variables according to the statedefinition in ‘Definition of State’ as shown in Equation (8).

X ¼ ½s1; s2; s3; s4; s5; z1; z2; z3; z4; z5;L1;L2; k; p4k; p5k� ð8Þ

The calculation of the total state space size is complex due to the inability of certaincombinations of completed tasks and event times to be realized. It is expected that the duration

Table III. Simulation results for 30 000 realizations under theheuristic policies.

Total profit Heuristic 1 Heuristic 2 Heuristic 3

Mean 5914.00 3967.85 7276.98Max. 30340.20 28726.64 32097.61Min. �9000 �9100 �8750

Table IV. Performance of the heuristics.

H1 > H2 H1 > H3 H2 > H1 H2 > H3 H3 > H1 H3 > H2

# of cases 7188 13575 6839 4942 10881 16234Mean 6190.36 2695.00 3662.54 2537.51 5441.92 6860.18

Table V. Simulation results for 30 000 realizations under the heuristicpolicies with 0:5% of idling action and 1% of cancellation action.

Total profit Heuristic 1 Heuristic 2 Heuristic 3

Mean 5932.52 3927.43 7270.81Max. 30340.20 28733.22 32050.05Min. �9000 �9100 �8750



DOI: 10.1002/rnc

of the whole schedule will be about 40 time units, based on estimates of the longest task durationand an idea of possible task parameter sets, we calculate that approximately 950 million statescould potentially be experienced.

1. State–action pairs. As a result of the simulation, 263 053 non-redundant state–action pairsare obtained. Each of the 263 053 state–action pairs has a Q-Value representing the expectedtotal reward from the current state to the terminal state. Among the 263 053 state–action pairs,29 599 states have ‘no action’ to choose because they are identified as terminal states.

2. Q-Value iteration. For the 263 053 state–action pairs, the Q-Value iteration, as expressed byEquation (3), is performed starting with the initial Q-Values obtained in the previous step. For agiven state–action pair, the Q-Value iteration equation finds an optimal action. The iterationscheme converged within a chosen error tolerance level for jjðQiþ1 �QiÞ=Qijj after the 21stiteration and took about 3:1 h for each iteration.

3. Improved solution: online decision making. The decision policy obtained by the proposedapproach is represented by the converged Q-Value table, which can be used for online decisionsthrough Equation (4). Thus, after the converged Q-Values are obtained, we can make a validdecision for any realization generated by the underlying Markov chain model. In an actualdecision-making process, eventual outcomes of currently on-going tasks are not known at anypoint of decision. The transition from a current state is a consequence of both the decision madeaccording to Equation (4) and the parameter values realized by the random process.

Computational results

To verify the performance of the policy obtained by the proposed approach, it is compared tothe heuristic solutions for the 30 000 realization used to synthesize the policy. The results shownin Table VI indicates that the proposed approach improves the mean performance by about39:13% compared to the best heuristic policy, the Heuristic #3: This significant improvementcan be explained by the appropriate ‘cancelling’ and ‘idling’ decisions made by the policy.Although those actions are randomly mixed with the heuristics during the simulation, some ofthose actions are chosen appropriately to maximize total reward in the Q-Value iteration. The‘cancellation’ and ‘idling’ actions are mainly chosen to prevent the ‘worst’ case, in which anegative total cost is expected due to major project failures. The results in Table VI show thatthe minimum reward, the worst case, is increased to �8050: On the other hand, the maximumreward of the online policy is in same ranges as those of the heuristics. Hence, the significantimprovement of the mean value is mainly due to reducing the worst case results (loss) withappropriate use of the cancellation or idling action.

Table VI. Simulation results for 30 000 realizations under the heuristicpolicies with 0:5% of idling action and 1% of cancellation action vs

online decision making with the learned Q-Value table.

Total profit H1 H2 H3 Online

Mean 5914.00 3967.85 7276.98 10124.63Max. 30340.20 28726.64 32097.61 30385.45Min. �9000 �9100 �8750 �8050



DOI: 10.1002/rnc

The stochastic complexity of the problem is very high with 642 096 potential scenarios. Thepolicy should be robust with respect to any of these scenarios even though not all of them wereseen during the learning stage. To test the robustness of the policy, it is applied to 10 000 newrealizations that were not included in the training set. The computational results summarized inTable VII and Figure 11 show that the policy obtained by the proposed approach is indeedrobust with respect to unseen scenarios in this example. Figure 11 shows an obvious shift of thenegative reward cases toward the positive direction.

Figure 12 shows how the online policy improves the total reward dramatically for a particularrealization, realization #8863; which is one of the 10 000 realizations used for the policyevaluation. In realization #8863; two of the three initial projects, project 1 and project 3, turnout to fail in their second tasks. Meanwhile, both of the potential project candidates, project 4and project 5, arrive at time 10 and both are successfully completed. The best heuristic, Heuristic#3; allocates the resource to the project 1 and project 3 until they are completed with failures.However, the Q-Value table-based policy cancels the project 3 after its successful completion ofthe first task. Furthermore, after the new projects arrive at t ¼ 10; it allocates the resource to thenew project and cancels the project 1. All these decisions are made by the Q-Value table-basedpolicy of Equation (4). As a result of the appropriate use of cancellation actions for less

Table VII. Simulation results for 10 000 new realizations under theheuristic policies with 0:5% of idling action and 1% of cancellationaction vs online decision making with the learned Q-Value table.

Total profit H1 H2 H3 Online

Mean 6054.31 4021.65 7339.34 10321.67Max. 29874.99 28762.29 29974.99 30856.05Min. �8600 �8800 �8350 �7450

Figure 11. Evaluation of the online decision making performance for a new set of 10 000 realizations.



DOI: 10.1002/rnc

profitable projects, the online policy can boost the final reward up to 20588.36, which is morethan a 60% improvement compared to the result of 12752.99 achieved by the best of theheuristics.

CONCLUSION AND FUTURE STUDY

A stochastic resource-constrained project scheduling problem (sRCPSP) has been addressed.Markov chains are used to model key uncertainties, including the duration, cost and result of atask. To the basic problem structure of the sRCPSP, an additional feature of new project arrivalis added. To solve the problem, a Q-Learning-based approach has been developed withappropriate definitions of the state and actions. The Q-Learning approach enables us to induce

Figure 12. Gantt charts: the Heuristic #3 vs the Q-Value table-based policy for realization #8863.



DOI: 10.1002/rnc

empirical state transition rules from the simulation data so that the costly evaluation ofanalytical transition rules can be avoided. To maximize advantages of using the empirical statetransition rules, special types of actions, project cancellation and resource idling that are difficultto incorporate into heuristics, were randomly added in the simulation. Some of the randomactions are filtered and confined during the Q-Value iteration and the final resulting policy usesthese actions appropriately in order to maximize the total reward of the system. The proposedsolution method has been tested by solving an illustrative sRCPSP with 642 096 scenarios. Thesolution obtained by the policy on average outperforms all the heuristics used to generate thelearning data. Furthermore, by utilizing the cancellation and idling actions appropriately, theresulting policy can reduce the worst case losses. Robustness of the policy is confirmed bysolving the problem with a new set of realizations, the data of which were not used in thelearning phase.

ACKNOWLEDGEMENTS

JHL gratefully acknowledges the financial support from the U.S. National Science Foundation (CTS-#0301993).

REFERENCES

1. Blazewicz J, Lenstra JK, Rinnooy Kan AHG. Scheduling subject to resource constraints: classification andcomplexity. Discrete Applied Maths 1983; 5:11–24.

2. Schmidt CW, Grossmann IE. Optimization models for the scheduling of testing tasks in new product development.Industrial and Engineering Chemistry Research 1996; 35(10):3498–3510.

3. Jain V, Grossmann IE. Resource-constrained scheduling of tests in new product development. Industrial andEngineering Chemistry Research 1999; 38(8):3013–3026.

4. Blau GE, Mehta B, Bose S, Pekny JF. Sinclair G, Kuenker K, Bunch P. Risk management in the development ofnew products in highly regulated industries. Computers and Chemical Engineering 2000; 24(2–7):659–664.

5. Subramanian D, Pekny JF, Reklaitis GV. A simulation-optimization framework for addressing combinatorialand stochastic aspects of an R&D pipeline management problem. Computers and Chemical Engineering 2000;24:1005–1011.

6. Maravelias CT, Grossmann IE. Simultaneous planning for new product development and batch manufacturingfacilities. Industrial and Engineering Chemistry Research 2001; 40(26):6147–6164.

7. Subramanian D, Pekny JF, Recklaitis GV. A simulation-optimization framework for research and developmentpipeline management. AIChE Journal 2001; 47(10):2226–2241.

8. Rogers MJ, Gupta A, Maranas CD. Real options based analysis of optimal pharmaceutical research anddevelopment portfolios. Industrial and Engineering Chemistry Research 2002; 41:6607–6620.

9. Subramanian D, Pekny JF, Reklaitis GV. Simulation-optimization framework for stochastic optimization of R&Dpipeline management. AIChE Journal 2003; 49(1):96–112.

10. Choi J, Lee JH, Realff MJ. Dynamic programming in a heuristically confined state space: a stochastic resource-constrained project scheduling application. Computers and Chemical Engineering 2004; 28(6–7):1039–1058.

11. Choi J, Lee JH, Realff MJ. Simulation based approach for improving heuristics in stochastic resource-constrainedproject scheduling problem. 8th International Symposium on Process Systems Engineering, Kunming, China, 2003.

12. Watkins CJ. Learning from delayed rewards. Ph.D. Thesis, Cambridge University, 1989.13. Sutton RS, Barto AG. Reinforcement Learning (3rd edn). The MIT Press: Cambridge, MA, 2000.14. Bertsekas DP, Tsitsiklis JN. Neuro-Dynamic Programming (1st edn), vol. 1. Athena Scientific: Belmont,

Massachusetts, 1996.



DOI: 10.1002/rnc

a q-learning-based method applied to stochastic resource constrained project scheduling with new...

Documents