gradient-based adaptive stochastic search for simulation ... · gradient-based adaptive stochastic...

This article was downloaded by: [143.215.33.36] On: 01 August 2018, At: 11:52Publisher: Institute for Operations Research and the Management Sciences (INFORMS)INFORMS is located in Maryland, USA

INFORMS Journal on Computing

Publication details, including instructions for authors and subscription information:http://pubsonline.informs.org

Gradient-Based Adaptive Stochastic Search for SimulationOptimization Over Continuous SpaceEnlu Zhou, Shalabh Bhatnagar

To cite this article:Enlu Zhou, Shalabh Bhatnagar (2018) Gradient-Based Adaptive Stochastic Search for Simulation Optimization Over ContinuousSpace. INFORMS Journal on Computing 30(1):154-167. https://doi.org/10.1287/ijoc.2017.0771

Full terms and conditions of use: http://pubsonline.informs.org/page/terms-and-conditions

This article may be used only for the purposes of research, teaching, and/or private study. Commercial useor systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisherapproval, unless otherwise noted. For more information, contact [email protected].

The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitnessfor a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, orinclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, orsupport of claims made of that product, publication, or service.

Copyright © 2018, INFORMS

Please scroll down for article—it is on subsequent pages

INFORMS is the largest professional society in the world for professionals in the fields of operations research, managementscience, and analytics.For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org

http://pubsonline.informs.org

https://doi.org/10.1287/ijoc.2017.0771

http://pubsonline.informs.org/page/terms-and-conditions

http://www.informs.org

INFORMS JOURNAL ON COMPUTINGVol. 30, No. 1, Winter 2018, pp. 154–167

http://pubsonline.informs.org/journal/ijoc/ ISSN 1091-9856 (print), ISSN 1526-5528 (online)

Gradient-Based Adaptive Stochastic Search for SimulationOptimization Over Continuous SpaceEnlu Zhou,a Shalabh BhatnagarbaH. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332;bDepartment of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, IndiaContact: [email protected], http://orcid.org/0000-0001-5399-6508 (EZ); [email protected] (SB)

Received: March 3, 2014Revised: November 17, 2014; December 2,2015; November 29, 2016; June 18, 2017Accepted: June 19, 2017Published Online: January 23, 2018


Copyright: © 2018 INFORMS

Abstract. We extend the idea of model-based algorithms for deterministic optimizationto simulation optimization over continuous space. Model-based algorithms iterativelygenerate a population of candidate solutions from a sampling distribution and use theperformance of the candidate solutions to update the sampling distribution. By viewingthe original simulation optimization problem as another optimization problem over theparameter space of the sampling distribution, we propose to use a direct gradient searchon the parameter space to update the sampling distribution. To improve the computationalefficiency, we further develop a two-timescale updating scheme that updates the parame-ter on a slow timescale and estimates the quantities involved in the parameter updating onthe fast timescale. We analyze the convergence properties of our algorithms through tech-niques from stochastic approximation, and demonstrate the good empirical performanceby comparing with two state-of-the-art model-based simulation optimization methods.

History: Accepted by Marvin Nakayama, Area Editor for Simulation.Funding: The authors gratefully acknowledge the support by the National Science Foundation [Grants

CMMI-1413790 and CAREER CMMI-1453934], Air Force Office of Scientific Research [GrantFA-9550-14-1-0059], Xerox Corporation, USA, Department of Science and Technology, Governmentof India, and the Robert Bosch Centre for Cyber-Physical Systems, Indian Institute of Science (IISc).

Supplemental Material: The online supplement is available at https://doi.org/10.1287/ĳoc.2017.0771.

Keywords: simulation optimization • model-based optimization • two-timescale stochastic approximation

1. IntroductionMany engineering systems require optimization overcontinuous decision variables to improve system per-formance. Due to the scale and complexity of thosesystems, the system performance is often evaluatedthrough simulation or experimentation that has noise,so only samples of system performance are observed.Such problems are often referred to as simulation op-timization problems in the literature. As character-ized by Fu et al. (2008), there are four main classesof approaches to simulation optimization over con-tinuous space: (i) sample average approximation, e.g.,Rubinstein and Shapiro (1993), de Mello et al. (1999),Kleywegt et al. (2002); (ii) stochastic gradient methodsor stochastic approximation, e.g., Kiefer andWolfowitz(1952), Spall (1992), Kushner and Yin (2004); (iii) se-quential response surface methodology, e.g., Bartonand Meckesheimer (2006), Huang et al. (2006), Changet al. (2013); and (iv) deterministic metaheuristics, abroad category of methods that generalize determin-istic metaheuristics to the simulation optimization set-ting, e.g., Ólafsson (2006), Shi and Ólafsson (2000),Andradóttir (2006), Scott et al. (2011). As noted in thesurvey papers by Fu (2002a, b), when it comes to solv-ing practical problems, the optimization algorithms

actually implemented in simulation software are evo-lutionary algorithms and other metaheuristics fromdeterministic nonlinear optimization, such as geneticalgorithms (Goldberg 1989), tabu search (Glover 1990),and random search (Zabinsky 2003, Andradóttir 2006).

Amongdeterministicmetaheuristics, a recent class ofmethods under the name of “model-based methods”have been proven to work very well for deterministicnonlinear optimization problems (Zlochin et al. 2004,Hu et al. 2012). Model-basedmethods typically assumea sampling distribution (i.e., a probabilistic model),often within a parameterized family of distributions,over the solution space, and iteratively carry out the twointerrelated steps: (1) draw candidate solutions fromthe sampling distribution and (2) use the evaluations ofthese candidate solutions to update the sampling dis-tribution. The hope is that at every iteration, the sam-pling distribution is biased toward the more promisingregions of the solution space, and will eventually con-centrate on one ormore of the optimal solutions. Exam-ples of model-based algorithms include ant colonyoptimization (Dorigo and Gambardella 1997), anneal-ing adaptive search (AAS) (Romeĳn and Smith 1994),probability collectives (PCs) (Wolpert 2004), the esti-mation of distribution algorithms (EDAs) (Larranaga

154

http://pubsonline.informs.org/journal/ijoc/

mailto:[email protected]

http://orcid.org/0000-0001-5399-6508

mailto:[email protected]


Zhou and Bhatnagar: GASSO Over Continuous SpaceINFORMS Journal on Computing, 2018, vol. 30, no. 1, pp. 154–167, ©2018 INFORMS 155

et al. 1999, Muhlenbein and Paaß 1996), the cross-entropy (CE) method (Rubinstein 2001), model ref-erence adaptive search (MRAS) (Hu et al. 2007), theinteracting-particle algorithm (Molvalioglu et al. 2009,2010), and gradient-based adaptive stochastic search(GASS) (Zhou and Hu 2014). These algorithms retainthe primary strengths of population-based approachessuch as genetic algorithms, while providing more flex-ibility and robustness in exploring the entire solutionspace. There are a few attempts at extending model-basedmethods to simulation optimization,mostlywitha focus of how to allocate simulation budget to can-didate solutions in each iteration of the algorithm:Chepuri and deMello (2005) proposed a simple heuris-tic sampling scheme to determine the number of simu-lation replications in each iteration for the CE method;Hu et al. (2008) studied sufficient conditions on the sim-ulation allocation rule in MRAS to guarantee conver-gence of the algorithm and proposed the method ofstochastic model reference adaptive search (SMRAS);more recently, He et al. (2010) developed the algorithmunder the name of cross-entropywith optimal comput-ing budget allocation (CEOCBA),whichnicely incorpo-rates the idea of optimal computing budget allocation(OCBA) (Chen et al. 2000, Chen and Lee 2010) into eachiteration of the CEmethod.Motivated by model-based methods but different

from the aforementioned approaches, we propose aframework of incorporating gradient search into mo-del-based optimization: by viewing the original prob-lem as another optimization problem over the parame-ter space of the sampling distribution, we use a directgradient search to update the parameterwhile drawingnewsolutions from the samplingdistribution. This ideahas been exploited for deterministic nondifferentiableoptimization in our earlier work (Zhou and Hu 2012,2014). In this paper, we extend this idea to the settingof simulation optimization, where the performance of acandidate solution canonlybe estimated through simu-lation or experimentation. As a result, the gradient andHessian terms involved in thegradient searchneed tobeestimated jointly by candidate solutions and their per-formance estimates. With the aim of reducing the sam-ple size to further improve the efficiency, we develop atwo-timescale scheme that updates the parameter on aslow timescale and estimates the gradient and Hessianterms using the samples on a fast timescale. The resul-tant algorithm is essentially a two-timescale stochasticapproximation scheme, which allows us to use theoryand tools from the multi-timescale stochastic approx-imation literature (Borkar 1997, 2008; Bhatnagar andBorkar 1998; Bhatnagar et al. 2013) to analyze the con-vergence of the algorithm.

The rest of the paper is organized as follows. Section 2introduces and develops the main idea behind the pro-posed algorithm. Section 3 presents our proposed algo-rithm and the two-timescale variant. Section 4 presents

the main convergence result. The entire convergenceanalysis of the algorithms is presented in the accom-panying online supplement. Section 5 illustrates theperformance of our algorithms by comparing with theCEOCBAmethod on several benchmark problems andwith SMRAS on a simulation optimization problem.Finally, we conclude the paper in Section 6.

2. Main Idea and Algorithm DevelopmentWe consider the following simulation optimizationproblem:

maxx∈X

H(x) , E[h(x , ξx)], (1)

where X ⊆ �dx is a compact set, ξx represents the ran-domness in the system, and the subscript x signifiesthe distribution of ξx may depend on x, and the expec-tation is taken with respect to the distribution of ξx .Assume that H is bounded on X , i.e., ∃Hlb > −∞,Hub < ∞ s.t. Hlb ≤ H(x) ≤ Hub . Due to the complexityof the system, the analytical form of H is often notavailable. For a given x, its feasibility can be deter-mined easily, but its performance H(x) can only beestimated through stochastic simulation or experimen-tation, which returns a sample of h(x , ξx). We alsoassume H(x) is Lipschitz continuous. Continuity of theobjective function is needed because one underlyingassumption behind our algorithms is that the solutionsin a neighborhood have similar performance.

2.1. Main IdeaIn this section, we will explain our main idea on a highlevel and focus more on the intuition. We will leavethe technical details to the next section on the formaldevelopment of the algorithms.

As in many model-based methods, we introduce aparameterized family of densities { f (x;θ)} as the sam-pling distribution, where θ is the parameter that willbe updated over iterations. We will illustrate our ideaby first considering the function

L(θ) ,∫

H(x) f (x;θ) dx.

It is easy to see that

L(θ) ≤ H(x∗),

and the equality is achieved if and only if all the proba-bility mass of f (x;θ) concentrates on a subset of the setof global optima. If such a θ exists, it will recover theoptimal solution and the optimal function value. Forexample, assuming the optimal solution x∗ is uniqueand f (x;θ) is a Gaussian distribution, then x∗ will berecovered by a degenerate distribution that concen-trates only at x∗, which corresponds to the mean of theGaussian being equal to x∗ and the variance being zero.

In contrast to H(x), which does not necessarily havedifferentiability, the function L(θ) is continuous and

Zhou and Bhatnagar: GASSO Over Continuous Space156 INFORMS Journal on Computing, 2018, vol. 30, no. 1, pp. 154–167, ©2018 INFORMS

differentiable in θ under some mild conditions onf (x;θ), and hence direct gradient methods may beused to update θ by taking advantage of this struc-tural property. Moreover, the updated sampling dis-tribution f (x;θ) guides the stochastic search on thesolution space of H(x), and hence maintains a globalexploration of the solution space. Viewed from the per-spective of balancing exploration and exploitation, thedirect gradient search on θ tries to exploit the promis-ing region on the θ-space while the sampling distri-bution keeps exploration on the x-space. This leads toa natural idea of incorporating direct gradient searchwithin the framework of model-based optimization,with the hope to combine the relative fast conver-gence of gradient search with the global exploration ofmodel-based optimization. More specifically, such anincorporation can be done by iteratively carrying outthe following two steps:1. Generate candidate solutions from the sampling

distribution.2. Based on the evaluations of the candidate solu-

tions, update the parameter of the sampling distribu-tion via direct gradient search.

For the second step above, computing the gradientand (approximate) Hessian is easy and only requiresevaluation or estimation of the objective function H(x).Assuming that the derivative and integral can be inter-changed (which will be formally justified later), we canderive an expression for the gradient of L(θ) as follows:

∇θL(θ)�∇θ∫

H(x) f (x;θ) dx

�

∫H(x)

∇θ f (x;θ)f (x;θ) f (x;θ) dx

� E fθ [H(x)∇θ ln f (x;θ)].Throughout the paper, we will use the notation E fθ todenote the expectation taken with respect to the distri-bution f ( · ;θ), which is parameterized by θ. We notethat any unbiased estimator H(x) of H(x) leads to unbi-ased estimation for E fθ [H(x)∇θ ln f (x;θ)], since

E[H(x)∇θ ln f (x;θ)]� E fθ [E[H(x)]∇θ ln f (x;θ)]� E fθ [H(x)∇θ ln f (x;θ)],

where the inner expectation is takenwith respect to therandomness in H(x). Hence, to estimate the gradient∇θL(θ), we can first draw a sample x from f ( · ;θ) andthen plug into H(x)∇θ ln f (x;θ).

2.2. Formal DevelopmentFor the development of a simple and yet flexible algo-rithm, we introduce a shape function Sθ: � → �+,where the subscript θ signifies the possible depen-dence of the shape function on the parameter θ. Thefunction Sθ satisfies the following condition:For every fixed θ, Sθ(y) is strictly increasing in y, and

bounded from above and below for any y ∈ (−∞,+∞).

Moreover, for every fixed y, Sθ(y) is continuous andnondecreasing in θ.The shape function makes the objective function

value positive, and the strictly increasing property pre-serves the order of solutions and, in particular, theoptimal solution. Other conditions on the shape func-tion are regularity conditions to facilitate the deriva-tion and analysis later. The shape function also imposesa weighting scheme on the samples based on samplefunction evaluations, and hence guides the samplingdistribution toward the more promising regions wherethe candidate solutions have better performance; thispoint will be discussed in detail in Section 3. Then, foran arbitrary but fixed θ′, we define

L(θ;θ′)�∫

Sθ′(H(x)) f (x;θ) dx.

By the condition on the shape function, it is easy to seethat for a fixed θ′,

L(θ;θ′) ≤ Sθ′(H(x∗)), (2)

and the equality is achieved if and only if all the proba-bility mass of f (x;θ) is concentrated on a subset of theset of global optima. We further define

l(θ;θ′) , ln L(θ;θ′).Since ln( · ) is a strictly increasing function, l(θ;θ′) hasthe same set of optimal solutions as L(θ;θ′).Following ourmain idea outlined before, we propose

a stochastic search algorithm that carries out the fol-lowing two steps at each iteration: let θk be the param-eter obtained at the kth iteration,1. Generate candidate solutions from f (x;θk).2. Update the parameter to θk+1 using a Newton-like

iteration for supθ∈Θ l(θ;θk).The second step requires one to compute the gradi-

ent and Hessian of l(θ;θk), whose analytical expres-sions are provided in Proposition 1 below when thesampling distribution is chosen to be an exponentialfamily of densities (see Proposition 2 in Zhou and Hu2014 for the same result and result for the case of gen-eral distributions; we provide the proof here for com-pleteness). The definition of exponential families (cf.Barndorff-Nielsen 1978) is stated as follows.Definition 1. A family { f (x;θ): θ ∈Θ ⊂�dθ } is an expo-nential family of densities if its probability density func-tion (pdf) satisfies

f (x;θ)� exp{θTT(x) −φ(θ)},

φ(θ)� ln{∫

exp(θTT(x)) dx},

(3)

where T(x) � [T1(x),T2(x), . . . ,Td(x)]T is the vector ofsufficient statistics, θ � [θ1 , θ2 , . . . , θd]T is the vector ofnatural parameters, andΘ� {θ ∈�dθ : |φ(θ)| <∞} is theparameter space that is an open set with a nonemptyinterior.


Proposition 1. Assume that f (x;θ) is twice differentiablein θ and that ∇θ f (x;θ) and ∇2

θ f (x;θ) are bounded on Xfor any θ. Furthermore, if f (x;θ) is in an exponential familyof densities, then

∇θ l(θ;θ′)|θ�θ′ � Egθ′ [T(X)] −E fθ′ [T(X)],∇2θ l(θ;θ′)|θ�θ′ �Covgθ′ [T(X)] −Cov fθ′ [T(X)],

where E fθ′ and Cov fθ′ denote the expectation and covari-ance with respect to f ( · ;θ′), and Egθ′ and Covgθ′ denotethe expectation and covariance taken with respect to the pdfg( · ;θ′), which is defined by

g(x;θ)�Sθ(H(x)) f (x;θ)∫ Sθ(H(y)) f (y;θ) dy

. (4)

Proof. Consider the gradient of l(θ;θ′) with respectto θ,

∇θ l(θ;θ′)|θ�θ′ �∇θL(θ;θ′)

L(θ;θ′)

��θ�θ′

�∫ Sθ′(H(x)) f (x;θ)∇θ ln f (x;θ) dx

L(θ;θ′)

��θ�θ′

� Egθ′ [∇θ ln f (X;θ′)]. (5)

Differentiating (5) with respect to θ, we obtainthe Hessian

∇2θ l(θ;θ′)|θ�θ′

�∫ Sθ′(H(x)) f (x;θ)∇2

θ ln f (x;θ) dxL(θ;θ′)

+∫ Sθ′(H(x))∇θ ln f (x;θ)(∇θ f (x;θ))T dx

L(θ;θ′)

−(∫ Sθ′(H(x)) f (x;θ)∇θ ln f (x;θ)dx)(∇θL(θ;θ′))T

L(θ;θ′)2

��θ�θ′

.

Using ∇θ f (x;θ) � f (x;θ)∇θ ln f (x;θ) in the secondterm on the right-hand side, the above expression canbe rewritten as

∇2θ l(θ;θ′)|θ�θ′� Egθ′ [∇

2θ ln f (X;θ′)]

+Egθ′ [∇θ′ ln f (X;θ′)(∇θ′ ln f (X;θ′))T]−Egθ′ [∇θ ln f (X;θ′)]Egθ′ [∇θ ln f (X;θ′)]T

� Egθ′ [∇2θ ln f (X;θ′)]+Covgθ′ [∇θ ln f (X;θ′)]. (6)

Furthermore, if f (x;θ)� exp{θTT(x) −φ(θ)},∇θ ln f (x;θ)� T(x) −E fθ [T(X)]. (7)

Plugging (7) into (5) yields

∇θ l(θ;θ′)|θ�θ′ � Egθ′ [T(X)] −E fθ′ [T(X)].Differentiating (7) with respect to θ, we obtain

∇2θ ln f (x;θ)�−Cov fθ [T(X)]. (8)

Plugging (7) and (8) into (6) yields

∇2θ l(θ;θ′)|θ�θ′ �Covgθ′ [T(X)] −Cov fθ′ [T(X)]. �

With the analytical expressions of the gradient andHessian, we are ready to use a gradient-based ap-proach to update the parameter θ. To ensure theparameter updating is along the ascent direction ofl(θ;θ′), we approximate the Hessian by the slightlyperturbed second term in ∇2

θ l(θ′;θ′), i.e., a negativedefinite term −(Cov fθ′ [T(X)]+ εI), where I is the iden-tity matrix and ε is a small positive constant. Thereason for this approximation is that Cov fθ [T(X)] �E[−∇2

θ ln f (X;θ)] is the Fisher information matrix,whose inverse provides a lower bound on the varianceof an unbiased estimator of the parameter, leading tothe fact that Cov fθ [T(X)]−1 is the minimum variancestepsize for stochastic approximation (as we will showlater, our algorithm can be interpreted as a stochas-tic approximation procedure on θ). In summary, weupdate the parameter as follows:

θk+1�ΠΘ{θk+αk

(Cov fθk

[T(X)]+εI)−1∇θ l(θk ;θk)

}�ΠΘ

{θk+αk

(Cov fθk

[T(X)]+εI)−1

·(Egθk[T(X)]−E fθk

[T(X)])}, (9)

where αk is a positive stepsize, and ΠΘ denotes theprojection operator that projects an iterate back ontothe parameter space Θ.

To have an implementable algorithm, we still needto evaluate or estimate the expectation and covarianceterms in (9). The expectation term E fθk

[T(X)] can beevaluated analytically in most cases. For example, iff ( · ;θk) belongs to the family of Gaussian distribu-tions, then E fθk

[T(X)] reduces to the mean and sec-ond moment of the Gaussian distribution. The covari-ance termCov fθk

[T(X)]may not be directly available orcould be too complicated to compute analytically, butit can be approximated by the sample covariance usingthe candidate solutions drawn from f ( · ;θk). Recalling(4), the remaining term Egθk

[T(X)] can be rewritten as

Egθk[T(X)]� E fθk

[T(X)

Sθk(H(X))

E fθk[Sθk(H(X))]

].

Hence we can draw i.i.d. samples {x ik , i � 1, . . . ,Nk}

from f (x;θ) and compute normalized weights of thesamples according to

w ik ∝ Sθk

(H(x ik)), i � 1, . . . ,Nk ;

Nk∑i�1

w ik � 1,

and approximate Egθk[T(X)] by ∑Nk

i�1 w ikT(x i

k). Since theexact function value is not available in simulationoptimization in the algorithm, we will use a perfor-mance estimate H(x i

k) (instead of H(x ik)) to compute the

weight as above.


3. Algorithms: Gradient-Based AdaptiveStochastic Search for SimulationOptimization (GASSO) andTwo-Timescale Gradient-BasedAdaptive Stochastic Search forSimulation Optimization (GASSO-2T)

With the prior derivations, we propose the followingalgorithm.

Algorithm 1 (GASSO)1. Initialization: choose an exponential family of den-

sities { f ( · ;θ), θ ∈ Θ}, and specify a small positiveconstant ε, initial parameter θ0, sample size sequence{Nk} that satisfies Nk →∞ (k→∞), simulation bud-get sequence {Mk} that satisfies Mk→∞ (k→∞), andstepsize sequence {αk} that satisfies ∑∞

k�0 αk � ∞,∑∞k�0 α

2k <∞. Set k � 0.

2. Sampling: draw samples x ikiid∼ f (x;θk), i�1,2, . . . ,Nk .

For each x ik , evaluate the performance by simulation to

generate h(x ik ,ξ

i , jk ), j�1, . . . ,Mk , and compute the perfor-

mance estimate H(x ik)�(1/Mk)

∑Mkj�1 h(x i

k ,ξi , jk ).

3. Estimation: compute the normalized weights w ik

according to

w ik �

Sθk(H(x i

k))∑Nkj�1 Sθk

(H(x jk)),

and then estimate Egθk[T(X)] and Cov fθk

[T(X)] asfollows:

Egθk[T(X)]�

Nk∑i�1

w ikT(x i

k),

Cov fθk[T(X)]� 1

Nk − 1

Nk∑i�1

T(x ik)T(x i

k)T

− 1N2

k −Nk

( Nk∑i�1

T(x ik)) ( Nk∑

i�1T(x i

k))T

.

4. Updating: update the parameter θ according to

θk+1 �ΠΘ{θk + αk

(Cov fθk

[T(X)]+ εI)−1

·(Egθk[T(X)] −E fθk

[T(X)])},

where Θ ⊆ Θ is a nonempty compact and convex setthat closely approximates Θ, and ΠΘ denotes the pro-jection operator that projects an iterate back onto theset Θ by choosing the closest point in Θ.

In the initialization step (step 1), the conditions onthe sample size sequence and stepsize sequence areimposed to guarantee the convergence, which will beshown in the next section. For continuous problems,some common choices of sampling distribution includeGaussian family, truncated Gaussian family, Gammafamily, and Dirichlet family. The specific choice islargely driven by the constraint set (or in other words,

the solution spaceX) of theproblem. For example, if thesolution space is unbounded (although we assume Xis compact for the purpose of convergence analysis, thealgorithm can be applied to problemswith unboundedsolution space), then Gaussian is a good choice; if thesolution space is a (scaled) simplex (with or without itsinterior), then Dirichlet is a good choice, since a samplefrom a Dirichlet distribution of appropriate dimensionlies in the solution space; if the solution space has anirregular shape, wemay use a truncated Gaussian fam-ily and need to use accept-reject sampling to generatecandidate solutions that lie within the solution space.

The sampling step (step 2) draws candidate solutionsfrom the current sampling distribution and runs simu-lation to obtain their performance estimates. The esti-mation step (step 3) computes the gradient andHessianestimates using the samples. One common choice of theshape function Sθ is the level/indicator function usedin the CE method and MRAS: I{H(x) ≥ γθ}, where γθis the (1− ρ)-quantile γθ , supr{r: P fθ {x ∈X : H(x) ≥ r}≥ ρ}, and P fθ denotes the probability with respect tof ( · ;θ). Another similar choice is (H(x)−Hl)I · {H(x)≥ γθ}, where Hl is a lower bound on H(x) so thatSθ(H(x)) is always nonnegative. These shape functionsessentially prune the level sets below γθ and give equalor more weight to the elite solutions in the level setabove γθ. By varying ρ, we can adjust the percentile ofelite samples that are selected to update the next sam-pling distribution: the smaller ρ, the fewer elite samplesselected, and hencemore emphasis is put on exploitingthe neighborhood of the current best solutions. Theseshape functions are found to work well in terms of bal-ancing the trade-offbetween solution accuracy and con-vergence speedwhen theparameter ρ is relatively small(e.g., ρ ∈ [0.1,0.2]); intuitively, it is because the inferiorcandidate solutions usually have no implication on thepromising regions that containgood solutions.Note theindicator function does not satisfy the regularity condi-tion on Sθ (such as continuity) needed for analysis, soinstead, we use a continuous approximation

Sθ(H(x))�1

1+ e−S0(H(x)−(1−δ)γθ), (10)

where S0 is a large positive constant and δ is a constantin (0, 1). This continuous approximation can be madearbitrarily close to the indicator function I{H(x) ≥ γθ}by choosing sufficiently large S0 and small δ. Whenthe approximation is close enough, we observe thatthere is no difference in numerical results betweenusing the exact indicator function or the continuousapproximation. Other possible choices of the shapefunction include Sθ(H(x))� exp (H(x)), and Sθ(H(x))�exp(H(x)I{H(x) ≥ γθ}), which work well for someproblems.

The projection operator in the updating step (step 4)is used to ensure the iterate stays inside the parameter


space Θ. To make sure the projection yields a uniquepoint insideΘ, we take a compact and convex approxi-mation Θ of Θ. Moreover, to make the projection oper-ation computationally easy, for most of the samplingdistributions mentioned above, Θ can be a hyperrect-angle having the form Θ � {θ ∈ Θ : ai ≤ θi ≤ bi} forconstants ai < bi , i � 1, . . . , d, and these constants canbe set so as to ensure Θ is a close approximation to Θ;other more general choices of Θmay also be used (see,e.g., Kushner and Yin 2004, Section 4.3).

3.1. GASSO-2TThe parameter updating step in GASSO can be inter-preted as a stochastic approximation scheme for eval-uating the zeros of ΠΘ{Cov fθ [T(X)] + εI)−1∇θ l(θ;θ)}.To reduce the sample size per iteration and furtherimprove the efficiency, we can also use a stochasticapproximation scheme to update the Hessian and gra-dient estimates but on a faster timescale. That is, at kthiteration we carry out the following steps:1. Draw candidate solutions x i

kiid∼ f (x;θk), i � 1,

. . . ,N , and simulate to obtain h(x ik , ξ

ik), i � 1, . . . ,N .

2. Update the gradient and Hessian estimates on thefast timescale with stepsize βk .

3. Update the parameter θk on the slow timescalewith stepsize αk .The stepsizes satisfy the standard conditions in two-

timescale stochastic approximation (Borkar 2008) asshown in the following assumption.

Assumption 1. The stepsizes satisfy the following condi-tions: ∑

k

αk �∑

k

βk �∞, (11)∑k

(α2k + β

2k) <∞, (12)

limk→∞

αk

βk� 0. (13)

By Assumption 1, as k goes to infinity, βk domi-nates αk . Therefore the parameter θ is updated on theslow timescale with stepsize αk , while the gradient andHessian estimates are updated on the fast timescalewith stepsize βk . When k is large, the parameter θkremains almost unchanged compared with the gradi-ent and Hessian estimates, therefore θk is often said tobe quasi-static to the fast timescale. This implies thatthe sampling distribution can be viewed as fixed overmany iterations while the gradient and Hessian esti-mates are updated, and therefore, often amuch smallersample size is needed for every iteration.With the above idea, we propose the following two-

timescale variant of GASSO, named as GASSO-2T.

Algorithm 2 (GASSO-2T)1. Initialization: choose an exponential family of den-

sities { f ( · ;θ)}, and specify a small positive constant ε,

initial parameter θ0, sample size N (integer ≥ 1), simu-lation budget {Mk} that satisfies Mk→∞ (k→∞), andstepsize sequences {αk} and {βk} that satisfy Assump-tion 1. Set k � 0. Set L0(1) � 0, G0(1) � 0, ET0(1) � 0,T0(1)� 0, Q0(1)� 0.2. Sampling: draw samples x i

kiid∼ f (x;θk), i�1,2, . . . ,N .

For each x ik , evaluate the performance by simulation to

generate h(x ik ,ξ

i , jk ), j�1, . . . ,Mk , and compute the perfor-

mance estimate H(x ik)�(1/Mk)

∑Mkj�1 h(x i

k ,ξi , jk ).

3. Estimation:(a) First, we estimate L(θk ;θk) in the following

manner: For i � 1, . . . ,N , update

Lk(i + 1)� Lk(i)+ βk(Sθk(H(x i

k)) − Lk(i)). (14)

Set Lk+1(1) :� Lk(N + 1).(b) Next, we estimate Egθk

[T(X)] as follows: Fori � 1, . . . ,N , update

Gk(i + 1)� Gk(i)+ βk

(Sθk(H(x i

k))Lk(N + 1) T(x i

k) −Gk(i)). (15)

Set Egθk[T(X)] :� Gk(N + 1). Set Gk+1(1) :� Gk(N + 1).

(c) Finally, we estimate Cov fθk[T(X)] as follows:

For i � 1, . . . ,N , update

Pk(i + 1)� Pk(i)+ βk(T(x ik) −Pk(i)) (16)

Qk(i + 1)� Qk(i)+ βk(T(x ik)T(x i

k)′−Qk(i)). (17)

Set Cov fθk[T(X)] :� Qk(N + 1) −Pk(N + 1)Pk(N + 1)′. Set

Pk+1(1) :� Pk(N + 1) and Qk+1(1) :� Qk(N + 1).4. Updating: update the parameter θ according to

θk+1 �ΠΘ{θk + αk

(Cov fθk

[T(X)]+ εI)−1

·(Egθk[T(X)] −E fθk

[T(X)])}, (18)

where Θ ⊆ Θ is a nonempty compact and convex setthat closely approximates Θ, and ΠΘ denotes the pro-jection operator that projects an iterate back onto theset Θ by choosing the closest point in Θ.5. Stopping: check if some stopping criterion is sat-

isfied. If yes, stop and return the current best sampledsolution; else, set k :� k + 1 and go back to step 2.

GASSO-2T differs from GASSO in the estimationstep (step 3), where the expectation and covarianceterms are estimated through stochastic approxima-tion iterations instead of sample average estimates. Forexample, in step 3(a) the samples generated at thecurrent iteration are used to update the estimate forL(θk ;θk) sequentially, and the final estimate Lk(N + 1)of the current iteration is set to be the initial esti-mate Lk+1(1) of the next iteration. In contrast to GASSOwhich estimates L(θk ;θk) by the sample average ofsamples generated at the current iteration, GASSO-2Tuses all the samples that have been generated so far.


Because GASSO-2T uses all the samples generatedin the current and previous iterations for estimation,GASSO-2T only needs a fixed and usually small samplesize at each iteration. To make it easier to see, considerthe case that only a single sample xk is drawn in itera-tion k, i.e., N � 1, for all iterations. Then, for example,the iterate in step 3(a) can be written as

Lk+1�Lk +βk(Sθk(H(xk))−Lk)� (1−βk)Lk +βkSθk

(H(xk)),

where we suppress the index i because N � 1. By in-duction, we can write out Lk+1 as a weighted sum ofall sample values Sθ0

(H(x0)), . . . , Sθk(H(xk)) from the

first iteration to the current iteration, with smallerweight on the samples further away in the past andall weights summing up to one. The intuitive reasonthat we can use this weighted-sum estimate is becauseθk is updated on a slow timescale, and thus there islittle change in θk from iteration to iteration viewedby the Lk-iterate that is updated on the fast timescale;hence, samples from previous iterations can be used toestimate the expected value under the current distribu-tion f ( · ;θk), but their weights need to be discountedto take into account that there is still a little differencebetween previous distributions and the current one.Therefore the effective sample size for estimation of thegradient and Hessian in GASSO-2T is roughly equalto the total sample size of all iterations, and this effec-tive sample size automatically increases to infinity asthe iteration number goes to infinity, which makes esti-mation error go to zero in the limit. Theoretically, thesample size per iteration can be N � 1 in GASSO-2T,but, in practice, we usually use a sample size from 50to 100 for numerical stability. In contrast, since GASSOonly uses the samples generated at the current itera-tion to estimate the gradient and Hessian, it requiresthe sample size per iteration to increase to infinity (i.e.,Nk →∞, k→∞), to ensure the estimation error goesto zero so that the convergence of the algorithm can beguaranteed. This intuition will be made formal in theconvergence analysis.

4. Main Convergence ResultsIn this section, we first present Assumptions 2 and 3.Due to the lack of space, the detailed convergenceproof is presented in the accompanying online supple-ment. Nonetheless, we state here ourmain convergenceresults, viz., Theorem 1, Lemma 1, and Corollary 1, onthe convergence of the iterates obtained from GASSO-2T. The proofs of Lemma 1 andCorollary 1 are also pre-sented here as they are independent of other results.Our analysis draws upon techniques from stochas-tic approximation literature (Kushner and Yin 2004,Borkar 2008). The convergence proof for GASSO isstraightforward and is briefly mentioned in Remark 2of the online supplement.

We do not analyze the rate of convergence of ouralgorithm. In general, it is hard to obtain rate of conver-gence results for two-timescale stochastic approxima-tion, see, however, Konda and Tsitsiklis (2004) whererate results for a two-timescale scheme with linear(faster and slower) recursions have been obtained.

We make the following regularity assumptions.

Assumption 2. (i) The set X is compact.(ii) For a fixed θ ∈ Θ, the function Sθ(H( · )) is Lipschitz

continuous with Lipschitz constant Kθ <∞.(iii) The set of functions {Sθ( · ), θ ∈ Θ} is uniformly

bounded.(iv) For any fixed θ′ ∈ Θ, the function L(θ;θ′) is twice

continuously differentiable for θ ∈ Θ.Assumption 3. (i) Themap x 7→T(x) is bounded onX .

(ii) For any fixed x, the pdf f (x;θ) is continuous inθ ∈Θ.Note that Assumption 2(i) is imposed for the con-

vergence analysis. In practice, we may apply the algo-rithm to more general problems where X is notcompact. We use the shorthand notation ∇θ l(θ′;θ′)for ∇θ l(θ;θ′)|θ�θ′ . The ordinary differential equation(ODE) corresponding to the θ-update in step 4 of Algo-rithm 2 is as follows:

Ûθ(t)� ΠΘ{V(θ(t))−1∇θ l(θ(t);θ(t))}, (19)

where the operator ΠΘ( · ) is defined as follows: for anycontinuous function v: �d→�d ,

ΠΘ(v(y))� limη→0

(ΠΘ(y + ηv(y)) −ΠΘ(y)

η

).

Let Z denote the set of equilibria of (19) where

Z , {θ ∈ Θ | ΠΘ{V(θ)−1∇θ l(θ;θ)} � 0}.

Theorem 1. Under Assumptions 1–3, θk → Z w.p.1 ask→∞. Note that θk → Z means limk→∞ dist(θk ,Z) � 0,where dist(θ,Z)� minθ′∈Z |θ− θ′ |.Lemma 1. The stationary points {θ : ∇θ l(θ;θ) � 0} lie inthe set Z.

Proof. Suppose W(θ) , −l(θ;θ). Then, W(θ) works asa Lyapunov function for the ODE (19). This can be seenin the following manner: Note that

dW(θ)dt

� (∇θW(θ))′ dθdt

�−∇θ l(θ;θ)′ΠΘ{V(θ)−1∇θ l(θ;θ)}.

If θ ∈ Θ0 (the interior of Θ), then for η > 0 small enough,

Fθ,η , θ+ ηV(θ)−1∇θ l(θ;θ) ∈ Θ0

as well; hence, by definition of ΠΘ( · ), we get

ΠΘ(V(θ)−1∇θ l(θ;θ))� V(θ)−1∇θ l(θ;θ).


This would also be true in the case of θ ∈ ∂Θ (theboundary of Θ) when Fθ,η ∈ Θ0 for small η > 0. SinceV(θ)−1 is positive definite and symmetric,

−∇θ l(θ;θ)′V(θ)−1

· ∇θ l(θ;θ){< 0, for θ s.t. ∇θ l(θ;θ), 0;� 0, o.w.

Thus the points where ∇θ l(θ;θ)� 0 lie in the set Z. �

Remark 1. In principle, because of the projection, theremay exist spurious fixed points that lie in Z (seeKushner and Yin 2004), for which

−∇θ l(θ;θ)′ΠΘ{V(θ)−1∇θ l(θ;θ)} � 0

even when ∇θ l(θ;θ) , 0. Such fixed points would,however, lie on the boundary (∂Θ) of Θ. In case theequilibria of the unprojected ODE, i.e., (19), withoutthe ΠΘ operator, lie outside Θ, then the equilibria of theprojected ODE would lie in ∂Θ itself. Such equilibriawould also be close to the ones of the unprojected ODEsince Θ closely approximates Θ.

As argued in Remark 2 of the online supplement,Theorem 1 and Lemma 1 also hold in the case ofGASSO. Assuming that the set Z comprises only ofisolated equilibria, these results together imply thatGASSO and GASSO-2T converge to a stationary pointthat solves ∇θ l(θ;θ) � 0. Let’s assume this stationarypoint is a local optimal solution (not a saddle point).Next, we show the relationship between such a pointand a local optimal solution to the original problem (1).

Corollary 1. If θ∗ is a local optimal solution to the prob-lem supθ∈Θ l(θ, θ) and f ( · ;θ∗) is a degenerate distribution,then the support of f ( · ;θ∗) denoted by x∗ is a local optimumof problem (1).

Proof. This can be shown by contradiction. If it is nottrue, then there exists a degenerate distribution f ( · ;θ′)such that its support x′ is in the neighborhood of x∗ andH(x′)>H(x∗). Note that θ∗ and θ′ are in ∂Θ (the bound-ary of Θ) since they are both degenerate, and hence θ′is in the neighborhood (confined to ∂Θ) of θ∗. Denoteby {θ′n ∈ Θ, n � 1, 2, . . .} a sequence in Θ such thatθ′n→ θ′ as n→∞, and similarly, {θ∗n ∈Θ, n � 1, 2, . . .}a sequence in Θ such that θ∗n → θ∗ as n→∞. There-fore limθ′n→θ′ l(θ

′n ;θ′n) � ln Sθ′(H(x′)) > ln Sθ∗(H(x∗)) �

limθ∗n→θ∗ l(θ∗n ;θ∗n), where the inequality follows fromthe strictly increasing property of the logarithm func-tion and the forms of Sθ( · ) of all the choices consid-ered in this paper. It contradicts with the assumptionthat θ∗ is a locally optimal solution to the problemsupθ∈Θ l(θ;θ). Therefore x∗ has to be a local optimalsolution of the original problem. �

Remark 2. Corollary 1 implies that GASSO andGASSO-2T converge to a locally optimal solution of the

original problem if the limiting sampling distributionis a degenerate distribution. Note that the parameterof a nondegenerate distribution may also be a station-ary point, but such a θ∗ is an unstable limiting pointin practice. The reason is that a sample estimate of thegradient ∇θ l(θ;θ)will not be exactly equal to zero, dueto the variability caused by sampling from a nonde-generate distribution. Hence the θ-iterate, driven bythe gradient-based algorithm, will not rest on such apoint. This intuition has also been empirically verifiedin our numerical experiments, where we have alwaysobserved the covariance matrix of the sampling distri-bution converge to a zero matrix.

5. Numerical ExperimentsIn this section, we demonstrate the performance ofGASSO and GASSO-2T on several benchmark prob-lems and a simulation optimization problem, and con-clude with our recommendations on algorithm param-eter choices for practitioners.

5.1. Benchmark ProblemsWe test the algorithms on six benchmark continuousoptimization problems with additive noise ξx . We con-sider three cases for the noise ξx : (i) stationary noise,where ξx is normally distributedwithmean 0 and vari-ance 100; (ii) increasing noise, where ξx is normally dis-tributedwithmeanzeroandvariance ‖x‖22 ; (iii) decreas-ing noise, where ξx is normally distributed with meanzero and variance 100/(‖x‖22 + 1). The test functionsare all 10 dimensional, and their expressions are listedbelow. Specifically, Powell (H1) is a badly scaled func-tion; Trigonometric (H2) and Rastrigin (H3) are multi-modal functions with a large number of local optima,and the number of local optima increases exponentiallywith the problem dimension; Pintér (H4) and Levy (H5)are multimodal and badly scaled problems; and theweighted sphere function (H6) is a multidimensionalconcave function.

(1) Powell singular function (n � 10)

h1(x , ξx)�−1−n−2∑i�2[(xi−1 + 10xi)2 + 5(xi+1 − xi+2)2

+ (xi − 2xi+1)4 + 10(xi−1 − xi+2)4]+ ξx ,

where x∗ � (0, . . . , 0)T , H1(x∗)� E[h1(x∗ , ξx)]�−1.(2) Trigonometric function (n � 10)

h2(x , ξx)�−1−n∑

i�1[8 sin2(7(xi − 0.9)2)

+ 6 sin2(14(xi − 0.9)2)+ (xi − 0.9)2]+ ξx ,

where x∗ � (0.9, . . . , 0.9)T , H2(x∗)� E[h2(x∗ , ξx)]�−1.


(3) Rastrigin function (n � 10)

h3(x , ξx)�−n∑

i�1(x2

i − 10 cos(2πxi)) − 10n − 1+ ξx ,

where x∗ � (0, . . . , 0)T , H3(x∗)� E[h3(x∗ , ξx)]�−1.(4) Pintér’s function (n � 10)

h4(x , ξx)

�−[ n∑

i�1ix2

i +

n∑i�1

20i sin2(xi−1 sin xi − xi + sin xi+1)

+

n∑i�1

i log10(1+ i(x2i−1 − 2xi + 3xi+1 − cos xi + 1)2)

]− 1+ ξx ,

where x∗ � (0, . . . , 0)T , H4(x∗)� E[h4(x∗ , ξx)]�−1.(5) Levy function (n � 10)

h5(x , ξx)

�−1− sin2(πy1) −n−1∑i�1[(yi − 1)2(1+ 10 sin2(πyi + 1))]

− (yn − 1)2(1+ 10 sin2(2πyn))+ ξx ,

where yi �1+xi/4, x∗� (0, . . . ,0)T , H5(x∗)�E[h5(x∗ ,ξx)]�−1.

(6) Weighted sphere function (n � 10)

h6(x , ξx)�−1−n∑

i�1ix2

i + ξx ,

where x∗ � (0, . . . , 0)T , H6(x∗)� E[h6(x∗ , ξx)]�−1.We compare our algorithms to the CEOCBAmethod

proposed by He et al. (2010). CEOCBA is a state-of-the-art stochastic optimization algorithm. It incorporatesthe successful ranking and selection procedure OCBA(Chen et al. 2000) into each iteration of the well-knownCE method, by smartly allocating the simulation bud-get to candidate solutions at every iteration. He et al.(2010) have shown that CEOCBA is a great improve-ment on the standardCEmethodwith equal allocation.Theoretically, the OCBA approach is only applicableto the case of additive Gaussian noise in the objectivefunction, although heuristic extension to other additivenoises is also possible. In contrast, ourmethod does notrequire such additive and/or Gaussian assumption onthe simulation noise, and is valid for general forms ofobjective functions.

In all three algorithms, we use the independent mul-tivariate Normal distribution N (µk ,Σk) as the param-eterized distribution f ( · ;θk), where the covariancematrix is a diagonal matrix. We also test the instance ofGASSO using the correlated multivariate Normal witha full covariance as the sampling distribution. The twoGASSO implementations are denoted as GASSO (inde-pendent Normal) and GASSO (multivariate Normal),

respectively, in the following tables and figures. Theinitial mean µ0 is chosen randomly according to uni-form distributions on [−30, 30]n , and the initial covari-ance matrix is set to be Σ0 � 1,000In×n , where n isthe dimension of the problem, and In×n is the identitymatrix with size n. We observe in the experiment thatthe performance of the algorithm is insensitive to theinitial candidate solutions, if the initial variance is largeenough.

Since CEOCBA uses the elite samples at each itera-tion to update the sampling distribution, a comparablechoice in our algorithms is to set the shape functionas the indicator function Sθk

(h(x , ξ))� I{h(x , ξ) ≥ γθk}.

However, the indicator function does not satisfy ourassumption on the shape function in the convergenceanalysis, therefore we use a continuous approximationof the form (10) with S0 �105 and δ�10−3, whichmakesSθk(h(x , ξ)) a very close approximation to I{h(x , ξ)

≥ γθk}. In all three algorithms, the quantile parameter

ρ is set to be 0.1, and the (1 − ρ)-quantile γθkis esti-

mated by the (1 − ρ) sample quantile of the functionvalues corresponding to all the candidate solutionsgenerated at the kth iteration. In our experiment, weobserve that the results using the previous two shapefunctions are indistinguishable (which is not surpris-ing at all because of the close approximation). In prac-tice, to simplify implementation, we can simply use theindicator function.

The sample size for each iteration is set tobeN �1,000for GASSO and CEOCBA; the sample size is N � 100for GASSO-2T, since GASSO-2T can use a small sam-ple size due to the two-timescale updating scheme. Foreach candidate solution, we evaluate M � 10 times toestimate its performance in GASSO andGASSO-2T; fora fair comparison, the simulation budget T per itera-tion in CEOCBA is set to be T � NM � 10,000. Pleasenote that although the asymptotic convergence requiresthe sample size Nk and simulation budget Mk go toinfinity as the iteration k goes to infinity, it is suffi-cient to use constant N and M in practice when thetotal iteration number is finite. Other parameters inGASSOandGASSO-2T are set as follows: the small con-stant ε � 10−10, and the stepsizes αk � 50/(k + 1,500)0.6and βk � 1/(k + 1,500)0.55, which satisfy Assumption 1.Other parameters inCEOCBAare set as follows: the ini-tial number of function evaluations for each candidatesolution n0 � 5, the budget increment ∆� 100, and wealso apply the smoothing parameter updating proce-dure in DeBoer et al. (2005) on the parameter updatingto prevent premature convergence, where the smooth-ing parameter is chosen to be ν � 0.5 as suggested inHe et al. (2010). For all algorithms, we have imple-mented versions that use common random numbers(CRN) in evaluating the performance across differentcandidate solutions of the same iteration; these versions


Table 1. Stationary Noise—Average Performance of GASSO (Independent Normal), GASSO (Multivariate Normal),GASSO-2T, and CEOCBA

GASSO (indep. Nor.) GASSO (multi. Nor.) GASSO-2T CEOCBA

Variable H∗ H∗ std_err H∗ std_err H∗ std_err H∗ std_err

Powell H1 −1 −1.000 0.000 −1.001 0.000 −1.001 0.000 −5.144 3.271Trigonometric H2 −1 −1.000 0.000 −1.000 0.000 −1.000 0.000 −1.000 0.000Rastrigin H3 −1 −1.000 0.000 −1.000 0.000 −1.040 0.028 −1.000 0.000Pintér H4 −1 −1.002 0.044 −1.000 0.000 −1.565 0.017 −1.809 0.015Levy H5 −1 −1.012 0.000 −1.002 0.000 −1.000 0.000 −1.002 0.000Sphere H6 −1 −1.000 0.000 −1.000 0.000 −1.000 0.000 −1.000 0.000

have demonstrated noticeable improvement over ver-sions without CRN. The results presented next are ver-sions with CRN.We run each of these three algorithms 50 times inde-

pendently. The average performance of the algorithmsare presented in Table 1 and Figure 1 in this paper andTables 2–3 and Figures 2–3 in the online supplement.In particular, Table 1/Figure 1, Table 2/Figure 2 (on-line supplement), and Table 3/Figure 3 (online supple-ment) show the results for stationary noise, increasingnoise, and decreasing noise, respectively. In the tables,we denote by H∗ the true optimal value of the test func-tion, and H∗ the average of the function values H(µK)returned by the 50 runs of an algorithm, where K isthe number of iterations of an algorithm, and std_errthe standard error of the optimal function values. Inthe experiments, we found the computation time offunction evaluations dominates the time of other steps,therefore we compare the performance of the algo-rithmswith respect to the total number of function eval-uations in the figures.For all the test functions, the average optimal func-

tion values returned by GASSO (independent Nor-mal) and GASSO (multivariate Normal) are very close,but GASSO (multivariate Normal) usually convergesslightly faster, probably because it uses a larger familyof sampling distributions. GASSO-2T converges fasterthan all other algorithms and often reduces the totalnumber of function evaluations needed for conver-gence by about 3–4 times, which confirms the benefit ofusing the two-timescale scheme. Overall, CEOCBA andGASSO have similar performance in terms of solutionaccuracy and convergence speed, but GASSO is moreaccurate in Powell and Pintér’s functions. It is worth

Table 2. Product Promotion Example—AveragePerformance of GASSO, GASSO-2T, and SMRAS

GASSO GASSO-2T SMRAS

Variable H∗ std_err H∗ std_err H∗ std_err

Q � 10, S � 5 305.77 0.28 305.95 0.23 301.39 0.89Q � 10, S � 10 352.62 0.78 352.88 0.61 344.75 1.33

mentioning that although GASSO currently allocatesthe simulation budget equally to the candidate solu-tions, its performance is very comparable to CEOCBAand is sometimes more accurate. The OCBA schememay also be incorporated into GASSO and GASSO-2Tto further improve their empirical performance.

5.2. A Simulation Optimization ExampleWe also test our algorithms on a simulation optimi-zation problem of product promotion management,which is adapted from Jacko (2016). Retail stores orsupermarkets need to decide how to allocate productsof Q different categories to S promotion shelves, wherethese products are more likely to attract customers andbring inmore revenues to the retailers. Each promotionshelf has a limited space Bs . There is a randomcustomerdemand Dq for each product category q. Each soldproduct of category q yields a profit rq , and each unsoldproduct remains in the promotion space and incurs aholding cost hq perday. Thegoal is todetermine xq , s , theallocation of product category q (� 1, . . . ,Q) on the pro-motion shelf s (� 1, . . . , S), to maximize the total profitover a promotion period of T days. This optimizationproblem can bewritten as follows:

max{xs , q }

T∑t�1

Q∑q�1

E[rq min(Dq , Iq)− hq max(Iq(t) −Dq(t), 0)

](20)

s.t. Iq(1)�S∑

s�1xs , q ; Iq(t + 1)� max(Iq(t) −Dq(t), 0),

t � 1, . . . ,T, (21)Q∑

q�1xs , q ≤ Bs , s � 1, . . . , S, (22)

xs , q ≥ 0, s � 1, . . . , S, q � 1, . . . ,Q. (23)

In the constraint (21), Iq(t) represents the inventory levelof product category q at time t. The constraints (22)and (23) specify the budget constraint and nonnegativ-ity constraint on the allocation, respectively. We ignorethe integrality constraint on the allocation, assumingany solution can be rounded to a nearest integer solu-tion with similar performance. Notice that the dimen-sion of the problem is Q × S. The objective func-tion (20) can only be evaluated through simulation, via


Figure 1. (Color online) Stationary Noise—Average Performance of GASSO (Independent Normal), GASSO (MultivariateNormal), GASSO-2T, and CEOCBA

No. of function evaluations102 103 104 105 102 103 104 105106

Fun

ctio

n va

lue

–108

–106

–104

–102

–100

10-D Powell function

True optimal value

GASSO (independent Normal)

GASSO (multivariate Normal)

GASSO-2T

CEOCBA

No. of function evaluations

No. of function evaluations102 103 104 105 102 103 104 105 106106


No. of function evaluations102 103 104 105 102 103 104 105 106106


10-D Trignometric function

Fun

ctio

n va

lue

–103

–102

–101

–100

10-D Rastrigin function 10-D Pint 'er function

Fun

ctio

n va

lue

–103

–102

–101

–100

Fun

ctio

n va

lue

–103

–102

–101

–100

Fun

ctio

n va

lue

–103

–104

–102

–101

–100

Fun

ctio

n va

lue

–103

–104

–102

–101

–100

10-D Levy function 10-D Sphere function

simulating the evolution of Iq(t) according to (21). Inour numerical experiment, we use the following prob-lemsetting:T �7 (i.e., timehorizon is oneweek),Q � 10,S � 5, or 10, Bs � 5 for all s, Dq follows a uniformdistribution on [0, 20]; rq is randomly generated from a

uniform distribution on [0, 20] and the specific instancepresentedhere is [15.2, 15.4, 6.4, 7.2, 17.1, 18.2, 13.8, 19.6,15.8, 15.0]; hq is generated from a uniform distributionon [0, 10], and the specific instance presented here is[2.9, 0 7.6, 2.5, 3.9, 6.8, 1.5, 6.1, 7.1, 1.9, 4.4].


Figure 2. (Color online) Product Promotion Example—Average Performance of GASSO, GASSO-2T, and SMRAS

No. of function evaluations104 105 106

Fun

ctio

n va

lue

230

240

250

260

270

280

290

300

310Product promotion example (Q = 10, S = 5)

GASSOGASSO-2TSMRAS

No. of function evaluations104 105 106

Fun

ctio

n va

lue

240

260

280

300

320

340

360Product promotion example (Q = 10, S = 10)

The constraints (22) and (23) imply that for each s, thesolution vector xs � [xs ,1 , . . . , xs ,Q] lives inside a (Q + 1)-dimensional simplex (multiplied by Bs). Therefore, inthe implementation of GASSO and GASSO-2T, we usethe Dirichlet distribution (which is an exponential fam-ily of distributions) as the parameterized sampling dis-tribution,because the supportof aDirichletdistributionis a simplex, which means that any samples generatedfrom an appropriate Dirichlet would satisfy the con-straints (22) and (23). More specifically, the pdf of thesampling distribution is f (y;θ) �∏S

s�1 f (ys ;θs), wherefor each s

f (ys ;θs)�∏Q+1

i�1 yθs , i−1s , i

B(θ) , ys , i ≥ 0 ∀ i ,Q+1∑i�1

ys , i � 1,

where B(θ) is a multinomial Beta function. By takingthe transformation xs , i � Bs × ys , i , i � 1, . . . ,Q, we notethat the xs , i’s satisfy the constraints (22) and (23).We setthe initial parameter θ as an all-one vector, the stepsizeαk � 50/(k + 1,000)0.6 and βk � 1/(k + 1,000)0.501, and thesample size N � 200 per iteration for GASSO-2T; all theother parameter settings are the same as in the previoussection.For this problem, although we can still apply

CEOCBA numerically, it is not a fair comparison be-cause CEOCBA is derived based on the assumption ofan additive Gaussian noise in the objective function,which obviously is far from satisfied by the objectivefunction in this example. Instead, we compare perfor-mance with the SMRAS proposed by Hu et al. (2008),which is an extension of the very competitive model-based algorithm MRAS to the stochastic optimizationsetting. For fair comparison with GASSO and GASSO-2T, we use the same sample size N � 1,000 for each iter-ation and M � 10 for the number of evaluations of eachcandidate solution, and use the Dirichlet distributionabove as the sampling distribution aswell. All the otherparameters are set as recommended inHu et al. (2008).

Table 2 and Figure 2 show the average performanceof the three algorithms based on 10 independent runs,for the problemswith 50 (Q � 10, S � 5) and 100 (Q � 10,S � 10) dimensions, respectively. The results show thatGASSO and GASSO-2T converge to a better solutionthan SMRAS, SMRAS converges faster thanGASSObutslower than GASSO-2T, and GASSO-2T again demon-strates a considerable computational saving comparedto GASSO. Lastly, we should mention many resourceallocation problems take a similar form as this prob-lem here with the same budget and nonnegativity con-straints on the resource allocation, for example, users’power allocation to transmit signal through differentchannels, see Hayashi and Luo (2009). If GASSO orGASSO-2T is to be applied on these problems, Dirich-let distribution would be a natural choice for the sam-pling distribution because its support coincides withthe constraint set.

5.3. Recommendation on Parameter ChoiceBased on theoretical intuition and our numerical expe-rience, we give some general recommendations on theparameter settings in the algorithms.

Theoretically, the sample size per iteration needs toincrease to infinity as the number of iterations goes toinfinity in GASSO, and it can be constant for all iter-ations in GASSO-2T. In practice, since all algorithmshave to stop after a finite number of iterations, we oftenset the sample size to be large enough. If the sam-ple size is too small, there is too much noise in theestimate and the algorithm may not even converge; aswe increase the sample size, the convergence usuallyimproves but the improvement starts to diminish aftera certain point. We recommend N ∈ [500, 1,000] forGASSO and N ∈ [50, 100] for GASSO-2T for problemsof dimension under 100, and, in general, a larger sam-ple size is needed as the problem dimension increases.To find the sweet spot, a practitioner can start with arelatively small sample size and gradually increase the


sample size until no further improvement on conver-gence canbeobserved.How to adaptively andautomat-ically set the sample size in these algorithms is one ofour future research directions, along the lines of priorwork, see, Byrd et al. (2012), Hashemi et al. (2014).The stepsize is probably the most difficult parameter

to set in these algorithms, just as this is a commonprob-lem with many stochastic approximation algorithms.Usually, a smaller stepsize sequence provides a moreaccurate solution but takes more computational time,and on the contrary, a larger stepsize may cause thealgorithm to converge too fast and prematurely to a badlocal solution. Therefore we need to balance the initialstepsize and the decay rate of the sequence. We use astepsize of the form αk � α0/(C + k)α, therefore roughlyspeaking, the ratio α0/C determines the initial stepsizeand thedenominator (C+k)α determines thedecay rate.For GASSO, we recommend a large C ∈ [1,000, 20,000]and a small α ∈ (0.5, 0.65) for a slow decay rate, andthen choose α0 as large as possible to increase the con-vergence speedwhile stillmaintaining convergence to agood solution (once α0 becomes too large, the algorithmstarts to become nonconvergent). For GASSO-2T, weneed to additionally set the stepsize βk , and we recom-mend to use the same form as αk , i.e., βk � α0/(C + k)β,where the exponent β should be smaller than α andbetween [0.501, 0.55].

The performance of the algorithms is not very sen-sitive to other parameter settings. We recommend thefollowing: shape function set to be I{H(x) > γρ}, or(H(x) − Hl)I{H(x) > γρ} (the former choice is morerobust while the latter is often faster), quantile param-eter ρ ∈ [0.1, 0.2], and initial parameter of the samplingdistribution set tobe close to auniformdistributionoverthe solution space.

6. ConclusionIn this paper, we developed a gradient-based adap-tive stochastic search approach to simulation optimiza-tion over continuous solution space. Our proposedalgorithm iteratively draws candidate solutions froma parameterized sampling distribution and updatesthe parameter of the sampling distribution using adirect gradient search over the parameter space. Toreduce the number of candidate solutions that needto be generated per iteration, we further incorporatedinto the algorithm a two-timescale updating scheme,which updates the parameter on the slow timescaleand the quantities involved in the parameter updatingon the fast timescale. We showed asymptotic conver-gence of the proposed algorithms. Numerical resultson several benchmark problems and a simulation opti-mization problem showed our algorithms have supe-rior empirical performance, and the two-timescale algo-rithm improves computational efficiency by severalfold. We also provided a detailed recommendation forpractitionerson theparameter setting in thealgorithms.

AcknowledgmentsThe authors thank Xi Chen, graduate student in the Depart-ment of Industrial and Enterprise Systems Engineering atUniversity of Illinois Urbana–Champaign, for assisting withthe numerical experiments in the preliminary version (Zhouet al. 2014) of this work. The authors also thank the AreaEditor Marvin Nakayama, the associate editor, and twoanonymous referees for their careful reading of the paperand very constructive comments that led to a substantiallyimproved paper.

ReferencesAndradóttir S (2006) An overview of simulation optimization with

random search. Henderson SG, Nelson BL, eds. Handbooks inOperations Research and Management Science: Simulation, Vol. 13(Elsevier, Amsterdam), 617–632.

Barndorff-Nielsen OE (1978) Information and Exponential Families inStatistical Theory (Wiley, New York).

Barton RR, Meckesheimer M (2006) Metamodel-based simulationoptimization. Henderson SG, Nelson BL, eds. Handbooks in Oper-ations Research and Management Science: Simulation, Vol. 13 (Else-vier, Amsterdam), 535–574.

Bhatnagar S, Borkar VS (1998) A two-time-scale stochastic approx-imation scheme for simulation based parametric optimization.Probab. Engrg. Inform. Sci. 12(4):519–531.

Bhatnagar S, Prasad HL, Prashanth LA (2013) Stochastic Recur-sive Algorithms for Optimization: Simultaneous Perturbation Meth-ods. Lecture Notes in Control and Information Sciences Series,Vol. 434 (Springer, London).

Borkar VS (1997) Stochastic approximation with two time scales. Sys-tems Control Lett. 29(5):291–294.

Borkar VS (2008) Stochastic Approximation: A Dynamical Systems View-point (Cambridge University Press, Cambridge, UK).

Byrd RH, Chin GM, Nocedal J, Wu Y (2012) Sample size selection inoptimization methods for machine learning. Math. Programming134(1):127–155.

Chang K-H, Hong LJ, Wan H (2013) Stochastic trust-region response-surface method (STRONG)—A new response-surface frame-work for simulation optimization. INFORMS J. Comput. 25(2):230–243.

Chen C-H, Lee LH (2010) Stochastic Simulation Optimization: An Opti-mal Computing Budget Allocation. System Engineering and Oper-ations Research (World Scientific Publishing, Singapore).

Chen C-H, Lin J, Yucesan E, Chick SE (2000) Simulation budget allo-cation for further enhancing the efficiency of ordinal optimiza-tion. Discrete Event Dynam. Systems: Theory Appl. 10(3):251–270.

Chepuri K, de Mello TH (2005) Solving the vehicle routing problemwith stochastic demands using the cross entropy method. Ann.Oper. Res. 134(1):153–181.

DeBoer PT, Kroese DP, Mannor S, Rubinstein RY (2005) A tutorial onthe cross-entropy method. Ann. Oper. Res. 134(1):19–67.

deMello TH, Shapiro A, SpearmanML (1999) Finding optimal mate-rial release times using simulation-based optimization. Manage-ment Sci. 45(1):86–102.

Dorigo M, Gambardella LM (1997) Ant colony system: A coopera-tive learning approach to the traveling salesman problem. IEEETrans. Evolutionary Comput. 1(3):53–66.

Fu MC (2002a) Optimization for simulation: Theory vs. practice.INFORMS J. Comput. 14(3):192–215.

Fu MC (2002b) Simulation optimization in the future: Evolution orrevolution? INFORMS J. Comput. 14(3):226–227.

FuMC, Chun C-H, Shi L (2008) Some topics for simulation optimiza-tion. Mason SJ, Hill RR, Möch L, Rose O, Jefferson T, Fowler JW,eds. Proc. 2008 Winter Simulation Conf., Miami, FL, 27–38.

Glover F (1990) Tabu search: A tutorial. Interfaces 20(4):74–94.Goldberg DE (1989) Genetic Algorithms in Search, Optimization and

Machine Learning (Addison-Wesley Longman Publishing Co.,Inc., Boston).


Hashemi F, Ghosh S, Pasupathy R (2014) On adaptive sample rulesfor stochastic recursions. Tolk A, Diallo SY, Ryzhov IO, Yilmaz L,Buckley S, Miller JA, eds. Proc. 2014 Winter Simulation Conf.(IEEE, Piscataway, NJ), 3959–3970.

Hayashi S, Luo Z (2009) Spectrum management for interference-limited multiuser communication systems. IEEE Trans. Inform.Theory 55(3):1153–1175.

He D, Lee LH, Chen C-H, Fu MC, Wasserkrug S (2010) Simulationoptimization using the cross-entropymethodwith optimal com-puting budget allocation. ACM Trans. Modeling Comput. Simula-tion 20(1):4:1–4:22.

Hu J, Fu MC, Marcus SI (2007) A model reference adaptive searchmethod for global optimization. Oper. Res. 55(3):549–568.

Hu J, Fu MC, Marcus SI (2008) A model reference adaptive searchmethod for stochastic global optimization. Comm. Inform. Sys-tems 8(3):245–276.

Hu J, Wang Y, Zhou E, Fu MC, Marcus SI (2012) A survey of somemodel-based methods for global optimization. Hernández-Hernández D, Minjárez-Sosa A, eds. Optimization, Control, andApplications of Stochastic Systems: In Honor of Onésimo Hernández-Lerma (Birkhäuser, Basel, Switzerland), 157–179.

Huang D, Allen TT, Notz WI, Zeng N (2006) Global optimizationof stochastic black-box systems via sequential kriging meta-models. J. Global Optim. 34(3):441–466.

Jacko P (2016) Resource capacity allocation to stochastic dynamiccompetitors: Knapsack problem for perishable items and index-knapsack heuristic. Ann. Oper. Res. 241(1–2):83–107.

Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum ofa regression function. Ann. Math. Statist. 23(3):462–466.

Kleywegt A, Shapiro A, de Mello TH (2002) The sample averageapproximation method for stochastic discrete optimization.SIAM J. Optim. 12(2):479–502.

Konda VR, Tsitsiklis JN (2004) Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab. 14(2):796–819.

Kushner HJ, Yin GG (2004) Stochastic Approximation Algorithms andApplications, 2nd ed. (Springer-Verlag, New York).

Larranaga P, Etxeberria R, Lozano JA, Pena JM (1999) Optimiza-tion by learning and simulation of Bayesian and Gaussian net-works. Technical Report EHU-KZAA-IK-4/99, Department ofComputer Science and Artificial Intelligence, University of theBasque Country, Bilbao, Spain.

Molvalioglu O, Zabinsky ZB, KohnW (2009) The interacting-particlealgorithm with dynamic heating and cooling. J. Global Optim.43(2–3):329–356.

Molvalioglu O, Zabinsky ZB, Kohn W (2010) Meta-control of aninteracting-paricle algorithm. Nonlinear Anal.: Hybrid Systems4(4):659–671.

Muhlenbein H, Paaß G (1996) From recombination of genes to theestimation of distributions: I. Binary parameters. Voigt HM,EbelingW, Rechenberg I, Schwefel HP, eds. Parallel Problem Solv-ing from Nature-PPSN IV (Springer-Verlag, Berlin), 178–187.

Ólafsson S (2006) Metaheuristics. Henderson SG, Nelson BL, eds.Handbooks in Operations Research and Management Science: Simu-lation, Vol. 13 (Elsevier, Amsterdam), 633–654.

Romeĳn HE, Smith RL (1994) Simulated annealing for constrainedglobal optimization. J. Global Optim. 5(2):101–126.

Rubinstein RY (2001) Combinatorial optimization, ants and rareevents. Uryasev S, Pardalos PM, eds. Stochastic Optimization:Algorithms and Applications (Kluwer Academic Publishers, Dor-drecht, Netherlands), 304–358.

Rubinstein RY, Shapiro A (1993) Discrete Event Systems: SensitivityAnalysis and Stochastic Optimization by the Score Function Method(John Wiley & Sons, New York).

Scott W, Frazier PI, Powell WB (2011) The correlated knowledgegradient for simulation optimization of continuous parame-ters using Gaussian process regression. SIAM J. Optim. 21(3):996–1026.

Shi L, Ólafsson S (2000) Nested partitions method for stochastic opti-mization. Methodol. Comput. Appl. Probab. 2(3):271–291.

Spall JC (1992) Multivariate stochastic approximation using simul-taneous perturbation gradient approximation. IEEE Trans. Auto-matic Control 37(3):332–341.

Wolpert DH (2004) Finding bounded rational equilibria part I: Iter-ative focusing. Vincent T, ed. Proc. Eleventh Internat. Sympos.Dynam. Games Appl., Tucson, AZ.

Zabinsky ZB (2003) Stochastic Adaptive Search for Global Optimization.Nonconvex Optimization and Its Applications, Vol. 72 (Springer,New York).

Zhou E, Hu J (2012) Combining gradient-based optimization withstochastic search. Rose O, Uhrmacher AM, eds. Proc. 2012 WinterSimulation Conf., Berlin.

Zhou E, Hu J (2014) Gradient-based adaptive stochastic search fornon-differentiable optimization. IEEE Trans. Automatic Control59(7):1818–1832.

Zhou E, Bhatnagar S, Chen X (2014) Simulation optimization viagradient-based stochastic search. Buckley SJ, Miller JA, eds. Proc.2014 Winter Simulation Conf. (IEEE, Piscataway, NJ), 3869–3879.

Zlochin M, Birattari M, Meuleau N, Dorigo M (2004) Model-basedsearch for combinatorial optimization: A critical survey. Ann.Oper. Res. 131(1–4):373–395.

gradient-based adaptive stochastic search for simulation ... · gradient-based adaptive stochastic...

Documents