simulated stochastic approximation annealing for global optimization...

Simulated Stochastic Approximation Annealingfor Global Optimization with a Square-Root

Cooling Schedule

Faming Liang

Faming Liang Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule

Abstract

Simulated annealing has been widely used in the solution of optimization problems. As

known by many researchers, the global optima cannot be guaranteed to be located by

it unless a logarithmic cooling schedule is used. However, the logarithmic cooling

schedule is so slow that no one can afford to have such a long CPU time. We propose

a new stochastic optimization algorithm, the so-called simulated stochastic

approximation annealing algorithm. Under the framework of stochastic approximation

Markov chain Monte Carlo, we show that the new algorithm can work with a cooling

schedule in which the temperature can decrease much faster than in the logarithmic

cooling schedule, e.g., a square-root cooling schedule, while guaranteeing the global

optima to be reached when the temperature tends to zero. The new algorithm has

been tested on a few benchmark optimization problems, including feed-forward neural

network training and protein-folding. The numerical results indicate that the new

algorithm can significantly outperform simulated annealing and other competitors.


The problem

The optimization problem can be simply stated as a minimization problem:

minx∈X

U(x),

where X is the domain of U(x).

Minimizing U(x) is equivalent to sampling from the Boltzmann distribution

fτ∗ (x) ∝ exp(−U(x)/τ∗)

at a very small value (closing to 0) of τ∗.


Simulated Annealing (Kirkpatrick et al., 1983)

It simulates from a sequence of Boltzmann distributions,

fτ1 (x), fτ2 (x), . . . , fτm (x),

in a sequential manner, where the temperatures τ1, . . . , τm form a decreasing ladder

τ1 > τ2 > · · · > τm = τ∗ > 0

with τ∗ ≈ 0 and τ1 reasonably large such that most uphill Metropolis-Hastings (MH)moves at that level can be accepted.


Simulated Annealing: Algorithm

1. Initialize the simulation at temperature τ1 and an arbitrary sample x0 ∈ X .

2. At each temperature τi , simulate the distribution fτi (x) for ni iterations usingthe MH sampler. Pass the final sample to the next lower temperature level asthe initial sample.


Simulated Annealing: Difficulty

The major difficulty with simulated annealing is in choosing the cooling schedule:

▶ Logarithmic cooling schedule O(1/log(t)): It ensures the simulation to convergeto the global minima of U(x) with probability 1. However, it is so slow that noone can afford to have so long running time.

▶ Linear or geometrical cooling schedule: A linear or geometrical cooling scheduleis commonly used, but, as shown in Holley et al. (1989), these schedules can nolonger guarantee the global minima to be reached.


Stochastic Approximation Monte Carlo (SAMC)

SAMC is a general purpose MCMC algorithm. To be precise, it is an adaptive MCMCalgorithm and also a dynamic importance sampling algorithm. Its self-adjustingmechanism enables it to be immune to local traps.

▶ Let E1, ...,Em denote a partition of the sample space X , which are madeaccording to the energy function as follows:

E1 = {x : U(x) ≤ u1}, E2 = {x : u1 < U(x) ≤ u2}, . . . ,Em−1 = {x : um−2 < U(x) ≤ um−1}, Em = {x : U(x) > um−1},

(1)

where u1 < u2 < . . . < um−1 are prespecified numbers.

▶ Let {γt} be a positive, non-increasing sequence satisfying the condition

∞∑t=1

γt = ∞,∞∑t=1

γ2t < ∞.


Stochastic Approximation Monte Carlo: Algorithm

1. (Sampling) Simulate a sample Xt+1 with a single MH update, which starts withXt and leaves the following distribution invariant:

fθt ,τ∗ (x) ∝m∑i=1

exp{−U(x)/τ∗ − θ

(i)t

}I (x ∈ Ei ), (2)

where I (·) is the indicator function.

2. (θ-updating) Setθt+ 1

2= θt + γt+1Hτt+1 (θt , xt+1), (3)

where Hτt+1 (θt , xt+1) = et+1 − π, et+1 = (I(xt+1 ∈ E1), ..., I(xt+1 ∈ Em)), andπ = (π1, . . . , πm).

Obviously, it is difficult to mix over the domain X if the temperature τ∗ is very low! In

this case, only very few points will be sampled from each subregion.


Space Annealing SAMC (Liang, 2007)

Suppose that the sample space has been partitioned as in (1) with u1, . . . , um−1

arranged in an ascending order. Let κ(u) denote the index of the subregion that asample x with energy u belongs to. For example, if x ∈ Ej , then κ(U(x)) = j . Let

X (t) denote the sample space at iteration t.

Space annealing SAMC starts with X (1) = ∪mi=1Ei , and then iteratively shrinks the

sample space by setting

X (t) = ∪κ(u(t)min+ℵ)

i=1 Ei , (4)

where u(t)min is the minimum energy value obtained by iteration t, and ℵ is a user

specified parameter.

A major shortcoming of this algorithm is that it tends to get trapped into local energyminima when ℵ is small and the proposal is relatively local.


SAA Algorithm

Simulated Stochastic Approximation Annealing, or SAA in short, is a combination ofsimulated annealing and stochastic approximation.

▶ Let {Mk , k = 0, 1, . . .} be a sequence of positive numbers increasingly divergingto infinity, which work as truncation bounds of {θt}.

▶ Let σt be a counter for the number of truncations up to iteration t, and σ0 = 0.

▶ Let θ̃0 be a fixed point in Θ.

▶ E1, . . . ,Em is the partition of the sample space.

▶ π = (π1, . . . , πm) is the desired sampling distribution of the m subregions.

▶ {γt} is a gain factor sequence.

▶ {τt} is a temperature sequence.


SAA Algorithm

1. (Sampling) Simulate a sample Xt+1 with a single MH update, which starts withXt and leaves the following distribution invariant:

fθt ,τt+1(x) ∝

m∑i=1

exp{−U(x)/τt+1 − θ

(i)t

}I (x ∈ Ei ), (5)

where I (·) is the indicator function.

2. (θ-updating) Setθt+ 1

2= θt + γt+1Hτt+1 (θt , xt+1), (6)

where Hτt+1 (θt , xt+1) = et+1 − π, et+1 = (I(xt+1 ∈ E1), ..., I(xt+1 ∈ Em)), andπ = (π1, . . . , πm).

3. (Truncation) If ∥θt+ 12∥ ≤ Mσt , set θt+1 = θt+ 1

2; otherwise, set θt+1 = θ̃0 and

σt+1 = σt + 1.


Features of SAA

▶ Self-adjusting mechanism: This distinguishes the SAA algorithm from simulatedannealing. For simulated annealing, the change of the invariant distribution issolely determined by the temperature ladder. While for SAA, the change of theinvariant distribution is determined by both the temperature ladder and the pastsamples. As a result, SAA can converge with a much faster cooling schedule.

▶ Sample space shrinkage: Compared to space annealing SAMC, SAA also shrinksits sample space with iterations but in a soft way: it gradually biases samplingtoward local energy minima of each subregion through lowering the temperaturewith iterations. This strategy of sample space shrinkage reduces the risk ofgetting trapped into local energy minima.

▶ Convergence: SAA can achieve essentially the same convergence toward globalenergy minima as simulated annealing from the perspective of practicalapplications.


Formulation of SAA

The SAA algorithm can be formulated as a SAMCMC algorithm with the goal ofsolving the integration equation

hτ∗ (θ) =

∫Hτ∗ (θ, x)fθ,τ∗ (x)dx = 0, (7)

where fθ,τ∗ (x) denotes a density function dependent on θ and the limitingtemperature τ∗s, and h is called the mean field function.

SAA works through solving a system of equations defined along the temperaturesequence {τt}:

hτt (θ) =

∫Hτt (θ, x)fθ,τt (x)dx = 0, t = 1, 2, . . . , (8)

where fθ,τt (x) is a density function dependent on θ and the temperature τt .


Conditions on mean filed function

For SAA, the mean field function is given by

hτ (θ) =

∫Hτ (θ, x)fθ,τ (x)dx =

(S(1)τ (θ)

Sτ (θ)− π1, . . . ,

S(m)τ (θ)

Sτ (θ)− πm

), (9)

for any fixed value of θ ∈ Θ and τ ∈ T , where S(i)τ (θ) =

∫Ei

e−U(x)/τdx/eθ(i), and

Sτ (θ) =∑m

i=1 S(i)τ (θ).

Further, we define

vτ (θ) =1

2

m∑i=1

(S(i)τ (θ)

Sτ (θ)− πi

)2

, (10)

which is the so-called Lyapunov function in the literature of stochastic approximation.

Then it is easy to verify that SAA satisfies the stability condition.


Stability Condition: (A1)

The function hτ (θ) is bounded and continuously differentiable with respect to both θand τ , and there exists a non-negative, upper bounded, and continuously differentiablefunction vτ (θ) such that for any ∆ > δ > 0,

supδ≤d((θ,τ),L)≤∆

∇Tθ vτ (θ)hτ (θ) < 0, (11)

where L = {(θ, τ) : hτ (θ) = 0, θ ∈ Θ, τ ∈ T } is the zero set of hτ (θ), andd(z,S) = infy{∥z − y∥ : y ∈ S}. Further, the set v(L) = {vτ (θ) : (θ, τ) ∈ L} isnowhere dense.


Conditions on observation noise

Observation noise: ξt+1 = Hτt+1 (θt , xt+1)− hτt+1(θt).

▶ One can directly impose some conditions on observation noise, see e.g., Kushnerand Clark (1978), Kulkarni and Horn (1995), and Chen (2002). Theseconditions are usually very weak, but difficult to verify.

▶ Alternatively, one can impose some conditions on the Markov transition kernel,which can lead to required conditions on the observation noise.


Doeblin condition: (A2)

(A2) (Doeblin condition) For any given θ ∈ Θ and τ ∈ T , the Markov transitionkernel Pθ,τ is irreducible and aperiodic. In addition, there exist an integer l ,0 < δ < 1, and a probability measure ν such that for any compact subsetK ⊂ Θ,

infθ∈K,τ∈T

P lθ,τ (x ,A) ≥ δν(A), ∀x ∈ X , ∀A ∈ BX ,

where BX denotes the Borel set of X ; that is, the whole support X is a smallset for each kernel Pθ,τ , θ ∈ K and τ ∈ T .

Uniform ergodicity is slightly stronger than V -uniform ergodicity, but it just serves

right for the SAA as for which the function Hτ (θ,X ) is bounded, and thus the mean

field function and observation noise are bounded. If the drift function V (x) ≡ 1, then

V -uniform ergodicity is reduced to uniform ergodicity.


Doeblin condition

To verify (A2), one may assume that X is compact, U(x) is bounded in X , and theproposal distribution q(x , y) satisfies the local positive condition:

(Q) There exists δq > 0 and ϵq > 0 such that, for every x ∈ X ,|x − y | ≤ δq ⇒ q(x , y) ≥ ϵq .


Conditions on {γt} and {τt}: (A3)

(i) The sequence {γt} is positive, non-increasing and satisfies the followingconditions:

∞∑t=1

γt = ∞,γt+1 − γt

γt= O(γι

t+1),∞∑t=1

γ(1+ι′)/2t √

t< ∞, (12)

for some ι ∈ [1, 2) and ι′ ∈ (0, 1).

(ii) The sequence {τt} is positive and non-increasing and satisfies the followingconditions:

limt→∞

τt = τ∗, τt − τt+1 = o(γt),∞∑t=1

γt |τt − τt−1|ι′′< ∞, (13)

for some ι′′ ∈ (0, 1), and

∞∑t=1

γt |τt − τ∗| < ∞, (14)


Conditions on {γt} and {τt}

For the sequences {γt} and {τt}, one can typically set

γt =C1

tς, τt =

C2√t+ τ∗, (15)

for some constants C1 > 0, C2 > 0, and ς ∈ (0.5, 1]. Then it is easy to verify that

(15) satisfies (A3).


Convergence Theorem

Theorem 1. Assume that T is compact and the conditions (A1)-(A3) holds. If θ̃0 used

in the SAA algorithm is such that supτ∈T vτ (θ̃0) < inf∥θ∥=c0,τ∈T vτ (θ) for some

c0 > 0 and ∥θ̃0∥ < c0, then the number of truncations in SAA is almost surely finite;

that is, {θt} remains in a compact subset of Θ almost surely.


Convergence Theorem

Theorem 2. Assume the conditions of Theorem 1 hold. Then, as t → ∞,

d(θt ,Lτ∗ ) → 0, a.s.,

where Lτ∗ = {θ ∈ Θ : hτ∗ (θ) = 0} and d(z,S) = infy{∥z − y∥ : y ∈ S}. That is,

θ(i)t →

{C + log(

∫Ei

fτ∗ (x)dx)− log(πi + πe), if Ei ̸= ∅,−∞, if Ei = ∅,

where C is a constant, and πe =∑

j :Ej=∅ πj/(m −m0), and m0 is the number of

empty subregions.


Strong Law of Large Numbers(SLLN)

Theorem 3. Assume the conditions of Theorem 1 hold. Let x1, . . . , xn denote a set ofsamples simulated by SAA in n iterations. Let g : X → R be a measurable functionsuch that it is bounded and integrable with respect to fθ,τ (x). Then

1

n

n∑k=1

g(xk ) →∫X

g(x)fθ∗,τ∗ (x)dx , a.s.


Convergence to Global Minima

Corollary. Assume the conditions of Theorem 1 hold. Let x1, . . . , xt denote a set ofsamples simulated by SAA in t iterations. Then, for any ϵ > 0, as t → ∞,

1∑tk=1 I (J(xk) = i)

t∑k=1

I (U(xk) ≤ u∗i +ϵ& J(xk) = i) →

∫{x :U(x)≤u∗

i+ϵ}∩Ei

e−U(x)/τ∗dx∫Ei

e−U(x)/τ∗dx, a.s.,

(16)for i = 1, . . . ,m, where I (·) denotes an indicator function. Moreover, if τ∗ goes to 0,then

P(U(Xt) ≤ u∗i + ϵ|J(Xt) = i

)→ 1, i = 1, . . . ,m. (17)

For simulated annealing, as shown in Haario and Saksman (1991), it can achieve thefollowing convergence with a logarithmic cooling schedule: For any ϵ > 0,

P(U(Xt) ≤ u∗1 + ϵ) → 1, a.s., (18)

as t → ∞.


Comparison with Simulated Annealing

▶ Simulated annealing can achieve a stronger convergence mode than SAA. As atrade-off, SAA can work with a cooling schedule in which the temperaturedecreases much faster than in the logarithmic cooling schedule, such as thesquare-root cooling schedule.

▶ From the perspective of practical applications, (17) and (18) are almostequivalent: Both allows one to identify a sequence of samples that converge tothe global energy minima of U(x).

▶ In practice, SAA can often work better than simulated annealing. This isbecause SAA possesses the self-adjusting mechanism, which enables SAA to beimmune to local traps.


A 10-state Distribution

The unnormalized mass function of the 10-state distribution.

x 1 2 3 4 5 6 7 8 9 10P(x) 5 100 40 1 125 75 1 150 50 20

The sample space X = {1, 2, . . . , 10} was partitioned according to the mass function

into five subregions: E1 = {8}, E2 = {2, 5}, E3 = {6, 9}, E4 = {3} and

E5 = {1, 4, 7, 10}.



Convergence of θt for the 10-state distribution: the true value θn is calculated at theend temperature 0.0104472, θ̂n is the average of θn over 5 independent runs, s.d. isthe standard deviation of θ̂n, and freq is the averaged relative sampling frequency of

each subregion. The standard deviation of freq is nearly 0.

Subregion E1 E2 E3 E4 E5

θn 6.3404 -11.1113 -60.0072 -120.1772 -186.5248

θ̂n 6.3404 -11.1116 -60.0009 -120.1687 -186.5044s.d. 0 6.26×10−3 2.28× 10−3 6.01× 10−3 8.16× 10−3

freq 20.29% 20.23% 20.05% 19.84% 19.6%



49000000 49200000 49400000 49600000 49800000 50000000

34

56

78

910

iteration

state

A thinned sample path of SAA for the 10-state distribution.


A function with multiple local minima

Consider minimizing the function U(x) =

−{x1 sin(20x2) + x2 sin(20x1)}2 cosh{sin(10x1)x1} − {x1 cos(10x2)− x2 sin(10x1)}2

cosh{cos(20x2)x2}, where x = (x1, x2) ∈ [−1.1, 1.1]2.



Comparison of SAA and simulated annealing for the multi-modal example:Average of Minimum Energy Valuesa

20000 40000 60000 80000 100000 propb cpuc

SAA -8.1145 -8.1198 -8.1214 -8.1223 -8.1229 92.0% 0.17(3.0× 10−4) (1.5× 10−4) (1.0× 10−4) (7.5× 10−5) (5.9× 10−5)

SAd (sr) -5.9227 -5.9255 -5.9265 -5.9269 -5.9271 3.5% 0.14(1.3× 10−2) (1.3× 10−2) (1.3× 10−2) (1.3× 10−2) (1.3× 10−2)

SAe(geo) -6.5534 -6.5598 -6.5611 -6.5617 -6.5620 30.7% 0.13(3.3× 10−2) (3.3× 10−2) (3.3× 10−2) (3.3× 10−2) (3.3× 10−2)



(a) Contour

x1

x2

−6 −6

−4

−4 −4

−4

−3

−3 −3

−3 −3

−3 −3

−3

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2

−2 −2

−2

−2

−1

−1

−1

−1

−1 −1

−1 −1

−1

−1

−1

−1 −1

−1

−1 −1 −1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1

−1 −1

−1

−1

0 0

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.00.5

1.0

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.00.5

1.0

(b) SAA

x1

x2

O O

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.00.5

1.0

(c) SA (square−root)

x1

x2

O O

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.00.5

1.0

(d) SA (geometric)

x1

x2O O

(a) Contour of U(x), (b) sample path of SAA, (c) sample path of simulated annealingwith a square-root cooling schedule, and (d) sample path of simulated annealing with

a geometric cooling schedule. The white circles show the global minima of U(x).


Feed-forward Neural Networks

j

j

j

j

j j

j

j

j

��

��

��

��

��

��

��

��

��

PPPPPPPPP

PPPPPPPPP

��

@@@@@

@@@@

@@@@@

@@@@

PPPPPPPPP

TTTTTTTTTTTTTTT�

��

��

��

��

((((((((((((((((((

hhhhhhhhhhhhhhhhhh

HHHHHHHHHHHHHHHHHH

��

��

��

QQ

QQQQ

QQQ

!!!!!!!!!!!!!!!

-

-

-

-

-

-

I1

I2

I3

I4

B H1

H2

H3

O

Input Layer

Hidden Layer

Output Layer

A fully connected one hidden layer MLP network with four input units (I1, I2, I3, I4),one bias unit (B), three hidden units (H1, H2, H3), and one output unit (O). The

arrows show the direction of data feeding.Faming Liang Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule

Two spiral Problem

The two-spiral problem is to learn a feedforward neural network that distinguishesbetween points on two intertwined spirals.

This is a benchmark feedforward neural network training problem. The objectivefunction is high-dimensional, highly nonlinear, and consists of a multitude of localenergy minima separated by high energy barriers.


Two spiral Problem

−6 −4 −2 0 2 4 6

−6−4

−20

24

6

x

y

(a)

−6 −4 −2 0 2 4 6

−6−4

−20

24

6

xy

(b)

Classification maps learned by SAA with a MLE of 30 hidden units. The black andwhite points show the training data for two intertwined spirals. (a) Classification maplearned in one run of SAA. (b) Classification map averaged over 20 runs. This figure

shows the success of SAA in optimization of complex functions.


Two spiral Problem

Comparison of SAA, space annealing SAMC, simulated annealing, and BFGS for thetwo-spiral example. Notation: vi denotes the minimum energy value obtained in theith run for i = 1, . . . , 20, “Mean” is the average of vi , “SD” is the standard deviation

of “mean”, “minimum”=min20i=1 vi , “maximum”=max20i=1 vi , “proportion”=#{i : vi ≤ 0.21}, “Iteration” is the average number of iterations performed in eachrun. SA-1 employs the linear cooling schedule, and SA-2 employs the geometric

cooling schedule with a decreasing rate of 0.9863.

Algorithm Mean SD Min Max Prop Iter(×106)SAA 0.341 0.099 0.201 2.04 18 5.82Space annealing SAMC 0.620 0.191 0.187 3.23 15 7.07Simulated annealing-1 17.485 0.706 9.02 22.06 0 10.0simulated annealing-2 6.433 0.450 3.03 11.02 0 10.0BFGS 15.50 0.899 10.00 24.00 0 —


Protein Folding

The AB model consists of only two types of monomers, A and B, which behave ashydrophobic (σi = +1) and hydrophilic (σi = −1) monomers, respectively. Themonomers are linked by rigid bonds of unit length to form linear chains living in two orthree dimensional space.For the 2D case, the energy function consists of two types of contributions, bondangle and Lennard-Jones, and is given by

U(x) =N−2∑i=1

1

4(1− cos xi,i+1) + 4

N−2∑i=1

N∑j=i+2

[r−12ij − C2(σi, σj)r

−6ij

], (19)

where x = (x1,2, . . . , xN−2,N−1), xi,j ∈ [−π, π] is the angle between the ith and jthbond vectors, and rij is the distance between monomers i and j . The constant

C2(σi , σj ) is +1, + 12, and − 1

2for AA, BB, and AB pairs, respectively.


Protein Folding

Comparison of SAA and simulated annealing for the 2D-AB models. a The minimumenergy value obtained by SAA (subject to a post conjugate gradient minimizationprocedure starting from the best configurations found in each run). b The averagedminimum energy value sampled by the algorithm and the standard deviation of the

average. c The minimum energy value sampled by the algorithm in all runs.

SAA Simulated AnnealingN Posta Averageb Bestc Averageb Bestc

13 -3.2941 -3.2833 (0.0011) -3.2881 -3.1775 (0.0018) -3.201221 -6.1980 -6.1578 (0.0020) -6.1712 -5.9809 (0.0463) -6.120134 -10.8060 -10.3396 (0.0555) -10.7689 -9.5845 (0.1260) -10.5240


Protein Folding

(a) (b) (c)

Minimum energy configurations produced by SAA (subject to post conjugate gradientoptimization) for (a) the 13-mer sequence with energy value -3.2941, (b) the 21-mer

sequence with energy value -6.1980, and (c) the 34-mer sequence with energy-10.8060. The solid and open circles indicate the hydrophobic and hydrophilic

monomers, respectively.


Summary

▶ We have developed the SAA algorithm for global optimization. Under theframework of stochastic approximation, we show that SAA can work with acooling schedule in which the temperature can decrease much faster than in thelogarithmic cooling schedule, e.g., a square-root cooling schedule, whileguaranteeing the global energy minima to be reached when the temperaturetends to 0.

▶ Compared to simulated annealing, an added advantage of SAA is itsself-adjusting mechanism that enables it to be immune to local traps.

▶ Compared to space annealing SAMC, SAA shrinks its sample space in a soft way,gradually biasing the sampling toward the local energy minima of each subregionthrough lowering the temperature with iterations. This strategy of sample spaceshrinkage reduces the risk for SAA to get trapped into local energy minima.

▶ SAA provides a more general framework of stochastic approximation than thecurrent stochastic approximation MCMC algorithms. By including an additionalcontrol parameter τt , stochastic approximation may find new applications orimprove its performance in existing applications.


Acknowledgments

▶ Collaborators: Yichen Cheng and Guang Lin.

▶ NSF grants

▶ KAUST grant


simulated stochastic approximation annealing for global optimization...

Documents