arxiv:1806.08815v3 [physics.comp-ph] 7 apr 2019 · physics-inspired optimization for quadratic...

Physics-Inspired Optimization for Quadratic Unconstrained Problems Using a Digital Annealer

Maliheh Aramon, Gili Rosenberg, Elisabetta Valiante,1 Toshiyuki Miyazawa, Hirotaka Tamura,2 and Helmut G. Katzgraber3, 4, 5

11QB Information Technologies Inc. (1QBit), 458-550 Burrard Street, Vancouver, BC V6C 2B5, Canada2Fujitsu Laboratories Ltd., 4-1-1 Kamikodanaka, Nakahara-ku, Kawasaki, Kanagawa 211-8588, Japan

3Microsoft Quantum, Microsoft, Redmond, WA 98052, USA4Department of Physics and Astronomy, Texas A&M University, College Station, TX 77843-4242, USA

5Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA(Dated: April 9, 2019)

The Fujitsu Digital Annealer is designed to solve fully connected quadratic unconstrained binary optimization(QUBO) problems. It is implemented on application-specific CMOS hardware and currently solves problemsof up to 1024 variables. The Digital Annealer’s algorithm is currently based on simulated annealing; how-ever, it differs from it in its utilization of an efficient parallel-trial scheme and a dynamic escape mechanism.In addition, the Digital Annealer exploits the massive parallelization that custom application-specific CMOShardware allows. We compare the performance of the Digital Annealer to simulated annealing and paralleltempering with isoenergetic cluster moves on two-dimensional and fully connected spin-glass problems withbimodal and Gaussian couplings. These represent the respective limits of sparse versus dense problems, as wellas high-degeneracy versus low-degeneracy problems. Our results show that the Digital Annealer currently ex-hibits a time-to-solution speedup of roughly two orders of magnitude for fully connected spin-glass problemswith bimodal or Gaussian couplings, over the single-core implementations of simulated annealing and paralleltempering Monte Carlo used in this study. The Digital Annealer does not appear to exhibit a speedup for sparsetwo-dimensional spin-glass problems, which we explain on theoretical grounds. We also benchmarked an earlyimplementation of the Parallel Tempering Digital Annealer. Our results suggest an improved scaling over theother algorithms for fully connected problems of average difficulty with bimodal disorder. The next generationof the Digital Annealer is expected to be able to solve fully connected problems up to 8192 variables in size.This would enable the study of fundamental physics problems and industrial applications that were previouslyinaccessible using standard computing hardware or special-purpose quantum annealing machines.

I. INTRODUCTION

Discrete optimization problems have ubiquitous applica-tions in various fields and, in particular, many NP-hard combi-natorial optimization problems can be mapped to a quadraticIsing model [1] or, equivalently, to a quadratic unconstrainedbinary optimization (QUBO) problem. Such problems arisenaturally in many fields of research, including finance [2],chemistry [3, 4], biology [5, 6], logistics and scheduling [7, 8],and machine learning [9–12]. For this reason, there is muchinterest in solving these problems efficiently, both in academiaand in industry.

The impending end of Moore’s law [13] signals that re-lying on traditional silicon-based computer devices is notexpected to sustain the current computational performancegrowth rate. In light of this, interest in novel computationaltechnologies has been steadily increasing. The introductionof a special-purpose quantum annealer by D-Wave SystemsInc. [14] was an effort in this direction, aimed at revolutioniz-ing how computationally intensive discrete optimization prob-lems are solved using quantum fluctuations.

Despite continued efforts to search for a scaling advan-tage of quantum annealers over algorithms on conventionaloff-the-shelf CMOS hardware, there is as yet no consen-sus. Efforts to benchmark quantum annealers against clas-sical counterparts such as simulated annealing (SA) [15] haveabounded [14, 16–30]. Although for some classes of syntheticproblems a large speedup was initially found, those prob-lems were subsequently shown to have a trivial logical struc-ture, such that they can be solved more efficiently by more-

powerful classical algorithms [31]. To the best of our knowl-edge, the only known case of speedup is a constant speedupfor a class of synthetic problems [29] and, so far, there is noevidence of speedup for an industrial application. The hope isthat future improvements to the quantum annealer and, in par-ticular, to its currently sparse connectivity and low precisiondue to analog noise, will demonstrate the power of quantumeffects in solving optimization problems [32, 33]. With thesame goal in mind, researchers have been inspired to pushthe envelope for such problems on novel hardware, such asthe coherent Ising machine [32], as well as on graphics pro-cessing units (GPU) [27, 28] and application-specific CMOShardware [34, 35]. Similarly, efforts to emulate quantum ef-fects in classical algorithms—often referred to as quantum- orphysics-inspired methods—run on off-the-shelf CMOS hard-ware have resulted in sizable advances in the optimization do-main (see, e.g., Ref. [30] for an overview of different algo-rithms).

Fujitsu Laboratories has recently developed application-specific CMOS hardware designed to solve fully connectedQUBO problems (i.e., on complete graphs), known as theDigital Annealer (DA) [34, 35]. The DA hardware is cur-rently able to treat Ising-type optimization problems of a sizeup to 1024 variables, with 26 and 16 bits of (fixed) precisionfor the biases and variable couplers, respectively. The DA’salgorithm, which we refer to as “the DA”, is based on sim-ulated annealing, but differs in several ways (see Sec. II), aswell as in its ability to take advantage of the massive paral-lelization possible when using a custom, application-specificCMOS hardware. Previous efforts of running simulated an-nealing in parallel include executing different iterations in

arX

iv:1

806.

0881

5v3

[ph

ysic

s.co

mp-

ph]

7 A

pr 2

019

parallel on an AP1000 massively parallel distributed-memorymultiprocessor [36, 37]. In addition to the DA, a version ofthe Digital Annealer, which we refer to as “the PTDA”, andwhich uses parallel tempering Monte Carlo [38–42] for thealgorithmic engine is now available. In particular, it has beenshown that physics-inspired optimization techniques such assimulated annealing and parallel tempering Monte Carlo typ-ically outperform specialized quantum hardware [30] such asthe D-Wave devices.

Much of the benchmarking effort has centred around spinglasses, a type of constraint satisfaction problem, in part dueto their being the simplest of the hard Boolean optimizationproblems. Furthermore, application-based benchmarks from,for example, industry tend to be structured and, therefore,systematic benchmarking is difficult. As such, spin glasseshave been used extensively to benchmark algorithms on off-the-shelf CPUs [43, 44], novel computing technologies suchas quantum annealers [18, 20, 45, 46], and coherent Isingmachines [32]. In this paper, we benchmark the DA andthe PTDA on spin-glass problems, comparing them to simu-lated annealing [47] and parallel tempering Monte Carlo withisoenergetic cluster moves [48, 49] (a variant of Houdayercluster updates [50] within the context of optimization and notthe thermal simulation of spin-glass systems), both state-of-the-art, physics-inspired optimization algorithms. For otheralternative classical optimization techniques used in the litera-ture to solve QUBO problems, the interested reader is referredto [22, 30, 51].

The paper is organized as follows. Section II describes thealgorithms we have benchmarked. In Sec. III we probe the ad-vantage of parallel-trial over single-trial Monte Carlo movesand in Sec. IV we discuss the methodology we have used formeasuring time to solution. In Sec. V we introduce the prob-lems benchmarked. The experimental results are presentedand discussed in Sec. VI. Finally, our conclusions are pre-sented in Sec. VII. The parameters used for our benchmarkingare given in A.

II. ALGORITHMS

In this paper, we compare several Monte Carlo (MC) algo-rithms and their use for solving optimization problems.

A. Simulated Annealing

Simulated annealing (SA) [15] is a generic algorithm witha wide application scope. The SA algorithm starts from a ran-dom initial state at a high temperature. Monte Carlo updatesat decreasing temperatures are then performed. Note that thetemperatures used follow a predefined schedule.

Algorithm 1 Simulated Annealing (SA)

1: for each run do2: initialize to random initial state3: for each temperature do4: for each MC sweep at this temperature do5: for each variable do6: propose a flip7: if accepted, update the state and effective fields8: end for9: end for

10: update the temperature11: end for12: end for

When the simulation stops, one expects to find a low-temperature state, ideally the global optimum (see Algo-rithm 1 for details). The high-temperature limit promotes di-versification, whereas the low-temperature limit promotes in-tensification. To increase the probability of finding the op-timum, this process is repeated multiple times (referred toas “runs”), returning the best state found. The computa-tional complexity of each Monte Carlo sweep in SA isO(N2)for fully connected problems with N variables, because eachsweep includes N update proposals, and each accepted moverequires updating N effective fields, at a constant computa-tional cost.

B. The Digital Annealer’s Algorithm

The DA’s algorithmic engine [34, 35] is based on SA, butdiffers from it in three main ways (see Algorithm 2). First, itstarts all runs from the same arbitrary state, instead of startingeach run from a random state. This results in a small speedupdue to its avoiding the calculation of the initial N effectivefields and the initial energy for each run. Second, it uses aparallel-trial scheme in which each Monte Carlo step consid-ers a flip of all variables (separately), in parallel. If at least oneflip is accepted, one of the accepted flips is chosen uniformlyat random and it is applied. Recall that in SA, each MonteCarlo step considers a flip of a single variable only (i.e., sin-gle trial). The advantage of the parallel-trial scheme is thatit can boost the acceptance probability, because the likelihoodof accepting a flip out of N flips is typically much higher thanthe likelihood of flipping a particular variable (see Sec. III).Parallel rejection algorithms on GPU [52, 53] are examplesof similar efforts in the literature to address the low accep-tance probability problem in Monte Carlo methods. Finally,the DA employs an escape mechanism called a dynamic off-set, such that if no flip was accepted, the subsequent accep-tance probabilities are artificially increased by subtracting apositive value from the difference in energy associated witha proposed move. This can help the algorithm to surmountshort, narrow barriers.

Algorithm 2 The Digital Annealer’s Algorithm

1: initial state← an arbitrary state2: for each run do3: initialize to initial state4: Eoffset ← 05: for each MC step (iteration) do6: if due for temperature update, update the temperature7: for each variable j, in parallel do8: propose a flip using ∆Ej − Eoffset9: if accepted, record

10: end for11: if at least one flip accepted then12: choose one flip uniformly at random amongst them13: update the state and effective fields, in parallel14: Eoffset ← 015: else16: Eoffset ← Eoffset + offset increase rate17: end if18: end for19: end for

Furthermore, the application-specific CMOS hardware al-lows for massive parallelization that can be exploited for solv-ing optimization problems faster. For example, in the DA,evaluating a flip of all variables is performed in parallel, andwhen a flip is accepted and applied, the effective fields of allneighbours are updated in parallel. Note that this step requiresa constant time, regardless of the number of neighbours, dueto the parallelization on the hardware, whereas the computa-tional time of the same step in SA increases linearly in thenumber of neighbours.

In order to understand the logic behind the DA, it is help-ful to understand several architectural considerations that arespecific to the DA hardware. In the DA, each Monte Carlostep takes the same amount of time, regardless of whether aflip was accepted (and therefore applied) or not. In contrast,in a CPU implementation of SA, accepted moves are typicallymuch more computationally costly than rejected moves, thatis,[O(N) vs. O(1)

], due to the need to update N effective

fields versus none if the flip is rejected. As a result, in the DA,the potential boost in acceptance probabilities (from using theparallel-trial scheme) is highly desirable. In addition, in theDA, the computational complexity of updating the effectivefields is constant regardless of the connectivity of the graph.Comparing this with SA, the computational complexity of up-dating the effective fields isO(N) for fully connected graphs,but it isO(d) for fixed-degree graphs (in which each node hasd neighbours). Therefore, running SA on a sparse graph istypically faster than on a dense graph, but the time is the samefor the DA. For this reason, it is expected that the speedup ofthe DA over SA be, in general, higher for dense graphs thanfor sparse ones.

Finally, it is worth mentioning that the pseudorandom num-ber generator algorithm implemented in the hardware is sim-ilar to a twisted generalized feedback shift register algorithmand gives a sufficiently long period of 219937 [54].

C. Parallel Tempering with Isoenergetic Cluster Moves

In parallel tempering (PT) [38–42, 55] (also known asreplica-exchange Monte Carlo), multiple replicas of the sys-tem are simulated at different temperatures, with periodic ex-changes based on a Metropolis criterion between neighbour-ing temperatures. Each replica, therefore, performs a ran-dom walk in temperature space, allowing it to overcome en-ergy barriers by temporarily moving to a higher temperature.The higher-temperature replicas are typically at a high enoughtemperature that they inject new random states into the sys-tem, essentially re-seeding the algorithm continuously, obvi-ating (at least partially) the need to perform multiple runs. PThas been used effectively in multiple research fields [41], andoften performs better than SA, due to the increased mixing.

The addition of isoenergetic cluster moves (ICM) [48, 50]to PT, which flip multiple variables at a time, can allow forbetter exploration of the phase space, but only if the variableclusters do not span the whole system [48, 49]. ICM is a gen-eralization of Houdayer’s cluster algorithm, which was tai-lored for two-dimensional spin-glass problems [50]. To per-form ICM, two copies (or more) of the system are simulatedat the same temperature. The states of those two replicas arethen compared, to find a cluster of variables (i.e., a connectedcomponent) that are opposite. In the case of QUBO prob-lems, opposite variables are defined as having a product ofzero. Finally, the move is applied by swapping the states ofthe opposite variables in the two replicas. The total energy ofthe two replicas is unchanged by this move, such that it is re-jection free. The combination of PT and ICM, PT+ICM (alsoknown as borealis; see Algorithm 3 [49]), has been shownto be highly effective for low-dimensionality spin-glass-likeproblems [20, 30, 56], but it does not provide a benefit forproblems defined on fully connected graphs. This can be un-derstood by noting that when the clusters span the system,ICM essentially results in swapping the states completely.

Algorithm 3 Parallel Tempering with Isoenergetic Cluster Moves(PT+ICM)

1: initialize all replicas with random initial states2: for each MC sweep do3: for each replica, for each variable do4: propose a flip5: if accepted, update the state and effective fields6: end for7: for each pair of sequential replicas do8: propose a replica exchange9: if accepted, swap the temperatures between the replicas

10: end for11: perform ICM update, swapping the states of a cluster of variables that

have opposite states in the two replicas; update the states and the effectivefields for both replicas

12: end for

D. The Parallel Tempering Digital Annealer’s Algorithm

Because the DA’s algorithm is based on SA, and giventhe often superior results that PT gives over SA (see, e.g.,Ref. [30]), Fujitsu Laboratories has also developed a Paral-

lel Tempering Digital Annealer (PTDA). We had access to anearly implementation of a PTDA. In the PTDA, the sweepsin each replica are performed as in the DA, including theparallel-trial scheme, parallel updates, and using the dynamicoffset mechanism, but the PT moves are performed on a CPU.The temperatures are set automatically based on an adaptivescheme by Hukushima et al. [57]. In this scheme, the highand low temperatures are fixed, and intermediate temperaturesare adjusted with the objective of achieving an equal replica-exchange probability for all adjacent temperatures. Havingequal replica-exchange acceptance probabilities is a commontarget, although other schemes exist [42].

The next generation of the Digital Annealer is expected tobe able to simulate problems on complete graphs up to 8192variables in size, to have faster annealing times, and to per-form the replica-exchange moves on the hardware, rather thanon a CPU. This is significant, because when performing acomputation in parallel, if a portion of the work is performedsequentially, it introduces a bottleneck that eventually domi-nates the overall run time (as the number of parallel threadsis increased). Amdahl’s Law [58] quantifies this by statingthat if the sequential part is a fraction α of the total work, thespeedup is limited to 1/α asymptotically.

III. PARALLEL-TRIAL VERSUS SINGLE-TRIAL MONTECARLO

To illustrate the advantage of parallel-trial Monte Carloupdates as implemented in the DA over single-trial MonteCarlo updates, let us calculate their respective acceptanceprobabilities. The acceptance probability for a particu-lar Monte Carlo move is given by the Metropolis criterionA(∆Ei, T ) ≡ e−∆Ei/T , where ∆Ei denotes the differencein energy associated with flipping variable i, and T is the tem-perature. The single-trial acceptance probability is then givenby

Ps(T ) =1

N

∑i

A(∆Ei, T ) , (1)

where N is the number of variables. In contrast, the parallel-trial acceptance probability is given by the complement prob-ability of not accepting a move,

Pp(T ) = 1−∏i

[1−A(∆Ei, T )] . (2)

At low temperatures, we expect the acceptance probability toreach zero, in general. In the limit A → 0, a first-order ap-proximation of the parallel-trial acceptance probability gives

Pp(T ) '∑i

A(∆Ei, T ) . (3)

This indicates that in the best case, there is a speedup by afactor of N at low temperatures. In contrast, at a high enoughtemperature, all moves are accepted, hence A → 1. In this

limit, it is clear that both the single-trial and parallel-trial ac-ceptance probabilities reach 1, so parallel-trial Monte Carlodoes not have an advantage over single-trial Monte Carlo.

To quantify the difference between parallel-trial and single-trial Monte Carlo, we perform a Monte Carlo simulation atconstant temperature for a sufficiently large number of sweepsto reach thermalization. Once the system has thermalized, wemeasure the single-trial and parallel-trial acceptance probabil-ities at every move. This has been repeated for a number ofsweeps, and for multiple temperatures and multiple problems.

The results of such an experiment are presented in Fig. 1,for problems of size N = 64 of the four problem classesdescribed in detail in Sec. V. The problem classes includetwo-dimensional (2D) and fully connected [Sherrington–Kirkpatrick (SK)] spin-glass problems with bimodal andGaussian disorder. The results for all the problem classes ex-cept for the 2D-bimodal class follow the expected pattern ofthe acceptance probabilities reaching zero at low temperaturesand one at high temperatures. In the 2D-bimodal case, there isa huge ground-state degeneracy, such that even at the groundstate there are single variables for which a flip does not resultin a change in energy. This results in a positive single-trialacceptance probability even at very low temperatures. For thesame reason, the parallel-trial probability reaches one even forvery low temperatures.

To quantify the acceptance probability advantage ofparallel-trial over single-trial updates, it is instructive to studythe parallel-trial acceptance probability divided by the single-trial acceptance probability, as presented in Fig. 2. For allproblem classes except 2D-bimodal, the advantage at lowtemperatures is indeed a factor ofN , as suggested by dividingEq. (3) by Eq. (1). As explained above, in the 2D-bimodalcase the single-trial acceptance probability is nonnegligibleat low temperatures, leading to a reduced advantage. It isnoteworthy that the advantage of the parallel-trial scheme ismaximal at low temperatures, where the thermalization timeis longer. As such, the parallel-trial scheme provides an ac-ceptance probability boost where it is most needed.

IV. SCALING ANALYSIS

The primary objective of benchmarking is to quantify howthe computational effort in solving problems scales as the sizeof the problem (e.g., the number of variables) increases. Thealgorithms we consider here are all stochastic, and a commonapproach to measuring the scaling of a probabilistic algorithmis to measure the total time required for the algorithm to find areference energy (cost) at least once with a probability of 0.99.The reference energy is represented by the optimal energy ifavailable or, otherwise, by the best known energy. We denotethis time to solution by “TTS”, and explain how it is calculatedin the rest of this section.

We consider the successive runs of a probabilistic algorithmas being a sequence of binary experiments that might suc-ceed in returning the reference energy with some probability.Let us formally define X1, X2, . . . , Xr as a sequence of ran-dom, independent outcomes of r runs (experiments), where

10−1 100 101 102 103

T

0.00.20.40.60.81.0

P

singleparallel

(a)

103 104 105 106 107 108

T

0.00.20.40.60.81.0

P

(b)

10−1 100 101 102 103

T

0.00.20.40.60.81.0

P

(c)

103 104 105 106 107 108

T

0.00.20.40.60.81.0

P(d)

FIG. 1: Mean single-trial and parallel-trial acceptance prob-abilities (P) vs. the temperature (T ), for four problemclasses: (a) 2D-bimodal, (b) 2D-Gaussian, (c) SK-bimodal,and (d) SK-Gaussian. Error bars are given by the 5th and95th percentiles of the acceptance probabilities in the configu-ration space. A constant-temperature Monte Carlo simulationhas been run for 105 sweeps to thermalize, after which theacceptance probabilities are measured for 5 · 103 sweeps. Foreach problem class, the simulation has been performed on 100problem instances of size N = 64 with 10 repeats per prob-lem instance. The horizontal dashed lines show the minimumand maximum attainable probabilities, zero and one, respec-tively. Acceptance probabilities for the parallel-trial schemeapproach unity as a function of simulation time considerablyfaster than in the single-trial scheme.

P(Xi = 1) = θ denotes the probability of success, that is, ofobserving the reference energy at the i-th run. Defining

Y =

r∑i=1

Xi (4)

as the number of successful observations in r runs, we have

P(Y = y|θ, r) =

(r

y

)(1− θ)r−yθy. (5)

That is, Y has a binomial distribution with parameters r andθ. We denote the number of runs required to find the referenceenergy with a probability of 0.99 as R99, which equals r suchthat P(Y ≥ 1|θ, r) = 0.99. It can be verified that

R99 =log(1− 0.99)

log(1− θ)(6)

and, consequently, that

TTS = τR99, (7)

10−1 100 101 102 103

T

0204060

P p/P

s (a)

103 104 105 106 107 108

T

0204060

P p/P

s (b)

10−1 100 101 102 103

T

0204060

P p/P

s (c)

103 104 105 106 107 108

T

0204060

P p/P

s (d)

FIG. 2: Ratio of the mean parallel-trial acceptance prob-ability (Pp) to the mean single-trial acceptance probability(Ps) vs. the temperature (T ), for four problem classes: (a)2D-bimodal, (b) 2D-Gaussian, (c) SK-bimodal, and (d) SK-Gaussian. A constant-temperature Monte Carlo simulationhas been run for 105 sweeps to thermalize, after which theacceptance probabilities are measured for 5 · 103 sweeps. Foreach problem class, the simulation has been performed on 100problem instances of sizeN = 64 with 10 repeats per probleminstance. The upper horizontal dashed line indicates N , thenumber of variables, which is the expected theoretical valueof the ratio at low temperatures, as given by Eq. (3) dividedby Eq. (1). The lower horizontal line indicates the minimumvalue of the ratio, which is one.

where τ is the time it takes to run the algorithm once. Becausethe probability of success θ is unknown, the challenge is inestimating θ.

Instead of using the sample success proportion as a pointestimate for θ, we follow the Bayesian inference techniqueto estimate the distribution of the probability of success foreach problem instance [22]. Having distributions of the suc-cess probabilities would be helpful in more accurately cap-turing the variance of different statistics of the TTS. In theBayesian inference framework, we start with a guess on thedistribution of θ, known as a prior, and update it based on theobservations from consecutive runs of the algorithm in orderto obtain the posterior distribution. Since the consecutive runshave a binomial distribution, the suitable choice of prior is abeta distribution [59] that is the conjugate prior of the bino-mial distribution. This choice guarantees that the posteriorwill also have a beta distribution. The beta distribution withparameters α = 0.5 and β = 0.5 (the Jeffreys prior) is cho-sen as a prior because it is invariant under reparameterizationof the space and it learns the most from the data [60]. Up-dating the Jeffreys prior based on the observations from con-secutive runs, the posterior distribution, denoted by π(θ), isπ(θ) ∼ Beta(α+y, β+ r−y), where r is the total number of

runs and y is the number of runs in which the reference energyis found.

To estimate the TTS for the entire population of instanceswith similar parameters, let us assume that there are I in-stances with the same number of variables. After findingthe posterior distribution πi(θ) for each instance i in set{1, 2, . . . , I}, we use bootstrapping to estimate the distribu-tion of the q-th percentile of the TTS. This procedure is de-scribed in Algorithm 4.

Algorithm 4 Estimating the Distribution of the q-th Percentile of the TTS

1: fix the number of bootstrap resamples to B (B = 5000)2: for b = 1, . . . , B do3: sample I instances with replacement4: for each sampled instance j do5: sample a value, pjb, from its posterior probability πj(θ)6: calculate R99,jb = log(1− 0.99)/ log(1− pjb)7: end for8: find the q-th percentile of the set {R99,jb} and denote it by R99,qb

9: end for10: consider the empirical distribution of (τR99,q1, . . . , τR99,qB) as an ap-

proximation of the true TTSq distribution

The procedure for deriving the TTS is slightly different forthe DA and the PTDA. The anneal time of the algorithmic en-gine of the DA is not a linear function of the number of runsfor a given number of sweeps. We therefore directly measurethe anneal time for a given number of iterations and a givennumber of runs where the latter is equal to the R99. Each it-eration (Monte Carlo step) in the DA represents one potentialupdate and each Monte Carlo sweep corresponds to N itera-tions.

It is important to note that the correct scaling is only ob-served if the parameters of the solver are tuned such that theTTS is minimized. Otherwise, a suboptimal scaling might beobserved and incorrect conclusions could be made. Recall thatthe TTS is the product of the R99 and the time taken per runτ . Let us consider a parameter that affects the computationaleffort taken, such as the number of sweeps. Increasing thenumber of sweeps results in the algorithm being more likelyto find the reference solution, hence resulting in a lower R99.On the other hand, increasing the number of sweeps also re-sults in a longer runtime, increasing τ . For this reason, it istypical to find that the TTS reaches infinity for a very low orvery high number of sweeps, and the goal is to experimentallyfind a number of sweeps at which the TTS is minimized.

V. BENCHMARKING PROBLEMS

A quadratic Ising model can be represented by a Hamilto-nian (i.e., cost function) of the form

H = −∑

(i,j)∈E

Jijsisj −∑i∈V

hisi . (8)

Here, si ∈ {−1, 1} represent Boolean variables, and the prob-lem is encoded in the biases hi and couplers Jij . The sumsare over the vertices V and weighted edges E of a graph

G = (V,E). It can be shown that the problem of find-ing a spin configuration {si} that minimizes H, in general,is equivalent to the NP-hard weighted max-cut problem [61–64]. Spin glasses defined on nonplanar graphs fall into theNP-hard complexity class. However, for the special case ofplanar graphs, exact, polynomial-time methods exist [65].

The algorithmic engine of the Digital Annealer can opti-mize instances of QUBO problems in which the variablesxi take values from {0, 1} instead of {−1, 1}. To solve aquadratic Ising problem described by the Hamiltonian repre-sented in Eq. (8), we can transform it into a QUBO problemby taking si = 2xi − 1.

In the following, we explain the spin-glass problems usedfor benchmarking.

2D-bimodal – Two-dimensional spin-glass problems ona torus (periodic boundaries), where couplings are cho-sen according to a bimodal distribution, that is, theytake values from {−1, 1} with equal probability.

2D-Gaussian – Two-dimensional spin-glass problemswhere couplings are chosen from a Gaussian distribu-tion with a mean of zero and a standard deviation ofone, scaled by 105.

SK-bimodal – Spin-glass problems on a completegraph—also known as Sherrington–Kirkpatrick (SK)spin-glass problems [66]—where couplings are chosenaccording to a bimodal distribution, that is, they takevalues from {−1, 1} with equal probability.

SK-Gaussian – SK spin-glass problems where cou-plings are chosen from a Gaussian distribution with amean of zero and a standard deviation of one, scaled by105.

In all the problems, the biases are zero. The coefficients of the2D-Gaussian and SK-Gaussian problems are beyond the pre-cision limit of the current DA. In order to solve these problemsusing the DA, we have used a simple scheme to first scale thecoefficients up to their maximum limit and then round to thenearest integer values. The maximum values for the linear andquadratic coefficients are given by the precision limits of thecurrent DA hardware, that is, 225−1 and 215−1, respectively.The problem instances are not scaled when solving them usingSA or PT (PT+ICM).

Our benchmarking experiment has been parametrized bythe number of variables. Specifically, we have considerednine different problem sizes in each problem category andgenerated 100 random instances for each problem size. Wehave used the instance generator provided by the Universityof Cologne Spin Glass Server [67] to procure the 2D-bimodaland 2D-Gaussian instances. SK instances with bimodal andGaussian disorder have been generated as described above.Each problem instance has then been solved by differentMonte Carlo algorithms. Optimal solutions to the 2D-bimodaland 2D-Gaussian problems have been obtained by a branch-and-cut [68] technique available via the Spin Glass Server[67]. The SK problems are harder than the two-dimensionalproblems and the server does not find the optimal solution

within the 15-minute time limit. For the SK-bimodal andthe SK-Gaussian problems with 64 variables, we have useda semidefinite branch-and-bound technique through the BiqMac Solver [69] and BiqCrunch [70] to find the optimal solu-tions, respectively. For problems of a size greater than 64, thesolution obtained by PT with a large number of sweeps (5·105

sweeps) is considered a good upper bound for the optimal so-lution. We refer to the optimal solution (or its upper bound)as the reference energy.

VI. RESULTS AND DISCUSSION

In this paper, we have used an implementation of thePT (and PT+ICM) algorithm based on the work of Zhuet al. [48, 49]. The DA and PTDA algorithms are run on Fu-jitsu’s Digital Annealer hardware. For the SA simulations wehave used the highly optimized, open source code by Isakovet al. [47].

The DA solves only a fully connected problem where thecoefficients of the absent vertices and edges in the originalproblem graph are set to zero. In our benchmarking study,we have included both two-dimensional and SK problems torepresent the two cases of sparsity and full connectivity, re-spectively. Furthermore, we have considered both bimodaland Gaussian disorder in order to account for problems withhigh or low ground-state degeneracy. The bimodal disorderresults in an energy landscape that has a large number of freevariables with zero effective local fields. As a result, the de-generacy of the ground state increases exponentially in thenumber of free variables, making it easier for any classicaloptimization algorithm to reach a ground state. Problem in-stances that have Gaussian coefficients further challenge theDA, due to its current limitations in terms of precision.

In what follows, we discuss our benchmarking results, com-paring the performance of different algorithms using two-dimensional and SK spin-glass problems, with bimodal andGaussian disorder. We further investigate how problem den-sity affects the DA’s performance. The parameters of the al-gorithms used in this benchmarking study are presented in A.

A. 2D Spin-Glass Problems

Figure 3 illustrates the TTS results of the DA, SA,PT+ICM, and PTDA for 2D spin-glass problems with bimodaland Gaussian disorder. In all TTS plots in this paper, pointsand error bars represent the mean and the 5th and 95th per-centiles of the TTS distribution, respectively. PT+ICM hasthe lowest TTS for problems with bimodal and Gaussian dis-order and has a clear scaling advantage. In problems with bi-modally distributed couplings, although SA results in a lowerTTS for small-sized problems, the DA and SA demonstratesimilar TTSs as the problem size increases. The performanceof both SA and the DA decreases when solving harder prob-lem instances with Gaussian disorder, with significantly re-duced degeneracy of the ground states. However, in this case,the DA outperforms SA even with its current precision limit.

The PTDA yields higher TTSs than the DA in both cases ofbimodal and Gaussian disorder, likely due to the CPU over-head of performing parallel tempering moves. Consideringthe number of problems solved to optimality, the PTDA out-performs the DA when solving 2D spin-glass problems withbimodal couplings; however, as shown in Fig. 3, the PTDAsolves fewer problems with Gaussian disorder.

In order to estimate the q-th percentile of the TTS distribu-tion, at least q% of problems should be solved to their corre-sponding reference energies. If there is no point for a givenproblem size and an algorithm in Fig. 3, it means that enoughinstances have not been solved to their reference energies andwe therefore could not estimate the TTS percentiles. Increas-ing the number of iterations to 107 in the DA and the PTDA,we could not solve more than 80% of the 2D-Gaussian prob-lem instances with a size greater than or equal to 400. There-fore, to gather enough statistics to estimate the 80th percentileof the TTS, we have increased the number of iterations to108 in the DA. However, because of the excessive resourcesneeded, we have not run the PTDA with such a large numberof iterations.

2D-bimodal

In Fig. 4a, we observe that for a given problem size andnumber of sweeps, the DA reaches higher success probabil-ities than SA. As the problem size increases, the differencebetween the mean success probability curves of the DA andSA becomes less pronounced. Figure 4b illustrates the suc-cess probability correlation of 100 problem instances of size1024. The DA yields higher success probabilities for 52 prob-lem instances out of 100 instances solved.

For 2D-bimodal problems, the boost in the probability ofupdating a single variable due to the parallel-trial scheme isnot effective enough to decrease the TTS or to result in betterscaling (Fig. 3a). Since both the DA and SA update at mostone variable at a time, increasing the probability of updatinga variable in a problem with bimodal disorder, where thereare a large number of free variables, likely results in a newconfiguration without lowering the energy value (see Sec. IIIfor details).

2D-Gaussian

The performance of the DA and SA significantly degradeswhen solving the problems with Gaussian disorder, whichare harder; however, the DA demonstrates clear superiorityover SA. Figure 5 shows the residual energy (E), which isthe relative energy difference (in percent) between the lowest-energy solution found and the reference-energy solution, forthe largest problem size, which has 1024 variables. We ob-serve that the DA outperforms SA, as it results in a lowerresidual energy for a given number of Monte Carlo sweeps(S ≤ 105). Furthermore, Fig. 6 illustrates that the parallel-trial scheme is more effective for this class of problems, whichcould be due to the decrease in the degeneracy of the ground

64 144 256 400 576 676 784 900 1024N

10−4

10−3

10−2

10−1

100

101

102

103

104

TTS

(s)

DA (50th)SA (50th)PT+ICM (50th)PTDA (50th)


(a)

64 144 256 400 576 676 784 900 1024N

10−4

10−3

10−2

10−1

100

101

102

103

104

TTS

(s)



(b)

FIG. 3: Mean, 5th, and 95th percentiles of the TTS distributions (TTS50 and TTS80) for (a) 2D-bimodal and (b) 2D-Gaussianspin-glass problems. For some problem sizes N , the percentiles are smaller than the symbols.

102 103 104

S

0.0

0.2

0.4

0.6

0.8

1.0

P

DA (N = 144)SA (N = 144)DA (N = 400)

SA (N = 400)DA (N = 1024)SA (N = 1024)

(a)

0.0 0.1 0.2 0.3 0.4 0.5PSA

0.0

0.1

0.2

0.3

0.4

0.5P DA

N = 1024

(b)

FIG. 4: (a) Mean success probabilities (P) of 2D-bimodal instances vs. the number of Monte Carlo sweeps (S) for differentproblem sizes N solved by the DA and SA. The error bars are not included, for better visibility. (b) Success probabilitycorrelation plot of 100 2D-bimodal instances with N = 1024 variables, where the number of sweeps is 104 for SA and thenumber of iterations is 107 for the DA. Because each sweep corresponds to N = 1024 iterations, the number of sweeps is' 104

for the DA, as well.

states. In Fig. 6a, we observe that the difference betweenmean success probabilities, for a problem of size 144 with104 Monte Carlo sweeps, is larger compared to the bimodaldisorder (Fig. 4a). The success probability correlation of 1002D-Gaussian problem instances with 400 variables in Fig. 6bfurther demonstrates that the DA reaches higher success prob-abilities, which results in a lower TTS (Fig. 3b).

B. SK Spin-Glass Problems

We have solved the SK problem instances with the DA, SA,PT, and the PTDA. As explained in Sec. II, for the fully con-nected problems, the cluster moves have not been includedin PT because the clusters of variables span the entire sys-tem. Our initial experiments verified that adding ICM to PTincreases the computational cost without demonstrating anyscaling benefit for this problem class.

The statistics of the TTS distribution of the DA, the PTDA,

102 103 104 105

S

0

1

2

3

4

5E(

%)

DA (N = 1024)SA (N = 1024)

FIG. 5: Mean, 5th, and 95th percentiles of the residualenergy (E) vs. number of Monte Carlo sweeps (S) for 2D-Gaussian problem instances with N = 1024 variables solvedby the DA and SA. As the number of Monte Carlo sweeps ap-proaches infinity, the residual energy of both algorithms willeventually reach zero. We have run SA for up to 106 sweepsand the DA for up to 108 iterations (108/1024 ∼ 105 sweeps).Therefore, the data up to 105 sweeps is presented for both al-gorithms. We expect that by increasing the number of itera-tions to 109 (106 sweeps) in the DA, the residual energy wouldfurther decrease.

and the SA and PT algorithms are shown in Fig. 7a for SK-bimodal problem instances. Comparing the DA to SA and PT,we observe that the DA yields a noticeable, consistent speedupof at least two orders of magnitude as we approach the largestproblem size. In the fully connected problems, accepting amove and updating the effective local fields in a CPU im-plementation of a Monte Carlo algorithm is computationallymore expensive than for sparse problems.

Figure 7a shows that each algorithm has solved at least 80%of the SK instances for all problem sizes. We attribute thisto the fact that the reference energy for the complete graphproblems is an upper bound on the exact optimal solution. Wedo not know how tight the upper bound is, but it represents, tothe best of our knowledge, the best known solution.

To obtain insights on scaling, for each algorithm, we havefit an exponential function of the form y = 10α+βN , wherey and N are the means of the TTS distribution and the num-ber of variables, respectively. Figure 7b shows the 90% con-fidence interval of the estimated scaling exponent β for thealgorithms based on the statistics of the 50th and the 80th per-centiles of the TTS distribution. For the 50th percentile, weobserve that the PTDA yields superior scaling over the otherthree algorithms for the problem class with bimodal disorder.For the 80th percentile, there is not enough evidence to drawa strong conclusion on which algorithm scales better becausethe 90% confidence intervals of the estimated scaling expo-nents overlap. However, the PTDA has the lowest point esti-mate.

For the DA, SA, and PT algorithms, we have searched overa large number of parameter combinations to experimentallydetermine a good set of parameters (see A) while the parame-ters of the PTDA have been determined automatically by thehardware. We have further experimentally determined the op-timal number of sweeps for all four algorithms. However, wedo not rule out the possibility that the scaling of the algorithmsmight be suboptimal due to a non-optimal tuning of parame-ters. For example, the scaling of the PTDA might improveafter tuning its parameters and PT might exhibit better scalingusing a more optimized temperature schedule.

Figure 8 illustrates the TTS statistics and the confidenceinterval of the scaling exponent for SK-Gaussian problem in-stances solved by the DA, SA, PT, and the PTDA. We observethat the DA continues to exhibit a constant speedup of at leasttwo orders of magnitude over the other algorithms, with nostrong scaling advantage, in solving spin-glass problems withGaussian disorder.

The DA versus SA

The DA results in lower TTSs than SA for both SK-bimodaland SK-Gaussian problem instances. The reasons for this be-haviour are twofold. First, the anneal time for the DA is in-dependent of the number of variables and the density of theproblem, whereas the computation time of a sweep in SA in-creases with the problem size and the problem density. Sec-ond, as shown in Fig. 9, the parallel-trial scheme significantlyimproves the success probability in fully connected spin-glassproblems of size 1024 with both bimodal and Gaussian dis-order. As expected, the boost in the low-degeneracy probleminstances (with Gaussian coefficients) is higher.

Although the confidence intervals of the scaling expo-nents overlap, considering the statistics of the TTS80, the DAyields lower point estimates than SA for SK-bimodal and SK-Gaussian problems. In particular, β = 0.0021(7) [0.0019(3)]for the DA with bimodal [Gaussian] disorder, whereas β =0.0027(6) [0.0028(5)] for SA with bimodal [Gaussian] disor-der, thus providing a weak scaling advantage.

Our results on spin-glass problems with Gaussian disor-der further indicate that the 16-bit precision of the hardwareused in this study is not a limiting factor because the DA out-performs SA on instances of these problems. Since there isa high variance in the couplers of spin-glass instances withGaussian disorder, we expect that the energy gap between theground state and the first-excited state is likely greater than10−5 and, as a result, the scaling/rounding effect is not sig-nificant [71]. Our experimentation data on the prototype ofthe second-generation Digital Annealer, which has 64 bits ofprecision on both biases and couplers, also confirms that thehigher precision by itself does not have a significant impacton the results presented in this paper. We leave a presenta-tion of our experimental results using the second-generationhardware for future work.

102 103 104 105 106

S

0.0

0.2

0.4

0.6

0.8

1.0P

DA (N = 144)SA (N = 144)

DA (N = 400)SA (N = 400)

(a)

0.0 0.1 0.2 0.3 0.4 0.5 0.6PSA (S = 106)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

P DA(S

=2.

5·10

5 )

N = 400

(b)

FIG. 6: (a) Mean success probabilities (P) vs. the number of Monte Carlo sweeps (S) for 2D-Gaussian problem instances withN = 144 and N = 400 variables solved by the DA and SA. The error bars are not included, for better visibility. (b) The successprobability correlation of 100 2D-Gaussian instances with N = 400 variables. The data points are obtained considering 106 and2.5 · 105 Monte Carlo sweeps for SA and the DA, respectively.

64 144 256 400 576 676 784 900 1024N

10−4

10−3

10−2

10−1

100

101

102

103

104

TTS

(s)

DA (50th)SA (50th)PT (50th)PTDA (50th)


(a)

DA SA PT PTDA02·

10−

34·

10−

3β

50th

50th

50th

50th

80th

80th

80th 80th

(b)

FIG. 7: (a) Mean, 5th, and 95th percentiles of the TTS distributions (TTS50 and TTS80) for fully connected spin-glass problemswith bimodal disorder. For some problem sizes, the percentiles are smaller than the symbols. (b) The 90% confidence intervalof the estimated scaling exponent β based on the mean of the TTS50 and TTS80 for the SK-bimodal problem class. The labelbelow each rectangle represents the TTS percentile on which the confidence interval is based. For each algorithm, the five largestproblem sizes have been used to estimate the scaling exponent by fitting a linear model to log10 TTS. The R2 of the fitted modelis greater than or slightly lower than 90% for each algorithm.

C. Spin-Glass Problems with Different Densities

Our results for the two limits of the problem-density spec-trum suggest that the DA exhibits similar TTSs to SA onsparse problems, and outperforms SA on fully connectedproblems by a TTS speedup of approximately two orders ofmagnitude. To obtain a deeper understanding of the relation

between the performance and the density, we have performedan experiment using random problem graphs with nine dif-ferent densities. For each problem density, 100 problem in-stances with 1024 variables have been generated based on theErdos–Renyi model [72], with bimodally distributed couplingcoefficients, and zero biases. The parameters of the DA andSA have been experimentally tuned for each of the nine prob-

64 144 256 400 576 676 784 900 1024N

10−4

10−3

10−2

10−1

100

101

102

103

104TT

S(s

)



(a)

DA SA PT PTDA02·

10−

34·

10−

3β

50th

50th

50th50th

80th

80th

80th 80th

(b)

FIG. 8: (a) Mean, 5th, and 95th percentiles of the TTS distributions (TTS50 and TTS80) for fully connected spin-glass problemswith Gaussian disorder. For some problem sizes, the percentiles are smaller than the symbols. (b) The 90% confidence intervalof the estimated scaling exponent β based on the mean of the TTS50 and TTS80 for the SK-Gaussian problem class. The labelbelow each rectangle represents the TTS percentile on which the confidence interval is based. For each algorithm, the five largestproblem sizes have been used to estimate the scaling exponent by fitting a linear model to log10 TTS. The R2 of the fitted modelis greater than or slightly lower than 90% for each algorithm.

0.0 0.2 0.4 0.6 0.8 1.0PSA

0.0

0.2

0.4

0.6

0.8

1.0

P DA

SK-GaussianSK-bimodal

FIG. 9: Success probability correlation (P) of 100 SK in-stances of size 1024 with bimodal and Gaussian disorder. Thenumber of Monte Carlo sweeps in SA is 104 and the numberof iterations in the DA is 107, corresponding to ' 104 MonteCarlo sweeps.

lem densities (see A).Figure 10 shows the statistics of the TTS of the DA and

SA for different problem densities. The TTS results for 2D-bimodal (d = 0.4%) and SK-bimodal (d = 100%) for a prob-lem size of 1024, representing the limits of the density spec-trum, are also included. The DA has lower TTSs than SAfor all problem densities except for the sparsest problem set—

2D-bimodal. Not enough 2D-bimodal instances were solvedto optimality using the DA and SA in order to estimate thestatistics of the TTS80 distributions.

600.4 70 90 1008010 20 30 5040d (%)

10−1

100

101

102

103

TTS

(s)

DA (50th)SA (50th)

DA (80th)SA (80th)

FIG. 10: Mean, 5th, and 95th percentiles of the TTS dis-tributions (TTS50 and TTS80) for spin-glass problems withdifferent densities (d).

Figure 11 shows the success probabilities of 100 spin-glassproblem instances of size 1024 with different densities solvedby the DA and SA. The DA has higher success probabilitiesthan SA by a statistically significant margin for all of the den-sities except for the sparsest problem set—2D-bimodal. Weinterpret these results as being due to both the increase in the

success probabilities from using a parallel-trial scheme andthe constant time required to perform each Monte Carlo stepon the DA hardware architecture.

20 40 60 80 100Instance Id

0.0

0.2

0.4

0.6

0.8

1.0

P

SA (d = 100%)DA (d = 100%)SA (d = 90%)DA (d = 90%)SA (d = 50%)DA (d = 50%)SA (d = 10%)DA (d = 10%)SA (d = 0.4%)DA (d = 0.4%)

FIG. 11: Success probabilities (P) of 100 spin-glass prob-lem instances with different densities solved by the SA andthe DA algorithms. The number of Monte Carlo sweeps inSA is 104 and the number of iterations in the DA is 107, cor-responding to ' 104 Monte Carlo sweeps.

VII. CONCLUSIONS AND OUTLOOK

In this work we have compared the performance of the Dig-ital Annealer (DA) and the Parallel Tempering Digital An-nealer (PTDA) to parallel tempering Monte Carlo with andwithout isoenergetic cluster moves (PT+ICM and PT, respec-tively) and simulated annealing (SA) using random instancesof sparse and fully connected spin-glass problems, with bi-modal and Gaussian disorder.

Our results demonstrate that the DA is approximately twoorders of magnitude faster than SA and PT in solving denseproblems, while it does not exhibit a speedup for sparse prob-lems. In the latter problem class, the addition of cluster up-dates to the PT algorithm is very effective in traversing theenergy barriers, outperforming algorithms that act on a sin-gle flip neighbourhood, such as the DA and SA. For denseproblems, the efficiency of the cluster moves diminishes suchthat the DA is faster, due to the parallel-trial scheme com-bined with the massive parallelization that is possible onapplication-specific CMOS hardware. Our results further sup-port the position that the DA has an advantage over SA onrandom spin-glass problems with densities of 10% or higher.

In Sec. III we demonstrate that parallel-trial Monte Carlocan offer a significant boost to the acceptance probabilitiesover standard updating schemes. Furthermore, we show thatthis boost vanishes at high temperatures and is diminishedfor problems with high ground-state degeneracy. Our bench-marking results further support the view that the parallel-trial scheme is more effective in solving problems with low

ground-state degeneracy because an accepted move is morelikely not only to change the state configuration, but also tolower the energy value.

In the current early implementation of the PTDA, the TTSis higher than it is likely to be in the future, due to the CPUoverhead in performing PT moves. However, the PTDA al-gorithm demonstrates better scaling than the other three al-gorithms for a fully connected spin-glass problem of averagecomputational difficulty, with bimodal couplings.

In the next generation of the Digital Annealer, the hardwarearchitecture is expected to allow the optimization of problemsusing up to 8192 fully connected variables. In addition, theannealing time is expected to decrease, and we conjecturethat the TTS might decrease accordingly. Finally, we expectthe replica-exchange moves in the PTDA to be performed onthe hardware, which could improve the performance of thePTDA.

Our results demonstrate that pairing application-specificCMOS hardware with physics-inspired optimization methodsresults in extremely efficient, special-purpose optimizationmachines. Because of their fully connected topology and highdigital precision, these machines have the potential to outper-form current analog quantum optimization machines. Pairingsuch application-specific CMOS hardware with a fast inter-connect could result in large-scale transformative optimiza-tion devices. We thus expect future generations of the Dig-ital Annealer to open avenues for the study of fundamentalphysics problems and industrial applications that were previ-ously inaccessible with conventional CPU hardware.

CONFLICT OF INTEREST STATEMENT

Authors MA, GR, and EV are employed by the company1QBit. Authors TM and HT are employed by the companyFujitsu Laboratories Ltd. Author HGK was employed by1QBit when this research was completed and is currently em-ployed by the company Microsoft Research.

The authors declare that there is no further commercial orfinancial relationship that could be construed as a potentialconflict of interest.

AUTHOR CONTRIBUTIONS

Authors MA, GR, and HGK developed the methodology,implemented the code, performed the experiments, analyzedthe results, and wrote the manuscript. Author EV partiallycontributed to implementing the code and conducting the ex-periments. Authors TM and HT carried out the experimentsrelated to the PTDA algorithm.

FUNDING

This research was supported by 1QBit, Fujitsu LaboratoriesLtd., and Fujitsu Ltd. The research of HGK was supported bythe National Science Foundation (Grant No. DMR-1151387).

HGK’s research is based upon work supported by the Office ofthe Director of National Intelligence (ODNI), Intelligence Ad-vanced Research Projects Activity (IARPA), via InteragencyUmbrella Agreement No. IA1-1198. The views and conclu-sions contained herein are those of the authors and should notbe interpreted as necessarily representing the official policiesor endorsements, either expressed or implied, of the ODNI,IARPA, or the U.S. Government. The U.S. Government is au-thorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright annotation thereon.

ACKNOWLEDGEMENT

The authors would like to thank Salvatore Mandra forhelpful discussions, Lester Szeto, Brad Woods, Rudi Plesch,Shawn Wowk, and Ian Seale for software developmentand technical support, Marko Bucyk for editorial help, andClemens Adolphs for reviewing the manuscript. We thankZheng Zhu for providing us with his implementation of thePT+ICM algorithm [48, 49], and Sergei Isakov for the use ofhis implementation of the SA algorithm [47]. HGK wouldlike to thank Bastani Sonnati for inspiration.

Appendix A: Simulation Parameters

In this section we present the parameters of the algorithmsused to solve two-dimensional and SK spin-glass problemswith bimodal and Gaussian disorder, then explain the parame-ters used to calculate the TTSs of the DA and SA when solvingspin-glass problems with different densities.

2D & SK Spin-Glass Problems

For the DA, SA, and PT (PT+ICM) and each problem class,we have performed a grid search in the parameter space todetermine the best parameters, using a subset of problem in-stances. The subset of instances used for parameter tuninginclude instances of size 576, 784, and 1024. The number ofinstances solved to optimality, the success probability, and theresidual energy have been used to select the best parametercombination for the benchmarking study.

In order to find the optimal TTS for a given problem size,we have varied the number of sweeps (iterations in the DA andthe PTDA) and have used the procedure shown in Algorithm4 to find the empirical distribution of the TTS for each num-ber of sweeps. We have then found the number of sweeps forwhich the TTS distribution has the lowest mean and report thestatistics of that distribution as the optimal TTS results. In thecalculation of the TTS, we have excluded the initialization andpost-processing times. The time spent on replica-exchangemoves, currently performed via CPU, has been considered tobe part of the PTDA’s execution time. The time that it takes toexecute one run of SA and PT (and PT+ICM) has been mea-sured using r4.8xlarge Amazon EC2 instances, which consistof Intel Xeon E5-2686 v4 (Broadwell) processors.

To set the grid for the high and the low temperatures, wehave simulated the distribution of the energy differences as-sociated with proposed moves and search in the vicinity ofthe 5th (80th) to the 10th (85th) percentiles of this distribu-tion to find the best-performing low (high) temperature value.In PT (PT+ICM), because we are able to measure differentquantities during the simulation, we have further ensured thatthe highest temperature has been chosen such that the MonteCarlo acceptance probabilities are between 0.6 and 0.8. Theparameter values used in each algorithm are outlined below.

The DA parameters

For all experiments, we have used 100 runs, each startingat a vector of zeros. The temperature schedule is linear in theinverse temperature, and the temperature has been adjustedafter every iteration. The DA uses the Metropolis criterion toaccept Monte Carlo moves. It is worth mentioning that ourearly experimentation has suggested better performance forthe linear inverse temperature schedule than the exponentialtemperature schedule.

Our investigation of different parameter combinations hasfurther shown that the performance of the DA on spin-glassproblems is indifferent to the dynamic offset mechanism.Therefore, we turn this feature off for our final experimen-tation. The high (Th) and the low (Tl) temperatures used foreach problem class are shown in Table I.

TABLE I: High (Th) and low (Tl) temperatures used in theDA, SA, and PT runs. Note that the row marked with “PT”also includes the PT+ICM parameters. The low and the hightemperatures of SA for 2D-bimodal and SK-bimodal are se-lected according to Refs. [46, 47]. The temperature valuesused in our simulations are unitless.

2D-bimodal 2D-Gaussian SK-bimodal SK-GaussianTh Tl Th Tl Th Tl Th Tl

DA 2 0.66 2 · 104 4 · 102 40 4 2 · 105 104

SA 10 0.33 106 104√N 1 107 105

PT 2 0.33 106 2 · 104 80 2 107 5 · 105

SA parameters

Each instance has been solved 100 times using SA and thetemperature schedule has been set to be linear in the inversetemperature, which is a typical choice for SA in the literature[46, 47]. The high (Th) and the low (Tl) temperatures used foreach problem class are shown in Table I.

PT (PT+ICM) parameters

Although the performance of replica-exchange algorithmsis significantly dependent on the choice of temperature sched-ule, the temperatures at each replica have been set based on

the commonly used geometric schedule [42]. After determin-ing the low and the high temperatures, the number of repli-cas has been chosen such that the replica-exchange acceptanceprobabilities are above 0.2. The number of replicas has beenset to 25, 60, 50, and 60, respectively, for the 2D-bimodal,SK-bimodal, 2D-Gaussian, and SK-Gaussian instances. Incontrast to the runs of the DA and SA, the replicas in PT(PT+ICM) are not independent and to calculate the TTS, thewhole PT (PT+ICM) algorithm has been repeated 30 timesfor each instance. This is time consuming, so each run (re-peat) has been stopped immediately if the reference solutionhas been found. The high (Th) and the low (Tl) temperaturesused for each problem class are shown in Table I.

The PTDA parameters

The number of replicas has been set to 40 for both theSK-bimodal and SK-Gaussian problem instances, and the dy-namic offset feature has been turned off. The high and the lowtemperatures and the temperature schedule are set internallyby an automatic parameter-tuning strategy (see also Sec. II).As done for PT, since the replicas are dependent, the wholealgorithm has been repeated 30 times to have enough obser-vations to calculate the TTS for each instance.

Spin-Glass Problems with Different Densities

A grid-search approach on a subset of instances has beenused to tune the parameters of the DA and SA for spin-glassproblems, as explained in the previous section, separately foreach density. The parameters of the DA and SA are the sameas the ones used for 2D and SK spin-glass problems, exceptfor the temperature values that are given below.

TABLE II: High (Th) and low (Tl) temperatures used in theDA and SA runs of spin-glass problems with different densi-ties. The temperature values used in our simulations are unit-less.

DA SAd (%) Th Tl Th Tl

10 20 2.0 14 0.520 50 2.0 20 0.530 60 2.0 24 0.540 65 2.5 28 1.050 75 3.0 30 1.060 75 3.0 34 1.070 60 3.5 36 1.080 65 3.5 38 1.090 70 4.0 40 1.0

[1] A. Lucas, Ising formulations of many NP problems, Front.Physics 12, 5 (2014).

[2] G. Rosenberg, P. Haghnegahdar, P. Goddard, P. Carr, K. Wu,and M. L. de Prado, Solving the optimal trading trajectory prob-lem using a quantum annealer, IEEE Journal of Selected Topicsin Signal Processing 10, 1053 (2016).

[3] M. Hernandez, A. Zaribafiyan, M. Aramon, and M. Naghibi,A novel graph-based approach for determining molecular sim-ilarity (2016), (arXiv:1601.06693).

[4] M. Hernandez and M. Aramon, Enhancing quantum annealingperformance for the molecular similarity problem, Quantum In-formation Processing 16, 133 (2017).

[5] A. Perdomo-Ortiz, N. Dickson, M. Drew-Brook, G. Rose, andA. Aspuru-Guzik, Finding low-energy conformations of latticeprotein models by quantum annealing, Sci. Rep. 2, 571 (2012).

[6] R. Y. Li, R. Di Felice, R. Rohs, and D. A. Lidar, Quantum an-nealing versus classical machine learning applied to a simpli-fied computational biology problem, NPJ Quantum Inf. 4, 14(2018).

[7] D. Venturelli, D. J. J. Marchand, and G. Rojo, Quan-tum annealing implementation of job-shop scheduling (2015),(arXiv:1506.08479v2).

[8] F. Neukart, D. Von Dollen, G. Compostella, C. Seidel,S. Yarkoni, and B. Parney, Traffic flow optimization using aquantum annealer, Frontiers in ICT 4, 29 (2017).

[9] D. Crawford, A. Levit, N. Ghadermarzy, J. S. Oberoi, andP. Ronagh, Reinforcement learning using quantum Boltzmannmachines (2016), (arXiv:1612.05695v2).

[10] A. Khoshaman, W. Vinci, B. Denis, E. Andriyash, and M. H.Amin, Quantum variational autoencoder, Quantum Sci. Tech-nol. 4, 014001 (2019).

[11] M. Henderson, J. Novak, and T. Cook, Leveraging Adia-batic quantum computation for election forecasting (2018),(arXiv:1802.00069).

[12] A. Levit, D. Crawford, N. Ghadermarzy, J. S. Oberoi,E. Zahedinejad, and P. Ronagh, Free energy-based re-inforcement learning using a quantum processor (2017),(arXiv:1706.00074).

[13] G. Moore, Cramming more components onto integrated cir-cuits, Electronics Magazine 38, 114 (1965).

[14] M. W. Johnson, M. H. S. Amin, S. Gildert, T. Lanting,F. Hamze, N. Dickson, R. Harris, A. J. Berkley, J. Johansson,P. Bunyk, et al., Quantum annealing with manufactured spins,Nature 473, 194 (2011).

[15] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, Optimizationby simulated annealing, Science 220, 671 (1983).

[16] N. G. Dickson, M. W. Johnson, M. H. Amin, R. Harris,F. Altomare, A. J. Berkley, P. Bunyk, J. Cai, E. M. Chapple,P. Chavez, et al., Thermally assisted quantum annealing of a16-qubit problem, Nat. Commun. 4, 1903 (2013).

[17] S. Boixo, T. F. Rønnow, S. V. Isakov, Z. Wang, D. Wecker,D. A. Lidar, J. M. Martinis, and M. Troyer, Evidence for quan-tum annealing with more than one hundred qubits, Nat. Phys.10, 218 (2014).

[18] H. G. Katzgraber, F. Hamze, and R. S. Andrist, Glassy chimerascould be blind to quantum speedup: Designing better bench-marks for quantum annealing machines, Phys. Rev. X 4,021008 (2014).

[19] T. F. Rønnow, Z. Wang, J. Job, S. Boixo, S. V. Isakov,D. Wecker, J. M. Martinis, D. A. Lidar, and M. Troyer, Definingand detecting quantum speedup, Science 345, 420 (2014).

[20] H. G. Katzgraber, F. Hamze, Z. Zhu, A. J. Ochoa, andH. Munoz-Bauza, Seeking quantum speedup through spinglasses: The good, the bad, and the ugly, Phys. Rev. X 5,031026 (2015).

[21] B. Heim, T. F. Rønnow, S. V. Isakov, and M. Troyer, Quantumversus classical annealing of Ising spin glasses, Science 348,215 (2015).

[22] I. Hen, J. Job, T. Albash, T. F. Rønnow, M. Troyer, and D. A. Li-dar, Probing for quantum speedup in spin-glass problems withplanted solutions, Phys. Rev. A 92, 042325 (2015).

[23] T. Albash, T. F. Rønnow, M. Troyer, and D. A. Lidar, Reexam-ining classical and quantum models for the D-Wave One pro-cessor, Eur. Phys. J. Spec. Top. 224, 111 (2015).

[24] V. Martin-Mayor and I. Hen, Unraveling quantum annealersusing classical hardness, Nat. Sci. Rep. 5, 15324 (2015).

[25] J. Marshall, V. Martin-Mayor, and I. Hen, Practical engineeringof hard spin-glass instances, Phys. Rev. A 94, 012320 (2016).

[26] V. S. Denchev, S. Boixo, S. V. Isakov, N. Ding, R. Babbush,V. Smelyanskiy, J. Martinis, and H. Neven, What is the com-putational value of finite range tunneling?, Phys. Rev. X 6,031015 (2016).

[27] J. King, S. Yarkoni, J. Raymond, I. Ozfidan, A. D. King,M. M. Nevisi, J. P. Hilton, and C. C. McGeoch, Quantum an-nealing amid local ruggedness and global frustration (2017),(arXiv:quant-phys/1701.04579v2).

[28] T. Albash and D. A. Lidar, Demonstration of a scaling advan-tage for a quantum annealer over simulated annealing, Phys.Rev. X 8, 031016 (2018).

[29] S. Mandra and H. G. Katzgraber, A deceptive step towardsquantum speedup detection, QST 3, 04LT01 (2018).

[30] S. Mandra, Z. Zhu, W. Wang, A. Perdomo-Ortiz, and H. G.Katzgraber, Strengths and weaknesses of weak-strong clus-ter problems: A detailed overview of state-of-the-art classi-cal heuristics versus quantum approaches, Phys. Rev. A 94,022337 (2016).

[31] S. Mandra and H. G. Katzgraber, The pitfalls of planar spin-glass benchmarks: Raising the bar for quantum annealers(again), Quantum Sci. Technol. 2, 038501 (2017).

[32] R. Hamerly, T. Inagaki, P. L. McMahon, D. Venturelli,A. Marandi, T. Onodera, E. Ng, C. Langrock, K. Inaba,T. Honjo, et al., Scaling advantages of all-to-all connectivityin physical annealers: The coherent Ising machine vs. D-Wave2000Q (2018), (arXiv:quant-phys/1805.05217).

[33] H. G. Katzgraber and M. A. Novotny, How small-world inter-actions can lead to improved quantum annealer designs, Phys.Rev. Applied 10, 054004 (2018).

[34] S. Matsubara, H. Tamura, M. Takatsu, D. Yoo,B. Vatankhahghadim, H. Yamasaki, T. Miyazawa,S. Tsukamoto, Y. Watanabe, K. Takemoto, et al., in Com-plex, Intelligent, and Software Intensive Systems – Proceedingsof the 11th International Conference on Complex, Intelligent,and Software Intensive Systems (CISIS-2017), Torino, Italy,July 10–12, 2017 (2017), p. 432.

[35] S. Tsukamoto, M. Takatsu, S. Matsubara, and H. Tamura, Anaccelerator architecture for combinatorial optimization prob-lems, FUJITSU Sci. Tech. J. 53, 8 (2017).

[36] A. Sohn, Parallel n-ary speculative computation of simulatedannealing, IEEE Trans. Parallel Distrib. Syst. 6, 997 (1995).

[37] A. Sohn, Parallel satisfiability test with synchronous simulatedannealing on distributed-memory multiprocessor, J. ParallelDistrib. Comput. 36, 195 (1996).

[38] R. H. Swendsen and J.-S. Wang, Replica Monte Carlo simula-tion of spin-glasses, Phys. Rev. Lett. 57, 2607 (1986).

[39] C. Geyer, in 23rd Symposium on the Interface, edited by E. M.Keramidas (Interface Foundation, Fairfax Station, VA, 1991),p. 156.

[40] K. Hukushima and K. Nemoto, Exchange Monte Carlo methodand application to spin glass simulations, J. Phys. Soc. Jpn. 65,1604 (1996).

[41] D. J. Earl and M. W. Deem, Parallel tempering: Theory, ap-plications, and new perspectives, Phys. Chem. Chem. Phys. 7,3910 (2005).

[42] H. G. Katzgraber, S. Trebst, D. A. Huse, and M. Troyer,Feedback-optimized parallel tempering Monte Carlo, J. Stat.Mech. P03018 (2006).

[43] W. Wang, J. Machta, and H. G. Katzgraber, Population anneal-ing: Theory and application in spin glasses, Phys. Rev. E 92,063307 (2015).

[44] W. Wang, J. Machta, and H. G. Katzgraber, Comparing MonteCarlo methods for finding ground states of Ising spin glasses:Population annealing, simulated annealing, and parallel tem-pering, Phys. Rev. E 92, 013303 (2015).

[45] H. Karimi, G. Rosenberg, and H. G. Katzgraber, Effective op-timization using sample persistence: A case study on quan-tum annealers and various Monte Carlo optimization methods,Phys. Rev. E 96, 043312 (2017).

[46] D. Venturelli, S. Mandra, S. Knysh, B. O’Gorman, R. Biswas,and V. Smelyanskiy, Quantum optimization of fully connectedspin glasses, Phys. Rev. X 5, 031040 (2015).

[47] S. V. Isakov, I. N. Zintchenko, T. F. Rønnow, and M. Troyer,Optimized simulated annealing for Ising spin glasses, Comput.Phys. Commun. 192, 265 (2015), (see also ancillary material toarxiv:cond-mat/1401.1084).

[48] Z. Zhu, A. J. Ochoa, and H. G. Katzgraber, Efficient clusteralgorithm for spin glasses in any space dimension, Phys. Rev.Lett. 115, 077201 (2015).

[49] Z. Zhu, C. Fang, and H. G. Katzgraber, borealis - A general-ized global update algorithm for Boolean optimization prob-lems (2016), (arXiv:1605.09399).

[50] J. Houdayer, A cluster Monte Carlo algorithm for 2-dimensional spin glasses, Eur. Phys. J. B. 22, 479 (2001).

[51] G. Rosenberg, M. Vazifeh, B. Woods, and E. Haber, Buildingan iterative heuristic solver for a quantum annealer, ComputOptim Appl 65, 845 (2016).

[52] J. Niemi and M. Wheeler, Efficient Bayesian inference instochastic chemical kinetic models using graphical processingunits (2011), ( arXiv:1101.4242).

[53] E. E. Ferrero, A. B. Kolton, and M. Palassini, Parallel kineticMonte Carlo simulation of Coulomb glasses, AIP ConferenceProceedings 1610, 71 (2014).

[54] M. Matsumoto and Y. Kurita, Twisted GFSR generators, ACMTrans. Model. Comput. Simul. 2, 179 (1992).

[55] H. G. Katzgraber, Introduction to Monte Carlo methods (2009),(arXiv:0905.1629).

[56] Z. Zhu, A. J. Ochoa, F. Hamze, S. Schnabel, and H. G. Katz-graber, Best-case performance of quantum annealers on nativespin-glass benchmarks: How chaos can affect success proba-bilities, Phys. Rev. A 93, 012317 (2016).

[57] K. Hukushima, Domain-wall free energy of spin-glass models:Numerical method and boundary conditions, Phys. Rev. E 60,3606 (1999).

[58] G. M. Amdahl, in Proceedings of the April 18-20, 1967, SpringJoint Computer Conference (ACM, 1967), p. 483.

[59] M. Abramowitz and I. A. Stegun, Handbook of MathematicalFunctions with Formulas, Graphs, and Mathematical Tables(Dover, New York, 1964).

[60] B. S. Clarke and A. R. Barron, Jeffreys’ prior is asymptoticallyleast favorable under entropy risk, J. Stat. Plan. Inference 41,37 (1994).

[61] M. Junger, G. Reinelt, and S. Thienel, in DIMACS Series in Dis-crete Mathematics and Theoretical Computer Science, editedby W. Cook, L. Lovasz, and P. Seymour (American Mathemat-ical Society, 1995), vol. 20.

[62] G. Pardella and F. Liers, Exact ground states of large two-dimensional planar Ising spin glasses, Phys. Rev. E 78, 056705(2008).

[63] F. Liers and G. Pardella, Partitioning planar graphs: A fastcombinatorial approach for max-cut, Computational Optimiza-tion and Applications p. 1 (2010).

[64] M. Elf, C. Gutwenger, M. Junger, and G. Rinaldi, Lecturenotes in computer science 2241, in Computational Combinato-rial Optimization, edited by M. Junger and D. Naddef (SpringerVerlag, Heidelberg, 2001), vol. 2241.

[65] M. Grotschel, M. Junger, and G. Reinelt, in Heidelberg collo-quium on glassy dynamics (Springer, 1987), p. 325.

[66] D. Sherrington and S. Kirkpatrick, Solvable model of a spinglass, Phys. Rev. Lett. 35, 1792 (1975).

[67] http://informatik.uni-koeln.de/spinglass.[68] F. Liers, M. Junger, G. Reinelt, and G. Rinaldi, Computing

Exact Ground States of Hard Ising Spin Glass Problems byBranch-and-Cut (Wiley-Blackwell, 2005), chap. 4, p. 47.

[69] Information about the Biq Mac solver, offering asemidefinite-based branch-and-bound algorithm for solv-ing unconstrained binary quadratic programs is available athttp://biqmac.uni-klu.ac.at/.

[70] Information about BiqCrunch, providing a semidefinite-basedsolver for binary quadratic problems can be found athttp://lipn.univ-paris13.fr/BiqCrunch/.

[71] We have used 10−5 as the precision in our numerical simula-tion.

[72] P. Erdos and A. Renyi, On random graphs, Publicationes Math-ematicae Debrecen 6, 290 (1959).

arxiv:1806.08815v3 [physics.comp-ph] 7 apr 2019 · physics-inspired optimization for quadratic...

Documents