Quantum Annealing amid Local Ruggedness and Global Frustration James King,∗ Sheir Yarkoni, Jack Raymond, Isil Ozfidan, Andrew D. King, Mayssam Mohammadi Nevisi, Jeremy P. Hilton, and Catherine C. McGeoch D-Wave Systems (Dated: March 2, 2017) ArecentGooglestudy[Phys.Rev.X,6:031015(2016)]comparedaD-Wave2Xquantumprocess- ing unit (QPU) to two classical Monte Carlo algorithms: simulated annealing (SA) and quantum Monte Carlo (QMC). The study showed the D-Wave 2X to be up to 100 million times faster than the classical algorithms. The Google inputs are designed to demonstrate the value of collective multiqubittunneling,aresourcethatisavailabletoD-WaveQPUsbutnottosimulatedannealing. 7 But the computational hardness in these inputs is highly localized in gadgets, with only a small 1 amount of complexity coming from global interactions, meaning that the relevance to real-world 0 problems is limited. 2 InthisstudyweprovideanewsyntheticproblemclassthataddressesthelimitationsoftheGoogle inputswhileretainingtheirstrengths. Weusesimpleclustersinsteadofmorecomplexgadgetsand r a more emphasis is placed on creatingcomputational hardness through frustrated global interactions M like those seen in interesting real-world inputs. The logical spin-glass backbones used to generate these inputs are planar Ising models without fields and the problems can therefore be solved in 1 polynomial time [J. Phys. A, 15:10 (1982)]. However, for general heuristic algorithms that are unaware of the planted problem class, the frustration creates meaningful difficulty in a controlled ] environment ideal for study. h We use these inputs to evaluate the new 2000-qubit D-Wave QPU. We include the HFS p algorithm—the best performer in a broader analysis of Google inputs—and we include state-of- - t the-artGPUimplementationsofSAandQMC.TheD-WaveQPUsolidlyoutperformsthesoftware n solvers: whenweconsiderpureannealingtime(computationtime),theD-WaveQPUreachesground a states up to 2600 times faster than the competition. In the task of zero-temperature Boltzmann u samplingfromchallengingmultimodalinputs,theD-WaveQPUholdsasimilaradvantageanddoes q not see significant performance degradation due to quantum sampling bias. [ 2 v CONTENTS ruggedness and classical hardness 6 9 7 IV. Software solvers 7 5 I. Introduction 2 4 A. Proposing a new problem class 2 0 V. Optimization 7 1. B. Evaluation of the 2000-qubit D-Wave A. Varying ruggedness via logical 0 QPU 2 complexity 7 7 1 B. Varying ruggedness by scaling 8 II. D-Wave quantum processing units 3 : v A. Ising minimization 3 i X VI. Sampling 9 B. Chimera topology 3 r A. Sampling from all valleys 9 a C. Quantum annealing 3 B. Mining for interesting valley structure 9 D. Modeling performance 4 C. Sampling results 10 III. Frustrated Cluster Loop problems 5 VII. Conclusions 12 A. Ruggedness and clusters 5 B. FCL problem generation 5 References 13 C. Problem class parameters 6 A. Calculation of decorrelation 14 D. Confirming correlation between B. Details of software solvers 15 ∗ [email protected] 1. Classical hardware 15 2 2. Included software solvers 15 recently introduced an input class designed to ben- efit immensely from quantum tunneling. We refer 3. Excluded software solvers 15 to these inputs as Google problems. Their study 4. Parameter tuning 16 showed a massive speed increase (up to 100 million times faster) of a D-Wave 2X system over simulated annealing (SA) and quantum Monte Carlo (QMC), also known as simulated quantum annealing. This I. INTRODUCTION provided strong evidence for the ability of quan- tum annealing to leverage quantum tunneling in a Quantum annealers are designed to take advantage computationally relevant way. However, the spin- ofquantumtunnelingtofindgoodsolutionstohard glass backbones of the Google problems are easy optimization problems. When constructing a family to solve, meaning that a) the problems have lim- ofsyntheticinputstotestthepotentialofaquantum itedrelevancetoreal-worldproblems,andb)certain annealingplatform,oneshouldthereforeensurethat cluster-detectingalgorithmscansolvethemwithrel- the inputs a) are such that solvers can benefit from ative ease [4]. quantum tunneling, and b) are hard optimization In this study we provide a problem class that aims problems with global frustration. to retain the advantages of Google problems while For a solver to benefit from quantum tunneling, the being more reflective of real-world problems. They energy landscape associated with the input must are more reflective of real-world problems because, have tall, thin energy barriers. For an input to rather than relying too heavily on finely-tuned gad- be computationally hard, the input must have con- gets, they derive much of their computational hard- straints that interact with each other in nontrivial ness from larger spin-glass backbones with planted ways. frustration. Quantum processing units (QPUs) developed by Our problems are synthetic and are easy to solve D-Wave Systems that use the quantum annealing using knowledge of the problem class.1 More specif- algorithm have been commercially available since ically,sincethelogicalproblemsareIsingmodelson 2011. These QPUs solve Ising model inputs defined a planar lattice without fields, they are solvable in on the underlying working graph of the chip. There polynomialtime[5]. However,frustrationinthelog- have been various efforts to evaluate the perfor- icalproblemscreatesmeaningfuldifficultyforheuris- manceoftheD-Wavesystemsusingsyntheticinputs tic methods that are unaware of the planted prob- generated randomly from different distributions, or lem class. Further, the inputs have properties such input classes. astunableruggednessthatmakethemusefulforthe evaluation of quantum annealing and classical ap- This study has two main contributions: to propose proximations thereof. In this way, they are similar a new problem class ideal for evaluating D-Wave toKauffman’sNKmodelthathasprovedveryuseful QPUs, and to use this problem class to evaluate the in the analysis of evolutionary algorithms [6–8]. 2000-qubit D-Wave QPU. B. Evaluation of the 2000-qubit D-Wave QPU A. Proposing a new problem class Weusethisnewproblemclasstoevaluatethelatest- Previous evaluations of D-Wave QPUs have used generation D-Wave QPU. We measure its perfor- problem classes that benefit either too little or too manceinabsolutetermsandweanalyzeitsresponse much from quantum tunneling to be ideal for evalu- to the ruggedness parameters of the problem class. ating quantum annealers. The software competition we consider is much Ononesideofthisspectrumwehaveproblemssuch stronger than that considered by Denchev et al. [3], as random unstructured ±1 problems on the Chi- and includes GPU implementations of SA, QMC, meratopologynativetoD-WaveQPUs. Thesewere usedbyRønnowetal.[1]intheirevaluationoftheD- WaveTwoQPUin2014,buttheyarenowknown[2] to lack a finite-temperature phase transition, mean- 1 Forexample,thesuper-spinheuristic[4]thatreliesonhard- ingthatquantumtunnelingisunlikelytoplayasig- coded knowledge of clusters would be far faster than the nificant role when solving them. software solvers we consider. However, such heuristics do not generalize to other problem classes and it would not Ontheothersideofthespectrum,Denchevetal.[3] makesensetoincludethemascompetitionsolvers. 3 and SVMC, and also includes Selby’s implementa- is antiferromagnetic, the optimal solution is called tion [9, 10] of the Hamze-de Freitas-Selby (HFS) al- a ground state, and nonoptimal solutions are ex- gorithm [9, 11]. In the study of Mandrà et al. [4] cited states. IM instances can be trivially trans- thatusedawidearrayofalgorithmstosolveGoogle formed to Quadratic Unconstrained Boolean Op- problems, Selby’s implementation of HFS was the timization (QUBO) instances defined on integers fastest software solver in terms of both scaling and s={0,1}, ortoMaximumWeighted2-Satisfiability absolute speed. (MAX W2SAT) instances defined on Booleans s = {true,false}, all of which are NP-hard. WefindthattheD-WaveQPUisabletofindground statesupto2600timesfasterthanthesoftwarecom- petition. We also consider the problem of sampling from ground states and find that the D-Wave QPU maintainsasimilaradvantageanddoesnotstruggle B. Chimera topology to find a diverse set of optimal solutions. The remainder of the paper is organized as fol- The native connectivity topology for the D-Wave lows. In Section II we provide a description of the QPU is based on a C Chimera graph containing 16 2000-qubitD-WavesystemandahistoryofD-Wave 2048 vertices (qubits) and 6016 edges (couplers). QPUs. In Section III we present the problem class A Chimera graph of size C is an s × s grid of analyzed in this paper and discuss the concept of Chimera cells (also called unsit tiles or unit cells), ruggedness and its relevance to optimization prob- eachcontainingacompletebipartitegraphon8ver- lems. In Section IV we discuss the software solvers tices (a K ). Each vertex is connected to its four used in our evaluations, as well as notable solvers neighbors4i,n4side the cell as well as two neighbors that were not suitable. In Section V we present (north/south or east/west) outside the cell: there- our experimental results on optimization. In Sec- fore every vertex has degree 6 excluding boundary tion VI we present our experimental results on sam- vertices. pling from ground states. In Section VII we provide In this study, as in others, we vary the problem size further discussion and conclude the paper. using square subgraphs of the full graph, from size C (128 vertices) up to C (2048 vertices). Note 4 16 thatthenumberofproblemvariablesn=8s2 grows II. D-WAVE QUANTUM PROCESSING quadratically with Chimera size. The reason we UNITS measure algorithm performance as a function of the Chimera size and not the number of qubits is that WestartwithanoverviewofD-Wavedesignfeatures problem difficulty tends to scale exponentially with andintroducenotationthatwillbeusedthroughout. the Chimera size, i.e., with the square root of the FordetailsaboutunderlyingtechnologiesseeBunyk number of qubits, since the treewidth of a Chimera etal.[12],Dicksonetal.[13],Harrisetal.[14],John- graph C is linear in s [17, 18]. s son et al. [15] or Lanting et al. [16]. Because the chip fabrication and trapped magnetic flux leave some small number of qubits unusable, eachprocessorhasaspecifichardwareworkinggraph A. Ising minimization H ⊂ C16. The qubit yield—the fraction of qubits thatareoperational—istypicallyaround98%forthe 2000-qubitD-Wavesystemwhereas95%wastypical D-Waveannealing-basedquantumprocessorsarede- for the D-Wave 2X. The working graph used in this signed to find minimum-cost solutions to the Ising study has 2035 working qubits out of 2048. minimization (IM) problem, defined on a graph G = (V,E) as follows. Given a collection of fields h = {h : i ∈ V} and couplings J = {J : (i,j) ∈ i ij E}, assign values from {−1,+1} to n spin variables C. Quantum annealing s={s } so as to minimize the energy function i E(s)=(cid:88)h s + (cid:88) J s s . (1) D-WaveprocessorssolveIsingproblemsbyquantum i i ij i j annealing (QA) in the form proposed by Kadowaki i∈V (i,j)∈E and Nishimori [19]. The QA algorithm is imple- The spin variables s can be interpreted as mag- mented in hardware using a framework of analog neticpolesinaphysicalparticlesystem; inthiscon- control devices to manipulate a collection of qubit text, negative J is ferromagnetic and positive J states according to a time-dependent Hamiltonian ij ij 4 shown below. a. Program. Load (h,J) onto the chip; denote the elapsed programming/initialization time t . H(t)=A(t)·H +B(t)·H . (2) i init prob QA carries out a gradual transition in time t : b. Anneal. CarryouttheQAalgorithm. Anneal 0 → t , from an initial ground state in H , to a timeta canbesetbyusertosomevalue5µs≤ statedaescribedbytheproblemHamiltonianinHit = ta ≤1000µs. prob (cid:80) h σz +(cid:80) J σzσz. The problem Hamiltonian i i i ij ij i j matches the energy function (1), so that a ground c. Read. Record qubit states to obtain a solu- state for H is a minimum-cost solution to E(s). prob tion; denote the elapsed readout time t . r QA is closely related to adiabatic quantum comput- ing (AQC). The AQC model of computation was proposed by Farhi et al. [20] who showed that if the d. Repeat. Repeat steps b and c k times to ob- transitioniscarriedoutslowlyenoughthealgorithm tain a sample of k solutions. will find a ground state (i.e., an optimal solution) with high probability. Wedefinesampletimet andtotaltimeT asfollows: Theoretical guarantees about solution times for s quantumalgorithms(foundin[20])assumethatthe computation takes place in an ideal closed system, t =(t +t ) (3) s a r perfectlyisolatedfromenergyinterferencefromam- T =t +kt . bient surroundings. The 2000-qubit D-Wave chip is i s housed in a highly shielded chamber and cooled to near absolute zero; nevertheless, as is the case with For the D-Wave system studied in this paper, the any real-world quantum device, it must suffer some amount of interference, which has the general effect median programming time ti is 9.5ms and the me- of reducing the probability of landing in a ground dian readout time tr is 123µs. state. Thus, theoretical guarantees on performance In this study, both for software solvers and for the may not apply to these systems. We consider any D-Wave QPU, we typically report annealing time D-WaveQPUtobeaheuristicsolver,whichrequires rather than total time. Annealing time is the mea- empirical approaches to performance analysis. sure of the algorithm proper, and measuring total time often obscures trends in data. Scaling plots The D-Wave QPU studied here contains 2035 ac- areparticularlysusceptibletothisbecausetheover- tive qubits (quantum bits) and 5912 active cou- head of programming time makes scaling—typically plers made of microscopic loops of niobium con- presented on a semilog plot—look totally flat ex- nectedtoalargeandcomplexanalogcontrolsystem cept for an uptick at the very largest problem sizes. via an arrangement of Josephson Junctions. Ther- Further, we are most interested in the future po- mometry on the refrigerator of the D-Wave QPU tential of D-Wave QPUs, and we expect that pro- and fits of single qubit measurements to a thermo- gramming time and readout time will be reduced to dynamic model indicate that T (cid:46) 15mK. When smallfractionsoftheircurrentvalues; minimuman- cooled to temperatures below 9.3K, niobium be- nealing times will similarly be reduced, allowing us comesasuperconductorandiscapableofdisplaying better control over the algorithm parameters. For quantum properties including superposition, entan- reference,sincemanypeoplewillbeinterestedinto- glement, and quantum tunneling. Because of these tal wall clock time, rather than annealing time, a properties, the qubits on the chip behave as a quan- 1000 times speedup over software solvers in anneal- tum mechanical particle process that carries out a ing time, typical for the D-Wave QPU in this study, transition from initial state described by H to a init translatesroughlytoa30timesspeedupintotalwall problem state described by H [13, 16, 21]. prob clock time including programming and readout. System characteristics of D-Wave QPUs such as yield can vary within a generation. If we compare this specific 2000-qubit D-Wave system to the spe- D. Modeling performance cificD-Wave2XQPUstudiedin2015[22],program- ming time has decreased by 20%, readout is three Given input instance (h,J), a D-Wave computation times faster, and yield has improved from 95% to involves the following steps. 99%. 5 III. FRUSTRATED CLUSTER LOOP fields; annealers must then go over or through an PROBLEMS energy barrier to reach the gadget’s ground state. Instead of using two-cluster gadgets, we simply use single-cell ferromagnetic clusters to induce rugged- A. Ruggedness and clusters ness, leaving us with a simpler problem class. Ruggedness is a feature of certain optimiza- tion problems—more specifically their energy landscapes—characterized by tall energy barriers B. FCL problem generation and many local optima [23, 24]. Typically, rugged problems are harder to solve, particularly with We create local ruggedness by treating unit cells Markov chain Monte Carlo (MCMC) methods [25, of the Chimera graph as ferromagnetically-coupled 26]. Inthelate1980s,whenruggednesswasfirstbe- clusters. We create global frustration by joining ing explored in the context of evolutionary biology these clusters together using a problem generated and bio-inspired computing, Kauffman’s NK model on the logical graph of clusters. This creates an wasputforwardasamodelwithtunableruggedness energylandscape thatismacroscopicallyinteresting inspired by genetic fitness functions under varying and in which the clusters induce wells separated by degrees of epistasis, or how many other genetic loci tall, thin energy barriers. affect the fitness contribution of a given locus [6–8]. ThetunableruggednessoftheNKmodelhasproved Thelogicalgraphofclustersisasquarelattice,with veryvaluableinthestudyofoptimizationheuristics, alogical16×16latticeofclustersspanningthework- particularly evolutionary algorithms [27]. inggraphofa2000-qubitD-WaveQPU.2 Theprob- Closely related to ruggedness is the analysis of spin lems we generate on the logical graph are frustrated overlap, in which landscape features are inferred loop problems, constraint satisfaction problems first used in the evaluation of D-Wave QPUs by Hen et fromthedistributionofoverlapoftworandomstates al.[35]andmodifiedtoallowprecisionlimitsbyKing sampled from the Boltzmann distribution [28, 29]. et al. [36]. Tall, thin peaks in the spin overlap distribution tend to correspond to tall, thin energy barriers; the Werefertothefinalinputsasfrustrated cluster loop presence of these features correlates not only with (FCL) problems. For a given Chimera graph G C ruggedness and classical hardness, but also with thatmayormaynothavemissingqubitsorcouplers, applicability of quantum annealing, since quantum anFCLproblemisgeneratedfromthreeparameters, tunneling is likely to be a useful computational re- α (the clauses-to-variables ratio), ρ (the range, or source in the presence of these tall, thin barriers. precision), and R≥ρ (the ruggedness) as follows: Zhuetal.[30]haveusedspinoverlapfeaturestopre- dictwhetheraproblemcanbesolvedbyQAmoreef- ficientlythanbySA,showingpromisingpreliminary a. Defineeachunitcellasalogicalspin ifithasno results for optimization problems such as weighted missing qubits or couplers. Use c(v) to denote the logical spin index corresponding to qubit partial MAX-2SAT, minimum vertex cover, satis- fiability, graph partitioning, circuit fault diagnosis, v. and certain spin-glass instances [30, 31]. This work b. Whereverallfourcouplersconnectingtwolog- pointstothepotentialofquantumannealerstohave ical spins are present, define these couplers as aplaceinportfoliosolvers[32]andhybridalgorithms runningonheterogeneouscomputingsystemsalong- a logical coupler. side CPUs, GPUs, and other coprocessors [33, 34]. c. DefinethelogicalgraphG asthegraphcom- To induce ruggedness using tall, thin energy barri- L prising the logical spins and logical couplers. ers,Denchevetal.[3]usedferromagneticallycoupled unit tiles as clusters. Flipping such a cluster in the absenceoffieldsorexternalcouplingsinvolvesjump- ingoverortunnelingthroughanenergybarrierthat is 16 Ising units high and has a width of 8 in Ham- 2 Notethata16×16logicallatticeissignificantlylargerthan mingspace. Denchevetal.[3]actuallygobeyondus- thelargestlogicallattice,4×4,oftheGoogleproblemscon- sideredbyDenchevetal.[3]—theirtwo-tilegadgetstakeup ingsingle-tileclustersandusetwo-tilegadgetsstud- more space than our one-tile clusters and the D-Wave 2X ied previously by Boixo et al. [21]. This gadget is has a smaller working graph than the 2000-qubit D-Wave made up of two clusters that form a deceptive trap system. Theselargerlogicalgraphsintheproblemswecon- to draw annealers into a local minimum using local sidermeanthatthespin-glassbackbonesoftheseproblems aresignificantlymorecomputationallychallenging. 6 d. Generate a range-bounded frustrated loop 102 problemHamiltonian(h ,J )onG usingpa- n L L L o rameters α and ρ as per King et al. [36] (note ti u that h is the zero vector). ol Range(ρ) L s 3 o e. Define the native Chimera Hamiltonian t 4 s (hC,JC) with hC as the zero vector and JC ple 5 as: m 6 a (cid:40) −1, if c(u)=c(v) S 1010.5 0.6 0.7 0.8 0.9 1.0 1.1 ∞ J (u,v)= C 1 ·J (c(u),c(v)), otherwise. α R L It is worth repeating that these Hamiltonians have FIG. 1. Logical problem difficulty as measured by ex- no fields (i.e., h and h are both zero vectors). pected samples to solution for simulated annealing. Er- L C Note also that in-tile couplings in J are all −1 and ror bars show the 95% confidence intervals for the me- C inter-tile couplings take values in dians, grouped over α and ρ. Difficulty is maximized at α=0.65forprecision3,α=0.75forprecision4,α=0.8 {j/R | j ∈{−R,−R+1,...,R}}. for precision 5, and α=0.85 for precision 6. Since we ensure that ρ ≤ R, and ρ ≥ |j| for any logicalcouplingj,allcouplingsinJ areintherange value of α, problems with a lower ρ value have their C [−1,1] loops spread out more evenly over the logical spins. Finally, for the native problem, coupler values are Thelogicalfrustratedloopproblemsmaybediscon- scaled down by a factor of R ≥ ρ so that inter-tile nectedandhavemultiplecomponents;werejectsuch couplings are in the range [−1,1] (in-tile couplings disconnected inputs at generation time. arealways−1). ThushighervaluesofρconstrainR While these problems are large enough to span the tobehigher,andmakeproblemsmorelocallyrugged entire working graph of the latest D-Wave QPUs, relativetotheglobalHamiltonian. InSectionV,we the repetition code inherent in logical couplers and attempt to deconvolve the impacts of ρ and R. spins makes them relatively robust to analog errors The ability to tune the ruggedness of the inputs by [37]. varyingR,eitherbyspecifyingR=ρandvaryingρ, or by varying R independently, gives FCL problems anadditionaldegreeofutility, particularlywhenas- C. Problem class parameters sessing the value of quantum tunneling and the po- tential of quantum annealing. Varying ρ and spec- ifying R = ρ makes problem generation simpler by The FCL problem class has three parameters: the reducingthenumberoffreeparameterswhereasfix- clauses-to-variables ratio α, the range ρ, and the ing ρ and varying R allows us to isolate the impact ruggedness R. We would like to restrict our experi- ofruggednesswithoutalteringthecomplexityofthe ments to the most interesting region of the parame- logical problem. ter space. First we aim to determine the value of α that max- imizes the difficulty of the logical problem. If α is too low, a problem is underconstrained and is easy D. Confirming correlation between ruggedness to solve. If α is too high, the planted solution is and classical hardness expressed too strongly and the problem’s features approach those of a ferromagnet, making it easy to We expect to see a positive correlation between solve. The difficulty of the logical problem depends ruggedness and classical hardness. Here we charac- only on α and ρ, not on R. For various values of ρ, terizeclassicalhardnessusingthedecorrelationtime weperformasweepofαtodeterminethevaluethat of an MCMC procedure. To validate this assump- maximizes the hardness of the logical problem (see tionwemeasurethedecorrelationtimeforaparallel Figure 1). tempering (PT) procedure that uses the Metropolis The impacts of ρ are more nuanced. First, the algorithm in combination with the standard replica value of ρ provides an upper bound on the limit of exchange rule [38]; we use the autocorrelation of α because packing in more loops eventually raises temperature as our measure of decorrelation [39]. the maximum coupler range. Second, for a fixed For more details of this method, see Appendix A. 7 competitive in an absolute sense. Indeed, of the 105 three classes of solvers that they study, only se- quential algorithms, which they find to have the e ur worst performance, have the massive parallelizabil- t a ityandlowmemoryrequirementsthatmakeefficient r pe GPU implementations possible. It is also possible m to implement SA in a field-programmable gate ar- e t ray (FPGA), but the additional speedup over GPU f o n 104 implementation is limited and generally not worth o the increased cost of hardware. ti a el InAppendixBwegivefurtherdetailsofthesoftware r r solvers and parameterizations we used. We also dis- o c o cuss algorithms we omitted because of prohibitive t u runtimes. A 103 3.0 3.5 4.0 4.5 5.0 Ruggedness (R) V. OPTIMIZATION FIG. 2. Box plots showing ruggedness versus classical We measure the expected time to solution (TTS) of hardness. We hold ρ and α fixed at 3 and 0.65, respec- different solvers on the inputs, calculated as tively,andvarytheruggednessRofthenativeChimera problem. AteachvalueofRwegenerated100instances; time per anneal points indicate outliers. Classical hardness is measured TTS= ground state probability. using the autocorrelation of temperature. There is a clearpositivecorrelationbetweenruggednessandclassi- We consider only annealing times and exclude pro- cal hardness. gramming and readout times from our analysis as these are not part of the algorithms proper. Figure2illustratestherelationshipbetweenrugged- For a given value of ρ, we choose α to maximize ness and classical hardness. Confirming our intu- the difficulty of the logical problem (see Figure 1). ition, FCL problems with greater ruggedness are For each selection of ρ and R, we generate 100 FCL characterized by greater classical hardness. problems at each problem size and solve each prob- lem with each solver. IV. SOFTWARE SOLVERS A. Varying ruggedness via logical complexity The four software solvers we consider are GPU im- In our first experiment, we vary ρ and set R=ρ. In plementationsofSA,QMC,andSVMC,andSelby’s thiscasetheonlyfreeparameterρcontrolsboththe CPU implementation of HFS [9, 10]. ruggedness and the logical complexity of the inputs. Recent studies of D-Wave QPUs have not included Time-to-solution plots are shown in Figure 3. At GPU-basedsoftwaresolversdespitethefactthatSA the largest problem size, the D-Wave QPU is three is very amenable to GPU implementation [40]. The orders of magnitude faster than the fastest software addition of GPU solvers is a significant raising of solver for each value of ρ. D-Wave’s speedup over thebarintermsofsoftwarecompetition,andmeans software peaks at 2600 times for ρ=4. thatsolversthatcanbeimplementedonGPUshave As ρ increases, the impact of local ruggedness in- taken a leap forward relative to solvers that cannot. creasesasthelogicalHamiltonianiscompressedrel- Runonmodernhardware,ourGPU-basedalgorithm ative to the local wells induced by the clusters. The implementations are roughly 1000 times faster than performance of SA drops off sharply while the per- thecorrespondingsingle-coreCPUimplementations. formance of DW and QMC declines gracefully. The Mandrà et al. [4] analyzed the performance of a performance of HFS decreases only very slightly. diverse set of solvers on the inputs of Denchev et HFS is not affected by the local ruggedness because al. [3]. However the lack of GPU implementations it is tailored to the Chimera topology and uses up- of these solvers means that most are unlikely to be dates that contain entire clusters; the performance 8 ρ = 3 ρ = 4 ρ = 5 ρ = 6 ) 108 s µ 107 ( n 106 tio 105 DW olu 104 SA os 103 SVMC t 102 QMC me 101 HFS Ti 100 4 8 12 16 4 8 12 16 4 8 12 16 4 8 12 16 Logical lattice size FIG. 3. Time to solution for D-Wave and software solvers with range values ρ ∈ {3,4,5,6}. For each value of ρ, α ischosentomaximizelogicalhardness. Shownaremedianvalues(over100inputsateachsize)with95%confidence intervals. degradation is due to the slight increase in logical problem hardness. 102 DW All solvers except HFS have strictly convex scaling SA curves because the anneal lengths are optimized for SVMC the largest problem size and are too long for the k QMC r smaller problems. HFS does not use fixed-length o w annealsandendsupusingshorterannealsonsmaller ve 101 inputs. ti a Thoughtruescalingismaskedbytheinabilitytoop- el R timizeparametersforsmallerinputs[1,41], wenote that the performance of the D-Wave QPU scales at least as well as the software solvers between the two 100 largest problem sizes. 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Ruggedness (R) B. Varying ruggedness by scaling FIG.4. Ruggedness(increasingfromlefttoright)versus relative work for various solvers. Relative work for each solver is calculated as TTS divided by median TTS at In our second experiment, we fix ρ = 3 and vary R = 3.0. The solvers have notably different responses theruggednessR. Thiskeepsthelogicalcomplexity to increasing ruggedness, with SA struggling the most, constant,allowingustoisolatetheimpactofrugged- followed by SVMC, then QMC, then the D-Wave QPU. ness on the various solvers. Here we consider only HFS deals with these energy barriers using exponential the largest problem size having a 16×16 logical lat- brute force; therefore the parameter R does not affect itsperformance. Markersindicatemedians(over100in- tice. puts) and error bars indicate 95% confidence intervals Consistent with our findings when varying ρ, tun- for the median. ing the ruggedness directly by varying R increases difficulty dramatically for simulated annealing and less so for other solvers (see Figure 4). Excluding over SVMC indicates that crucial information is be- HFS,whosebehaviorisconstantinthisexample,the ing lost in the mean-field approximation. The im- workrequiredbyasolveressentiallyscalesaccording proved scaling of QA (i.e., the D-Wave QPU) over to its quantumness. The D-Wave QPU is most ca- QMC may indicate that QMC is failing to faithfully pable of dealing with ruggedness. QMC—the most simulate the dynamics of the QA processor, or it faithful classical simulation of quantum annealing— may simply be an artifact of our inability to use comes next, followed by SVMC, which is a mean- fasterD-Waveanneals. Thisbearsfurtherinvestiga- field approximation to QMC. Bringing up the rear tion using future D-Wave QPUs with faster anneal- is SA, a simulation of a fully classical process. ing times, again utilizing the tunable ruggedness of The improved scaling (versus ruggedness) of QMC FCL problems. 9 VI. SAMPLING statescomeinantipodalpairs,3 andbyextensionso do valleys. We treat each pair of antipodal valleys as a single valley since it is trivial to move from one The ability of an Ising solver to sample diverse op- to the other. tima has both practical and theoretical importance. Ground state sampling in combinatorial problems Withvalleysdefinedinthisway,wedefinethetime- is the basis for construction of space-efficient SAT- to-all-valleys(TTAV)metricastheexpectedamount based membership filters [42, 43]. The associated of annealing time required to draw at least one complexityclass,#P—thecountinganalogofNP— sample from each valley. This metric captures the hasbeenthesubjectofextensiveresearchintheoret- hardest part of sampling from these distributions— icalcomputersciencesincethe1970s[44]. Sampling finding ground states in every mode—and ensures from the Boltzmann distribution, in which states a diverse set of solutions. With at least one sam- with equal energy are sampled with equal proba- ple from each valley, it is possible to find all ground bility, is of particular interest in machine learning. states using only a modest amount of postprocess- BoltzmannsamplesareusedtotrainBoltzmannma- ing. The TTAV metric is most meaningfully inter- chines,ataskknowntobebothhardanduseful[45]. preted relative to TTS since hitting valleys directly depends on hitting ground states. While machine learning applications typically de- pendonfinite-temperatureBoltzmannsampling,us- ing near-optimal states as well as optimal states, we focus on zero temperature sampling to sim- B. Mining for interesting valley structure plify our investigation. This saves us from hav- ing an additional input parameter β—the inverse Samplingfromallvalleysisnotalwaysmuchharder temperature—that we would have to either set ar- than finding a single ground state—an input may bitrarily or determine empirically. Empirical esti- have only a single valley or may have valleys that mation of β can be challenging [46] and basing the are all very close in Hamming space. We wish to target β on the output of a solver would arguably generate inputs with multiple valleys that are well- give that solver an unfair advantage. separated. Sampling from distributions with mul- tiple, well-separated valleys is particularly hard [48] andhasimportantapplicationssuchasclassification using deep Boltzmann machines [49]. Becausewedefinevalleysasclustersofgroundstates in the logical space, analyzing the valley structure A. Sampling from all valleys of an input is tractable. The largest logical graphs are 16×16 lattices having treewidth 16, so solving a logical problem using dynamic programming and The expected time required for a solver to find all returning some fixed number of ground states typ- ground states of a problem is known, both in the ically takes less than a second. This allows us to equiprobable case and the biased case [47]. In the mine for inputs having interesting valley structure. caseofanIsingspinproblem,groundstatesoftenlie in connected valleys in Hamming space, and given We quantify interesting valley structure using the one ground state in the cluster it is easy to find the distribution of spin overlap P(q) [28, 29] at zero rest. We therefore adopt a more practical metric temperature,i.e.,fortwogroundstatessampleduni- based on the time required to sample all valleys of formly with replacement, what fraction of spins do ground states. they have in common? The random variable P(q) In ground states of FCL problems, all clusters have takes values in the range [−1,1]. For inputs with- out fields the distribution is symmetric about zero; their spins in agreement; therefore the distance be- we can therefore consider the distribution of the ab- tweenanytwogroundstatesinthenativeHamming space is a multiple of 8. However, ground states can solute value P(|q|). We define the mean overlap as be adjacent (i.e., differ by a single spin) in the log- theexpectationofP(|q|). Inputswithmeanoverlap near 1 tend to resemble ferromagnets—if there are ical space. We define a valley as a set of ground multiple valleys they will be close together. Inputs states that are connected in the logical Hamming space. While it is nontrivial to move from one state to another in the same valley because of the single tall, thin energy barrier, it can still be done with 3 ForHamiltonianswithnofields,flippingallspinsofastate a modest amount of postprocessing. We also note doesnotchangetheenergy. Thereforetheantipode(nega- that, sinceFCLproblemsdonothavefields, ground tion)ofanygroundstateisalsoagroundstate. 10 with lower mean overlap tend to have valleys that 2. KL-divergence of valley distributions are well-separated. Inputs that are hard to sample from have multiple TheTTAVresultsshowninFigure5failtoaddressa valleys that are well-separated. We mine for such specificfear—thatinasignificantminorityofinputs inputs as follows. First we reject any input with there are valleys that the D-Wave QPU would be more than 1000 ground states, as these slow down simplyunabletofindduetoquantumsamplingbias. our analysis and may be too easy. Second, we reject To address this, we calculate for each (solver, prob- any input that does not have at least 4 valleys since lem) pair the KL-divergence between empirical val- we want valley collection to be nontrivial. Finally, ley distributions and exact valley distributions (i.e., we reject any input with a mean overlap of 0.7 or relative valley sizes). KL-divergence is an asymmet- higher since we want valleys to be well-separated. ric measure of the distance between two probability distributions; we calculate it such that it is infinite if the solver fails to see all valleys, i.e., (cid:88) P(v) C. Sampling results KLD= P(v)log , P(cid:98)(v) valleysv We generated problems at the 16×16 lattice size where P(v) is the true Boltzmann probability of with α = 0.85 and ρ = R = 6 and mined them valley v and P(cid:98)(v) is the sample estimate of P(v), for interesting valley structure as described above. conditioned on samples being ground states. This We generated 50,000 inputs and rejected all but 74. KL-divergence measure includes two types of error. This gave us an acceptance rate of roughly 0.15% First,thereisadistributionalerror,sinceeachsolver of inputs. We sought to answer the question, after samples from a distribution that differs from the x seconds of annealing, what fraction of the valleys Boltzmann distribution. Second, there is a sample has each solver seen? For each problem we drew a size error, since our sample estimate has finite size numberofsamplesaccordingtothesolverasfollows: and therefore differs from the solver’s true distribu- tion. Inthiscontextitisappropriatetoincludeboth types of error. D-Wave QPU: 100,000 samples at 5µs (in batches Figure 6 shows histograms of KL-divergence for the of 100 per spin-reversal transform) different solvers. For these FCL problems, fears of SA: 100,000 samples valleys suppressed by quantum sampling bias are QMC: 5000 samples unfounded. The D-Wave QPU has a superior KL- divergence distribution than any of the software HFS: 5000 samples solversevenwhenannealingforthreeordersofmag- nitude less time. On the single input for which the D-Wave KLD was infinite because at least one val- ley was never seen, it was also infinite for all other solvers. 1. Time to all valleys Results for the TTAV metric are shown in Figure 3. Raw error on model marginals 5. The 2000-qubit D-Wave QPU is the fastest of the solvers, hitting all valleys in a median time of The TTAV metric and valley distributions can be roughly 30ms. The fastest software competition thought of as representing what sample quality was HFS, which hit all valleys in a median time of would look like with postprocessing. We would roughly 30s. also like a more raw metric that does not have In certain situations quantum annealing in the this implicit postprocessing. For this we consider transverse-field Ising model is subject to inherent marginals of the zero-temperature Boltzmann dis- sampling bias [50–53], although that does not prove tribution. Specifically, we consider the spin-spin ex- to be a significant problem here. While the slope of pectations, i.e., for each coupler, what is the ex- theD-WavecurveisslightlylesssteepthantheHFS pectedproductofthetwoincidentspinsinthezero- curve, indicating that its samples might be less di- temperature Boltzmann distribution? The L error 1 verse,theD-WaveQPUstillmanagestooutperform on these marginals (i.e., the empirical estimates of thecompetitionbyaboutthreeordersofmagnitude. spin-spin expectations minus true expectations) is