Table Of Content

Simulations of Shor’s Algorithm using Matrix Product States D. S. Wang,∗ C. D. Hill,† and L. C. L. Hollenberg‡ Centre for Quantum Computation and Communication Technology, School of Physics, The University of Melbourne, Parkville, Victoria 3010, Australia Weshow thatunderthematrix product state formalism thestates produced in Shor’salgorithm can be represented using O(max(4lr2,22l)) space, where l is the number of bits in the number to factorise, andr istheorderandthesolutiontotherelated order-findingproblem. Thereductionin spacecompared to an amplitudeformalism approach is significant, allowing simulations as large as 42 qubits to be run on a single processor with 32GB RAM. This approach is readily adapted to a 5 distributed memory environment, and we have simulated a 45 qubit case using 8 cores with 16GB 1 RAMin approximately one hour. 0 2 PACSnumbers: 03.67.Ac,03.67.Lx n a J I. INTRODUCTION pared to the O(23l) of amplitude formalism simulators allows us to simulate a 42 qubit case on a single proces- 0 3 Running large-scale quantum algorithms on a quan- sor with 32GB RAM in a matter of hours. We discuss tum computer is still a distant goal. The qubit [1] is a optimisation techniques that arise from circuit. We also ] two-level system which can be in superpositions of both reasonthatMPSsimulatorsarewellsuitedtoparallelisa- h p states, and is the fundamental unit of information for tion, anddemonstratenear idealscalingusing MPI.Our - quantum computation. Not only is creating large num- largest simulation instance consisted of 45 qubits, using nt bers ofqubits with precisecontroldifficult, sotoo is pre- 8 processorcoreswith 16GBRAM in approximatelyone a serving the delicate entanglement. Thus, classical sim- hour (real time). u ulation is essential to study the tolerance of quantum This paper is organised as follows. Section II is a re- q algorithms againstqubit decoherence and controlerrors. viewofShor’salgorithm. SectionIII isa reviewofMPS. [ Shor’s algorithm for prime number factorisation [2, 3] Section IV analyses the space performance of Shor’s al- 1 is a well known quantum algorithm, and has been simu- gorithm when stored as a MPS. Section V describes our v lated extensively in the amplitude formalism [4–8]. The implementation in detail, and tabulates single process 4 computationalresourcesgrowexponentiallywith system benchmarks. Section VI discusses a multi-processor im- 4 sizeandthesimulationresultsareeasilyverified,making plementation, and benchmarks indicate good scaling be- 6 it an ideal test candidate. Amplitude formalism simula- haviour. 7 0 torshavesimulatedsystemswithasmanyas42qubitson . massive supercomputers [7, 8], using highly parallelised 1 II. REVIEW OF SHOR’S ALGORITHM software. These calculations represent the current limit 0 using the amplitude formalism. 5 1 Hereweconsideradifferentapproachtoclassicalquan- We shall give a brief review of Shor’s algorithm [2, 3]. : tum algorithm simulation. The matrix product state Given an l-bit number N = p q, where the integers p v × (MPS) formalism [9] is a specialised form of tensor net- and q are co-prime, the task is to determine the factors i X work where the interacting bodies are confined to one p and q. This is canonically achieved by reduction to r dimension. Briefly, each quantum state is represented a related order-finding problem: for a chosen integer x, a asaproductofmatrices. Thematrixdimensionsdepend where 1<x<N, find the order r of x modulo N. The ontheentanglementinthesystem,andexceptionalcases order r is the smallest positive integer satisfying mayleadtoanefficientsimulation. Fortunatelyforquan- xr 1 (mod N). (1) tumcomputing,this isnotthe caseforShor’salgorithm. ≡ Nonetheless, MPS can be useful for circuit simulation. It can be shown that for even valued r and xr/2 = 1 In this paper, we shall show that the space require- 6 − (mod N), one can find the factors of N by computing ments for running Shor’s algorithm using a MPS simu- thegreatestcommondenominatorgcd(xr/2 1,N). This lator is O(max(4lr2,22l)), where l is the number of bits ± can be performed efficiently by Euclid’s algorithm. The in the number N to factorise, and r is the order of the quantum order-finding circuit (Fig 1) that lies at the relatedorder-findingproblem. Thespacereductioncom- heart of Shor’s algorithm allows a quantum computer toefficiently determinetheorderr,usingqubitsandcir- cuitrygrowingonly polynomiallywithl. Itconsistsof3l ∗ [email protected] qubitsdividedintotworegisters: the upper2l qubitsare † [email protected] initialisedin 0 ,andtheremaininglqubitsareinitialised ‡ [email protected] in 1 . In the|fiigure, H =(σ +σ )/√2 is the Hadamard x z | i 2 q3 |0i H • by the size of the system, regardless of the state stored. The MPS formalismis a different way to store the state, q2 |0i H • where the space requiredgrows as the state becomes en- QFT tangled, by using the ability to write a general quantum q1 |0i H • state ψ as | i q0 |0i H • ψ = (A B C ) a b c . (5) a b c R |1i / U1 U2 U4 U8 | i ··· | i| i| i··· abc··· X FIG.1. Schematicoftheorder-findingcircuitwhichefficiently Here a , b , c , etc. are the bases of the subsystems | i | i | i determines the order r. To factorise an l-bit number N, one of the state, and Aa, Bb, Cc, etc. are matrices associ- usesanupperregisterwith2lqubits,alowerregisterRwithl atedwith the respective subsystems. Thatis, the ampli- qubits, and U|ai=|axmodNi for some integer 1<x<N. tude α of the state is stored as the matrix product abc··· QFT is thequantum Fourier transform. A B C . Theindicesa,b,c,etc. spanthedimensions a b c ··· oftheirrespectivesubsystems(e.g. foraqubit,theindex canbezeroorone),allowingforstraightforwardgeneral- operation, where σ and σ are the Pauli matrices, the x z isations to d-level subsystems (i.e. qudits). Indeed, any unitary U performs twoormoreneighbouringquditsinaMPS—thatis,their matrices are adjacent in the product—can be combined U a = ax mod N , (2) | i | i intoasinglehigher-dimensionalquditsimplybytheeval- uating the different matrix products and redefining the and QFT denotes the quantum Fourier transform [10]. subsystems (e.g. D D A B ). Neglecting normalisation constants, one can show that d ab a b ≡ ≡ A single-qudit gate U is applied by mapping U be- the state created after the controlled-Us have been ap- 1 1 tweenthe correspondingelements inthe matricesassoci- plied is ated with that qudit only. Many-qudit gates (assuming 22l−1 all involved qudits form a contiguous sequence) can be ψ = i xi mod N . (3) applied by first contracting those qudits together, thus | i | i forming a single qudit, and then applying the equivalent i=0 X (cid:12)(cid:12) (cid:11) single-qudit gate on the combined system. Swap gates WeshallalwayssuppresstheKroneckerortensorprod- can be used to rearrange qudits amongst one another uctbetweentwokets(i.e. a b a b ). Substituting wheninteractinginitiallydistantqudits. Finally,onecan | i| i≡| i⊗| i in Eqn 1 allows one to rewrite this as separatethecompositesystembackintotheoriginalpar- titions,orasdesired,byapplyingamatrixdecomposition r−1 algorithm on an specifically arrangedmatrix: ψ = jr+i xi mod N . (4) | i  | i Xi=0 Xj=0 (cid:12) (cid:11) D00 D01 D0j A0  (cid:12) D D ··· D A surAeftaenrdadpipsclyairndgthaellloofwtehrerecgoinstterro,llleeda-vUinsg, aonseysctaenmmcoena--  ...10 ...11 ·.·.·. ...1j= ...1 B0 B1 ··· Bj . (6)     taining only 2l qubits. The QFT produces a probabil- Di0 Di1 Dij Ai(cid:2) (cid:3)  ···    ity distribution in the upper register with high probabil-     ity densities in states around integer multiples of 22l/r. Theleft-handsideshowsthearrangementofthematrices The results of a few measurements from this probabil- D , where 0 a i and 0 b j, of the composite ab ≤ ≤ ≤ ≤ ity distribution can be classically processed into r via qudit within a single m n matrix, for decomposition × the continued fractions algorithm. Since it is possible to into a left (i+1)-dimensional qudit and a right (j+1)- construct circuits to efficiently perform all powers of U dimensional qudit. A matrix decomposition algorithm [11, 12] and the QFT [10], it is therefore possible for a returns the two right-hand side matrices (or something quantum computer to efficiently determine the order r similar), which have dimensions m k and k n, where × × and hence factorise large numbers. k is ideally the rank of the left-hand side matrix (the number of linearly independent rows or columns). After the decomposition, one can readily extract the matrices III. REVIEW OF MATRIX PRODUCT STATES A and B . Rank-revealing decompositions such as the a b singular value decomposition (SVD) and the pivoted QR Here we briefly review the MPS formalism [9]. The decomposition (RRQRF),althoughnotstrictlynecessary, reader may be familiar with amplitude formalism state arebeneficialastheyminimisekand,hence,therequired vector simulators, where one stores the state as a col- storage space. lection of complex coefficient-index pairs. Typically this Note that, due to numerical imprecision, the rank is is implemented as a complex number along a contigu- normally understood to be the number of eigenvalues ous array. This approach requires space governed only above a given threshold returned by the SVD, and one 3 can introduce a form of approximation by raising the Increasing Unitary Order cut-off value. MPSwas developedas anefficient method U1 1 ··· 1 1 1 1 1 2 1 for simulating systems where the number of non-trivial U2 1 ··· 1 1 1 1 2 4 1 eigenvalues is limited. However, for the controlled-U U4 1 ··· 1 1 1 2 4 8 1 gates in the order-finding circuit, the eigenvalues are either clearly zero or non-zero, thus increasing the cut-off U8 1 ··· 1 1 2 4 8 12 1 value does not provide any benefit (without invalidating U16 1 ··· 1 2 4 8 12 12 1 the results). U8192 1 2 4 8 12 12 ··· 12 1 TABLE I. The ranks across the MPS after completing the order-finding circuit up to and including the indicated IV. ORDER-FINDING CIRCUIT SPACE controlled-unitary for N =65, x=2, l =7. Increasing time ANALYSIS flows down the table. Adjacent pairs of numbers define the dimensions of the intermediate matrix. The lower register is Amplitude formalism state vector simulators are stored at the rightmost end, so that the second last number severely limited by the number of qubits in the circuit. ineachrowisalsoequaltothenumberofnon-trivialstatesin Inthecaseoftheorder-findingcircuit,themostresource the lower register. The underlined value is the order r =12. intensivestateundertheamplitudeformalismisthestate The finalorder of qubitsis q0,q1,··· ,q13. as one applies the final controlled-U operation (Eqn 4), requiring O(23l) space for a standard implementation, Decreasing Unitary Order and O(22l) space for one using sparse storage. U8192 1 ··· 1 1 1 1 1 2 1 We now show that the state in Eqn 4 can be stored U4096 1 ··· 1 1 1 1 2 3 1 much more economically by using the MPS formalism. U2048 1 ··· 1 1 1 2 3 3 1 We wish to first divide this state into two qudits: a left- U4 1 2 3 3 ··· 3 3 3 1 handsideketwhichisthecontractionoftheupper2lcon- U2 1 2 3 3 ··· 3 3 6 1 trol qubits, and a right-hand side ket for lower l qubits. U1 1 2 3 3 ··· 3 6 12 1 A general state with this division can be written as 22l−12l−1 TABLEII.Applyingcontrolled-unitariesindecreasingpowers oftwoleadtoequalorlowerranksacrosstheMPS(cf. Table ψ = A B a b . (7) | i a b| i| i I),thusimprovesbothspaceandtimeperformance. Thefinal Xa=0 Xb=0 order of qubitsis q13,q12,···,q0. Here A and B are complex row and column vectors a b respectively, so that the product A B is a scalar. We a b The remainderof the circuitcanbe completedby first note that in Eqn 4, the summation index i spans r val- measuring and discarding the lower register, leaving be- ues only, correspondingto the r unique values generated hind a 2l qubit system. All of the control qubits can byxi mod N. Fromthiscomparison,oneconcludesthat then be contracted to a single qudit. At this point, the thereareonlyrnon-trivialB matricesinEqn7. Forlin- b MPSrepresentationbecomesequivalenttotheamplitude ear independence, the B matrices must contain at least b formalism representation, thus it requires O(22l) space. r rows. The A matrices must possess as many columns a Therefore, the entire order-finding circuit can be com- as B does rows to satisfy the matrix product. It follows b pleted using O(max(4lr2,22l)) space. Further space re- that for this particular decomposition, the state can be duction is possible by using sparse matrices—after read- expressed with O(22l[1 r]+r[r 1]) space. × × ing Section V, it should become obvious that the space Onecanfurtherdecomposetheleft-handsideketdown required before the QFT is O(lr). to individual control qubits. Each of the 2l control Note that the of value r will arise naturally as the qubits’ two bases can be represented by matrices no simulationis run. It is interesting to observethat in r < greaterthan r r. This is easily seen fromthe structure × log N cases,onecanefficientlysimulateShor’salgorithm of the order-finding circuit; the control qubits only in- 2 by using the semi-classical QFT [13, 14]. However, it is teractwiththe lowerregister,andnot withone another. exceedingly rare to choose an x with r < log N [15]. Usingthisdecomposition,Eqn4canberepresentedusing 2 Furthermore, for the case x = N 1, it is easy to check O(2l 2[r r]) space. − · × that r = 2, but the algorithm does not return any non- To illustrate the above analysis, see Table I. Here trivial factors. we have graphed the ranks across the MPS after each controlled-U gate is applied for the case N = 65, x = 2, and r = 12. We shall describe our implementation in V. IMPLEMENTATION AND BENCHMARKS detail in Section V. For now, it suffices to know that adjacent pairs of numbers define the dimensions of the intermediate matrix. One can see that no matrix exceeds We now describe the details of our simulations and dimensions r r. present our benchmarks. Preliminary processing steps × 4 required before the order-finding circuit available else- there arek columns(we canthink ofthis aspivoting the where (e.g. see [1]) and will not be repeated here. We trivial columns to the end of the matrix). For the sake shall divide the circuit into three distinct steps: (1) the of speed yet still maintaining the space performance of application of the controlled-U gates; (2) the measure- MPS, whenever one suspects an m n matrix A having × ment of the lower register; and, (3) the final quantum rank k min(m,n), as is the case here, one can use the ≈ Fourier transform. trivially simple decomposition: The circuit consists of 2l control qubits, labelled q , i 0U2≤i. iTh<e r2elm, asuincihngthlaqtutbhitesqfourbmittqhiecloonwterrolrsegtihsetergaRte. Am×n =(AImm××mnAInm××nn,, mm<≥nn. (8) They are always contracted into a single 2l-dimensional qudit (technically, the circuit also works with a single The above discusses how a single controlled-U gate is N-dimensional qudit). Following our space analysis, we implemented. Thecircuitconsistsof2lsuchgates,which chooseonlytostorethenon-trivialmatricesofR(atmost differ only in their exponents and, therefore, commute. r), and furthermore, we choose to always keep it at the We find that it is never worse and often better to apply rightmost end of the product, so that these matrices are the Us in decreasing powers of two. This is because the always column vectors, each containing at most r rows. sequence generated by U2i = x2i mod N, i 0, will ≥ The state is initialised in 0 1 . The H gates may eventually lead to a cyclic subsequence. Applying the | i| i be applied immediately, or combined into the following Us in increasing powers saturates the rank in the fewest controlled-U gates. We have opted to implement the number of steps, which occurs just before the cycle re- latter. In both cases, all of the matrices representing peats. In contrast,applying the Us in decreasing powers the state are, at present, simply scalars—either 0, 1, leads to performing the same operations on R at the be- or 1/√2—and thus the internalorderingamongstqubits ginning. Repeated operations will tend to populate the andquditsisnegligible. Note,however,thatforageneral samestatesinRaspreviousapplications,thusleadstoa entangled state, the ordering is of critical importance. slower increase in ranks. Other orderings are also possible, but we have not consideredthem due to foresightof the forthcoming QFT. The rank between the rightmost A. The Controlled-U Gates control qubit and R reaches r once all unique Us have been applied. Inordertoapplythecontrolled-U2i gate,wefirstshift Table I shows the ranks after each controlled-U operation is applied, for the example N = 64, x = 2, and thecontrolqubitq directlytotheleftoflowerregisterR. i r = 12. Time flows down the table, reflecting applying Sinceq isinaproductstate,ormorecorrectly,itsmatri- i the Us in increasing powers of two. Our implementation cesarecurrentlyscalars,onecancommuteitthroughthe placesthemostrecentlyinteractedcontrolqubitfurthest matrix product with ease, and resize it (i.e. multiplying totheright,whichleadstoorderedqubitsuponcomplet- it by an appropriately sized square identity matrix) to ingthecontrolled-U gatesequence: q ,q , ,q . Using retain a well-formed MPS. 0 1 13 ··· this ordering, the state can be stored using 3300 com- Nowone cancontractq with R,andsubsequentlyap- i plex numbers, or approximately 50KB. Table II shows ply the controlled operation. Let this combined system the same simulation parameters, now with Us applied havematricesdenotedbyD ,wherea 0,1 andbare ab ∈{ } in decreasing powers, and hence reverse ordered qubits. thenon-trivialstatesofR. Then,inordertoseparatethe Under this ordering, the same state can be stored using system into two partitions, with q on the left and R on i only 520 complex numbers, or approximately 8KB. theright,onearrangesthematricesD intoasinglema- ab trixasshownonthe left-handsideofEqn6. Inpractice, instead of performing each of these steps independently, B. Measurement of Lower Register one can perform the contraction, apply the controlled-U gate,andarrangethe D matricesinasingleswiftblow ab Measurement of the lower register requires the proba- forefficiency. Additionally,the H operationonq ,which i simply takes the state from 0 to (0 + 1 )/√2, is also bility of each state i in R, which can be calculated from | i | i | i first principles using Eqn 5. One can show that this is readily combined into the above steps. A matrix decomposition algorithm takes the matrix onthe left-hand side ofEqn6 andproduces the two ma- pr(i)= α 2 =R† B† A†A B R . trices on the right-hand side. These contain the new | ab···i| i (···" b a a! b#···) i b a matrices associated with the states of qi and R. Note X X (9) thatthe decompositionisnotunique. As mentionedear- lier, the intermediate dimension is minimised by using Itispossibletoenforce A†A =1forallqubitsand a a a a rank-revealing decomposition, and is at best the rank qudits in the MPS by using the SVD exclusively. This P k of the left-hand side matrix. However, in this case, allowsone to calculatethe probabilityof a givenstate of we know k is exactly the number of non-trivial states of any qubit or qudit by using its own matrices only. How- R. Furthermore, we have arrangedthe matrix such that ever,thisisnottrueinoursimulations;wehavesacrificed 5 measurement speed for simpler decompositions, namely q3 H 1 2 3 × Eqn 8. Having computed the probabilities, one can apply a q2 • H 1 2 × projection and rescale the remaining amplitudes. Since the projectionwill reduce the entanglementin the entire q1 • • H 1 × system, one should take this opportunity to reduce the spaceconsumption. WeapplytheRRQRFpairwisefrom q0 • • • H × righttoleftforthispurpose. WehavechosentheRRQRF only for its rank-revealing property and its speed over FIG. 2. Standard circuit implementing the quantum Fourier the SVD, and not for any structure in the results. We transform on four qubits. Controlled-phase gates performing have also avoided the faster LU decomposition as it is |11i → exp(iπ/2n)|11i are denoted by the number n in a considered less numerically stable. square. Swapgatesaredenotedbytwocrossesjoinedtogether byaverticalline. Long-rangegatesmakethiscircuitlessthan ideal on a MPS simulator. C. The Quantum Fourier Transform q3 H 1 2 3 × ThetypicalcircuitfortheQFTisshowninFig2. The useoflong-rangecontrolled-phasegatesmakethiscircuit q2 • H 1 2 × less than ideal when using a MPS simulator. Note that the final swap gates can be implemented classically and q1 • • H 1 × thus do not pose a problem. To overcome this, one can be rearrange the QFT into q0 • • • H × 2l subcircuits, each of which includes exactly one more qubit than its predecessor, as shown in Fig 3. One con- tracts a single qubit into a growing qudit at the be- FIG. 3. QFT rearranged to be more compatible with MPS. ginning of each subcircuit, and applies all gates in the Dashed boxes show the repetitive pattern of the subcircuits. subcircuit directly onto this qudit. This arrangement is No matrix decompositions are required, and all qubits in- preferable because one eliminates the need to perform volvedarecontractedintoasinglequditatthecompletionof any matrix decompositions. Upon completion, a single thecircuit. 22l-dimensional qudit remains (i.e. a 2l qubit amplitude formalism state vector). q3 H 1 ×q2 H 1 ×q1 H 1 ×q0 H Another approach is to add swap gates after each controlled-phasegate,akinto implementing the QFT on q2 • ×q3 2 ×q1 • ×q2 2 ×q0 • ×q1 a 1d quantum computer with only nearest neighbour in- teractions, as shown in Fig 4. This approach is initially q1 • ×q3 3 ×q0 • ×q2 slower,however,itmayleadtobetterspaceperformance when coupled to the SVD due to truncation of insignifi- q0 • ×q3 cant eigenvalues. FIG. 4. QFT for a 1d nearest-neighbour quantum computer is also suitable for MPS. Matrix decompositions are only required after every swap operation. D. Benchmarks Table III shows the CPU times in seconds to simu- (IBMBlueGene/L,4096processes)onveryhighlyparal- late Shor’s algorithm for various N and r for each of lelisedamplitudeformalismsimulators[7],corresponding the described stages in the algorithm: t to apply all U to days in CPU time. The same software has been used controlled-U gates, t to measure the lower register meas tosimulatel =14cases,orequivalently42qubits(64TB andminimisetheranks,andt toapplytheQFT(us- QFT RAM) [8]. ing Fig 3). As the memory and time depends heavily on r, x has been chosen ahead of time to maximise r for the given N, and we have also chosen cases where VI. MULTI-PROCESSOR IMPLEMENTATION r N/2. All simulations use a single core and double ≈ precisioncomplexnumbers;thel =12casewouldrequire 1TB RAM on an amplitude formalism state vector sim- Simulations of Shor’s algorithm benefit immensely ulator. The l 12casestakeminutes to runona laptop fromstoringthe stateasaMPS.However,singleprocess ≤ (Intel Core 2 Duo P8400 @ 2.26GHz, 2GB RAM), and simulations will ultimately be memory limited, which is the l 13 cases take hours on a cluster (AMD Opteron typically resolved by distributed computing. We show ≥ @ 2.5GHz, 32GB RAM). For comparison, l = 12 cases that MPS is well suited to this regime. We use the take 970s(IBM Regatta p690+,512 processes)and 170s Message Passing Interface (MPI) [16] to manage inter- 6 l 11 12 13 14 l N nproc tU tmeas tQFT ttotal N 2033 4063 8189 16351 13 8189 1 3.4 6763 1539 8305 (19×107) (17×239) (19×431) (83×197) 2 2.1 3378 511 3891 x 2 3 10 2 4 1.3 1636 253 1890 r 954 1904 3870 8036 8 0.7 886 128 1015 tU 0.2 0.1 2.6 3.8 16 0.5 452 76 529 tmeas 49 76 4843 12,799 14 16,351 1 6.3 13,173 3909 17,088 tQFT 11 6.7 962 1949 2 3.6 5588 1430 7022 ttotal 60 83 5808 14,752 4 1.8 2769 788 3558 8 1.2 1471 356 1829 16 0.7 741 172 914 TABLE III. CPU time in seconds the distinct steps of the order-finding circuit (Fig 1): tU to complete all controlled- 15 32,663 4 5.2 7510 1934 9449 Us;tmeastomeasureoutthelowerregister;and,tQFTperform 8 3.1 3239 788 4031 theQFT.ttotal isthetotaltime. xischosentomaximise the order r for the given N. r determines the maximum size of thematrices before theQFT. TABLEIV.Wall-clocktimeinsecondsforthedistinctstepsof the order-finding circuit for three cases: l =13, N =8189= 19×431, x = 10, r = 3870; l = 14, N = 16351 = 83×197, x=2, r =8036; and, l =15, N =32663 =89×367, x=6, processcommunication,thePBLASlibrary[17]forbasic distributed matrix operations, and ScaLAPACK [18] for r = 16104. nproc is the number of MPI processes or CPU cores. linear algebra. We assume that our processes are arranged into a 2d grid. For simplicity, we shall also assume that the number of processes n is a power of two. The matrices proc associatedwiththe quditsaredividedanddistributedto these processes. For our simulations, we have chosen to asisthecasefortheamplitudeformalism,thecontrolled- dividematricesinto16 16blocks,whicharedistributed phase gates can also be applied without additional data × cyclically. It is straightforward to port the program to transfer. thisenvironment;many-quditoperationsunderMPSare simply some combination of matrix multiplications, re- Table IV shows the wall-clock times in seconds for arranging or adding sub-matrices, and matrix decompo- our implementation. These were run on the same clus- sition. ter mentioned in the single process simulations (AMD The simple data distribution described above works Opteron @ 2.5GHz). The largest case simulated was well for the majority of the simulation as the matrices l = 15, using approximately 16GB RAM spread over 8 are large and relatively square (Table I). However, one processes and taking approximately one hour. Because must take care during the QFT (Fig 3). This is because thesesimulationswererunonasharedcluster,theexecu- the circuit combines many qubits to form a very high tion time can deviate by 25%or more, depending on the dimensional qudit, with the most extreme case appear- loadonthenodesandtheassignmentofnodes. Thusthe inguponcompletion,asingle22l-dimensionalqudit. One listedtimes arethe lowestfromeachcase,obtainedfrom can easily produce unevenly distributed data, for exam- asmallnumberofruns. Theresultsindicategoodscaling ple, by creating either 22l 1 1 matrices (all data stored in a single process), or a si×ngle 22l-dimensional row or behaviour, doubling the number of processes leads to a 40–60%decrease in the total execution time. column vector (data distributed along a single row or column only). To ensure a balanced data distribution throughout the circuit, we set the matrix’s origin (i.e. We also mention here that there is another difference the process containing the first element) to be the pro- between these simulations and those of Table III. Re- cess whose rank is equal to the first log2(nproc) bits of call that after the measurement projection, we apply a the state. This isanalogoustothe distributionmechanic rank-revealing decomposition across the MPS to reduce oflargeamplitude formalismsimulators,where the most the matrix sizes. Unfortunately, the QR decomposition significantlog2(nproc)bitsofthestatedeterminethepro- fromScaLAPACK1.8.0(pzgeqpf)appearstonot always cess in which it is stored. choosepivotssuchthatthediagonal(leading)elementsof We have further divided the matrices into two aligned R are in order of descending magnitude, as described in matrices,forthecaseswherethemostsignificantqubitis [18]. Thisappearstobe abug,andthe behaviourmeans 0 and 1 . Since the most significant qubit is the qubit we are unable to determine the numerical rank through | i | i to which the H gate is applied in each of the subcircuits this routine. Thus in these benchmarks,we haveinstead in Fig 3, this allows H to be applied locally. Note that, opted to use the SVD (pzgesvd). 7 VII. CONCLUSION 1/nproc scaling. Onecanextendthese simulations,for example,byim- Wehaveshownthat,inordertofactoriseanl-bitnum- plementing the circuits for the controlled-unitaries, and ber N with order r, a MPS approach reduces the space by incorporating decoherence. Other quantum circuits complexity from O(23l) to O(max(4lr2,22l)) in the case with the same structure as the order-finding circuit are ofdensestorage. Thisisasignificantimprovementasris likely to benefit from an MPS approach, and many of typically much smaller than N, allowing one to simulate the optimisations described here would be applicable to circuitswithasmanyas36qubitsonastandardlaptops. these and other MPS simulations. Our largestsimulation on a single processorof 42 qubits This research was conducted by the Australian Re- required less than 32GB RAM. search Council Centre of Excellence for Quantum Com- We have also ported our simulations to use MPI. We putation and Communication Technology (project num- have simulated circuits with as many as 45 qubits over ber CE110001027). This research was supported in part 8 CPU cores. Since MPS is built around matrix prod- by the U.S. Army ResearchOffice (W911NF-08-1-0527), ucts, the distributed matrix approach to parallelisation the US National Security Agency and the Albert Shim- understandably works well. Our benchmarks show near mins Memorial Fund. [1] M. A. Nielson, I. L. Chuang, Quantum computation (2010). and quantum information, Cambridge University Press, [9] G. Vidal, Efficient classical simulation of slightly entan- Cambridge, 2000. gled quantum computations, Phys. Rev. Lett. 91 (2003) [2] P. W. Shor, Algorithms for quantum computation: dis- 147902–147905. crete logarithms and factoring, in: Proc. 35th Annu. [10] D.Coppersmith,AnapproximateFouriertransformuse- Symp.Found.of Comput. Sci., 1994, pp.124–134. ful in quantum factoring, Tech. Rep. RC 19642, T. J. [3] P.W.Shor,Polynomial-timealgorithmsforprimefactor- Watson Research Center, IBM Corporation (1994). ization anddiscretelogarithms on aquantumcomputer, [11] C.Zalka,FastversionsofShor’squantumfactoringalgo- SIAMJ. Comput. 26 (1997) 1484–1509. rithm, quant-ph/9806084. [4] K. M. Obenland, A. M. Despain, A parallel quan- [12] S. Beauregard, Circuit for Shor’s algorithm using 2n+3 tumcomputersimulator,presentedatHighPerformance qubits, QuantumInf. Comput. 3 (2003) 175–185. Computing 1998. [13] D. E. Browne, Efficient classical simulation of the quan- [5] J. Niwa, K. Matsumoto, H. Imai, General-purpose par- tum Fourier transform, New J. Phys. 9 (2007) 146. allel simulator for quantumcomputing, Phys.Rev.A 66 [14] N.Yoran,A.J.Short,Classical simulability andthesig- (2002) 062317. nificanceofmodularexponentiationin Shor’salgorithm, [6] F.Tabakin,B.Juliá-D´ıaz,Qcmpi: aparallelenvironment Phys. Rev.A 76 (2007) 060302(R). for quantum computing, Comput. Phys. Commun. 180 [15] R. Jozsa, N. Linden, On the role of entanglement in (2009) 948–964. quantum-computationalspeed-up,Proc.Roy.Soc.Lond. [7] K. D. Raedt, K. Michielsen, H. D. Raedt, B. Trieu, A 459 (2003) 2011–2032. G.Arnold,M.Richter,T.Lippert,H.Watanabe,N.Ito, [16] Message passing interface forum, Massive parallel quantum computer simulator, Comput. http://www.mpi-forum.org/. Phys.Commun. 176 (2007) 121–136. [17] ParallelBLAS,http://www.netlib.org/scalapack/pblas qref.html. [8] Helmholtz Association of German Research Centres, [18] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, World record: Ju¨lich supercomputer simulates quantum J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, computer, Press release, retrieved from http://www2.fz- G.Henry,A.Petitet,K.Stanley,D.Walker,R.C.Wha- juelich.de/portal/index.php?cmd=show&index=163&mid=761 ley,ScaLAPACKUsers’Guide,SocietyforIndustrialand Applied Mathematics, Philadelphia, PA,1997.