software pipelining for quantum loop programs
TRANSCRIPT
Software Pipelining for Quantum Loop ProgramsGuo Jingzhe
Department of Computer Scienceand Technology
Tsinghua UniversityChina
Mingsheng YingCQSI, FEIT, University of Technology Sydney
AustraliaInstitute of Software, CAS, China
Tsinghua University, [email protected]
AbstractWe propose a method for performing software pipelining onquantum for-loop programs, exploiting parallelism in andacross iterations. We redefine concepts that are useful in pro-gram optimization, including array aliasing, instruction de-pendency and resource conflict, this time in optimization ofquantum programs. Using the redefined concepts, we presenta software pipelining algorithm exploiting instruction-levelparallelism in quantum loop programs. The optimizationmethod is then evaluated on some test cases, including pop-ular applications like QAOA, and compared with severalbaseline results. The evaluation results show that our ap-proach outperforms loop optimizers exploiting only in-loopoptimization chances by reducing total depth of the loopprogram to close to the optimal program depth obtained byfull loop unrolling, while generating much smaller code insize. This is the first step towards optimization of a quantumprogram with such loop control flow as far as we know.
Keywords: quantumprogram scheduling, quantumprogramcompilation
1 IntroductionQuantum computer hardware has reached the so-called quan-tum supremacy showing that quantum computation can ac-tually outperform classical computation for certain tasks, butit is still in the NISQ (Noisy-Intermediate-Scale-Quantum)era where there are no sufficient quantum bits (qubits, forshort) for quantum error correction.
Program optimization is particularly important for ex-ecuting a quantum program on NISQ hardware in order toreduce the number of required qubits, the length of gatepipeline, and to mitigate quantum noise. Indeed, there hasalready been plenty of work on optimization and paralleliza-tion of quantum programs. Theoretically, it was proved in[5] that compilation of quantum circuits with discretizedtime and parallel execution can be NP complete. Practically,quantum hardware architectures, especially those based onsuperconducting qubits, provide instruction level support forexploiting parallelism in quantum programs; for example,Rigettiβs Quil [20] allows programmers to explicitly spec-ify multiple instructions that do not involve same qubitsto be executed together, while in Qiskit, ASAP or ALAP
scheduling is performed implicitly [23]. Furthermore, sev-eral compilers have been implemented that can optimizequantum circuits by exploiting instruction level parallelism;for example, ScaffCC [11] introduces critical path analysisto find the βdepthβ of a quantum program efficiently, re-vealing how much parallelism there is in a quantum circuit;commutativity-aware logic scheduling is proposed in [18]to adopt a more relaxing quantum dependency graph thanβqubit dependencyβ by taking in mind commutativity be-tween the π π gates and CNOT gates as well as high-levelcommutative blocks while scheduling circuits. There are alsosome more sophisticated optimization strategies reported inin previous works [10, 13, 19, 22] .Quantum hardware will soon be able to execute quan-
tum programs with more complex program constructs, e.g.for-loops. However, most of the optimization techniques inprevious work only deal with sequential quantum circuits.Some methods allow loop programs as their input, but thoseloops will be unrolled immediately and optimization willbe performed on the unrolled code. Loop unrolling is thetechnique that allows optimization across all iterations ofa loop, but comes at a price of long compilation time, re-dundant final code and run-time compulsory cache misses.As quantum hardware in the near future may allow up tohundreds of qubits, it will often be helpful to preserve loopstructure during optimization since the growth in numberof qubits will also lead to increment in total gate count, aswell as increment in difficulty unrolling the entire program.
Software pipelining [12] is a common technique in op-timizing classical loop prosgrams. Inspired by the executionof an unrolled loop on an out-of-order machine, softwarepipelining reorganizes the loop by a software compiler in-stead of by hardware. There are two major approaches forsoftware pipelining:
β’ Unrolling-based software pipelining usually unrollsloop for several iterations and finds repeating patternin the unrolled part; see for example [2].
β’ Modulo scheduling guesses an initiation interval firstand try to schedule instructions one by one underdependency constraints and resource constraints; seefor example [12].
OurContributions:We hereby presents a software pipelin-ing algorithm for parallelizing a certain kind of quantumloop programs. Our parallelization technique is based on a
arX
iv:2
012.
1270
0v1
[qu
ant-
ph]
23
Dec
202
0
A Preprint, December 23, 2020 Guo, et al.
novel and more relaxed set of dependency rules on a CZ-architecture (Theorems 1 and 2). The algorithm is essentiallya combination of unrolling-based software pipelining andmodulo scheduling [12], with several modifications to makeit work on quantum loop programs.
We carried out experiments on several examples and com-pared the results with the baseline result obtained by loopunrolling. Our approach proves to be a steady step towardbridging the gap between optimization results without con-sidering across-loop optimization and fully unrolling resultswhile restraining the increase in code size.
Organization of the Paper: In Section 2, we review somebasic definitions used in this paper. The theoretical tools fordefining and exploiting parallelism in quantum loop programare developed in Section 3. In Section 4, we present our ap-proach of rescheduling instructions across loops, extractingprologue and epilogue so that depth of the loop kernel canbe reduced. The evaluation results of our experiments aregiven in Section 5. The conclusion is drawn in the Section 6.[For conciseness, all proofs are given in the Appendices.]
2 Preliminaries and ExamplesThis section provides some backgrounds [14, 25] on quantumcomputing and quantum programming.
2.1 Basics of quantum computingThe quantum counterparts of bits are qubits. Mathematically,a state of a single qubit is represented by a 2-dimensionalcomplex column vector (πΌ, π½)π , whereπ stands for transpose.It is often written in the Diracβs notation as |π β© = πΌ |0β©+π½ |1β©with |0β© = (1, 0)π , |1β© = (0, 1)π corresponding to classicalbits 0 and 1, respectively. It is required that |π β© be unit:β₯πΌ β₯2+β₯π½ β₯2 = 1. Intuitively, the qubit is in a superposition of0 and 1, andwhenmeasuring it, wewill get 0with probabilityβ₯πΌ β₯2 and 1 with probability β₯π½ β₯2. A gate on the qubit isthen modelled by a 2 Γ 2 complex matrix π . The output ofπ on an input |π β© is quantum state |π β²β©. Its mathematicalrepresentation as a vector is obtained by the ordinary matrixmultiplicationπ |π β©. To guarantee that |π β²β© is always unit,π must be unitary in the sense that π β π = πΌ where π β isthe adjoint of π obtained by transposing and then complexconjugatingπ . In general, a state of π qubits is representedby a 2π-dimensional unit vector, and a gate on π qubits isdescribed by a 2π Γ 2π unitary matrix. [For convenience ofthe readers, we present the basic gates used in this paper inAppendix A.]
2.2 Quantum execution environmentSoftware pipelining is a highly machine-dependent approachof optimization. So we must give out some basic assumptionsabout the underlying machine that our algorithm requires.State-of-the-art universal quantum computers differ in manyways:
β’ Instruction set: A quantum computer chooses a uni-versal set of quantum gates as its low-level instructions.For example, IBM Q[4] uses controlled-NOT CNOTand three one-qubit gatesπ1,π2,π3, but Rigetti Quil[20]uses controlled-Z CZ and one-qubit rotations π π , π π .We use the universal gate set {π3,πΆπ } for the reasonthatπ3 itself is universal for single qubit gates, whichallows us to merge single qubit gates at compile time.[see Appendix A for the definition of these gates.]
β’ Instruction parallelism: Different quantum comput-ers are implemented on different technologies, con-straining their power to execute multiple instructionssimultaneously. Usually superconductive quantum com-puters support parallelism while ion-trap ones do not.We assume qubit-level parallelism: instructions on dif-ferent qubits can always be executed simultaneously.
β’ Timing: Different quantum computers may use differ-ent timing strategies, using continuous time or discretetime. Also execution time of different instructions maydiffer and is highly machine-dependent. Usually a two-qubit gate (e.g. CZ and CNOT ) costs much longertime than single qubit gates. We use a discrete timemodel with every gate requiring 1 tick equally.
β’ Qubit connectivity: Different machines may havedifferent qubit topologies. However, we assume that allgates in the input are directly executable, which mayrequire a layout synthesis step before our optimization.
β’ Classical control. The support for classical controlflow varies among different quantum computers; forexample, IBM Q does not support any complex controlflow, while Rigetti Quil supports branch statements.We assume such classical controls [see Appendix C].
The above assumptions do not fit into the existing quan-tum hardware architecture perfectly (for instance, IBM Qrequires CNOT and Quil disallowsπ3), while the architec-ture of Googleβs devices[22] fits these requirements most.With some slight modifications, however, our method can beeasily adapted to unsupported architectures [see AppendixL].
2.3 Quantum loop programsWe focus on a special family of quantum loop programs,called one-dimensional for-loop programs, defined as below:
program :=header statementβheader :=[(qdef | udef)β]
qdef :=ππ’πππ‘ ident[N];udef :=ππ π πππ‘π ident[N] = gate;
gate :=[(C2Γ2)β] | π π | π +π | ππππππ€π
gateref :=ident[expr]qubit :=ident[expr]
op :=ππ (gateref) qubit | πΆπ qubit, qubit;
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
statement :=op | π ππ ident ππ Z π‘π Z{opβ}| π ππ ident ππ ident π‘π ident{opβ}
expr :=Z β ident + Z
where:β’ The loop involves a group of one-dimensional qubitarray variables defined by qdef.
β’ The loop has only one iteration variable π starting fromπ to π with stride 1. The range [π, π] is completelyknown at compile time, or completely unknown untilexecution. This allows our algorithm to be performedon a program with parametric loop range.
β’ All array index expressions are in the form (ππ + π),where π is the iteration variable, andπ, π β Z are knownconstants.
β’ All operations in the loop body are either an one-qubitgate, or a πΆπ gate on two qubits. We donβt considermeasurement operations.
β’ One-qubit gates are defined by udef. They are givenas known matrices, or βan element in an array of un-known matricesβ when a hint on whether the matrixarray is diagonal or antidiagonal can be given. Thisallows our algorithm to be performed on a programwith parametric gates or performing different gates ondifferent iterations.
At the very start of the entire program, all qubit arrays are ini-tialized as |0β©. Our optimization may introduce some branchstatements if the endpoints π and π are unknown before codeexecution. As a result, the output language of the compileris a superset of the input language above, with support forbranch statements [see Appendix C for one possible definitionof output language]. To show versatility of the above loop,let us consider several popular quantum algorithms.
Example 1. Grover algorithm [9] is designed for the black-box searching problem: given a function π : {0, 1}π β {0, 1},find a bitstring π₯ : {0, 1}π such that π (π₯) = 1. While a classicalalgorithm requires Ξ©(π) calls to the oracle, Grover search canfind a solution in π (
βπ) calls of quantum oracle ππ ( |π₯β© β
|πβ©) = |π₯β© β |π β π (π₯)β©. This is done by repeating a series ofquantum gates, called Grover iteration. Grover search can bewritten as the loop program:
for i in 0 to N-1 doπ» [π [π]]
end forπ» [ππ€πππ ]for i in 1 to π (
βπ ) do
ππ [π, ππ€πππ ]; (2 |π β© β¨π | β πΌ ) [π]end for
Example 2. A Quantum Approximate Optimization Algo-rithm (QAOA for short) is designed in [8] to solve the MaxCutproblem on a given graph πΊ = β¨π , πΈβ©. It can be written as aparametric quantum loop program:
for i=0 to (N-1) doπ» [π [π]]
end forfor i=1 to p dofor (π, π) β πΈ doπΆπππ [π [π], π[π]];ππ΅ [π] [π [π]]; πΆπππ [π [π], π[π]]
end forfor j=0 to (N-1) doππΆ [ π] [π [ π]]
end forend for
Here, we use parametric gate arrays ππΆ [π] = π π (π½π , π) andππ΅ [π] = π π (βππππΎπ ) of rotations. The two innermost loops canbe unrolled to satisfy our input language requirements. SinceQAOA repeatedly executes the circuit but each time with dif-ferent sets of angles {π½π } and {πΎπ }, an optimizer has to supportcompilation of the circuit above without knowing all parame-ters in advance. Note that the compiler can know in advancethat ππ΅ [π] are diagonal matrices, and this hint might be usedduring optimization. [for a further explanation of QAOA seesee Appendix B]
3 Theoretical toolsIn this section, we develop a handful of theoretical techniquesrequired in our optimization. To start, let us identify someof the most critical challenges in optimizing quantum loopprograms:
β’ Instructions may bemerged together at compile time,potentially reducing the total depth. However, merginginstructions needs to know which instructions may beadjacent in the unrolled pattern, thus requiring us toresolve all possible qubit aliasings.
β’ Data dependency graph in a quantum program isusually much denser than that in classical program,since generally two matrices are not commutable, thatis, π΄π΅ β π΅π΄.
β’ Resource constraint, which prevents instructionsthat do not have dependency from executing together,is quite different in quantum case from classical case.
We will show how much optimization can be done by miti-gating these challenges in loop reordering.
3.1 Gate mergingOur assumptions allow several instructions to be mergedinto a single instruction with the same effect:
β’ Two adjacent one-qubit gates on the same qubit canbe merged, since we are usingπ3.
β’ Two adjacent πΆπ gates on the same qubits can canceleach other.
Example 3. Figure 1 is a simple case for periodical gate merg-ing pattern. The two one-qubit gates in different iterations may
A Preprint, December 23, 2020 Guo, et al.
f o r i =0 t o 3 doU q [ i ] ; V q [ i + 1 ] ; W q [ i + 2 ] ;
end f o r
(a) Loop program.|q0γ U
|q1γ V U
|q2γ W V U
|q3γ W V U
|q4γ W V
|q5γ W
(b) Unrolled circuit.
|q0γ U
|q1γ UV
|q2γ UVW
|q3γ UVW
|q4γ VW
|q5γ W
(c) Merged.
Figure 1. Single qubit gates can be merged periodically.
|iγ β’
|oγ H H
|jγ β’(a) π β 0 β§ π β β1
|oγ H β’ H
|jγ β’(b) π = 0
|iγ β’
|oγ H β’ H
(c) π = β1
Figure 2. The πΆπ gate prevents the two Hadamard gatesfrom merging, due to potential qubit aliasing.
merge with each other, thus simplifying the dependency graphand introducing new opportunities for optimization.
Gate merging allows us to decrease count of gates, andthus reduce total execution time. However, the existence ofpotential aliasing adds to the difficulty of finding βadjacentβpairs of gates. Figuring out pairs of gates that can be safelymerged is one of the critical problems when scheduling theprogram.
Example 4. Even for a simple program, it can be hard todecide whether two adjacent instructions on a qubit can bemerged. Consider the simple program:
for i=a to b doπ» [π [0]]; πΆπ [π [π], π[π + 1]]; π» [π [0]];
end forWe can merge the Hadamard gates if and only if βπ, π β
0 β§ (π + 1) β 0. Three possible cases of π lead to three differentresults, as Figure 2 shows.
The above example reveals that resolving qubit aliasingsis crucial in gate merging.
f o r i =0 to 3 doH q [ 1 ] ;CZ q [ i ] , q [ i + 1 ] ;H q [ 1 ] ;
end f o r
(a) Loop program.|q0γ β’
|q1γ H β’ H H β’ H H H H H
|q2γ β’ β’
|q3γ β’ β’
|q4γ β’
(b) Unrolled circuit.
Figure 3. Unrolled loop does not reveal periodic feature dueto qubis aliasing.
f o r i =0 to 3 doH q [ i ] ;CZ q [ i ] , q [ i + 1 ] ;H q [ i + 1 ] ;
end f o r
(a) Loop program.|q0γ H β’
|q1γ β’ H H β’
|q2γ β’ H H β’
|q3γ β’ H H β’
|q4γ β’ H
(b) Unrolled circuit.
Figure 4. Periodic feature in the unrolled loop can be cap-tured.
3.2 Qubit aliasing resolutionAllowing arbitrary linear expressions being used to indexqubit arrays introduces the problem of qubit aliasing bothin a single iteration and across iterations. Potential aliasingin quantum programs leads two kinds of problems: lack ofperiodic features in unrolled schedule, and extra complexity indetecting aliasings.
The first problem is that non-periodic features cannot becaptured using software-pipelining (or other loop schedul-ing methods). For example, in Figure 3, the situation whereπΆπ blocks two Hadamards from merging only occurs in oneor two iterations of the loop program, but it prevents themerging in all iterations, since software pipelining can onlygenerate a periodic pattern and has to generate conservativecode. The only kind of aliasing (two different qubit expres-sions refering to the same qubit) that software pipeliningcan capture is those expressions on the same qubit array andwith the same slope, as shown in Figure 4.
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
Figure 5. An example for across-loop qubit aliasing withπ1 = 3 and π2 = 2. For π = Z, Ξπ = 1, while for π = [4, 10],Ξπ = 2.
To see the second problem, we note that detection of mem-ory aliasing [1] is usually solved by an Integer Linear Pro-gramming (ILP) problem solver such as Z3[7]. However, ageneral ILP problem is NP-complete in theory and may takelong time to solve in practice. Fortunately, we will see thatall problems that we are facing can be solved efficiently inπ (1) time without an ILP solver.
We consider two references to a same qubit array:π [π1π + π1] ,π [π2π + π2] , π β π , where π is the loop interval when theloop range is known and Z when unknown.
Definition 1. In-loop qubit aliasing: To check whethertwo instructions can always be executed together, we haveto check if one qubit reference may be an alias of another, thatis, (βπ β π ) (π1π + π1 = π2π + π2) .
This problem can be easily solved by checking whether(π2 β π1) is a multiple of (π1 β π2) and π2βπ1
π1βπ2lies in π .
Definition 2. Across-loop qubit aliasing: To checkwhetherthere is an across-loop dependency between two instructions,we have to check if one qubit reference may be an alias of an-other qubit reference several iterations later. Thus, we needto find the minimal increment Ξπ β©Ύ 1, s.t.
(βπ β π ) ((π + Ξπ β π ) β§ (π1π + π1 = π2 (π + Ξπ) + π2)) . (1)
This issue can be reduced to the Diophantine equation(π2 β π1)π + π2 (Ξπ) = π1 β π2, π β π, π + Ξπ β π,Ξπ β©Ύ 1, (2)
which can be solved inπ (1) time [see Appendix D]. We solvethe equation every time when needed rather than memoriz-ing its solution. A visualization of across-loop qubit aliasingis presented in Figure 5.
3.3 Instruction data dependencyOne most important step in rescheduling a loop is to find thedata dependencies - instrucions that can not be reorderedwhile scheduling. Previous work mostly defined instructiondependency according to matrix commutativity: the order oftwo instructions can change if their unitary matrices satisfyπ΄π΅ = π΅π΄. This captures most commutativity between gates,but not all. Here, we relax this requirement by establishingseveral novel and more relexed commutativity rules betweenquantum instructions. Since πΆπ gates is the only two-qubit
gate we use and any twoπΆπ gates commute with each other,what we need to care about is commutativity between πΆπgates and one-qubit gates.
Definition 3. (CZ conjugation) If for one-qubit gatesππ΄,ππ΅ ,ππ΄ and ππ΅ , we have πΆπππ΄ππ΅πΆπ = ππ΄ππ΅ , we say πΆπ conju-gatesππ΄ β ππ΅ into ππ΄ β ππ΅ .
Conjugation allows us to swap a πΆπ gate with a pair ofone-qubit gates, at the price of changing ππ΄ and ππ΅ to ππ΄and ππ΅ correspondingly. The following theorem identifiesall possible conjugations.
Theorem 1. (CZ conjugation of single qubit gates) πΆπ con-jugatesππ΄ βππ΅ into someππ΄ βππ΅ if and only ifππ΄ andππ΅
are diagonal or anti-diagonal: ππ = π π (\ ) or ππ = π +π(\ ) for
π β {π΄, π΅}.Note 1. The antidiagonal rule has been named βEjectPhased-Paulisβ in [22]. However we propose the rules for both necessityand sufficiency: no more commutation rules can be obtainedat gate level.
Since identity matrix πΌ is diagonal, ππ΄ and ππ΅ can bethought of as going under conjugation separately. Thus, weonly need to consider two special cases: πΌ β π π and πΌ β π +
π.
Note that in conjugation rules π +πwill always introduce a π
gate to the other qubit. This inspires us to generalize Theo-rem 1 for a generalized form of πΆπ defined in the following:
Definition 4. (Generalized πΆπ gates) For π₯,π¦ β {0, 1}, wedefine following variants of πΆπ gate:πΆπ11 [π, π] = πΆπ [π, π], πΆπ00 [π, π] = βπ [π]π [π]πΆπ [π, π]πΆπ10 [π, π] = π [π]πΆπ [π, π], πΆπ01 [π, π] = π [π]πΆπ [π, π]
Equivalently,πΆππ₯π¦ can be defined as follows:πΆππ₯π¦ |ππβ© =(β1)πΏππ₯πΏππ¦ |ππβ©, where πΏπ π is Kronecker delta. Now we havethe following commutativity rules for generalized πΆπ :
Theorem 2. (Generalized πΆπ conjugation of single qubitgates) When exchanged with π +
π, πΆπ gate changes into one of
its variants by toggling the corresponding bit.1. π π (πΌ) [π]πΆππ₯π¦ [π, π] = πΆππ₯π¦ [π, π]π π (πΌ) [π];2. π +
π(πΌ) [π]πΆππ₯π¦ [π, π] = πΆππ₯ (1βπ¦) [π, π]π +
π(πΌ) [π].
Since generalized πΆπ gates are also diagonal, they com-mute with each other and can be scheduled just as ordinaryπΆπ gate and converted back to πΆπ by adding π gates.
3.4 Instruction resource constraintQubits have properties that resemble both data and resource:qubits work as quantum data registers and carry quantumdata; meanwhile, qubit-level parallelism allows all instruc-tions, if they operate on different qubits, to be executed simul-taneously. This results in a surprising property for quantumprograms: the resources should be described using linear ex-pressions, instead of by a static βresource reservation tableβas in the classical case. Using the rules for detecting qubit
A Preprint, December 23, 2020 Guo, et al.
Kernel Compaction
Loop Unrolling
Loop Rotation
Modulo Scheduling #1
Modulo Scheduling #2
Loop Rotation Prologue
Loop Rotation Epilogue
MS #1 Prologue
MS #2 Prologue
MS #2 Epilogue
MS #1 Epilogue
Loop Unrolling Epilogue
Kernel
Branch by m mod C
Figure 6. The entire compilation flow of our approach.
aliasings, we simply check if there is an aliasing betweenthe qubit references from two instructions, that is, the twoinstructions share a same qubit at some iteration and cannotbe executed simultaneously.
4 Rescheduling loop bodyNow we are ready to present the main algorithm for pipelin-ing quantum loop programs. It is based on modulo schedul-ing via hierarchical reduction [3], but several modificationsto the original algorithm are required to fit into schedulingquantum instructions on qubits. The entire flow of our ap-proach is depicted in Figure 6. For simplicity we suppose thenumber of iterations is large enough so that we donβt worryabout generating a long prologue/epilogue.
4.1 Loop body compactionAt first we compact the loop kernel to merge the gates thatcan be trivially merged, including: (a) adjacent single qubitgates; (b) diagonal or antidiagonal single qubit gates andtheir nearby single qubit gates, maybe at the other side of aπΆπ gate; and (c) adjacent πΆπ gates. To this end, we definethe following compaction procedure, which considers thepotential aliasing between qubits:
Definition 5. A greedy procedure for compacting loop kernel:
β’ Initialize all qubits with an ideneity gate.β’ Place all instructions one by one. Initialize operationto βBlockedβ. Check the new instruction (A) against allplaced instructions (B). Update operation according toTable 1.
β’ Perform the last operation according to the table.β βBlockedβ means the instruction is put at the end ofthe instruction list.
β βMerge with Bβ means the single qubit instructionis merged with the placed single qubit gate B. If theplaced gate is an antidiagonal,π gates should be addedfor uncancelled πΆπ gates that occur earlier but areplaced after the antidiagonal.
β βCancelledβ means two πΆπ gates are cancelled. Notethat the added π gates are not cancelled. Also, a thirdarrivingπΆπ can βuncancelβ a cancelledπΆπ , which wealso call as βCancelledβ.
This compaction can be done in two directions: compact-ing to the left or to the right. They can be seen as the resultsof ASAP schedule and ALAP correspondingly. However, thisprocedure does not guarantee compacting once will con-verge: not all the outputs from the procedure are fixpoints ofthe procedure. For example, the circuit in Figure 7 only con-verges after three applications of left compaction. In general,we have the following:
Theorem 3. Compacting three times results in a fixpoint ofthe compaction procedure.
Note that we allow using unknown single-qubit gates. Ifall components are known to be diagonal or antidiagonal,the product of these matrices is also diagonal or antidiagonal[see Appendix F]. Otherwise, we can only see the productas a general matrix. However, this does not affect our resultof three-time compaction. Also compacting in one directiondoes not capture all chances of merging. Figure 8 shows thatsome single-qubit merging changes are missed out. In prac-tice we perform a left compaction after a right compaction.
4.1.1 Loop unrolling and rotation. Loop kernel com-paction can only discover gate merging and cancellationin one iteration. However, gate merging and cancellationcan also occur across iterations. For example, in Figure 4the last π» gate in the previous iteration can be merged andcancelled with the firstπ» gate in the next iteration. This kindof cancellation cannot be discovered by software pipeliningeither, since it is a reordering technique and cannot cancelinstructions out.An instruction π in one iteration may merge or cancel
with instruction π from π‘ β©Ύ 1 iterations later. All poten-tial merging of single qubit gates and cancellable πΆπ gatescan be written out by enumerating all pairs of instructions.Loop rotation[15] is an optimization technique to convertacross-loop dependency to in-loop dependency (so that somevariables can be privatized and optimized out). Consider aloop ranging fromπ to π: {π΄ππ΅ππΆπ }ππ . Here, π΄π can be ro-tated to the tail of the loop: π΄π {π΅ππΆππ΄π+1}πβ1π π΅ππΆπ, and πΆπ
and π΄π+1 are now in one iteration. If πΆπ writes into a tem-porary variable and π΄π+1 reads from it, this variable can beprivatized. For merging candidates with π‘ = 1, we can use asimilar procedure:
Definition 6. An instruction is considered movable if it sat-isfies one of following conditions:
β’ The instruction is a single-qubit gate, and there are nogates on the same qubit or on an aliasing qubit before it;in this case the instruction can be rotated to the right.
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
A\B SQ with same qubit SQ with in-loop aliasing CZ with same qubit CZ with aliasing qubitDiagonal SQ Merge with B BlockedAntiDiagonal SQ Merge with B Blocked BlockedGeneral SQ Blocked Blocked Blocked BlockedCZ Blocked Blocked If exactly-same then Cancel
Table 1. Operation table for loop kernel compaction. Empty cell means using previous operation. Check is performed fromleft to right, so antidiagonal can pass through πΆπ with a same qubit and an aliasing qubit.
|aγ Z β’
|bγ X β’ H Z H
(a) Original circuit
|aγ Z β’
|bγ X β’ X
(b) Compacting #1
|aγ Z β’ Z
|bγ β’(c) Compacting#2
|aγ β’
|bγ β’(d) Compacting#3
Figure 7. Compacting more than once yields better result.
|aγ β’
|bγ Z β’ H
Figure 8. Left compaction will miss the chance of compact-ing the π gate and the π» gate.
β’ The instruction is a πΆπ gate, and there are no single-qubit gates on the same qubit or on ailasing qubits; inthis case the instruction can be rotated to the right.
β’ The instruction is a πΆπ gate, and there are no single-qubit gates on the same qubit or on ailasing qubitsexcept the πΆπ gate has only one linear offset referencewith π = 0 and there is a single-qubit gate on this qubit.In this case, the instruction will be rotated to the rightalong with this single qubit gate.
This definition of movable instructions guarantees theprograms before and after the rotation are equivalent. Weuse the following procedure to rotate one instruction fromleft to right:
1. Find the first unmarked movable instruction that,there exists another instruction to merge or cancelwith π‘ = 1.
2. Mark the chosen instruction, and rotate the instruc-tion to the right. The instruction is added to prologueand the others added to epilogue.
3. Perform left compaction on the new loop kernel. Notethat the left-compaction algorithm is modified, so thatmerging single-qubit gates or cancellingπΆπ gates willclear the mark.
4. If there is no rotatable instruction, stop the procedure.
Corollary 4. If the original loop has only candidates withπ‘ = 1 and no one-qubit gate merges with itself, this procedureeliminates all across-loop merging or cancellation. That is, ifwe unroll the loop after rotation, the unrolled quantum βcircuitβshould be a fixpoint of compaction procedure.
However, loop rotation can only handle potential gatemerging across one iteraion (i.e. from nearby iterations). Tohandle potential merging across many iteraions, we adoptloop unrolling from classical loop optimization. While themajor objective for loop unrolling is usually to reduce branchdelay, Aiken et al. [2] also used loop unrolling to unroll firstfew iterations of loop and schedule them ASAP, so that re-peating patterns can be recognized into an optimal softwarepipelining schedule. Our approach uses modulo schedulinginstead of kernel recognition, but we can still exploit thepower of loop unrolling to capture patterns that requiremany iterations to reveal. The key point is that unrollingdecreases π‘ . Suppose we use a graph to represent all βcandi-dates for instruction mergingβ, with edgeπ΄ π‘ββ π΅ indicatinginstruction π΄ will merge with or cancel out instruction π΅
from π‘ iterations later, if we unroll the loop by πΆ times, theweight of the edges in the graph will decrease.
Example 5. Figure 9 gives an example showing the connec-tion between the βmerging graphβ before unrolling and the oneafter unrolling: if βπ‘,πΆ β©Ύ π‘ , there are no edges with π‘ > 1.
There is a tradeoff between generated code length (deter-mined byπΆ) and remaining π‘ > 1 edges. For example, if thereis an edge with π‘ = 10000, we are not likely to unroll theloop for 10000 times just to merge the two single qubit gates.Also for eliminating self-cancelling πΆπ gates (i.e. πΆπ gateson a pair of constant qubits), we may want πΆ β©Ύ 2 and πΆ
even. In the following discussion we use πΆ as a configurablevariable in our algorithm determining the maximal allowedunroll time (and the minimal time of iterations of the loop).The new unrolled loop will be in the form
π ππ (π =π; π β©½ π; π+ = πΆ) {ππ (ππ‘ π + ππ‘ )}π ππ (π =πβ²; π β©½ π; π + 1) {ππ (ππ‘ π + ππ‘ )}
(3)
and the first loop should be written intoπ ππ (π = 0; π β©½ πβ²; π+ = 1) {ππ (πΆππ‘ π + ππ‘ +πππ‘ )} (4)
where πβ² =βπβπ+1
πΆ
ββ 1 andπβ² = πΆ (πβ² + 1) +π. This step
of transformation makes sure the loop stride is still 1 after
A Preprint, December 23, 2020 Guo, et al.
Figure 9. Example for the QDGs of loop βπ΄ 4ββ π΅β unrolled2, 3, 4 and 5 times. Unrolling the loop decreases the edgeweight π‘ . When πΆ =πππ₯ {π‘} all edges will be decreased toweight 1.
loop unrolling. Note that item (πππ‘ ) appears in every offsetof the loop body. Ifπ is unknown we canβt proceed with ouralgorithm. Fortunately, sinceπ = ππΆ +π, π =π mod πΆ , wehave πΆππ‘ π + ππ‘ +πππ‘ = πΆππ‘ (π + π) + ππ‘ + πππ‘ , showing thatwhen the range is unknown, the results of array dependencydepend only on the Euclidean modulo π =π mod πΆ . In thiscase, we can generate πΆ copies of code for each case of π,and perform following parts of the algorithm on each copy.Let us briefly summarize our compilation flow till now:
we compact the loop kernel, unroll the loop by πΆ , and rotatesome instructions in the unrolled loop kernel. The unrollingstep may copy the loop by πΆ times, and steps after unrolling(including rotation) will be performed on each copy.
4.2 Modulo schedulingOur next step is modulo scheduling borrowed from [12]:
1. Find in-loop and loop-carried dependencies.2. Estimate an initialization interval πΌ πΌ . For simplicity
we use binary search and the maximum πΌ πΌ is totalinstruction count. Use Floyd to check validity.
3. Using Tarjan algorithm to find strong connected com-ponents and schedule all SCCs by in-loop dependencysubgraph.
4. Merge every SCC in DDG into one node, obtaining anew DDG.
5. Schedule the new DDG by list scheduling.There are some major differences between quantum pro-grams and the classical programs considered in [12]:
4.2.1 Quantum dependency graph. The instruction de-pendency for quantum programs is described by a QDG
(Quantum Dependency Graph) as a generalization of DDG(Data Dependency Graph), where vertices represent instruc-tions and edges represent precedence constraints that mustbe satisfies while reordering. In modulo scheduling, a de-pendency edge is described by two integers: πππ and ππ π .Suppose there is an edge pointing from instruction π΄ to in-struction π΅ with parameter (πππ,ππ π ), it means βinstructionπ΅ from ππ π iterations later should be scheduled at leastπππ
ticks later than instruction π΄ in this iterationβ. Recall fromSection 3.2 and 3.3, our dependency is defined by the rules:
1. There are no dependencies between πΆπ gates, or be-tween a πΆπ and a diagonal single qubit gate.
2. In-loop dependency: if two offsets are on the samequbit array and reveal in-loop qubit aliasing, there isa dependency edge (1, 0) between the correspondinginstructions. To unify with across-loop, we set Ξπ = 0.
3. Across-loop dependency: if two offsets are on the samequbit array and reveal across-loop qubit aliasing withΞπ , there is a dependency edge (1,Ξπ) between thecorresponding instructions.
4. Exception on antidiagonal gates: if the qubit (π1π +π1)of an antidiagonal gate aliases with one operand π2π +π2 of a πΆπ gate and π1 = π2, we remove the edge ifthereβs no aliasing on the other operand.
5. Exception on single qubit gates: if two single qubitgates operate on the same qubit array where offsets(π1π + π1) and (π2π + π2) aliases with each other andπ1 = π2, we specify the dependency edge to be valued(0,Ξπ), that is,πππ = 0 rather thanπππ = 1.
There may be multiple edges in the graph connecting thesame pair of instructions; for example, an in-loop depen-dency and an across-loop dependency between the two in-structions. Since we are going to use Floyd algorithm on thegraph to compute largest distance in modulo scheduling, weonly need the edge with the maximal (πππ β πΌ πΌ Β· ππ π ) afterassigning πΌ πΌ . Fortunately we donβt need to save all multipleedges, since the following theorem guarantees that we cancompare (πππ β πΌ πΌ Β· ππ π ) before assigning different πΌ πΌs.
Theorem 5. Suppose (πππ1, ππ π1), (πππ2, ππ π2) are two edgeswithπππ1 β©½ 1,πππ2 β©½ 1 and ππ π1 > ππ π2. Then for all πΌ πΌ β©Ύ 1,we have:πππ1 β πΌ πΌ Β· ππ π1 β©½ πππ2 β πΌ πΌ Β· ππ π2.
This theorem allows us to sort multiple edges by lexicalordering of (ππ π ,βπππ) (i.e. compare ππ π first, and compare(βπππ) if ππ π1 = ππ π2) and the smallest one is exactly theedge with maximal (πππ β πΌ πΌ Β· ππ π ).
4.2.2 Resource conflict handling. Another importantissue when inserting an instruction into modulo schedul-ing table or merging two strong connected components isresource conflict: there is no dependency between two πΆπgates, yet they may not be executed together because theymay share a same qubit. To solve this issue, let us first intro-duce several notations:
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
f o r x=m to n doCNOT q1 [ x β50 ] , q0 [ x + 0 ] ;CNOT q1 [ x β50 ] , q0 [ x + 0 ] ;
end f o r
(a) Loop program.
(b) Corresponding QDG.
Figure 10. Quantum dependency graph example. Tuplesrepresent (πππ,ππ π ).
1. πΌ πΌ is the current iteration interval being tested.2. πΏ is the length of the original loop kernel.3. The π-th instruction in the original loop is placed in
the modulo scheduling table at tick π‘ = ππΌπΌ + π, whereπ β©Ύ 0, 0 β©½ π < πΌ πΌ .
Example 6. Figure 11 is a simple example for modulo sched-uling. In this case, πΌ πΌ = 2 and πΏ = 4. Instructions are placedat time slot 0, 2, 3, 4. Thus, π΄ from one iteration, π΅ from a pre-vious iteration, and π· from previous 2 iterations are executedsimultaneously, while πΆ is executed alone.
We use the retrying scheme: if a resource conflict is de-tected, try next tick. The basic approach to detect resourceconflict is detecting in-loop qubit aliasing. This leads to twonew problems that do not exist in the classical case:
1. The array offsets of instruction operands may increase.As π‘ increases, π also increase, and the instructioncomes from one more iteration earlier, thus changingarray offsets.
2. The pair of instructions for resource conflict checkingmay not both exist in some iterations. Increasing π‘
leads to a long prologue and long epilogue, shrinkingthe range for loop kernel, and may eliminate the re-source conflict that once existed (when the loop rangeis known).
(a) Rescheduled sin-gle iteration. πΌ πΌ=2.
(b) Issuingeach iterationreveals loopkernel.
(c) Modulo schedul-ing table. Column in-dex represents origi-nal iteration.
Figure 11. Example for modulo scheduling loop π΄ππ΅ππΆππ·π .In this case πΌ πΌ = 2, πΏ = 4, π = [0, 2].
Example 7. Suppose when generating the schedule in Figure11, we have inserted instructions π΄, π΅ and πΆ , and are ready toinsert π· at time slot 4.
1. Since 4 = 2πΌ πΌ + 0, the π· in the loop kernel is from twoiterations earlier compared with the iteration that theπ΄ is in. We have to decrease offset of π· operands by 2π .The offseted index may no longer conflict with π΄.
2. When checking if there is resource conflict betweenπ· andπ΄, we only need to check the case where both iterationsare valid; that is, π = 2. This means the scheduling is stillvalid even if π΄0 has a resource conflict with π·β2, sinceπ·β2 does not even exist.
In the original modulo scheduling and other classicalscheduling algorithms, the retry strategy only allows πΌ πΌ re-tries. For example, if there is not enough π΄πΏπ or πΉππ forinstruction π΄π in modulo scheduling table tick π, there isalso not enough resource for instruction π΄πβ1 from previousiteration. However, this is not true for our case, and we haveto modify the strategy.
Example 8. Suppose we perform modulo scheduling on theprogram in Figure 12. Since the threeπΆπ s are exactly the same,we may expect πΌ πΌ = 3 due to resource conflict. However, if weallow more retries, these πΆπ s can be separated into differentiterations and can be executed concurrently with πΆπ s fromother iterations.
We consider the general case where loop range is un-known. When placing an instruction in the modulo schedul-ing table, we check its operands with all operands scheduledat this tick. Suppose now we check operand (π2 (π βπ2) +π2)with operand (π1 (πβπ1)+π1), and we find an aliasing, that is,βπ0 β Z, π2 (π0 β π2) + π2 = π1 (π0 β π1) + π1. In case π1 = π2,βπ β Z, π2 (π β π2) + π2 = π1 (π β π1) + π1. When π1 = 0,
A Preprint, December 23, 2020 Guo, et al.
f o r x=0 to 6 doCZ q [ x ] , q [ x + 1 ] ;CZ q [ x ] , q [ x + 1 ] ;CZ q [ x ] , q [ x + 1 ] ;
end f o r
(a) Original Program.
|q0γ β’ β’ β’
|q1γ β’ β’ β’ β’ β’ β’
|q2γ β’ β’ β’ β’ β’ β’
|q3γ β’ β’ β’ β’ β’ β’
|q4γ β’ β’ β’ β’ β’ β’
|q5γ β’ β’ β’ β’ β’ β’
|q6γ β’ β’ β’ β’ β’ β’
|q7γ β’ β’ β’
(b)Unrolled Program, for a clearerview.
CZ q [ 0 ] , q [ 1 ] ;CZ q [ 1 ] , q [ 2 ] ;CZ q [ 0 ] , q [ 1 ] ; CZ q [ 2 ] , q [ 3 ] ;CZ q [ 1 ] , q [ 2 ] ; CZ q [ 3 ] , q [ 4 ] ;f o r x=4 to 6 p a r a l l e l do
CZ q [ x β4 ] , q [ x β 3 ] ;CZ q [ x β2 ] , q [ x β 1 ] ;CZ q [ x ] , q [ x + 1 ] ;
end f o rCZ q [ 3 ] , q [ 4 ] ; CZ q [ 5 ] , q [ 6 ] ;CZ q [ 4 ] , q [ 5 ] ; CZ q [ 6 ] , q [ 7 ] ;CZ q [ 5 ] , q [ 6 ] ;CZ q [ 6 ] , q [ 7 ] ;
(c) Software pipelined version.
|q0γ β’ β’ β’
|q1γ β’ β’ β’ β’ β’ β’
|q2γ β’ β’ β’ β’ β’ β’
|q3γ β’ β’ β’ β’ β’ β’
|q4γ β’ β’ β’ β’ β’ β’
|q5γ β’ β’ β’ β’ β’ β’
|q6γ β’ β’ β’ β’ β’ β’
|q7γ β’ β’ β’
(d) Software pipelined version, un-rolled.
Figure 12. Three πΆπ gates in a row. Although there seems to be resource conflicts, the minimal πΌ πΌ = 1.
this is the same as classical resource scheduling; otherwise,βΞπ β 0,βπ β Z, π2 (π β π2 β Ξπ) +π2 β π1 (π β π1) +π1. Thismeans if we delay the instruction by ΞππΌπΌ ticks, the conflictwill be resolved. We call it false conflict. In case π1 β π2,after ΞππΌπΌ ticks it will fall in the same time slot. There is stilla conflict iff βπ1 β Z, π2 (π1 βπ2 βΞπ) +π2 = π1 (π1 βπ1) +π1;that is, π1 = π0 + Ξππ2
π2βπ1, which means (π2 β π1) |Ξππ2. The
conflict appears periodically as Ξπ increases. However, inthe worst case where (π2 β π1) |π2, there is always a conflictand can be seen as classical resource scheduling. We call it,together with the case where π1 = π2 = 0, true conflict.We insert an instruction or an entire schedule into the
modulo scheduling table in the following way: if there isno conflict, we insert the instructions; if there is only falseconflict, we try next tick. As an exception, false conflictsbetween two single qubit gates are also seen as no conflict;and if there is true conflict, we start a βdeath countdownβbefore trying next tick: if next (πΌ πΌ β1) retries do not succeed,give up, as we do in classical retry scheme.
4.2.3 Inversion pair correction. The commutativity be-tween antidiagonal π +
πgates and πΆπ gates comes at a price
of a Z gate. In modulo scheduling stage we allowed themto commute freely, ignoring the generated Z gates. Now wehave to fill them back to ensure equivalence. By the term
βinversionβ, we mean that our scheduling alters the executionorder of instructions compared with original ordering:Definition 7. If the original πth instruction ismodulo-scheduledat π‘ = ππΌπΌ + π in new loop (where the πth original loop is is-sued), we define the absolute order of the instruction to beπ = (π β π)πΏ + π = ππΏ + (π β ππΏ).Example 9. Suppose πΏ = 4 and π΅ in Figure 11 is the secondinstruction in the original loop (π = 1). π΅ is placed in themodulo scheduling table at π = 1 and π = 0.
1. The first π΅ instruction is issued in the prologue (incom-plete loop kernel) where the second (π = 1) iterationis issued. Thus the absolute order of the instruction isπ = 1.
2. The second π΅ instruction is issued in the loop kernelwhere the third (π = 2) iteration is issued. Thus theabsolute order is π = 5.
3. The third π΅ instruction is issued in the epilogue (againincomplete loop kernel) where the fourth (π = 3) itera-tion is issued (or, should be issued). The absolute order isπ = 9.
We see that the absolute order is exactly the time when theinstruction is executed in the original loop.
Our idea is to check all inversion pairs in the moduloschedule. There are two kind of order-inversions:
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
Figure 13. An example of inverted pairs of instructionsacross loop iterations.
Definition 8. 1. In-loop inversion: For two instructionsin the π-iteration in new scheduling (i.e. the iterationwhere πth iteration of original loop is issued), if the firstprecedes the second while its absolute order succeeds theabsolute order of the second instruction:ππΏ+(π1βπ1πΏ) >ππΏ + (π2 β π2πΏ), there is an in-loop inversion.
2. Loop-carried inversion: For two instructions inπ-iterationand (π + π )-iteration (π β©Ύ 1), if ππΏ + (π1 β π1πΏ) >
(π + π )πΏ + (π2 β π2πΏ), there is an across-loop inversion.
Since the ππΏ term can be cancelled, inversion pairs inmodulo schedule also reveals periodicity. Figure 13 shows anexample with periodic π = 1 inversions, and π = 2 inversions.Since the term (π + π )πΏ + (π2 β π2πΏ) increases as π increases,there exists π0 s.t. βπ > π0 there is no across-loop inversion.We can increase π and find pairs of inversion from iterationπ and (π + π ), until there is no inversion pair. When findingall inversion pairs, we can check the pairs to see if one isπΆπand the other is antidiagonal on one of πΆπ βs operand. If so,we add a π gate at the tick where πΆπ is placed.
4.2.4 Code generation for kernel, prologue and epi-logue. We generate prologue and epilogue by removingnon-existing instructions from the loop kernel.
Example 10. Consider in Figure 11 (remember π = [0, 2]),the iteration where πth original iteration is issued (or shouldbe issued) by enumerating π from ββ toβ:
1. For π < 0, {π, π β 1, π β 2} β© π = Ξ¦, no instruction isput.
2. For π = 0, {π, π β 1, π β 2} β©π = {π}, only π΄ is put.3. For π = 1, {π, π β 1, π β 2} β©π = {π, π β 1}, π΄, π΅,πΆ are
put.4. For π = 2, {π, π β 1, π β 2} β©π = {π, π β 1, π β 2}. This
is the complete loop kernel.5. For π = 3, {π, π β 1, π β 2}β©π = {π β 1, π β 2}, π΅,πΆ, π·
are put.6. For π = 4, {π, π β 1, π β 2} β©π = {π β 2}, π· is put.
7. For π > 4, {π, π β 1, π β 2} β© π = Ξ¦, no instruction isput.
For prologue and epilogue, we have to remove instructionsfrom iterations that do not exist; for extra π gates from theinversion of a πΆπ and an antidiagonal, removing either gatewill make the π gate disappear. After removing non-existinginstructions, we perform compaction and ASAP schedule onthe two parts.For loop kernel, we need to merge the single qubit gates
on the same qubit in the same time slot (from the resourceconflict exception) by their absolute order.
4.3 Modulo scheduling againIn the first round of modulo scheduling, inversion of πΆπand antidiagonal gates may introduce π gates overlappingπΆπs, resulting an illegal schedule. To generate an executableschedule, we performmodulo scheduling again, but this timewe no longer allow βcommutativityβ between antidiagonalsandπΆπs, and thus the inversion-fix step can be skipped. Thescheduled loop by this second round of modulo schedulingis directly executable on the device.[An analysis on the complexity of our algorithm presented
in this section is given in Appendix K.]
5 EvaluationWe have implemented our method and carried out exper-iments on several quantum programs. Some of them areintrinsically parallel, while others are not. Baselines for ourevaluation come from the following sources:
β’ Kernel-ASAP performs compaction and ASAP sched-uling on the loop kernel. We expect our work to out-perform this naive approach.
β’ Unroll unrolls the loop and performs compaction aswell as ASAP scheduling on the unrolled circuit. Thesoftware-pipelined version should generate a programwith similar depth but much smaller code size.
β’ Cirq uses the optimization passes in [22] to unroll theloop. This gives another perspective of loop unrollingbesides our implementation.
The experiment results are in Table 2. We hereby analyzesome of the important examples:
5.1 Grover SearchGrover search is a test case with long dependency chain andlittle space for optimization. Yet our approach can reducethe overall depth by merging adjacent gates in iterationand across iterations. We use the πΆπΆπππ case from [6] andSudoku solver from [4]. Since Grover search is a hard-to-optimize case, we inspected the optimized code and got thefollowing findings:
Although examples do not revealmuch optimization chance,there is a pitfall for ASAP optimizers that may cause a di-agonal π β gate to be scheduled at the first tick alone. This
A Preprint, December 23, 2020 Guo, et al.
Test case Input Loop Output Loop Known range resultsASAP πΆ πΆ-ASAP Pre K Post #Iter K-ASAP Unroll Cirq QSP#Iter QSP
Cluster 4 2 5 4 1 4 200 800 203 203 96 104Array 1 5 2 10 8 4 5 100 500 500 500 48 205Array 2 3 2 5 4 1 4 100 300 201 201 46 54Array 3 11 2 17 12 12 17 100 1100 605 606 48 605Grover 1 13 2 26 26 24 871 99 1287 1287 1288 15 1257Grover 2 71 2 141 141 135 40881 1000 71000 70001 71001 207 68967QAOA-Hard 1 21 2 41 41 40 2021 1001 21021 20021 20021 449 20022QAOA-Hard 2 21 2 41 41 40 2061 1001 21021 20021 20021 448 20022QAOA-Hard 3 16 2 27 41 18 1121 1001 16016 11016 11016 448 9226QAOA-Hard 4 33 2 47 60 31 3882 1000 33000 14019 14019 360 15102QAOA-Par 1 15 2 26 46 20 943 201 3015 2215 2215 56 2109QAOA-Par 2 15 2 26 45 20 1009 201 3015 2215 2215 53 2114QAOA-Par 3 18 2 29 43 18 1080 201 3618 2218 2218 50 2023QAOA-Par 4 15 2 29 29 25 3668 1000 15000 14001 14001 368 12897
Table 2. Evaluation results. ASAP is the minimal depth of original loop body. πΆ-ASAP is the minimal depth of the originalloop body unrolled by πΆ times. Pre, K and Post represents prologue, kernel and epilogue. For each test case a range sized #Iteris assigned, and the span of the output loop is QSP#Iter.
|aγ β’ H
|bγ β’ β’
|cγ β’
(a) Original program,depth=3.
|aγ β’ H
|bγ β’ β’
|cγ β’
(b) New program by acci-dental inversion of twoπΆπs,depth=2.
Figure 14. The accidental inversion of πΆπs reduced kerneldepth by 1.
is prevented in our approach by performing bidirectionalcompactions. Moreover, the depth cut mainly comes from in-version of a pair of πΆπ s while scheduling, which indeed ourapproach does not consider. (see Figure 14). This inspires usto find more optimization chances while placing instructionswithout dependency, like a program with many πΆπs.
5.2 QAOAThe QAOA programs in [8] (in Figure 15), as well as theQAOA example in [22] are used in our experiment, but witha π (i.e. the number of iterations) large enough. Since thedecomposition of QAOA into gates affects how it can be op-timized on our architecture, we consider two different ways:QAOA-Par where QAOA is decomposed to expose morecommutativity (see the details in Appendix J), and QAOA-Hard, where QAOA is decomposed into a harder form, witha long dependency chain formed by cross-qubit operationsthat is unable to be detected by gate-level optimizers.
(a) (b) (c)
Figure 15. QAOA-MaxCut examples in [8].
The evaluation results in Table 2 show that in all cases,our approach can reduce the loop kernel size compared withKernel-ASAP, and can sometimes outperform unrollingresults. This advantage is more evident in the QAOA-Parcases than in the QAOA-Hard cases, since QAOA-Par revealsmore commutativity chances than QAOA-Hard. Anotherfinding is that QAOA-Hard generates larger code thanQAOA-Par, and thus requires more iterations for software-pipeliningto take effect.
[More discussions on examples are in Appendix M.]
6 ConclusionWe proposed a compilation flow for optimizing quantumprograms with control flow of for-loops. In particular, datadependencies and resource dependencies are redefined toexposes more chances for optimization algorithms. Our ap-proach is tested against several important quantum algo-rithms, revealing code-size advantages over the existing ap-proaches while keeping depth advantage close to loop rolling.Yet there is still gap for optimization of more complex quan-tum programs, on different architectures, and with lowercomplexity, which could be filled in future works.
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
References[1] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006.
Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., USA.
[2] Alexander Aiken and Alexandru Nicolau. 1988. Optimal Loop Paral-lelization. Technical Report. 308β317 pages. https://doi.org/10.1145/53990.54021
[3] Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan.1995. Software Pipelining. ACM Comput. Surv. 27, 3 (1995), 367β432.https://doi.org/10.1145/212094.212131
[4] Abraham Asfaw, Luciano Bello, Yael Ben-Haim, Sergey Bravyi,Nicholas Bronn, Lauren Capelluto, Almudena Carrera Vazquez, JackCeroni, Richard Chen, Albert Frisch, Jay Gambetta, Shelly Garion,Leron Gil, Salvador De La Puente Gonzalez, Francis Harkins, TakashiImamichi, David McKay, Antonio Mezzacapo, Zlatko Minev, RamisMovassagh, Giacomo Nannicni, Paul Nation, Anna Phan, Marco Pis-toia, Arthur Rattew, Joachim Schaefer, Javad Shabani, John Smolin,Kristan Temme, Madeleine Tod, Stephen Wood, and James Woot-ton. 2020. Learn Quantum Computation Using Qiskit. http://community.qiskit.org/textbook
[5] Adi Botea, Akihiro Kishimoto, and Radu Marinescu. 2018. On theComplexity of Quantum Circuit Compilation. In Proceedings of theEleventh International Symposium on Combinatorial Search, SOCS 2018,Stockholm, Sweden - 14-15 July 2018, Vadim Bulitko and Sabine Storandt(Eds.). AAAI Press, 138β142. https://aaai.org/ocs/index.php/SOCS/SOCS18/paper/view/17959
[6] Patrick J. Coles, Stephan J. Eidenbenz, Scott Pakin, Adetokunbo Ade-doyin, John Ambrosiano, Petr M. Anisimov, William Casper, GopinathChennupati, Carleton Coffrin, Hristo Djidjev, David Gunter, SatishKarra, Nathan Lemons, Shizeng Lin, Andrey Y. Lokhov, Alexander Ma-lyzhenkov, David Dennis Lee Mascarenas, Susan M. Mniszewski, BaluNadiga, Dan OβMalley, Diane Oyen, Lakshman Prasad, Randy Roberts,Philip Romero, Nandakishore Santhi, Nikolai Sinitsyn, Pieter Swart,Marc Vuffray, Jim Wendelberger, Boram Yoon, Richard J. Zamora, andWei Zhu. 2018. Quantum Algorithm Implementations for Beginners.CoRR abs/1804.03719 (2018). arXiv:1804.03719 http://arxiv.org/abs/1804.03719
[7] Leonardo de Moura and Nikolaj BjΓΈrner. 2008. Z3: An Efficient SMTSolver. In Tools and Algorithms for the Construction and Analysis ofSystems, C. R. Ramakrishnan and Jakob Rehof (Eds.). Springer BerlinHeidelberg, Berlin, Heidelberg, 337β340.
[8] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. 2014. A QuantumApproximate Optimization Algorithm. arXiv:quant-ph/1411.4028
[9] Lov K. Grover. 1996. A Fast Quantum Mechanical Algorithm forDatabase Search. In Proceedings of the Twenty-Eighth Annual ACMSymposium on Theory of Computing (Philadelphia, Pennsylvania, USA)(STOC β96). Association for ComputingMachinery, New York, NY, USA,212β219. https://doi.org/10.1145/237814.237866
[10] Gian Giacomo Guerreschi and Jongsoo Park. 2018. Two-step approachto scheduling quantum circuits. Quantum Science and Technology 3, 4(Jul 2018), 045003. https://doi.org/10.1088/2058-9565/aacf0b
[11] Ali JavadiAbhari, Shruti Patil, Daniel Kudrow, Jeff Heckey, AlexeyLvov, Frederic T. Chong, and Margaret Martonosi. 2015. ScaffCC:Scalable compilation and analysis of quantum programs. ParallelComput. 45 (2015), 2β17. https://doi.org/10.1016/j.parco.2014.12.001
[12] Monica S. Lam. 1988. Software Pipelining: An Effective SchedulingTechnique for VLIW Machines. In Proceedings of the ACM SIGPLANβ88Conference on Programming Language Design and Implementation(PLDI), Atlanta, Georgia, USA, June 22-24, 1988, Richard L. Wexelblat(Ed.). ACM, 318β328. https://doi.org/10.1145/53990.54022
[13] Prakash Murali, David C. McKay, Margaret Martonosi, and Ali Javadi-Abhari. 2020. Software Mitigation of Crosstalk on Noisy Intermediate-Scale Quantum Computers. In ASPLOS β20: Architectural Support forProgramming Languages and Operating Systems, Lausanne, Switzerland,
March 16-20, 2020 [ASPLOS 2020 was canceled because of COVID-19],James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 1001β1016.https://doi.org/10.1145/3373376.3378477
[14] Michael A. Nielsen and Isaac L. Chuang. 2011. Quantum Computa-tion and Quantum Information: 10th Anniversary Edition (10th ed.).Cambridge University Press, USA.
[15] Bill Pottenger. [n.d.]. Loop Rotation. http://polaris.cs.uiuc.edu/projects/rec/node8.html
[16] Robert Raussendorf, Daniel E Browne, and Hans J Briegel. 2003.Measurement-based quantum computation on cluster states. Physicalreview A 68, 2 (2003), 022312.
[17] Vivek V. Shende, Stephen S. Bullock, and Igor L. Markov. 2006. Synthe-sis of quantum-logic circuits. IEEE Trans. on CAD of Integrated Circuitsand Systems 25, 6 (2006), 1000β1010. https://doi.org/10.1109/TCAD.2005.855930
[18] Yunong Shi, Nelson Leung, Pranav Gokhale, Zane Rossi, David I. Schus-ter, Henry Hoffmann, and Frederic T. Chong. 2019. Optimized Compi-lation of Aggregated Instructions for Realistic Quantum Computers.In Proceedings of the Twenty-Fourth International Conference on Archi-tectural Support for Programming Languages and Operating Systems,ASPLOS 2019, Providence, RI, USA, April 13-17, 2019, Iris Bahar, MauriceHerlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 1031β1044.https://doi.org/10.1145/3297858.3304018
[19] Seyon Sivarajah, Silas Dilkes, Alexander Cowtan, Will Simmons, AlecEdgington, and Ross Duncan. 2020. t |πππ‘ β©: a retargetable compilerfor NISQ devices. Quantum Science and Technology 6, 1 (nov 2020),014003. https://doi.org/10.1088/2058-9565/ab8e92
[20] Robert S. Smith, Michael J. Curtis, and William J. Zeng. 2016. APractical Quantum Instruction Set Architecture. CoRR abs/1608.03355(2016). arXiv:1608.03355 http://arxiv.org/abs/1608.03355
[21] Bochen Tan and Jason Cong. 2020. Optimal Layout Synthesis forQuantum Computing. arXiv:cs.AR/2007.15671
[22] Quantum AI team and collaborators. 2020. Cirq. https://doi.org/10.5281/zenodo.4062499
[23] Qiskit Development Team. [n.d.]. Qiskit Terra basic sched-ulers. https://qiskit.org/documentation/stubs/qiskit.scheduler.methods.basic.html#module-qiskit.scheduler.methods.basic
[24] Mingsheng Ying. 2009. Commutativity between CNOT and one-qubitgates (Unpublished notes). (2009).
[25] Mingsheng Ying. 2016. Foundations of Quantum Programming. Mor-gan Kaufmann, Boston. https://doi.org/10.1016/B978-0-12-802306-8.00002-1
A Preprint, December 23, 2020 Guo, et al.
A Basic quantum gatesThe following are the frequently-used one-qubit gates repre-sented in 2 Γ 2 unitary matrices:
Pauli gates : π =
[0 11 0
],
π =
[0 βππ 0
],
π =
[1 00 β1
],
Hadamard gate : π» =1β2
[1 11 β1
],
Phase andπ
8gates : π =
[1 00 π
],
π =
[1 0
0 πππ4
],
Pauli Rotations : π π (πΌ) =[
πππ πΌ2 βππ ππ πΌ2
βππ ππ πΌ2 πππ πΌ2
],
π π (πΌ) =[πππ πΌ2 βπ ππ πΌ
2π ππ πΌ
2 πππ πΌ2
],
π π (πΌ) =[πβ
ππΌ2 0
0 πππΌ2
].
They combined with one of the (two-qubit) controlled gates
CNOT =
1
11
1
,πΆπ =
1
11
β1
.are universal for quantum computing; that is, they can beused to construct arbitrary quantum gate of any size.
Beside the above, we will use the following auxiliary gatesto simplify the presentation of our approach:
π βπ (πΌ) =
[cos \
2 βπ sin \2
π sin \2 β cos \
2
],
π +π (πΌ) =
[0 πππΌ/2
πβππΌ/2 0
]= ππ π (πΌ),
π» (πΌ) = 1β2
[1 1πππΌ βπππΌ
]= π π (πΌ)π»,
π»β (πΌ) = 1β2
[1 β1πππΌ πππΌ
]= π π (πΌ)π»π .
Note that parameter πΌ in the above gates is a real number.The π +
π(πΌ) gate can represent all single qubit gates that are
anti-diagonal, i.e. only anti-diagonal entries are not 0. Theother three notations are used in Appendix I.
For real-world quantum computers, a quantum devicemay only support a discrete or contiguous set of single qubitgates while keeping the device universal. For example, IBMβsdevices allow the following three kinds of single qubit gatesto be executed directly[4]:
π1 (_) =[1 00 ππ_
],
π2 (π, _) =1β2
[1 βππ_πππ ππ_+ππ
],
π3 (\, π, _) =[
πππ ( \2 ) βππ_π ππ( \2 )ππππ ππ( \2 ) ππ_+πππππ ( \2 )
]Note that π2 (π, _) = π3 ( π2 , π, _) and π1 (_) = π3 (0, 0, _).Also note that gate π3 itself is universal for single-qubitgates, and the main reasons for supportingπ1 andπ2 is tomitigate error, which is beyond our consideration.
B More Examples for quantum loopprograms
We hereby presents more quantum algorithms that can bewritten into quantum loop programs and can thus be poten-tially optimized by our approach.
B.1 One-way quantum computingPreparation circuit for simulating one-way quantum com-putation on quantum circuit is another example that allowseach iteration to be performed on different qubits.
Example 11. One-way quantum computing ππΆC[16] is aquantum computing scheme that is quite different from thecommonly used quantum-circuit based schemes. Instead ofstarting from |0β©,ππΆC initializes all qubits (on a 2-dimensionalqubit grid) in a highly-entangled state, called cluster state.After the preparation step, ππΆC performs single-qubit mea-surements on all qubits and extract the computation resultfrom these measurement outcomes.To simulate one-way quantum computing with quantum
circuit, we first need to prepare the cluster state from |0β©. Thiscan be done by first performing Hadamard gates on all qubits,then performing πΆπ gate on each pair of adjacent qubits onthe qubit grid.
The preparation circuit can be written in a nested loop man-ner. If we assume the grid has a fixed width (3 in our case), wecan unroll the innermost loop to get the flattened loop:π» [π [0]]π» [π [1]]π» [π [2]]πΆπ [π [0], π[1]]πΆπ [π [1], π[2]]for i=1 to (L-1) doπ» [π [3π]]π» [π [3π + 1]]π» [π [3π + 2]]
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
πΆπ [π [3π], π[3π + 1]]πΆπ [π [3π + 1], π[3π + 2]]πΆπ [π [3π], π1 [3π β 3]]πΆπ [π [3π + 1], π2 [3π β 2]]πΆπ [π [3π + 2], π3 [3π β 1]]
end forFigure 16 shows the gates and qubits involved in each iter-
ation where πΏ = 5. The optimization of this program will bediscussed in Appendix M.
B.2 Quantum Approximate OptimizationAlgorithm
Example 12. QuantumApproximate Optimization Algorithm(QAOA)[8] can be used to solve MaxSat problems, for example,MaxCut problems on 3-regular graphs, say πΊ = β¨π , πΈβ©. QAOAperforms quantum computation and classical computation al-ternatively. On the quantum part, it requires us to create thestate:
|πΎ, π½β© =πβπ=1
π (π΅, π½π )π (πΆ,πΎπ ) |+β© (5)
where:
π (πΆ, π½π ) =β
(π,π) βπΈ
1
πβπππππΎπ
πβπππππΎπ
1
(6)
π (π΅,πΎπ ) =πβ1βπ=0
π π (π½π , π). (7)
The sets of parameters {π½π } and {πΎπ } are computed in theclassical computation between every two quantum epochs. Thisrequires the optimizer to support compilation of the circuitabove without knowing all parameters in advance.π (π΅,πΎπ ) are products of Pauliπ rotations on all qubits. Since
in our caseπ (πΆ, π½π ) can be decomposed in the following way:1
πβπππππΎπ
πβπππππΎπ
1
=|aγ β’ β’
|bγ β RZ(βΟabΞ³i) β,
(8)we can define parametric gate arrays ππΆ [π] = π π (π½π , π) andππ΅ [π] = π π (βππππΎπ ), and the QAOA quantum part can bewritten as a parametric quantum loop program:
for i=0 to (N-1) doπ» [π [π]]
end forfor i=1 to p dofor (π, π) β πΈ doπΆπππ [π [π], π[π]]ππ΅ [π] [π [π]]πΆπππ [π [π], π[π]]
end forfor j=0 to (N-1) do
ππΆ [ π] [π [ π]]end for
end for
The two nested loops can be fully unrolled by hand, andthe outcome loop satisfies our requirements for optimization.
C Output languageIf the input range of the loop program is unknown, we mayhave to add guard statements into the orginal program, forexample, when we want to check if the range is large enoughfor us to use the software-pipelined version. Those featuressuch as guard statements, unfortunately, are not supportedin our definition of input language. So we have to define thefollowing language for the optimization result:
program :=header statementβheader :=[(qdef | udef)β]
qdef :=ππ’πππ‘ ident[N];udef :=ππ π πππ‘π ident[N] = gate;
gate :=[(C2Γ2)β] | π π | π +π | ππππππ€π
gateref :=ident[expr]qubit :=ident[expr]
op :=ππ (gateref) qubit;| πΆπ qubit, qubit;
statement :=op| π ππ ident ππ expr π‘π expr{statementβ}| ππππππππ{statementβ}| ππ’πππ{(compare => {statementβ})β
ππ‘βπππ€ππ π => {statementβ}}
expr :=ident | ππ₯ππ + ππ₯ππ | ππ₯ππ β ππ₯ππ
| ππ₯ππ β ππ₯ππ | ππ₯ππ/ππ₯ππ | ππ₯ππ%ππ₯ππ | Zcompare :=expr ordering exprordering := == | ! = | > | < | >= | <=
The main differences between the input language and theoutput language are:
1. The ππππππππ notation is added to explicitly point outwhich instructions are scheduled together.
2. The ππ’πππ statement is added to check whether theinput range is suitable for the software-pipelined ver-sion if the range is unknown at compilation time, andto separate cases with different (ππππ πΆ). The ππ’πππstatement executes the first statement block with asatisfied guard condition.
A Preprint, December 23, 2020 Guo, et al.
Figure 16. Converting cluster state preparation circuit into loop program. Fig (a) is a 3 Γ 5 two-dimensional qubit network.The preparation is done by performing a layer of Hadamard gates (Fig (b)) and a layer ofπΆπ gates (Fig (c)). One way to performthose πΆπ gates without qubit conflict is to split them into four non-overlapping groups and execute each group separately, asin Fig (d) to Fig (g). The procedure can also be written into loop program, as in Fig (h) to Fig (l).
3. The ππ₯ππ allows for more general indexing into qubitarrays and gate arrays. Note that the division and mod-ulo operators are Euclidean, i.e. it always holds that{
π πππ(π%π) = π πππ(π)π%π + (π/π) β π = π
(9)
D Solving Diophantine equationsIn this appendix we focus on solving the Diophantine equa-tion:
(π2 βπ1)π +π2 (Ξπ) = π1 βπ2, π β π, π +Ξπ β π,Ξπ β©Ύ 1. (10)
We rewrite it into:
ππ₯ + ππ¦ = π, π₯ β π, π₯ + π¦ β π,π¦ β©Ύ 1. (11)
We recall the solutions π for linear Diophantine equationswith two variables:
Lemma 1. Solutions for linear Diophantine equationswith two variables
ππ₯ + ππ¦ = π, π₯ β Z, π¦ β Z. (12)
1. If π = 0 and π = 0, π = Ξ¦ if π β 0 and π = Z Γ Z ifπ = 0.
2. If π = 0 but π β 0 (similar for π = 0 but π β 0),a. If π |π , π = Z Γ
{ππ
}.
b. Otherwise, π = Ξ¦.3. If π β 0 and π β 0:a. If π = π Β· πππ (π, π),
β’ Special solution (π₯0, π¦0) where
ππ₯0 + ππ¦0 = πππ (π, π) (13)
can be solved using extended Euclidean algorithm.β’ General solution
(π ππππ (π,π) ,βπ
ππππ (π,π)
)for equa-
tionππ₯ + ππ¦ = 0 (14)
is known.β’ The total solution space is
π =
{(π₯0 + π
π
πππ (π, π) , π¦0 β ππ
πππ (π, π)
)|π β Z
}. (15)
We rewrite the equation into:
π = {(π₯0 + πΞπ₯,π¦0 + πΞπ¦) |π β Z} . (16)
b. Otherwise, π = Ξ¦.
For our original question with constraints, we only con-sider the cases where π β 0 and π β 0.
When π = Z, the constraints no longer exist and we onlyneed to find the minimal positive integer in set {π¦0 + πΞπ¦},which can be solved by an Euclidean division. With loss ofgenerality, we can just let π = 0 by choosing π¦0 to be exactlythe smallest positive integer in {π¦0 + πΞπ¦} and adjust π₯0accordingly, without affecting the solution set π .When π = [π, π], the corresponding π₯0 may not lie in
π . In this case we may want to find a secondary-minimalpositive integer. Without loss of generality we assume Ξπ¦ >
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
0 (otherwise choose Ξπ₯ = βΞπ₯ and Ξπ¦ = βΞπ¦). Then theproblem becomes: find minimal π β π+ s.t.{
π₯0 + πΞπ₯ >= π
π₯0 + πΞπ₯ <= π, (17)
which is equivalent to{πΞπ₯ >= π β π₯0
πΞπ₯ <= π β π₯0(18)
which can thus be solved by a routine calculation: a minimalπ exists, or does not exist at all.
E Proofs of Theorems 1 (CZ conjugationrules)
In this section we give out proof for our new rules of instruc-tion data dependency. We will show that our definition ofdependency is βsufficient and necessaryβ for quantum gatesets using πΆπ .
We first restate Theorem 1 as follows:
πΆπππ΄ππ΅πΆπ = ππ΄ππ΅,
if and only ifππ΄ andππ΅ are diagonal or anti-diagonal. Thatis,ππ = π π (\ ) orππ = π +
π(\ ) for π β {π΄, π΅}.
Proof. We here introduce our methodology of proving quan-tum gate algebra equations: first we give a necessary condi-tion by trying several input states, and show that the condi-tion is also sufficient for the equation to hold.The first lemma is a criteria for deciding whether a state
is separable or entangled:
Lemma 2. Two-qubit state |π β© = (π, π, π, π)π is separable ifand only if:
ππ β ππ = 0. (19)
Proof. (Necessity) If |π β© is separable, there exists two singlequbit states |π1β© and |π2β©, s.t.
|π β© = |π1β© β |π2β© (20)
Suppose|π1β© = (πΌ1, π½1)π , (21)|π2β© = (πΌ2, π½2)π , (22)
We have
|π β© = (πΌ1πΌ2, πΌ1π½2, π½1πΌ2, π½1π½2)π , (23)
and it can be easily verified that ππ β ππ = 0.(Sufficiency) If
|π β© = (π, π, π, π)π (24)
with ππ β ππ = 0,1. If π = 0, this indicates π = 0 or π = 0. If π = 0, let{
|π1β© = |1β©|π2β© = π |0β© + π |1β©
; (25)
otherwise π = 0, and let{|π1β© = π |0β© + π |1β©|π2β© = |0β©
. (26)
2. If π = 0, this indicates π = 0 or π = 0. If π = 0, let{|π1β© = π |0β© + π |1β©|π2β© = |1β©
; (27)
otherwise π = 0, and let{|π1β© = |0β©|π2β© = π |0β© + π |1β©
. (28)
3. Otherwise π, π, π, π β 0. Let|π1β© =
(πβ
β₯π β₯2+β₯π β₯2, πβ
β₯π β₯2+β₯π β₯2
)π|π2β© =
(π
πβ
β₯ ππβ₯2+β₯1β₯2
, 1ββ₯ ππβ₯2+β₯1β₯2
)π . (29)
It can be verified that β₯ |π1β© β₯ = β₯ |π2β© β₯ = 1, and that
|π1β© β |π2β© =(π, π, π, π)πβοΈ
(β₯πβ₯2 + β₯π β₯2) (β₯ ππβ₯2 + β₯1β₯2)
, (30)
which is exactly (π, π, π, π)π since tensor product pre-serves norm.
β‘
Lemma 3. (Necessity) For the equation to hold,ππ΄ andππ΅
have to be diagonal or anti-diagonal. This meansππ transforms|0β© to |0β© or |1β©, up to a global phase.
Proof. Suppose |πβ© = ππ΄ |0β© = (π, π)π , thus
πΆπππ΄ππ΅πΆπ ( |0β© β (π β π΅|πβ©)) (31)
=πΆπ |πβ© β |πβ© (32)=(π2, ππ, ππ,βπ2)π , (33)
which should be a separable state since this is alsoππ΄ππ΅ ( |0β©β(π β
π΅|πβ©)), which is separable. Thus π2π2 = 0, so π = 0 (π +
π
case) or π = 0 (π π case). This is the same forππ΅ . β‘
Lemma 4. (Sufficiency) π π and π +πsatisfies the conjugation
rules.
Proof. Note that π +π= ππ π andπΆπππ΄ = ππ΄ππ΅πΆπ . By simple
computation we can see the conjugation holds. β‘
β‘
F Proof of Theorem 3 (Convergence ofcompaction)
We show that compaction procedure will converge afterapplying the procedure three times.If we look at the factors that prevents compaction proce-
dure from reaching its fixpoint, there are two main reasons:
A Preprint, December 23, 2020 Guo, et al.
1. Single qubit merging results in new diagonal gatesor antidiagonal gates, which is not recognized whenthe first gate is placed. Compacting #1 in Figure 7shows an example where three gates merge into anantidiagonal π gate, which can merge through theπΆπgate on next compaction.
2. Antidiagonal and πΆπ changing order will add π gatesto the circuit. Compacting #2 in Figure 7 shows anexample.
Fortunately, these problems will not occur at the thirdtime of compaction. This is because diagonal gates and an-tidiagonal gates forms a subgroup ofπ2:
Lemma 5. Let
πΊπ = {π π (\ ) |\ β [0, 2π)} , (34)πΊ+π =
{π +π (\ ) |\ β [0, 2π)
}, (35)
πΊ = πΊπ βͺπΊ+π , (36)
thusπΊπ ,πΊ are subgroups ofπ2, whileβπ1, π2 β πΊ+π, π1π2 β πΊπ .
Corollary 6. βπ1 β π2\πΊ,π2 β πΊ,π2π1 β π2\πΊ .
On #2 compaction, single qubit gates can only mergewhen they are on different sides of a πΆπ gate and one isdiagonal or antidiagonal (otherwise they should have beenmerged on #1 compaction). According to corollary 6, thismerging will not add new diagonals or antidiagonals, andall new gates from compaction #2 come from moving an-tidiagonal through πΆπ . The last compaction merges theseadditional π gates to their left.
G Proof of Theorem 5 (Remove multipleedges)
In the QDG defined in Section 4, Theorem 5 is proposed sothat multiple edges can be removed before πΌ πΌ is assigned.The proof of Theorem 5 is listed below:
Proof. Since ππ π1 and ππ π2 are integers,1 + ππ π2 β©½ ππ π1, (37)
Since πΌ πΌ β©Ύ 1,β πΌ πΌ Β· ππ π1 β©½ βπΌ πΌ β πΌ πΌ Β· ππ π2 β©½ β1 β πΌ πΌ Β· ππ π2. (38)
Sinceπππ1 β©½ 1 andπππ2 β©½ 1,πππ1 β©½ πππ2 + 1. (39)
Adding up Equation 38 and 39 shows the result. β‘
H Resource scheduling complexityanalysis
In Secion IV we mentioned that we can keep retrying if thereis a βresource conflictβ and the death countdown is not timed-out (i.e. resource conflict are all caused by false conflicts),which may lead to too many retries that may dominate thecomplexity of the algorithm. This requires us to give an
upper bound of maximum number of retries to estimate thetotal complexity.
Recall how we perform resource checking when insertinginstructions into the schedule:
β’ For every time slot, we have scheduled a bunch ofinstructions in this time slot.
β’ When adding an instruction or a group of instructions,we check the operands of each instruction to be addedagainst instructions in the time slot where it will beadded.
β’ If there is a resource conflict, we have to try next tick(and perhaps start a death countdown).
We first show that if there is only false conflict, the loopcan be written into an equivalent form where all π = 1. Infact, this is achieved by the fact:
ππ + π = π (π + (π/π)) + (π mod π), (40)
where
(π mod π) β [0, β₯π β₯) , π (π/π) + (π mod π) = π. (41)
According to this fact, the array can be split into β₯π β₯ slices,and resource conflict can occur if the two qubit referencesfall into the same slice. Figure 17 is an example for π = 3.Offsets 3π and (3π β 1) will never conflict with each other,since they fall into different slices π0 and π2.This splitting allows us to use one integer π β² = (π/π) to
represent an expression in the slice: in the Figure 17 case wecan use 0 for π [3π] in slice π0, 0 for π [3π + 1] in slice π1, and(β1) for π [3π β 1] in slice π2.
Corollary 7. For the modulo scheduling, if a resource is sched-uled πΌ πΌ ticks later, the integer π β² representing the resource de-creases by 1.
This allows to use a stricter model for upper-bound esti-mation:
β’ For the entire schedule, we use a universal set to storeall integer representations {π β²} of linear expressions.
β’ When adding an instruction or a group of instructions,we check the operands to be added against the univer-sal set, rather than the time-slot set. This means twoinstructions with the same operand but scheduled atdifferent ticks will also be seen as conflicted.
β’ If the integer representation of operand is already inthe set, there is a resource conflict. To find the worstcase, we suppose the next (πΌ πΌ β 1) tries will definitelyfail. The next retry that will possibly success is theπΌ πΌ -th retry where the instruction is going to be placedin the same time slot again.
β’ The array indexπ and slice indexπ mod π are ignored.For example, operands π [3π] and π [3π + 1] will be seenas conflicted since they have the same representation0, even though the two expressions will never be equalto each other.
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
Figure 17. Example for splitting the qubit array when π = 3. Resource conflict can only occur inside each slice, and resourcesin each slice can be represented by one integer.
This strict set of rules reduces our upper bound problemto a clearer problem:
Theorem 8. For finite set π΄ β π standing for resources (inte-gers representing each resource) already scheduled, and π΅ β π
being resources to be scheduled. Define
π΅ β (π β π ) = {π₯ β π |π₯ β π΅} (42)
to be the resource set of π΅ after ππΌπΌ retries. Let ππππ be theminimal π , s.t.
π΄ β© (π΅ β π) = Ξ¦, (43)then πππππΌ πΌ retries is required at most in our algorithm.
Figure 18. Resource 3(π [π₯ + 3]) and 5(π [π₯ + 5]) are nowoccupied, and resource 4 to 6 required to scheduled. Nowππππ = 4.
A naive estimation of ππππ would be
ππππ β©½ πππ₯ (π΅) βπππ(π΄), (44)
which is not acceptable. Fortunately, we can give out a moreprecise estimation not in the values in π΄ or π΅, but only inthe size of sets.
Theorem 9. Let β₯π β₯ be size of set π ,ππππ β©½ β₯π΄β₯β₯π΅β₯. (45)
Proof. Consider the set
π· = {π β π |π β π΄,π β π΅, (π β π) β©Ύ 0} . (46)
thus π β π· if and only if π΄ β© (π΅ β π) = Ξ¦. Thus ππππ is thefirst natural number not appearing in π· . However, β₯π· β₯ β©½β₯π΄β₯β₯π΅β₯ according to its definition, so π β©½ β₯π΄β₯β₯π΅β₯. β‘
Corollary 10. Insertingπ instructions at one time (e.g. merg-ing to scheduled blocks) into a schedule with π instructionsrequires at most π (πππΌπΌ ) retries. If each retry takes π (ππ)queries to find a conflict, the total complexity is atmostπ (π2π2πΌ πΌ ).
According to the theorem, we can get some several impor-tant results on the complexity:
Corollary 11. 1. Inserting one instruction into the mod-ulo scheduling table sized π requires π (ππΌπΌ ) retries andπ (π2πΌ πΌ ) time. Thus inserting all π instructions requireπ (π3πΌ πΌ ) time.
2. The span of themodulo scheduling table above is boundedby π (π2πΌ πΌ ).
3. Suppose the loop kernel sized π is split into π β©Ύ 2 strongconnected components sized π, the total complexity forscheduling all SCCs is ππ (π3πΌ πΌ ) = π (ππ3πΌ πΌ ) = π (π4),and the total time required to merge all SCCs together is
πβ1βοΈπ=1
π (π2 (ππ)2πΌ πΌ ) = π (π3π4πΌ πΌ ) = π (π5). (47)
4. The span of the total schedule is
ππ (π2πΌ πΌ ) +πβ1βοΈπ=1
π (ππ)πΌ πΌ = π (ππ2πΌ πΌ +π2π2πΌ πΌ ) = π (π2πΌ πΌ ). (48)
Thus we expect the length of prologue and epilogue to be
π (π2)βοΈπ=1
π Β· πΌ πΌ = π (π3). (49)
A Preprint, December 23, 2020 Guo, et al.
I CNOT conjugation rulesThese results are taken directly from [24].
Theorem 12. (πΆπππ conjugation) πΆπππ conjugates singlequbit gates if and only if the conjugation satisfies one of thefollowing eight cases:
1.
|aγ β’ RZ(Ξ±) β’
|bγ β β=|aγ RZ(Ξ±)
|bγ(50)
2.
|aγ β’ R+Z (Ξ±) β’
|bγ β β=|aγ R+
Z (Ξ±)
|bγ X(51)
3.
|aγ β’ β’
|bγ β RX(Ξ±) β=
|aγ
|bγ RX(Ξ±)(52)
4.
|aγ β’ RβX(Ξ±) β’
|bγ β β=
|aγ Z
|bγ RβX(Ξ±)
(53)
5.
|aγ β H(Ξ±) β’
|bγ β’ H(Ξ²)β β=
|aγ H(Ξ±)
|bγ H(Ξ²)β (54)
6.
|aγ β Hβ(Ξ±) β’
|bγ β’ H(Ξ²)β β=
|aγ Hβ(Ξ±)
|bγ H(Ξ² + Ο)β (55)
7.
|aγ β H(Ξ±) β’
|bγ β’ Hβ(Ξ²)β β=
|aγ H(Ξ±+ Ο)
|bγ Hβ(Ξ²)β (56)
8.
|aγ β Hβ(Ξ±) β’
|bγ β’ Hβ(Ξ²)β β=
|aγ Hβ(Ξ±+ Ο)
|bγ Hβ(Ξ² + Ο)β (57)
It is easy to check that πΆπππ conjugation rules and πΆπconjugation rules are equivalent to each other, by convertingπΆπππ to πΆπ and vice versa.
J Parallel QAOA DecompositionQAOA is one of the fashionable algorithms in NISQ era. Wewill use the QAOA program for solving MaxCut problemsas our optimization test cases.However, we face the problem of lacking commutativ-
ity when optimizing ππ΄ππ΄ programs: our device canβt ex-ecute π (π΅, π½π ) operation directly and it has to be decom-posed into basic gates according to Equation 8, and the block-commutativity optimization chances by commutativity be-tweenπ (π΅, π½π ) matrices are missed.There have been different ways to optimize QAOA cir-
cuits with π (π΅, π½π ) commutable with each other in mind.For example, [18] detects all two-qubit diagonal structuresin the circuit and aggregate them, so that commutativitydetection can be performed on aggregated blocks. Anotherlayout synthesis algorithm (scheduling considering devicelayout) QAOA-OLSQ[21] schedules QAOA circuits twice,the first time on a large granularity (named TB-OLSQ) andthe second time on a small granularity (named OLSQ). Thelarge-granularity pass allows block commutativity to be con-sidered and gates are placed in blocks. The small-granularitypass finishes the scheduling.
However, these two approaches both require the optimiza-tion algorithm to perform coarse-grain block-level schedul-ing in addition to fine-grain gate-level scheduling. We maywant to find another way to give commutativity hints to agate-scheduling algorithm without modifying the algorithmitself.Equation 8 inspires us with the fact that the shape of
decomposed form of π (π΅, π½π ) is a bit like πΆπππ gate: ithas a βcontrollerβ qubit and a βcontrolledβ qubit; multipleblocks with the same βcontrollerβ qubit can be commutedand interleaved freely at gate level, and can be finished in 2ticks on average instead of 3, as in Figure 19.
|aγ β RZ(βΟabΞ³i) β
|bγ β’ β’ β’ β’
|cγ β RZ(βΟbcΞ³i) β
Figure 19. The two blocks can be executed interleavingly.
The level of βblocksβ according to the discovery abovecan be derived by directing and coloring all edges in theundirected graph πΊ = β¨π , πΈβ©:
β’ First, we assign every edge with the direction in whichwe would perform the 8 decomposition (i.e. assignthe graph with an orientation). Suppose the directionpoints from the controller qubit to the controlled qubit.
β’ Then, we colour all edges with minimal number ofcolours under the following constraints:
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
(a) Graph for QAOA. (b) One orientation for thegraph.
(c) One coloring satisfyingthe constraints.
(d) The equialent vertex-coloring problem.
Figure 20. Example for one possible orientation and layeringof a graph.
1. All in-degree edges of a vertex should be coloureddifferently from each other.
2. Out-degree edges of a vertex should be coloureddifferently from all in-degree edges of the vertex.
The minimal number of required colors over all possibleorientations is the minimal number of layers we can putthese gates into.Note that finding the minimal edge colouring under the
constraints can be reduced to the problem of finding minimalvertex colouring of a new graph. In the new graph, verticesrepresent original edges; vertices for out-degree edges arefully connected; vertices for in-degree edges are connectedwith those for out-degree edges. Figure 20 is an exampleof assigning directions and colours for edges in the graph,and the equivalent vertex-colouring problem to the edge-colouring one.
One direct way to compute the block placement strategyis to use an SMT solver, for example, ππ΄ππ΄ β πππ test casesin our evaluation are generated using Z3 Solver[7]. We leaveit as an open problem whether there is an efficient approach.
K Complexity AnalysisIn this section we give a rough estimation of complexity ofthe scheduling algorithm above. We put the main complexityresults in table 3, with some notes below to explain.
K.1 Complexity of loop compactionComplexity for compacting a piece of loop program sizedπ (π) once isπ (π2), since when adding every instruction wecheck it against all instructions that are previously added.
K.2 Complexity of loop unrollingFinding merging or cancelling candidates requires π (π2)time. Suppose the loop range is unknown, we have to per-form the following steps on πΆ loops sizedπ = π (πΆπ).
Step Time Code SizeCompaction π (π2) π (π)Unrolling π (π2 +πΆ2π) πΆ loops sized π (πΆπ)
For each loop sized π (π)Rotation π (π3) π (π3)Try πΌ πΌ π (ππππ) -Tarjan π (π2) -Floyd π (π3) -Scheduling π (π5) Span=π (π3)Add π π (π4) -Codegen π (π6) π (π3)Total π (π6)ππππ π (π3)
In TotalOverall π (πΆ6π6 (ππππΆπ)) π (πΆ4π3)
Table 3. Complexity of our software pipelining approach.
K.3 Complexity of loop rotationA loop sized π (π) can be rotated for at most π (π2) times,since loop rotation will not introduce new βqubitβ into theloop, and the π (π) qubits can be placed in an partial order:ππ βΊ ππ if a single qubit gate onππ will be onππ after rotation.
This will create a prologue sized π (π2), an epilogue sizedπ (π3) and a new loop sized π (π). Each rotation requiresπ (π2) time (to find a rotatable gate) so the total complexityis π (π4).
K.4 Complexity of modulo schedulingWe need π (ππππ) retries to binary-search the minimal πΌ πΌ .Complexity of Tarjan algorithm on a dense graph is π (π2),and complexity of Floyd algorithm is π (π3).We leave the proof of complexity from retrying due to
resource conflict in Appendix H.
K.5 Inversion pair detectionThe complexity for detecting in-loop inversion pair ifπ (π2).The complexity for detecting across-loop inversion dependson the span of the total schedule. Note that according toDefinition 8:
π β€ (π2 β π1) +π1 β π1
πΏ, (58)
where π1, π2 = π (π2). Thus
π = π (π2). (59)
The total complexity of checking π (π2) pairs of instruc-tions across π iterations is π (π4).
K.6 Code generationThe complexity for code generation is just the length of pro-logue and epilogue, π (π3). The compaction is of quadraticcomplexity so the total complexity is π (π6). However, forcases where the loop range is known, using a hash set to store
A Preprint, December 23, 2020 Guo, et al.
the last operation on each qubit can reduce the complexityto π (π3).
Theorem 13. The total time complexity for our algorithm is
π (πΆ6π6 (ππππΆπ)), (60)and the size of the generated code is
π (πΆ4π3). (61)
L Adapting to existing architecturesNote that we are building our approach of optimization basedon a specific quantum circuit model as specified in Section2.2. Recall some of the features of the model that we use:
β’ Classical computation and loop guards can be carriedout instantly.
β’ The hardware can execute arbitrary single qubit oper-ations and πΆπ gates between arbitrary qubit pairs. Allinstructions can finish in one cycle.
β’ Instructions on totally different qubits can be carriedout at the same time.
L.1 Powerful classical controlA quantum processor is usually split into classical part andquantum part, and all the classical logics (i.e. branch state-ments) are run on the classical part.To implement fast classical guard for π ππ -loops, we can
use several classical architecture mechanisms, such as su-perscalar, classical branch prediction and speculative exe-cution. As long as classical part commits instructions fasterthan quantum part executing instructions, we may keep thequantum part fully-loaded without introducing unnecessarybubbles.
If we want classical operations that affect the control flowof quantum part (e.g. classical branch statements), one waywould be converting them to their quantum version. Onepractical example would be measurements with feedback:if we want to use the measurement outcome to control thefollowing operations, we can just use a qubit array to replaceclassical memory, use πΆπππ gate to replace measurement,and use controlled gate to replace classical control. The clas-sical trick of register renaming can be adopted when convert-ing measurement to quantum gates: different iterations canβmeasure toβ different qubits to prevent unnecessary namedependency.Also on real quantum processors the full-parallelism is
not likely to be achieved, for example, there may be a limitof instruction issuing width on the device. For this case, wecan just limit the maximal issuing width in resource conflictchecking.
L.2 CNOT-based instruction setOne major difference between our assumptions and the real-world architectures is that most existing models and archi-tectures adopt a πΆπππ -based instruction set, instead of a
πΆπ -based one. We provide two possible approaches for ex-tending our method to the πΆπππ -architecture case.One approach is to convert the original circuit to πΆπ -
version directly, using the equation π [π]πΆπ [π, π]π [π] =
πΆπππ [π]. After optimization, an additional step is requiredto convert eachπΆπ gate intoπΆπππ gates by addingHadamardgates. Note that the way of adding Hadamard gates can affectthe depth of the kernel.
Example 13. Adding Hadamard gates on the same qubit oftwo adjacent πΆπ gates saves gate depth by 1, compared to theversion adding Hadamard gates on different qubits of the twoπΆπ gates.
|aγ β’
|bγ β’
|cγ β’ β’
=
|aγ H ββ H
|bγ β’
|cγ β’
=
|aγ H β H β’
|bγ β’
|cγ H β H
(62)
However, deciding all directions of πΆπππ gates can bea hard problem. We can formulate the problem as an ILPproblem. A rough description is as follows:
β’ Each πΆπ is given a boolean variable, indicating the di-rection of πΆπππ (and where to add Hadamard gates).
β’ If one πΆπ is adjacent to a single qubit gate, the π» canbe absorbed.
β’ If one πΆπ is adjacent to another πΆπ and if they addHadamard on the same qubit, the two Hadamard canbe cancelled and no depth is added.
β’ Otherwise the depth is added by 1 from Hadamard.If there is an aliasing, the depth need to be added bymore than 1 so that π» gates on qubits with aliasingwill be placed at two different ticks.
β’ The objective is to minimize the depth on all qubits.We leave the best conversion fromπΆπ program intoπΆπππ
program with minimal depth as a remaining problem.Another way to port our approach is to modify our QDG
definition to the πΆπππ -based instruction set. But in fact,the most commonly used πΆπππ commutation rules thatare based on intuition are only part of the complete πΆπππ
conjugation rules:
Lemma 6. (πΆπππ conjugation rules)[24] There are 8 rulesin total for πΆπππ conjugation rules, similar to πΆπ rules. SeeAppendix I.
If we want to exploit full power of these rules, we haveto consider all these rules while building QDG, instead ofconsidering only the intuitive rules (usually the first 4 rules).
Software Pipelining for Quantum Loop Programs A Preprint, December 23, 2020
(a) Before. (b) After. The numbers correspond to the inter-cept π in expression π [6π + π].
Figure 21. Loop kernel for cluster state preparation (π = 3).Shaded dots are qubits for Hadamard operands and closeddots are πΆπ operands.
But this time, the rewriting trick in Theorem 2 no longerworks for πΆπππ rules. How to use these rules directly forQDG construction remains an open problem.
L.3 Working with device topologyOne problem about a controlled-Z architecture is that itcan be hard to perform long-distance πΆπ operation. For theπΆπππ case, a long distance πΆπππ gate with length π canbe implemented using (4π β 4) according to [17]. However,this is not true for πΆπ gates, as βamplitudeβ canβt propagatethrough πΆπ gates.
A direct conversion approach can be taken by convertingπΆπ toπΆπππ and back forth. Since everyπΆπππ is on criticalpath and no adjacent controlled bits can be found on criticalpath, this would require (8π β 8 + 1) = (8π β 7) gates oncritical path. The exception is π = 2, since the lastπ»πππππππ
on the critical path should be removed and total depth is 8.
M Optimization of Cluster StatePreparation, etc.
This chapter introduces the Cluster and Array test casesused in our evaluation.
Cluster is an example of cluster-state preparation pro-gram, which is a for-all loop: increasing count of iterationsdoes not add to the overall depth of the program, which onthe 2-dimensional grid is a constant 5 (4 for πΆπs in fourdirections and 1 for Hadamard). Despite that, we can stillperform loop optimization on this program to get a loop withkernel sized 1.For πΆ = 2, the loop kernels before and after rotation fol-
lowed by software-pipelining is given in Figure 21. Our ap-proach split πΆπ gates that conflicts with each other intodifferent iterations so that they can be executed together,and the kernel size is reduced to 1, the best result for anyloop-optimization approach except fully-unrolling.
Array series are several artificially-crafted loop programson qubit arrays.Array 1 performs threeπΆπ gates as in Figure12, while two Hadamard gates are added between πΆπs toprevent cancellation. Array 2 performs non-cancelling πΆπgates so that they can be parallelized maximally. Array 3constructs a huge Toffoli gate using Toffoli gates and ancillas:
in each iteration, a Toffoli is performed on a source qubit, anancilla and the next ancilla.The instruction operands of these examples contain the
iteration variable and are thus simpler to optimize comparedwith those on fixed set of qubits.