grammar-aware question-answering on quantum computers

16
by Konstantinos Meichanetzidis, Alexi Toumi, Giovanni de Felice, Bob Coecke Grammar-Aware Question-Answering on Quantum Computers Published by Cambridge Quantum 07 December 2020 Cambridge Quantum Department of Computer Science University of Oxford, United Kingdom KONSTANTINOS MEICHANETZIDIS [email protected] ALEXIS TOUMI [email protected] GIOVANNI DE FELICE [email protected] BOB COECKE [email protected] Cambridge Quantum Terrington House 13-15 Hills Road Cambridge CB2 1NL United Kingdom

Upload: others

Post on 01-Jun-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Grammar-Aware Question-Answering on Quantum Computers

Cambridge QuantumTerrington House13-15 Hills RoadCambridge CB2 1NLUnited Kingdom

Published by Cambridge Quantum

by Konstantinos Meichanetzidis, Alexi Toumi, Giovanni de Felice, Bob Coecke

Grammar-Aware Question-Answering on Quantum Computers

Published by Cambridge Quantum 07 December 2020

Cambridge QuantumDepartment of Computer ScienceUniversity of Oxford, United KingdomKONSTANTINOS [email protected] [email protected] DE [email protected] [email protected]

Cambridge QuantumTerrington House13-15 Hills RoadCambridge CB2 1NLUnited Kingdom

Page 2: Grammar-Aware Question-Answering on Quantum Computers

Grammar-Aware Question-Answering on Quantum Computers

Konstantinos Meichanetzidis, Alexis Toumi, Giovanni de Felice, Bob CoeckeCambridge Quantum Computing Ltd. and

Department of Computer Science, University of Oxford

Natural language processing (NLP) is atthe forefront of great advances in contem-porary AI, and it is arguably one of themost challenging areas of the field. Atthe same time, with the steady growth ofquantum hardware and notable improve-ments towards implementations of quan-tum algorithms, we are approaching an erawhen quantum computers perform tasksthat cannot be done on classical com-puters with a reasonable amount of re-sources. This provides an new range ofopportunities for AI, and for NLP specif-ically. Earlier work has already demon-strated potential quantum advantage forNLP in a number of manners: (i) algorith-mic speedups for search-related or classifi-cation tasks, which are the most dominanttasks within NLP, (ii) exponentially largequantum state spaces allow for accommo-dating complex linguistic structures, (iii)novel models of meaning employing densitymatrices naturally model linguistic phe-nomena such as hyponymy and linguisticambiguity, among others.

In this work, we perform the first im-plementation of an NLP task on noisyintermediate-scale quantum (NISQ) hard-ware. Sentences are instantiated as pa-rameterised quantum circuits. We en-code word-meanings in quantum statesand we explicitly account for grammaticalstructure, which even in mainstream NLPis not commonplace, by faithfully hard-wiring it as entangling operations. Thismakes our approach to quantum naturallanguage processing (QNLP) particularlyNISQ-friendly. Our novel QNLP modelshows concrete promise for scalability asthe quality of the quantum hardware im-proves in the near future.

NLP is a rapidly evolving area of AI of boththeoretical importance and practical interest [1,2]. State of the art language models, such asGPT-3 with 175 billion parameters [3], show im-pressive results on general NLP tasks and onedares to claim that humanity is entering Turing-

test territory [4]. NLP technology becomes in-creasingly entangled with everyday life as part ofsearch engines, personal assistants, informationextraction and data-mining algorithms, medicaldiagnoses, and even bioinformatics [5, 6]. Despitesuccess in both language understanding and lan-guage generation, under the hood of mainstreamNLP models one exclusively finds deep neuralnetworks, which famously suffer the criticism ofbeing uninterpretable black boxes [7].

One way to bring transparency to said blackboxes, is to incorporate linguistic structure [8–10]into distributional language models. A prominentapproach attempting this merge is the Distribu-tional Compositional Categorical model of natu-ral language meaning (DisCoCat) [11–13], whichpioneered the paradigm of combining grammati-cal (or syntactic) structure and statistical meth-ods for encoding and computing meaning (or se-mantics). This approach also provides the toolsfor modelling linguistic phenomena such as lexicalentailment and ambiguity, as well as the trans-parent construction of syntactic structures like,relative and possessive pronouns [14, 15], conjuc-tion, disjuction, and negation [16]. From a mod-ern lens, DisCoCat is a tensor network languagemodel. Recently, the motivation for designing in-terpretable AI systems has caused a surge in theuse of tensor networks in language modelling [17–21].

Quantum computing (QC) is a field which, inparallel with NLP is growing at an extremelyfast pace. The importance of QC is now well-established, especially after the demonstration ofquantum supremacy [22], and reaches the wholerange of human interests, from foundations ofphysics and computer science, to applications inengineering, finance, chemistry, and optimisationproblems.

In the last half-decade, the natural concep-tual fusion of QC with AI, and especially ma-chine learning (ML), has lead to a plethora ofnovel and exciting advancements. The quantummachine learning (QML) literature has reachedan immense size considering its young age, withthe cross-fertilisation of ideas and methods be-tween fields of research as well as academia and

arX

iv:2

012.

0375

6v1

[qu

ant-

ph]

7 D

ec 2

020

Page 3: Grammar-Aware Question-Answering on Quantum Computers

2

industry being a dominant driving force. Thelandscape includes using quantum computers forsubroutines in ML algorithms for executing lin-ear algebra operations [23], or quantising classi-cal machine learning algorithms based on neuralnetworks [24], support vector machines, cluster-ing [25], or artificial agents who learn from in-teracting with their environment [26], and evenquantum-inspired and dequantised classical al-gorithms which nevertheless retain a complex-ity theoretic advantage [27]. Small-scale classi-fication experiments have also been implementedwith quantum technology [28, 29].

From this collection of ingredients there organ-ically emerges the interdisciplinary field of Quan-tum Natural Language Processing (QNLP), a re-search area still in its infancy [30–34] QNLP com-bines NLP and QC and seeks algorithms andnovel quantum language models. Building on therecently established methodology of QML, oneaims to import QC algorithms to obtain theo-retical speedups for specific NLP tasks or use thequantum Hilbert space as a feature space in whichNLP tasks are to be executed.

The first paper on QNLP, by Zeng and Co-ecke [30], introduced an approach to QNLP whereNLP tasks modelled in the DisCoCat frameworkare instantiated as quantum computations, and,remarkably, a quadratic speedup was obtainedfor the task of sentence similarity. The map-ping’s simplicity is attributed to the mathemat-ical similarity of the structures underlying Dis-CoCat and quantum theory, the latter as for-mulated by categorical quantum mechanics [35].This similarity becomes apparent when both areexpressed in the graphical language of string di-agrams of monoidal categories or process theo-ries [36], which are equivalent to tensor networks.This is what makes DisCoCat ’quantum-native’.

In this work, we bring quantum DisCoCatto the current age of noisy intermediate-scalequantum (NISQ) devices [37] by adopting theparadigm of PQCs [38, 39]. We argue that DisCo-Cat can justifiably be termed as ”NISQ-friendly”,as it allows for the execution of proof-of-conceptexperiments involving a non-trivial corpus, whichmoreover involves complex grammatical struc-tures. We present results on the first-ever QNLPexperiment on NISQ devices.

The model: DisCoCat is based on the alge-braic model of pregroup grammars (AppendixA) developed by Lambek [40]. A sentence is afinite product of words σ =

∏i wi. A parser

tags a word w ∈ σ with its part of speech.

Romeo loves Juliet dies

nn

ns

n

s

who

FIG. 1. Diagram for “Romeo who loves Juliet dies”.The grammatical reduction is generated by the nestedpattern of non-crossing cups, which connect wordsthrough wires of types n or s. Grammaticality isverified by only one s-wire left open. The diagramrepresents the meaning of the whole sentence from aprocess-theoretic point of view. The relative pronoun‘who’ is modeled by the Kronecker tensor. Interpret-ing the diagram in CQM, it represents a quantumstate.

Accordingly, w is assigned a pregroup typetw =

∏i bκii comprising a product of basic (or

atomic) types bi from the finite set B. Eachtype carries an adjoint order κi ∈ Z. Pregroupparsing is efficient; specifically it is linear timeunder reasonable assumptions [41]. The typeof a sentence is the product of the types ofits words and it is deemed grammatical iff ittype-reduces to the special type s0 ∈ B, i.e. thesentence-type, tσ =

∏w tw → s0. Reductions

are performed by iteratively applying pairwiseannihilations of a basic types with adjointorders of the form bibi+1. As an exampleconsider the reduction: tRomeo who loves Juliet dies =tRomeo twho tloves tJuliet tdies =(n0)(n1n0s−1n0)(n1s0n−1)(n0)(n1s0) →n0n1n0s−1n0n1s0n−1n0n1s0 → n0s−1s0n1s0 →n0n1s0 → s0.

Crucial for our work is acknowledging that atthe core of DisCoCat is a process-theoretic modelof natural language meaning. Process theoriesare alternatively known as symmetric monoidal(or tensor) categories [42]. Process networks suchas those that manifest in DisCoCat can be repre-sented graphically as string diagrams [43]. Stringdiagrams are not just convenient graphical nota-tion, but they constitute a formal graphical type-theoretic language for reasoning about complexprocess networks (Appendix B), and are key toour QNLP methods. String diagrams are gener-ated by boxes with input and output wires, witheach wire carrying a type. Boxes can be com-posed to form process networks by wiring out-puts to inputs and making sure the types are re-spected. Output-only processes are called statesand input-only processes are called effects.

A grammatical reduction is viewed as a pro-

Page 4: Grammar-Aware Question-Answering on Quantum Computers

3

w

HH

|0〉 |0〉

|0〉 |0〉 |0〉GHZ

Rx

|0〉HH

w

HH

H

H

H

|0〉 |0〉 |0〉

nn s n

n n n

H

〈0| 〈0|

Rz

Rz

Z

Z

n

s

w

n

,

s

ss

,

Rz

d = 2

(b)

(a)

(c)

(d) (e)

FIG. 2. Example instance of mapping from sentencediagrams to PQCs where qn = 1 and qs = 0. (a) Thedashed square is the empty diagram. In this exam-ple, (b) unary word-states are prepared by param-eterised Rx rotations followed by Rz rotations and(c) k-ary word-states are prepared by parameterisedword-circuits of width k and depth d = 2. (d) Thecup is mapped to a Bell effect, i.e. a CNOT followedby a Hadamard on the control and postselection on〈00|. (e) The Kronecker tensor modelling the relativepronoun is mapped to a GHZ state.

cess and so it can be represented as a string dia-gram. Words are represented as states and pair-wise type-reductions are represented by a patternof nested cups-effects (wires bent in a U-shape),and identities (straight wires). Wires in the stringdiagram carry the label of the basic type beingreduced. In Fig.1 we show the string diagramrepresenting the pregroup reductions for “Romeowho loves Juliet dies”. Only the s-wire is leftopen, which is the witness of grammaticality.

As described in Ref.[37], the diagram of a sen-tence σ can be canonically mapped to a PQCCσ(θσ) over the parameter set θ. The key ideahere is that such circuits inherit their architec-ture, in terms of a particular connectivity of en-tangling gates, from the grammatical reduction ofthe sentence.

Quantum circuits also, being part of pure quan-tum theory, enjoy a graphical language in termsof string diagrams. The mapping from sentencediagram to quantum circuit begins simply byreinterpreting a sentence diagram, such as that ofFig.1, as a diagram in categorical quantum me-chanics (CQM). The word-state of word w in asentence diagram is mapped to a pure quantumstate prepared from a trivial reference product-state by a PQC |w(θw)〉 = Cw(θw)|0〉⊗qw . Thewidth of the circuit depends on the number of

Rz Rz Rz

H

H

H H

H

HH

HHH

H

H

H

|0〉 |0〉 |0〉 |0〉 |0〉 |0〉 |0〉 |0〉

〈0|

〈0|〈0|

〈0|〈0|〈0|〈0|〈0|

Z

Z

Rz

Rz

⊕ ⊕ ⊕

FIG. 3. The PQC to which “Romeo who loves Julietdies” of Fig.1 is mapped, with the choices of hyper-parameters of Fig.2. As qs = 0, the circuit representsa scalar.

qubits assigned to each pregroup type b ∈ Bfrom which the word-types are composed, qw =∑b∈tw qb, and cups are mapped to Bell effects.

Given a sentence σ, we instantiate its quan-tum circuit by first concatenating in parallel theword-circuits of each word as they appear inthe sentence, corresponding to performing a ten-sor product, Cσ(θσ) =

⊗w Cw(θw) which pre-

pares the state |σ(θσ)〉 from the all-zeros ba-sis state. As such, a sentence is parameterisedby the concatenation of parameters of its words,θσ = ∪w∈σθw. The parameters θw determine theword-embedding |w(θw)〉. In other words, we usethe Hilbert space as a feature space [28, 44, 45]in which the word-embeddings are defined. Fi-nally, we apply Bell effects as dictated by the cuppattern in the grammatical reduction, a functionwhose result we shall denote gσ(|σ(θσ)〉). Notethat in general this procedure prepares an unnor-malised quantum state. In the special case whereno qubits are assigned to the sentence type, i.e.qs = 0, then it is an amplitude which we write as〈gσ|σ(θσ)〉. Formally, this mapping constitutes aparameterised functor from the pregroup gram-mar category to the category of quantum circuits.The parameterisation is defined via a functionfrom the set of parameters to functors from theaforementioned source and target categories.

Our model has hyperparameters (Appendix E).The wires of the DisCoCat diagrams we considercarry types n or s. The number of qubits thatwe assign to each pregroup type are qn and qs.These determine the arity of each word, i.e. thewidth of the quantum circuit that prepares eachword-state. We set qs = 0 throughout this work,which establishes that the sentence-circuits rep-

Page 5: Grammar-Aware Question-Answering on Quantum Computers

4

resent scalars. For a unary word w, i.e. a word-state on 1 qubit, we choose as its word-circuitthe Euler parametrisation Rz(θ

3w)Rx(θ2

w).For aword w of arity k ≥ 2, we use a depth-d IQP-style parameterisation [28] consisting of d layerswhere each layer consists of a layer of Hadamardgates followed by a lower of controlled-Z rota-tions CRz(θ

iw), such that i ∈ {1, 2, . . . , d(k− 1)}.

Such circuits are in part motivated by the con-jecture that circuits involving them are classi-cally hard to evaluate [28]. The relative pronoun“who” is mapped to the GHZ circuit, i.e. the cir-cuit that prepares a GHZ state on the number ofqubits as determined by qn and qs. This is jus-tified by prior work where relative pronouns andother functional words are modelled by a Kro-necker tensor [14, 15].

In Fig.2 we show an example of choices of word-circuits for specific numbers of qubits assigned toeach basic pregroup type (Appendix 8). In Fig.3we show the corresponding circuit to “Romeowho loves Juliet dies”.

Here, a motivating remark is in order. In clas-sical implementations of DisCoCat, sentence dia-grams represent tensor contractions. Meanings ofwords are encoded in terms of cooccurrence fre-quencies or other vector-space word-embeddingssuch as those produced by neural networks [46].Tensor contractions are exponentially expensivein the dimension of the vector spaces carried bythe wires, which for NLP applications can be-come prohibitively large. In the quantum case,however, the tensor product structure defined bya collection of qubits provides an exponentiallylarge Hilbert space, leading to exponential space-gain. Consequently, we adopt the paradigm ofQML in terms of PQCs to carry out near-termQNLP tasks.

Question Answering: Now that we have estab-lished our construction of sentence circuits, wedescribe a simple QNLP task. The dataset or‘labelled corpus’ K = {(Dσ, lσ)}σ, is a finite setof sentence-diagrams {Dσ}σ constructed from afinite vocabulary of words V . Each sentence hasa binary label lσ ∈ {0, 1} interpreted as its truthvalue. We split K into the training set ∆ contain-ing the first bp|{Dσ}σ|e of the sentences, wherep ∈ (0, 1), and the test set E containing the rest.

The sentences are generated randomly usinga context-free grammar (CFG). Each sentenceis represented as a syntax tree, which also canbe cast in string-diagram form. Each CFG-generated sentence diagram can then be canoni-cally transformed into a DisCoCat diagram (Ap-

0 200 400 600 800 1000SPSA iterations

0

1

2

3

4

5

6

〈L(θ

)〉

d = 1

d = 2

d = 3

0 500 1000 1500 2000 2500 3000SPSA iterations

2

3

4

5

6

〈L(θ

)〉

d = 1

d = 2

d = 3

d = 4

d = 5

1 2 3d

0.0

0.2

0.4

0.6

mea

ner

ror

〈etr〉〈ete〉

1 2 3 4 5d

0.0

0.2

0.4

0.6

0.8

mea

ner

ror

〈etr〉〈ete〉

FIG. 4. Convergence of mean cost function 〈L(θ)〉vs number of SPSA iterations for corpus K30. A lowerminimum is reached for larger d. (Top) qn = 1 and|θ| = 8+2d. Results are averaged over 20 realisations.(Bottom) qn = 2 and |θ| = 10d. Results are averagedover 5 realisations. (Insets) Mean training and testerrors 〈etr〉, 〈ete〉 vs d. Using the global optimisationbasinhopping with local optimisation Nelder-Mead

(red), the errors decrease with d.

pendix C). Even though the data is synthetic,we curate the data by assigning labels by handso that the truth values among the sentences areconsistent, rendering the QA task non-trivial.

We define the predicted label as

lprσ (θσ) = |〈gσ|σ(θσ)〉|2 ∈ [0, 1] (1)

from which we can obtain the binary label byrounding to the nearest integer blpr

σ e ∈ {0, 1}.Now the parameters of the words need to be op-

timised (or trained) so that the predicted labelsmatch the labels in the training set. The opti-miser we invoke is SPSA [47], which has shown ad-equate performance in noisy settings (Appendix

Page 6: Grammar-Aware Question-Answering on Quantum Computers

5

F). The cost function we define is

L(θ) =∑

σ∈∆

(lprσ (θσ)− lσ)2. (2)

Minimising the cost function returns the opti-mal parameters θ∗ = argminL(θ) from which themodel predicts the labels lpr

σ (θ∗). Essentially, thisconstitutes learning a functor from the grammarcategory to the categeory of quantum circuits.We then quantify the performance by the train-ing and test errors e∆ and eE , as the proportionof labels predicted incorrectly:

eA =1

|A|∑

σ∈A

∣∣blprσ (θ∗)e − lσ

∣∣ , A = ∆, E.

This supervised learning task of binary classi-fication for sentences is a special case of questionanswering (QA) [48–50]. Questions are posedas statements and the truth labels are the an-swers. After training on ∆, the model predictsthe answer to a previously unseen question fromE, which comprises sentences containing wordsall of which have appeared in ∆. The optimisa-tion is performed over the parameters of all thesentences in the training set θ = ∪σ∈∆θσ. In ourexperiments, each word at least once in the train-ing set and so θ = ∪w∈V θw. Note that what isbeing learned are the inputs, i.e. the quantumword embeddings, to an entangling process cor-responding to the grammar. Recall that a givensentence-circuit does not necessarily involve theparameters of every word. However, that everyword appears in at least one sentence, which in-troduces correlations between the sentences andmakes such a learning task possible.

Classical Simulation: We first show resultsfrom classical simulations of the QA task. Thesentence circuits are evaluated exactly on a clas-sical computer to compute the predicted labels inEq.1. We consider the corpus K30 of 30 sentencessampled from the vocabulary of 7 words (Ap-pendix D) and we set p = 0.5. In Fig.4 we showthe convergence of the cost function, for qn = 1and qn = 2, for increasing word-circuit depth d.To clearly show the decrease in training and testerrors as a function of d when invoking the globaloptimiser basinhopping (Appendix F).

Experiments on IBMQ: We now turn to readilyavailable NISQ devices provided by the IBMQ inorder to estimate the predicted labels in Eq.1.

Before each circuit can be run on a backend, inthis case a superconducting quantum processor,

0 20 40 60 80 100SPSA iterations

0.5

1.0

1.5

2.0

2.5

3.0

3.5

L(θ

)

montreal : qn = 1, d = 2

sim : qn = 1, d = 2

toronto : qn = 1, d = 3

montreal : qn = 1, d = 3

toronto : qn = 1, d = 3

FIG. 5. Convergence of the cost L(θ) evaluatedon quantum computers vs SPSA iterations for cor-pus K16. For qn = 1, d = 2, for which |θ| = 10,on ibmq montreal (blue) we obtain etr = 0.125 andete = 0.5. For qn = 1, d = 3, where |θ| = 13, onibmq toronto (green) we get etr = 0.125 and a lowertesting error ete = 0.375. On ibmq montreal (red)we get both lower training and testing errors, etr = 0,ete = 0.375 than for d = 2. In all cases, the CNOT-depth of any sentence-circuit after t|ket〉-compilationis at most 3. Classical simulations (dashed), averagedover 20 realisations, agree with behaviour on IBMQfor both cases d = 2 (yellow) and d = 3 (purple).

it first needs to be compiled. A quantum com-piler takes as input a circuit and a backend andoutputs an equivalent circuit which is compatiblewith the backend’s topology. A quantum com-piler also aims to minimise the most noisy opera-tions. For IBMQ, the gate most prone to erros isthe entangling CNOT gates. The compiler we usein this work is t|ket〉 [51] by Cambridge QuantumComputing (CQC), and for each circuit-run on abackend, we use the maximum allowed number ofshots (Appendix G).

We consider the corpus K16 from 6 words (Ap-pendix D) and set p = 0.5. For every evaluationof the cost function under optimisation, the cir-cuits were run on the IBMQ quantum computersibmq montreal and ibmq toronto. In Fig.5 weshow the convergence of the cost function underSPSA optimisation and report the training andtesting errors for different choices of hyperparam-eters. This constitutes the first non-trivial QNLPexperiment on a programmable quantum proces-sor. According to Fig.4, scaling up the word-circuits results in improvement in training andtesting errors, and remarkably, we observe this onthe quantum computer, as well. This is impor-tant for the scalability of our experiment when fu-ture hardware allows for greater circuit sizes and

Page 7: Grammar-Aware Question-Answering on Quantum Computers

6

thus richer quantum-enhanced feature spaces andgrammatically more complex sentences.

Discussion and Outlook: We have performedthe first-ever quantum natural language process-ing experiment by means of question answeringon quantum hardware. We used a compositionalmodel of meaning constructed by a structure-preserving mapping from grammatical sentencesto PQCs.

Here a remark on postselection is in order.QML-based QNLP tasks such as the one imple-mented in this work rely on the optimisation ofa scalar cost function. In general, evaluatinga scalar encoded in an amplitude on a quan-tum computer requires either postselection or co-herent control over arbitrary circuits so that aHadamard test [52] can be performed (AppendixH). Notably, in special cases of interest to QML,the Hadamard test can be adapted to NISQ tech-nologies [53, 54]. In its general form, however,the depth-cost resulting after compilation of con-trolled circuits becomes prohibitable with currentquantum devices. However, given the rapid im-provement in quantum computing hardware, weenvision that such operations will be within reachin the near-term.

Future work includes implementation of morecomplex QNLP tasks such as sentence similarityand work with real-world data using a pregroupparser. In that context, we plan to explore regu-larisation techniques for the QML aspect of ourwork, which is an increasingly relevant topic thatin general deserves more attention [39]. In ad-dition, our DisCoCat-based QNLP framework is

naturally generalisable to accommodate mappingsentences to quantum circuits involving mixedstates and quantum channels. This is useful asmixed states allow for modelling lexical entaile-ment and ambiguity [55, 56].

Finally, looking beyond the DisCoCat model,it is well-motivated to adopt the recently in-troduced ‘augmented’ DisCoCirc model [57] ofmeaning and its mapping to PQCs [58], whichallows for QNLP experiments on text-scalereal-world data in a fully-compositional frame-work. Motivated by interpretability in AI, word-meanings in DisCoCirc are built bottom-up as pa-rameterised states defined on specific tensor fac-tors. Nouns are treated as ‘entities’ of a text andmakes sentence composition explicit. Entities gothrough gates which act as modifiers on them,modelling for example the application of adjec-tives or verbs. This interaction structure, viewedas a process network, can be mapped to quantumcircuits, with entities as density matrices carriedby wires and their modifiers as quantum channels.

Acknowledgments KM thanks VojtechHavlicek and Chris Self for discussions on QML,Robin Lorenz and Marcello Benedetti for com-ments on the manuscript, and the t|ket〉 team atCQC for technical advice on quantum compila-tion. KM is grateful to the Royal Commissionfor the Exhibition of 1851 for financial supportunder a Postdoctoral Research Fellowship. ATthanks Simon Harrison for financial supportthrough the Wolfson Harrison UK ResearchCouncil Quantum Foundation Scholarship.

[1] Daniel Jurafsky and James H. Martin, Speechand Language Processing: An Introduction toNatural Language Processing, ComputationalLinguistics, and Speech Recognition, 1st ed.(Prentice Hall PTR, USA, 2000).

[2] Patrick Blackburn and Johan Bos, Representa-tion and Inference for Natural Language: A FirstCourse in Computational Semantics (Center forthe Study of Language and Information, Stan-ford, CA, 2005).

[3] Tom B. Brown, Benjamin Mann, Nick Ry-der, Melanie Subbiah, Jared Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam,Girish Sastry, Amanda Askell, Sandhini Agar-wal, Ariel Herbert-Voss, Gretchen Krueger,Tom Henighan, Rewon Child, Aditya Ramesh,Daniel M. Ziegler, Jeffrey Wu, Clemens Win-ter, Christopher Hesse, Mark Chen, Eric Sigler,

Mateusz Litwin, Scott Gray, Benjamin Chess,Jack Clark, Christopher Berner, Sam McCan-dlish, Alec Radford, Ilya Sutskever, and DarioAmodei, “Language models are few-shot learn-ers,” (2020), arXiv:2005.14165 [cs.CL].

[4] A. M. TURING, “I.—COMPUTINGMACHINERY AND INTELLI-GENCE,” Mind LIX, 433–460 (1950),https://academic.oup.com/mind/article-pdf/LIX/236/433/30123314/lix-236-433.pdf.

[5] David B. Searls, “The language of genes,” Nature420, 211–217 (2002).

[6] Zhiqiang Zeng, Hua Shi, Yun Wu, and ZhilingHong, “Survey of natural language processingtechniques in bioinformatics,” Computationaland Mathematical Methods in Medicine 2015,674296 (2015).

[7] Vanessa Buhrmester, David Munch, and

Page 8: Grammar-Aware Question-Answering on Quantum Computers

7

Michael Arens, “Analysis of explainers of blackbox deep neural networks for computer vision: Asurvey,” (2019), arXiv:1911.12116 [cs.AI].

[8] Joachim Lambek, “The mathematics of sen-tence structure,” AMERICAN MATHEMATI-CAL MONTHLY , 154–170 (1958).

[9] RICHARD MONTAGUE, “Universal gram-mar,” Theoria 36, 373–398 (2008).

[10] Noam Chomsky, Syntactic Structures (Mouton,1957).

[11] Bob Coecke, Mehrnoosh Sadrzadeh, andStephen Clark, “Mathematical foundations for acompositional distributional model of meaning,”(2010), arXiv:1003.4394 [cs.CL].

[12] E. Grefenstette and M. Sadrzadeh, “Experimen-tal support for a categorical compositional dis-tributional model of meaning,” in The 2014Conference on Empirical Methods on Natu-ral Language Processing. (2011) pp. 1394–1404,arXiv:1106.4058.

[13] D. Kartsaklis and M. Sadrzadeh, “Prior disam-biguation of word tensors for constructing sen-tence vectors.” in The 2013 Conference on Em-pirical Methods on Natural Language Processing.(ACL, 2013) pp. 1590–1601.

[14] M. Sadrzadeh, S. Clark, and B. Coecke, “Thefrobenius anatomy of word meanings i: subjectand object relative pronouns,” Journal of Logicand Computation 23, 1293–1317 (2013).

[15] Mehrnoosh Sadrzadeh, Stephen Clark, and BobCoecke, “The frobenius anatomy of word mean-ings ii: possessive relative pronouns,” Journal ofLogic and Computation 26, 785–815 (2014).

[16] Martha Lewis, “Towards logical negation forcompositional distributional semantics,” (2020),arXiv:2005.04929 [cs.CL].

[17] Vasily Pestun and Yiannis Vlassopoulos,“Tensor network language model,” (2017),arXiv:1710.10248 [cs.CL].

[18] Angel J. Gallego and Roman Orus, “Languagedesign as information renormalization,” (2019),arXiv:1708.01525 [cs.CL].

[19] Tai-Danae Bradley, E. Miles Stoudenmire, andJohn Terilla, “Modeling sequences with quan-tum states: A look under the hood,” (2019),arXiv:1910.07425 [quant-ph].

[20] Stavros Efthymiou, Jack Hidary, and StefanLeichenauer, “Tensornetwork for machine learn-ing,” (2019), arXiv:1906.06329 [cs.LG].

[21] Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng,and Christopher Potts, “Recursive deep modelsfor semantic compositionality over a sentimenttreebank,” in Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Pro-cessing (Association for Computational Linguis-tics, Seattle, Washington, USA, 2013) pp. 1631–1642.

[22] Frank Arute, Kunal Arya, Ryan Babbush, Dave

Bacon, Joseph C. Bardin, Rami Barends, RupakBiswas, Sergio Boixo, Fernando G. S. L. Bran-dao, David A. Buell, Brian Burkett, Yu Chen,Zijun Chen, Ben Chiaro, Roberto Collins,William Courtney, Andrew Dunsworth, Ed-ward Farhi, Brooks Foxen, Austin Fowler, CraigGidney, Marissa Giustina, Rob Graff, KeithGuerin, Steve Habegger, Matthew P. Harrigan,Michael J. Hartmann, Alan Ho, Markus Hoff-mann, Trent Huang, Travis S. Humble, Sergei V.Isakov, Evan Jeffrey, Zhang Jiang, Dvir Kafri,Kostyantyn Kechedzhi, Julian Kelly, Paul V.Klimov, Sergey Knysh, Alexander Korotkov, Fe-dor Kostritsa, David Landhuis, Mike Lindmark,Erik Lucero, Dmitry Lyakh, Salvatore Mandra,Jarrod R. McClean, Matthew McEwen, An-thony Megrant, Xiao Mi, Kristel Michielsen,Masoud Mohseni, Josh Mutus, Ofer Naaman,Matthew Neeley, Charles Neill, Murphy YuezhenNiu, Eric Ostby, Andre Petukhov, John C.Platt, Chris Quintana, Eleanor G. Rieffel, Pe-dram Roushan, Nicholas C. Rubin, Daniel Sank,Kevin J. Satzinger, Vadim Smelyanskiy, Kevin J.Sung, Matthew D. Trevithick, Amit Vainsencher,Benjamin Villalonga, Theodore White, Z. JamieYao, Ping Yeh, Adam Zalcman, Hartmut Neven,and John M. Martinis, “Quantum supremacy us-ing a programmable superconducting processor,”Nature 574, 505–510 (2019).

[23] Aram W. Harrow, Avinatan Hassidim, and SethLloyd, “Quantum algorithm for linear systems ofequations,” Physical Review Letters 103 (2009),10.1103/physrevlett.103.150502.

[24] Kerstin Beer, Dmytro Bondarenko, Terry Far-relly, Tobias J. Osborne, Robert Salzmann,Daniel Scheiermann, and Ramona Wolf, “Train-ing deep quantum neural networks,” NatureCommunications 11 (2020), 10.1038/s41467-020-14454-2.

[25] Iordanis Kerenidis, Jonas Landman, AlessandroLuongo, and Anupam Prakash, “q-means: Aquantum algorithm for unsupervised machinelearning,” in Advances in Neural InformationProcessing Systems, Vol. 32, edited by H. Wal-lach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett (Curran Associates,Inc., 2019) pp. 4134–4144.

[26] Vedran Dunjko, Jacob M. Taylor, and Hans J.Briegel, “Quantum-enhanced machine learn-ing,” Physical Review Letters 117 (2016),10.1103/physrevlett.117.130501.

[27] Nai-Hui Chia, Andras Gilyen, Tongyang Li,Han-Hsuan Lin, Ewin Tang, and ChunhaoWang, “Sampling-based sublinear low-rank ma-trix arithmetic framework for dequantizing quan-tum machine learning,” Proceedings of the 52ndAnnual ACM SIGACT Symposium on Theory ofComputing (2020), 10.1145/3357713.3384314.

[28] Vojtech Havlıcek, Antonio D. Corcoles, Kristan

Page 9: Grammar-Aware Question-Answering on Quantum Computers

8

Temme, Aram W. Harrow, Abhinav Kandala,Jerry M. Chow, and Jay M. Gambetta, “Super-vised learning with quantum-enhanced featurespaces,” Nature 567, 209–212 (2019).

[29] Zhaokai Li, Xiaomei Liu, Nanyang Xu, andJiangfeng Du, “Experimental realization ofa quantum support vector machine,” Physi-cal Review Letters 114 (2015), 10.1103/phys-revlett.114.140504.

[30] William Zeng and Bob Coecke, “Quantum algo-rithms for compositional natural language pro-cessing,” Electronic Proceedings in TheoreticalComputer Science 221, 67–75 (2016).

[31] Lee James O’Riordan, Myles Doyle, FabioBaruffa, and Venkatesh Kannan, “A hybridclassical-quantum workflow for natural languageprocessing,” Machine Learning: Science andTechnology (2020), 10.1088/2632-2153/abbd2e.

[32] Nathan Wiebe, Alex Bocharov, Paul Smolen-sky, Matthias Troyer, and Krysta M Svore,“Quantum language processing,” (2019),arXiv:1902.05162 [quant-ph].

[33] Johannes Bausch, Sathyawageeswar Subrama-nian, and Stephen Piddock, “A quantumsearch decoder for natural language processing,”(2020), arXiv:1909.05023 [quant-ph].

[34] Joseph CH Chen, “Quantum computation andnatural language processing,” (2002).

[35] S. Abramsky and B. Coecke, “A categorical se-mantics of quantum protocols,” in Proceedingsof the 19th Annual IEEE Symposium on Logicin Computer Science, 2004. (2004) pp. 415–425.

[36] B. Coecke and A. Kissinger, Picturing QuantumProcesses. A First Course in Quantum Theoryand Diagrammatic Reasoning (Cambridge Uni-versity Press, 2017).

[37] Konstantinos Meichanetzidis, Stefano Gogioso,Giovanni De Felice, Nicolo Chiappori, AlexisToumi, and Bob Coecke, “Quantum natural lan-guage processing on near-term quantum comput-ers,” (2020), arXiv:2005.04147 [cs.CL].

[38] Maria Schuld, Alex Bocharov, Krysta M. Svore,and Nathan Wiebe, “Circuit-centric quantumclassifiers,” Physical Review A 101 (2020),10.1103/physreva.101.032308.

[39] Marcello Benedetti, Erika Lloyd, Stefan Sack,and Mattia Fiorentini, “Parameterized quantumcircuits as machine learning models,” QuantumScience and Technology 4, 043001 (2019).

[40] Joachim Lambek, “From word to sentence,” .[41] Anne Preller, “Linear processing with pre-

groups,” Studia Logica: An International Jour-nal for Symbolic Logic 87, 171–197 (2007).

[42] John C. Baez and Mike Stay, “Physics, topology,logic and computation: A rosetta stone,” (2009),arXiv:0903.0340 [quant-ph].

[43] P. Selinger, “A survey of graphical languages formonoidal categories,” Lecture Notes in Physics ,289–355 (2010).

[44] Maria Schuld and Nathan Killoran, “Quan-tum machine learning in feature hilbertspaces,” Physical Review Letters 122 (2019),10.1103/physrevlett.122.040504.

[45] Seth Lloyd, Maria Schuld, Aroosa Ijaz, JoshIzaac, and Nathan Killoran, “Quantum em-beddings for machine learning,” (2020),arXiv:2001.03622 [quant-ph].

[46] Tomas Mikolov, Kai Chen, Greg Corrado,and Jeffrey Dean, “Efficient estimation ofword representations in vector space,” (2013),arXiv:1301.3781 [cs.CL].

[47] J. C. Spall, “Implementation of the simultane-ous perturbation algorithm for stochastic opti-mization,” IEEE Transactions on Aerospace andElectronic Systems 34, 817–823 (1998).

[48] Giovanni de Felice, Konstantinos Meichanet-zidis, and Alexis Toumi, “Functorial questionanswering,” Electronic Proceedings in Theoreti-cal Computer Science 323, 84–94 (2020).

[49] Yiwei Chen, Yu Pan, and Daoyi Dong,“Quantum language model with entanglementembedding for question answering,” (2020),arXiv:2008.09943 [cs.CL].

[50] Qin Zhao, Chenguang Hou, Changjian Liu, PengZhang, and Ruifeng Xu, “A quantum expecta-tion value based language model with applicationto question answering,” Entropy 22, 533 (2020).

[51] Seyon Sivarajah, Silas Dilkes, Alexander Cow-tan, Will Simmons, Alec Edgington, and RossDuncan, “t—ket⟩: A retargetable compiler fornisq devices,” Quantum Science and Technology(2020), 10.1088/2058-9565/ab8e92.

[52] Dorit Aharonov, Vaughan Jones, and ZephLandau, “A polynomial quantum algorithm forapproximating the jones polynomial,” Algorith-mica 55, 395–421 (2008).

[53] Kosuke Mitarai and Keisuke Fujii, “Methodol-ogy for replacing indirect measurements with di-rect measurements,” Physical Review Research1 (2019), 10.1103/physrevresearch.1.013006.

[54] Marcello Benedetti, Mattia Fiorentini, andMichael Lubasch, “Hardware-efficient variationalquantum algorithms for time evolution,” (2020),arXiv:2009.12361 [quant-ph].

[55] Robin Piedeleu, Dimitri Kartsaklis, Bob Coecke,and Mehrnoosh Sadrzadeh, “Open system cate-gorical quantum semantics in natural languageprocessing,” (2015), arXiv:1502.00831 [cs.CL].

[56] Desislava Bankova, Bob Coecke, Martha Lewis,and Daniel Marsden, “Graded entailment forcompositional distributional semantics,” (2016),arXiv:1601.04908 [cs.CL].

[57] Bob Coecke, “The mathematics of text struc-ture,” (2020), arXiv:1904.03478 [cs.CL].

[58] Bob Coecke, Giovanni De Felice, KonstantinosMeichanetzidis, and Alexis Toumi, “Founda-tions for near-term quantum natural languageprocessing (unpublished),” (2020).

Page 10: Grammar-Aware Question-Answering on Quantum Computers

9

[59] Wojciech Buszkowski and Katarzyna Moroz,“Pregroup grammars and context-free gram-mars,”.

[60] M. Pentus, “Lambek grammars are contextfree,” in [1993] Proceedings Eighth Annual IEEESymposium on Logic in Computer Science (1993)pp. 429–433.

[61] https://github.com/andim/noisyopt.[62] James Bradbury, Roy Frostig, Peter Hawkins,

Matthew James Johnson, Chris Leary, DougalMaclaurin, and Skye Wanderman-Milne, “JAX:composable transformations of Python+NumPyprograms,” (2018).

[63] Brian Olson, Irina Hashmi, Kevin Molloy, andAmarda Shehu, “Basin hopping as a generaland versatile optimization framework for thecharacterization of biological macromolecules,”Advances in Artificial Intelligence 2012, 1–19(2012).

[64] Fuchang Gao and Lixing Han, “Implementingthe nelder-mead simplex algorithm with adap-tive parameters,” Computational Optimizationand Applications 51, 259–277 (2010).

[65] https://pypi.org/project/scipy.[66] Alexander Cowtan, Silas Dilkes, Ross Dun-

can, Alexandre Krajenbrink, Will Simmons,and Seyon Sivarajah, “On the Qubit RoutingProblem,” in 14th Conference on the Theoryof Quantum Computation, Communication andCryptography (TQC 2019), Leibniz InternationalProceedings in Informatics (LIPIcs), Vol. 135,edited by Wim van Dam and Laura Mancinska(Schloss Dagstuhl–Leibniz-Zentrum fuer Infor-matik, Dagstuhl, Germany, 2019) pp. 5:1–5:32.

[67] https://github.com/CQCL/pytket.

Page 11: Grammar-Aware Question-Answering on Quantum Computers

10

Appendix

In this supplementary material we begin bybriefly reviewing pregroup grammar. We thenprovide the necessary background to the graphi-cal language of process theories describe our pro-cedure for generating random sentence diagramsusing a context-free grammar. For completenesswe include the three labelled corpora of sentenceswe used in this work. Furthermore, we show de-tails of our mapping from sentence diagrams toquantum circuits. Finally we give details on theoptimisation methods we used for our supervisedquantum machine learning task and the specificcompilation pass we used from CQC’s compiler,t|ket〉.

Appendix A: Pregroup Grammar

Pregroup grammars where introduced by Lam-bek as an algebraic model for grammar [40].

A pregroup grammar G is freely generated bythe basic types in a finite set b ∈ B. Basic typesare decorated by an integer k ∈ Z, which signifiestheir adjoint order. Negative integers −k, withk ∈ N, are called left adjoints of order k and pos-itive integers k ∈ N are called right adjoints. Weshall refer to a basic type to some adjoint order(include the zeroth order) simply as ‘type’. Thezeroth order k = 0 signifies no adjoint action onthe basic type and so we often omit it in notation,b0 = b.

The pregroup algebra is such that the two kindsof adjoint (left and right) act as left and rightinverses under multiplication of basic types

bkbk+1 → ε→ bk+1bk,

where ε ∈ B is the trivial or unit type. The lefthand side of this reduction is called a contractionand the right hand side an expansion. Pregroupgrammar also accommodates induced steps a→ bfor a, b ∈ B. The symbol ‘→’ is to be read as‘type-reduction’ and the pregroup grammar setsthe rules for which reductions are valid.

Now, to go from word to sentence, we considera finite set of words called the vocabulary V . Wecall the dictionary (or lexicon) the finite set ofentries D ⊆ V × (B × Z)∗. The star symbol A∗

denotes the set of finite strings that can be gen-erated by the elements of the set A. Each dictio-nary entry assigns a product (or string) of types

to a word tw =∏i bkii , ki ∈ Z.

Finally, a pregroup grammar G generates a lan-guage LG ⊆ V ∗ as follows. A sentence is a se-quence (or list) of words σ ∈ V ∗. The type ofa sentence is the product of types of its wordstσ =

∏i twi

, where wi ∈ V and i ≤ |σ|. A sen-tence is grammatical, i.e. it belongs to the lan-guage generated by the grammar σ ∈ LG, if andonly if there exists a sequence of reductions sothat the type of the sentence reduces to the spe-cial sentence-type s ∈ B as tσ → · · · → s. Notethat it is in fact possible to type-reduce gram-matical sentences only using contractions.

Appendix B: String Diagrams

String diagrams describing process theories aregenerated by states, effects, and processes. InFig.6 we comprehensively show these genera-tors along with constraining equations on them.String diagrams for process theories formally de-scribe process networks where only connectivitymatters, i.e. which outputs are connected towhich inputs. In other words, the length of thewires carries no meaning and the wires are freelydeformable as long as the topology of the networkis respected.

It is beyond the purposes of this work to pro-vide a comprehensive exposition on diagrammaticlanguages. We provide the necessary elementswhich are used for the implementation of ourQNLP experiments.

Appendix C: Random Sentence Generationwith CFG

A context-free grammar generates a languagefrom a set of production (or rewrite) rules ap-plied on symbols. Symbols belong to a finite setΣ and There is a special type S ∈ Σ called initial.Production rules belong to a finite set R and areof the form T → ∏

i Ti, where T, Ti ∈ Σ. Theapplication of a production rule results in substi-tuting a symbol with a product (or string) of sym-bols. Randomly generating a sentence amountsto starting from S and randomly applying pro-duction rules uniformly sampled from the set R.The production ends when all types produced areterminal types, which are non other than wordsin the finite vocabulary V .

From a process theory point of view, we repre-sent symbols as types carried by wires. Produc-tion rules are represented as boxes with input and

Page 12: Grammar-Aware Question-Answering on Quantum Computers

11

. . .

. . .

stateeffect

t1 t2 tn

t1 t2 tm

process. . .

t1 t2 tm

. . .t1 t2 tn

. . .

. . .

t1 t2 tn1

t1 t2 tm1

p1

. . .

. . .

r1 r2 rn2

r1 r2 rm2

p2 p1 ⊗ p2

. . .t1 t2 tn1

. . .r1 r2 rn2

. . .

t1 t2 tn1

. . .

r1 r2 rn2

. . .

. . .

t1 t2 tn

l1 lk

p1

. . .

r1 r2 rm

p2

l2

. . .t1 t2 tn

p1 ◦ p2. . .

r1 r2 rm

. . .

s

p. . . s′

. . .

s. . .

e. . .p scalar

cap

cupt

tt

tt

t

Id

SWAP

r s

t w

r s

tw

FIG. 6. Diagrams are read from top to bottom.States have only outputs, effects have only inputs,processes (boxes) have both input and output wires.All wires carry types. Placing boxes side by side is al-lowed by the monoidal structure and signifies parallelprocesses. Sequential process composition is repre-sented by composing outputs of a box with inputs ofanother box. A process transforms a state into a newstate. There are special kinds of states called capsand effects called cups, which satisfy the snake equa-tion which relates them to the identity wire (trivialprocess). Process networks freely generated by thesegenerators need not be planar, and so there exists aspecial process that swaps wires and acts trivially oncaps and cups.

output wires labelled by the appropriate types.The process network (or string diagram) describ-ing the production of a sentence ends with a pro-duction rule whose output is the S-type. Thenwe randomly pick boxes and compose them back-wards, always respecting type-matching when in-puts of production rules are fed into outputs ofother production rules. The generation termi-nates when production rules are applied whichhave no inputs (i.e. they are states), and theycorrespond to the words in the finite vocabulary.

In Fig.7 (on the left hand side of the ar-

NP VP

S

TV NP

VP

IV

VP

N

NP

N VP

NP

RPRON

wN

N

wTV

wIV

IV

wRPRON

TV

RPRON

n s nn s n s

n n n

ns

wN

N

wTV

n s n

wIV

n s

wRPRON

nn s n

FIG. 7. CFG generation rules used to producethe corpora K30,K6,K16 used in this work, repre-sented as string-diagram generators, where wN ∈ VN ,wTV ∈ VTV , wIV ∈ VIV , wRPRON ∈ VRPRON .They are mapped to pregroup reductions by mappingCFG symbols to pregroup types, and so CFG-statesare mapped to DisCoCat word-states and productionboxes are mapped to products of cups and identi-ties. Note that the pregroup unit ε is the empty wireand so it is never drawn. Pregroup type contractionscorrespond to cups and expansions to caps. Sincegrammatical reduction are achievable only with con-tractions, only cups are required for the constructionof sentence diagrams.

rows) we show the string-diagram generators weuse to randomly produce sentences from a vo-cabulary of words composed of nouns, transi-tive verbs, intransitive verbs, and relative pro-nouns. The corresponding types of these partsof speech are N,TV, IV,RPRON . The vocab-ulary is the union of the words of each type,V = VN ∪ VTV ∪ VIV ∪ VRPRON .

Having randomly generated a sentence fromthe CFG, its string diagram can be translatedinto a pregroup sentence diagram. To do so weuse the translation rules shown in Fig.7. Notethat a cup labeled by the basic type b is usedto represent a contraction bkbk+1 → ε. Pregroupgrammars are weakly equivalent to context-freegrammars, in the sense that they generate thesame language [59, 60].

Appendix D: Corpora

Here we present the sentences and their labelsused in the experiments presented in the maintext.

The types assigned to the words of this sen-tence are as follows. Nouns get typed as tw∈VN

=

Page 13: Grammar-Aware Question-Answering on Quantum Computers

12

n0, transitive verbs are given type tw∈VTV=

n1s0n−1, intransitive verbs are typed tw∈IV =n1s0, and the relative pronoun is typed twho =n1n0s−1n0.

Corpus K30 of 30 labeled sentences from thevocabulary VN = {’Dude’, ’Walter’}, VTV ={’loves’, ’annoys’}, VIV = {’abides’,’bowls’},VRPRON = {’who’}:[(’Dude who loves Walter bowls’, 1),(’Dude bowls’, 1),(’Dude annoys Walter’, 0),(’Walter who abides bowls’, 0),(’Walter loves Walter’, 1),(’Walter annoys Dude’, 1),(’Walter bowls’, 1),(’Walter abides’, 0),(’Dude loves Walter’, 1),(’Dude who bowls abides’, 1),(’Walter who bowls annoys Dude’, 1),(’Dude who bowls bowls’, 1),(’Dude who abides abides’, 1),(’Dude annoys Dude who bowls’, 0),(’Walter annoys Walter’, 0),(’Dude who abides bowls’, 1),(’Walter who abides loves Walter’, 0),(’Walter who bowls bowls’, 1),(’Walter loves Walter who abides’, 0),(’Walter annoys Walter who bowls’, 0),(’Dude abides’, 1),(’Dude loves Walter who bowls’, 1),(’Walter who loves Dude bowls’, 1),(’Dude loves Dude who abides’, 1),(’Walter who abides loves Dude’, 0),(’Dude annoys Dude’, 0),(’Walter who annoys Dude bowls’, 1),(’Walter who annoys Dude abides’, 0),(’Walter loves Dude’, 1),(’Dude who bowls loves Walter’, 1)]

Corpus K6 of 6 labeled sentences from thevocabulary VN = {’Romeo’, ’Juliet’}, VTV ={’loves’}, VIV = {’dies’}, VRPRON = {’who’}:[(’Romeo dies’, 1.0),(’Romeo loves Juliet’, 0.0),(’Juliet who dies dies’, 1.0),(’Romeo loves Romeo’, 0.0),(’Juliet loves Romeo’, 0.0),(’Juliet dies’, 1.0)]

Corpus K16 of 16 labeled sentences from thevocabulary VN = {’Romeo’, ’Juliet’}, VTV ={’loves’, ’kills’}, VIV = {’dies’}, VRPRON ={’who’}:[(’Juliet kills Romeo who dies’, 0),

(’Juliet dies’, 1),(’Romeo who loves Juliet dies’, 1),(’Romeo dies’, 1),(’Juliet who dies dies’, 1),(’Romeo loves Juliet’, 1),(’Juliet who dies loves Juliet’, 0),(’Romeo kills Juliet who dies’, 0),(’Romeo who kills Romeo dies’, 1),(’Romeo who dies dies’, 1),(’Romeo who loves Romeo dies’, 0),(’Romeo kills Juliet’, 0),(’Romeo who dies kills Romeo’, 1),(’Juliet who dies kills Romeo’, 0),(’Romeo loves Romeo’, 0),(’Romeo who dies kills Juliet’, 0)]

Appendix E: Sentence to Circuit mapping

Quantum theory has formally been shown tobe a process theory. Therefore it enjoys a dia-grammatic language in terms of string diagrams.Specifically, in the context of the quantum cir-cuits we construct in our experiments, we usepure quantum theory. In the case of pure quan-tum theory, processes are unitary operations, orquantum gates in the context of circuits. Themonoidal structure allowing for parallel processesis instantiated by the tensor product and se-quential composition is instantiated by sequentialcomposition of quantum gates.

In Fig.8 we show the generic construction ofthe mapping from sentence diagrams to parame-terised quantum circuits for the hyperparametersand parameterised word-circuits we use in thiswork.

A wire carrying basic pregroup type b is givenqb qubits. A word-state with only one output wirebecomes a one-qubit-state prepared from |0〉. Forthe preparation of such unary states we choosethe sequence of gates defining an Euler decom-position of one-qubit unitaries Rz(θ1) ◦ Rx(θ2) ◦Rz(θ3). Word-states with more than one out-put wires become multiqubit states on k > 1qubits prepared by an IQP-style circuit from∏ki=1 |0〉. Such a word-circuit is composed of d-

many layers. Each layer is composed of a layer ofHadamard gates followed by a layer in which ev-ery neighbouring pair of qubit wires is connectedby a CRz(θ) gate,

(⊗ki=1H

)◦(⊗k−1i=1 CRz(θi)i,i+1

).

Since all CRz gates commute with each other itis justified to consider this as a single layer, atleast abstractly. The Kronecker tensor with n-many output wires of type b is mapped to a GHZ

Page 14: Grammar-Aware Question-Answering on Quantum Computers

13

wRz

|0〉

Rx

Rz

w. . .

CRz

HH

|0〉 |0〉

H

|0〉

CRz

. . .

HH H

. . .

. . .

CRz

CRz

. . .

HH H

CRz

CRz

. . .

. . .

. . .

|0〉 |0〉 |0〉. . .

GHZ

Rz

|0〉

or

b1 b2 bn

nqb

∑ni=1 qbi

. . .b b b

n

b qb

qb

b

H

〈0| 〈0|

H

〈0| 〈0|

H

〈0| 〈0|

. . .. . .

qb qb

d = 1

b1 b2qb1 qb2

. . .. . .

. . .. . .

FIG. 8. Mapping from sentence diagrams to param-eterised quantum circuits. Here we show how thegenerators of sentence diagrams are mapped to gener-ators of circuits, for the hyperparameters we considerin this work.

state on nqb qubits. Specifically, GHZ is a cir-

cuit that prepares the state∑2qb

x=0

⊗ni=1 |bin(x)〉,

where bin is the binary expression of an integer.The cup of pregroup type b is mapped to qb-manynested Bell effects, each of which is implementedas a CNOT followed by a Hadamard gate on thecontrol qubit and postselection on 〈00|.

Appendix F: Optimisation Method

The gradient-free otpimisation method we use,Simultaneous Perturbation Stochastic Approxi-mation (SPSA), works as follows. Start from arandom point in parameter space. At every it-eration pick randomly a direction and estimatethe derivative by finite difference with step-sizedepending on c towards that direction. This re-quires two cost function evaluations. This pro-vides a significant speed up the evaluation ofL(θ). Then take a step of size depending on atowards (opposite) that direction if the derivativehas negative (positive) sign. In our experimentswe use minimizeSPSA from the Python packagenoisyopt [61], and we set a = 0.1 and c = 0.1,except for the experiment on ibmq for d = 3 forwhich we set a = 0.05 and c = 0.05.

Note that for classical simulations, we use just-in-time compilation of the cost function by invok-

FIG. 9. Minimisation of binary cross entropy costfunction LBCE with SPSA for the question answeringtask for corpus K30.

ing jit from jax [62]. In addition, the choice ofthe squares-of-differences cost we defined in Eq.2is not unique. One can as well use the binarycross entropy

LBCE(θ) = − 1

|∆|∑

σ∈∆

lσ log lprσ (θσ)+(1−lσ) log(1−lpr

σ (θσ))

and the cost function can be minimised as well,as shown in Fig.9.

In our classical simulation of the experimentwe also used basinhopping [63] in combinationwith Nelder-Mead [64] from the Python pack-age SciPy [65]. Nelder-Mead is a gradient-freelocal optimisation method. basinhopping hops(or jumps) between basins (or local minima) andthen returns the minimum over local minima ofthe cost function, where each minimum is foundby Nelder-Mead. The hop direction is random.The hop is accepted according to a Metropolis cri-terion depending on the the cost function to beminimised and a temperature. We used the de-fault temperature value (1) and the default num-ber of basin hops (100).

1. Error Decay

In Fig.10 we show the decay of mean trainingand test errors for the question answering task forcorpus K30 simulated classically, which is shownas inset in Fig.4. Plotting in log-log scale we re-veal, at least initially, an algebraic decay of theerrors with the depth of the word-circuits.

Page 15: Grammar-Aware Question-Answering on Quantum Computers

14

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6log d

−3.0

−2.5

−2.0

−1.5

−1.0lo

gm

ean

erro

r

〈etr〉〈ete〉

FIG. 10. Algebraic decay of mean training and test-ing error for the data displayed in Fig.4 (bottom)obtained by basinhopping. Increasing the depth ofthe word-circuits results in algebraic decay of themean training and testing errors. The slopes arelog etr ∼ log−1.2d and log ete ∼ −0.3 log d. We at-tribute the existence of the plateau for etr at largedepths is due the small scale of our experiment andthe small values for our hyperparameters determiningthe size of the quantum-enhanced feature space.

2. On the influence of noise to the costfunction landscape

Regarding optimisation on a quantum com-puter, we comment on the effect of noise onthe optimal parameters. Consider a successfuloptimisation of L(θ) performed on a NISQ de-vice, returning θ∗NISQ. However, if we instantiate

the circuits Cσ(θ∗NISQ) and evaluate them on aclassical computer to obtain the predicted labelslCCpr (θ∗NISQ), we observe that these can in general

differ from the labels lNISQpr (θ∗NISQ) predicted by

evaluating the circuits on the quantum computer.In the context of a fault-tolerant quantum com-puter, this should not be the case. However, sincethere is a non-trivial coherent-noise channel thatour circuits undergo, it is expected that the opti-miser’s result are affected in this way.

Appendix G: Quantum Compilation

In order to perform quantum compilation weuse pytket [51]. It is a Python module for in-terfacing with CQC’s t|ket〉, a toolset for quan-tum programming. From this toolbox, we needto make use of compilation passes.

At a high level, quantum compilation can bedescribed as follows. Given a circuit and a device,quantum operations are decomposed in terms ofthe devices native gateset. Furthermore, the

|0〉

|ψ〉 U

H (S†)b H Z

FIG. 11. Circuit for the Hadamard test. Measuringthe control qubit in the computational basis allowsone to estimate 〈Z〉 = Re(〈ψ|U |ψ〉) if b = 0, and〈Z〉 = Im(〈ψ|U |ψ〉) if b = 1. The state ψ can be amultiqubit state, and in this work we are interestedin the case ψ = |0 . . . 0〉.

quantum circuit is reshaped in order to makeit compatible with the device’s topology [66].Specifically, the compilation pass that we use isdefault compilation pass(2). The integer op-tion is set to 2 for maximum optimisation undercompilation [67].

Circuits written in pytket can be run onother devices by simply changing the backendbeing called, regardless whether the hardwaremight be fundamentally different in terms ofwhat physical systems are used as qubits. Thismakes t|ket〉 it platform agnostic. We stressthat on IBMQ machines specifically, the na-tive gates are arbitrary single-qubit unitaries(‘U3’ gate) and entangling controlled-not gates(‘CNOT’ or ‘CX’). Importantly, CNOT gatesshow error rates which are one or even two or-ders of magnitude larger than error rates of U3gates. Therefore, we measure the depth of or cir-cuits in terms of the CNOT-depth. Using pytketthis can be obtained by invoking the commanddepth by type(OpType.CX).

For both backends used in this work,ibmq montreal and ibmq toronto, the reportedquantum volume is 32 and the maximum allowednumber of shots is 213.

Appendix H: Hadamard Test

In our binary classification NLP task, the pre-dicted label is the norm squared of zero-to-zerotransition amplitude 〈0 . . . 0|U |0 . . . 0〉, where theunitary U includes the word-circuits and thecircuits that implement the Bell effects as dic-tated by the grammatical structure. Estimatingthese amplitudes can be done by postselecting on〈0 . . . 0|. However, postselection costs exponen-tial time in the number of postselected qubits; inour case needs to discard all bitstring sampledfrom the quantum computer that have Hammingweight other than zero. This is the procedure we

Page 16: Grammar-Aware Question-Answering on Quantum Computers

15

Rz

Rz

Rz

H

H

H

H

H

H

H

H

H

H

H

H

H

|0〉 |0〉 |0〉 |0〉 |0〉 |0〉 |0〉 |0〉

Z

Z

Rz

Rz

H

H

(S†)b

|0〉

Z

FIG. 12. Use of Hadamard test to estimate theamplitude represented by the postselected circuit ofFig.3.

follow in this proof of concept experiment, as wecan afford doing so due to the small circuit sizes.

In such a setting, postselection can be avoidedby using the Hadamard test [52]. See Fig.11 forthe circuit allowing for the estimation of the realand imaginary part of an amplitude. In Fig.12we show how the Hadamard test can be used toestimate the amplitude represented by the post-selected quantum circuit of Fig.3.