neural syntactic preordering for controlled paraphrase ... · arrangement of content, or high-level...

Neural Syntactic Preordering for Controlled Paraphrase Generation

Tanya Goyal and Greg DurrettDepartment of Computer ScienceThe University of Texas at Austin

[email protected], [email protected]

Abstract

Paraphrasing natural language sentences is amultifaceted process: it might involve replac-ing individual words or short phrases, local re-arrangement of content, or high-level restruc-turing like topicalization or passivization. Pastapproaches struggle to cover this space of para-phrase possibilities in an interpretable manner.Our work, inspired by pre-ordering literaturein machine translation, uses syntactic trans-formations to softly “reorder” the source sen-tence and guide our neural paraphrasing model.First, given an input sentence, we derive a setof feasible syntactic rearrangements using anencoder-decoder model. This model operatesover a partially lexical, partially syntactic viewof the sentence and can reorder big chunks.Next, we use each proposed rearrangement toproduce a sequence of position embeddings,which encourages our final encoder-decoderparaphrase model to attend to the source wordsin a particular order. Our evaluation, both au-tomatic and human, shows that the proposedsystem retains the quality of the baseline ap-proaches while giving a substantial increase inthe diversity of the generated paraphrases.1

1 Introduction

Paraphrase generation (McKeown, 1983; Barzilayand Lee, 2003) has seen a recent surge of inter-est, both with large-scale dataset collection andcuration (Lan et al., 2017; Wieting and Gimpel,2018) and with modeling advances such as deepgenerative models (Gupta et al., 2018; Li et al.,2019). Paraphrasing models have proven to be es-pecially useful if they expose control mechanismsthat can be manipulated to produce diverse para-phrases (Iyyer et al., 2018; Chen et al., 2019b; Parket al., 2019), which allows these models to be em-ployed for data augmentation (Yu et al., 2018) and

1Data and code are available at https://github.com/tagoyal/sow-reap-paraphrasing

Rearrangement Aware ParaphrasingSource Order RewritingS

Clippers

won the game

NP VP

NPVBD

XNP won YNP

YNP won by XNP 4 3 1 2Clippers won the game

The game was won by the Clippers.

Source order encoding

Transformer seq2seq

Figure 1: Overview of our paraphrase model. First, wechoose various pairs of constituents to abstract away inthe source sentence, then use a neural transducer to gen-erate possible reorderings of the abstracted sentences.From these, we construct a guide reordering of the in-put sentence which then informs the generation of out-put paraphrases.

adversarial example generation (Iyyer et al., 2018).However, prior methods involving syntactic controlmechanisms do not effectively cover the space ofparaphrase possibilities. Using syntactic templatescovering the top of the parse tree (Iyyer et al., 2018)is inflexible, and using fully-specified exemplarsentences (Chen et al., 2019b) poses the problemof how to effectively retrieve such sentences. Fora particular input sentence, it is challenging to usethese past approaches to enumerate the set of re-orderings that make sense for that sentence.

In this paper, we propose a two-stage approachto address these limitations, outlined in Figure 1.First, we use an encoder-decoder model (SOW,for Source Order reWriting) to apply transduc-tion operations over various abstracted versionsof the input sentence. These transductions yieldpossible reorderings of the words and constituents,which can be combined to obtain multiple feasi-ble rearrangements of the input sentence. Eachrearrangement specifies an order that we shouldvisit words of the source sentence; note that suchorderings could encourage a model to passivize(visit the object before the subject), topicalize, orreorder clauses. These orderings are encoded forour encoder-decoder paraphrase model (REAP, for

arX

iv:2

005.

0201

3v1

[cs

.CL

] 5

May

202

0

https://github.com/tagoyal/sow-reap-paraphrasing

https://github.com/tagoyal/sow-reap-paraphrasing

REarrangement Aware Paraphrasing) by way of po-sition embeddings, which are added to the sourcesentence encoding to specify the desired order ofgeneration (see Figure 2). This overall workflow isinspired by the pre-ordering literature in machinetranslation (Xia and McCord, 2004; Collins et al.,2005); however, our setting explicitly requires en-tertaining a diverse set of possible orderings corre-sponding to different paraphrasing phenomena.

We train and evaluate our approach on the large-scale English paraphrase dataset PARANMT-50M(Wieting and Gimpel, 2018). Results show thatour approach generates considerably more diverseparaphrases while retaining the quality exhibitedby strong baseline models. We further demonstratethat the proposed syntax-based transduction proce-dure generates a feasible set of rearrangements forthe input sentence. Finally, we show that positionembeddings provide a simple yet effective way toencode reordering information, and that the gener-ated paraphrases exhibit high compliance with thedesired reordering input.

2 Method

Given an input sentence x = {x1, x2, . . . , xn},our goal is to generate a set of structurally distinctparaphrases Y = {y1,y2, . . . ,yk}. We achievethis by first producing k diverse reorderings for theinput sentence, R = {r1, r2, . . . , rk}, that guidethe generation order of each corresponding y. Eachreordering is represented as a permutation of thesource sentence indices.

Our method centers around a sequence-to-sequence model which can generate a paraphraseroughly respecting a particular ordering of the inputtokens. Formally, this is a model P (y | x, r). First,we assume access to the set of target reorderings Rand describe this rearrangement aware paraphras-ing model (REAP) in Section 2.2. Then, in Section2.3, we outline our reordering approach, includingthe source order rewriting (SOW) model, whichproduces the set of reorderings appropriate for agiven input sentence x during inference (x→ R).

2.1 Base Model

The models discussed in this work build on astandard sequence-to-sequence transformer model(Vaswani et al., 2017) that uses stacked layers ofself-attention to both encode the input tokens x anddecode the corresponding target sequence y. Thismodel is pictured in the gray block of Figure 2.

Encoder Decoder

+ + + +

Input tokens x

Original OrderToken embeddings

Encoder Output EM

Target Order r

New Encoder Output E

4 3 1 2

1 2 3 4

Output tokens y

Clippers won the game BOS The

The game

Figure 2: Rearrangement aware paraphrasing (REAP)model. The gray area corresponds to the standard trans-former encoder-decoder system. Our model adds posi-tion embeddings corresponding to the target reorderingto encoder outputs. The decoder attends over these aug-mented encodings during both training and inference.

Throughout this work, we use byte pair encoding(BPE) (Sennrich et al., 2016) to tokenize our inputand output sentences. These models are trained inthe standard way, maximizing the log likelihood ofthe target sequence using teacher forcing. Addition-ally, in order to ensure that the decoder does notattend to the same input tokens repeatedly at eachstep of the decoding process, we include a coverageloss term, as proposed in See et al. (2017).

Note that since the architecture of the trans-former model is non-recurrent, it adds positionembeddings to the input word embeddings in or-der to indicate the correct sequence of the wordsin both x and y (see Figure 2). In this work, wepropose using an additional set of position embed-dings to indicate the desired order of words duringgeneration, described next.

2.2 Rearrangement aware ParaphrasingModel (REAP)

Let r = {r1, r2, . . . , rn} indicate the target reorder-ing corresponding to the input tokens x. We wantthe model to approximately attend to tokens in thisspecified order when generating the final outputparaphrase. For instance, in the example in Figure1, the reordering specifies that when producing theparaphrase, the model should generate content re-lated to the game before content related to Clippersin the output. In this case, based on the rearrange-ment being applied, the model will most likely usepassivization in its generation, although this is notstrictly enforced.

The architecture for our model P (y | x, r) isoutlined in Figure 2. Consider an encoder-decoder

REORDER(S0): Recursively reorder constituents to get final ordering

S0

SBAR PRP VP

IN S

PRP VP1

If it continues to rain I will carry an umbrella

MD VP2

VB NP

SOW Input SOW Output

If S I will VP 4 5 1 2 3

SBAR I will carry NP 1 3 4 5 2

SELECTSEGMENTPAIRS: Choose constituents to abstract

REORDERPHRASE: Use seq2seq model to reorder phrase

SBAR I will carry NP

If S I will VP I will VP if S

SBAR NP I carry

Source reordering

Final paraphrases

I will carry an umbrella if rain continues.

FOR A, B SELECTSEGMENTPAIRS(S0)∈REORDERPHRASE (S0, A, B)

REAP

SOW

SBAR I will carry NPIf S I will VP4 5 1 2 3

If S I will carry an umbrella6 7 1 2 3 4 5

If it continues to rain I will carry an umbrella6 7 10 8 9 1 2 3 4 5

REORDER(A)

REORDER(B)r r If it continues to rain, an umbrella is what i will carry.

Derived reorderings

Figure 3: Overview of the source sentence rearrangement workflow for one level of recursion at the root node. First,candidate tree segment pairs contained within the input node are selected. A transduction operation is applied overthe abstracted phrase, giving the reordering 4 5 1 2 3 for the case shown in red, then the process recursivelycontinues for each abstracted node. This results in a reordering for the full source sentence; the reordering indicesserve as additional input to the REAP model.

architecture with a stack of M layers in the encoderand N layers in the decoder. We make the targetreordering r accessible to this transformer modelthrough an additional set of positional embeddingsPEr. We use the sinusoidal function to constructthese following Vaswani et al. (2017).

Let EM = encoderM (x) be the output of theM th (last) layer of the encoder. The special-purpose position embeddings are added to the out-put of this layer (see Figure 2): E = EM + PEr.Note that these are separate from standard positionembeddings added at the input layer; such embed-dings are also used in our model to encode the orig-inal order of the source sentence. The transformerdecoder model attends over E while computingattention and the presence of the position embed-dings should encourage the generation to obey thedesired ordering r, while still conforming to the de-coder language model. Our experiments in Section4.3 show that this position embedding method isable to successfully guide the generation of para-phrases, conditioning on both the input sentencesemantics as well as the desired ordering.

2.3 Sentence ReorderingWe now outline our approach for generating thesedesired reorderings r. We do this by predictingphrasal rearrangements with the SOW model at var-ious levels of syntactic abstraction of the sentence.We combine multiple such phrase-level rearrange-ments to obtain a set R of sentence-level rearrange-ments. This is done using a top-down approach,starting at the root node of the parse tree. The over-all recursive procedure is outlined in Algorithm 1.

Algorithm 1 REORDER(t)

Input: Sub-tree t of the input parse treeOutput: Top-k list of reorderings for t’s yieldT = SELECTSEGMENTPAIRS(t) // Step 1

R = INITIALIZEBEAM(size = k)for (A,B) in T doz = REORDERPHRASE(t, A,B) // Step 2

RA(1, . . . , k) = REORDER(tA) // k orderings

RB(1, . . . , k) = REORDER(tB) // k orderings

for ra, rb in RA ×RB dor = COMBINE(z, ra, rb) // Step 3

score(r) = score(z)+score(ra)+score(rb)R.push(r, score(r))

end forend forreturn R

One step of the recursive algorithm has threemajor steps: Figure 3 shows the overall workflowfor one iteration (here, the root node of the sen-tence is selected for illustration). First, we selectsub-phrase pairs of the input phrase that respectparse-tree boundaries, where each pair consistsof non-overlapping phrases (Step 1). Since theaim is to learn generic syntax-governed rearrange-ments, we abstract out the two sub-phrases, andreplace them with non-terminal symbols, retainingonly the constituent tag information. For example,we show three phrase pairs in Figure 3 that canbe abstracted away to yield the reduced forms ofthe sentences. We then use a seq2seq model toobtain rearrangements for each abstracted phrase

(Step 2). Finally, this top-level rearrangement iscombined with recursively-constructed phrase rear-rangements within the abstracted phrases to obtainsentence-level rearrangements (Step 3).

Step 1: SELECTSEGMENTPAIRS

We begin by selecting phrase tuples that form the in-put to our seq2seq model. A phrase tuple (t, A,B)consists of a sub-tree t with the constituents A andB abstracted out (replaced by their syntactic cat-egories). For instance, in Figure 3, the S0, S, andVP2 nodes circled in red form a phrase tuple. Multi-ple distinct combinations of A and B are possible.2

Step 2: REORDERPHRASE

Next, we obtain rearrangements for each phrasetuple (t, A,B). We first form an input consistingof the yield of t with A and B abstracted out; e.g.If S I will VP, shown in red in Figure 3. We use asequence-to-sequence model (the SOW model) thattakes this string as input and produces a correspond-ing output sequence. We then perform word-levelalignment between the input and generated outputsequences (using cosine similarity between GloVeembeddings) to obtain the rearrangement that mustbe applied to the input sequence.3 The log proba-bility of the output sequence serves as a score forthis rearrangement.

SOW model The SOW model is a sequence-to-sequence model P (y′ | x′, o), following the trans-former framework in Section 2.1.4 Both x′ and y′

are encoded using the word pieces vocabulary; ad-ditionally, embeddings corresponding to the POStags and constituent labels (for non-terminals) areadded to the input embeddings.

For instance, in Figure 3, If S I will VP and I willVP if S is an example of an (x′,y′), pair. Whilenot formally required, Algorithm 1 ensures thatthere are always exactly two non-terminal labels inthese sequences. o is a variable that takes valuesMONOTONE or FLIP. This encodes a preference tokeep the two abstracted nodes in the same order orto “flip” them in the output.5 o is encoded in themodel with additional positional encodings of the

2In order to limit the number of such pairs, we employ athreshold on the fraction of non-abstracted words remainingin the phrase, outlined in more detail in the Appendix.

3We experimented with a pointer network to predict indicesdirectly; however, the approach of generate and then align posthoc resulted in a much more stable model.

4See Appendix for SOW model architecture diagram.5In syntactic translation systems, rules similarly can be

divided by whether they preserve order or invert it (Wu, 1997).

form {. . . 0, 0, 1, 0, . . . 2, 0 . . . } for monotone and{. . . 0, 0, 2, 0, . . . 1, 0 . . . } for flipped, wherein thenon-zero positions correspond to the positions ofthe abstracted non-terminals in the phrase. Thesepositional embeddings for the SOW MODEL arehandled analogously to the r embeddings for theREAP model. During inference, we use both themonotone rearrangement and flip rearrangementto generate two reorderings, one of each type, foreach phrase tuple.

We describe training of this model in Section 3.

Step 3: COMBINE

The previous step gives a rearrangement for thesubtree t. To obtain a sentence-level rearrange-ment from this, we first recursively apply the RE-ORDER algorithm on subtrees tA and tB which re-turns the top-k rearrangements of each subtree. Weiterate over each rearrangement pair (ra, rb), ap-plying these reorderings to the abstracted phrasesA and B. This is illustrated on the left side ofFigure 3. The sentence-level representations, thusobtained, are scored by taking a mean over all thephrase-level rearrangements involved.

3 Data and Training

We train and evaluate our model on the PARANMT-50M paraphrase dataset (Wieting and Gimpel,2018) constructed by backtranslating the Czechsentences of the CzEng (Bojar et al., 2016) corpus.We filter this dataset to remove shorter sentences(less than 8 tokens), low quality paraphrase pairs(quantified by a translation score included with thedataset) and examples that exhibit low reordering(quantified by a reordering score based on the po-sition of each word in the source and its alignedword in the target sentence). This leaves us withover 350k paired paraphrase pairs.

3.1 Training Data for REAP

To train our REAP model (outlined in Section 2.2),we take existing paraphrase pairs (x,y∗) and de-rive pseudo-ground truth rearrangements r∗ of thesource sentence tokens based on their alignmentwith the target sentence. To obtain these rearrange-ments, we first get contextual embeddings (Devlinet al., 2019) for all tokens in the source and tar-get sentences. We follow the strategy outlined inLerner and Petrov (2013) and perform reorderingsas we traverse down the dependency tree. Startingat the root node of the source sentence, we deter-mine the order between the head and its children

If it continues to rain I will carry an umbrella

I will carry an umbrella if rain continues

Figure 4: Paraphrase sentence pair and its aligned tu-ples A → B,C and A′ → B′, C ′. These produce thetraining data for the SOW MODEL.

(independent of other decisions) based on the orderof the corresponding aligned words in the targetsentence. We continue this traversal recursively toget the sentence level-rearrangement. This mirrorsthe rearrangement strategy from Section 2.3, whichoperates over constituency parse tree instead of thedependency parse.

Given triples (x, r∗,y∗), we can train our REAP

model to generate the final paraphrases condition-ing on the pseudo-ground truth reorderings.

3.2 Training Data for SOW

The PARANMT-50M dataset contains sentence-level paraphrase pairs. However, in order to trainour SOW model (outlined in section 2.3), we needto see phrase-level paraphrases with syntactic ab-stractions in them. We extract these from thePARANMT-50M dataset using the following pro-cedure, shown in Figure 4. We follow Zhang et al.(2020) and compute a phrase alignment score be-tween all pairs of constituents in a sentence andits paraphrase.6 From this set of phrase alignmentscores, we compute a partial one-to-one mappingbetween phrases (colored shapes in Figure 4); thatis, not all phrases get aligned, but the subset that doare aligned one-to-one. Finally, we extract alignedchunks similar to rule alignment in syntactic trans-lation (Galley et al., 2004): when aligned phrasesA and A′ subsume aligned phrase pairs (B,C) and(B′, C ′) respectively, we can extract the alignedtuple (tA, B,C) and (tA′ , B

′, C ′). The phrases(B,C) and (B′, C ′) are abstracted out to constructtraining data for the phrase-level transducer, includ-ing supervision of whether o = MONOTONE orFLIP. Using the above alignment strategy, we wereable to obtain over 1 million aligned phrase pairs.

4 Evaluation

Setup As our main goal is to evaluate ourmodel’s ability to generate diverse paraphrases, we

6The score is computed using a weighted mean of thecontextual similarity between individual words in the phrases,where the weights are determined by the corpus-level inverse-document frequency of the words. Details in the Appendix.

obtain a set of paraphrases and compare these tosets of paraphrases produced by other methods. Toobtain 10 paraphrases, we first compute a set of10 distinct reorderings r1, . . . , r10 with the SOW

method from Section 2.3 and then use the REAP togenerate a 1-best paraphrase for each. We use top-k decoding to generate the final set of paraphrasescorresponding to the reorderings. Our evaluation isdone over 10k examples from PARANMT-50M.

4.1 Quantitative Evaluation

Baselines We compare our model against theSyntactically Controlled Paraphrase Network(SCPN) model proposed in prior work (Iyyer et al.,2018). It produces 10 distinct paraphrase outputsconditioned on a pre-enumerated list of syntactictemplates. This approach has been shown to outper-form other paraphrase approaches that condition oninterpretable intermediate structures (Chen et al.,2019b). Additionally, we report results on the fol-lowing baseline models: i) A copy-input modelthat outputs the input sentence exactly. ii) A vanillaseq2seq model that uses the same transformerencoder-decoder architecture from Section 2.1 butdoes not condition on any target rearrangement. Weuse top-k sampling (Fan et al., 2018) to generate10 paraphrases from this model.7 iii) A diverse-decoding model that uses the above transformerseq2seq model with diverse decoding (Kumar et al.,2019) during generation. Here, the induced di-versity is uncontrolled and aimed at maximizingmetrics such as distinct n-grams and edit distancebetween the generated sentences. iv) A LSTMversion of our model where the REAP model usesLSTMs with attention (Bahdanau et al., 2014) andcopy (See et al., 2017) instead of transformers. Westill use the transformer-based phrase transducerto obtain the source sentence reorderings, and stilluse positional encodings in the LSTM attention.

Similar to Cho et al. (2019), we report two typesof metrics:1. Quality: Given k generated paraphrases Y ={y1,y2 . . .yk} for each input sentence in thetest set, we select ybest that achieves the best(oracle) sentence-level score with the groundtruth paraphrase y. The corpus level evaluationis performed using pairs (ybest,y).

2. Diversity: We calculate BLEU or WER be-

7Prior work (Wang et al., 2019; Li et al., 2019) has shownthat such a transformer-based model provides a strong baselineand outperforms previous LSTM-based (Hasan et al., 2016)and VAE-based (Gupta et al., 2018) approaches.

Model oracle quality (over 10 sentences, no rejection) ↑ pairwise diversity (post-rejection)

BLEU ROUGE-1 ROUGE-2 ROUGE-L % rejected self-BLEU ↓ self-WER ↑

copy-input 18.4 54.4 27.2 49.2 0 − −SCPN 21.3 53.2 30.3 51.0 40.6 35.9 63.4

Transformer seq2seq 32.8 63.1 41.4 63.3 12.7 50.7 35.4+ diverse-decoding 24.8 56.8 33.2 56.4 21.3 34.2 58.1

SOW-REAP (LSTM) 27.0 57.9 34.8 57.5 31.7 46.2 53.9SOW-REAP 30.9 62.3 40.2 61.7 15.9 38.0 57.9

Table 1: Quality and diversity metrics for the different models. Our proposed approach outperforms other diversemodels (SCPN and diverse decoding) in terms of all the quality metrics. These models exhibit higher diversity, butwith many more rejected paraphrases, indicating that these models more freely generate bad paraphrases.

tween all pairs (yi,yj) generated by a singlemodel on a single sentence, then macro-averagethese values at a corpus-level.

In addition to these metrics, we use the paraphrasesimilarity model proposed by Wieting et al. (2017)to compute a paraphrase score for generated out-puts with respect to the input. Similar to Iyyer et al.(2018), we use this score to filter out low qualityparaphrases. We report on the rejection rate accord-ing to this criterion for all models. Note that ourdiversity metric is computed after filtering as it iseasy to get high diversity by including nonsensicalparaphrase candidates that differ semantically.

Table 1 outlines the performance of the dif-ferent models. The results show that our pro-posed model substantially outperforms the SCPNmodel across all quality metrics.8 Furthermore,our LSTM model also beats the performance ofthe SCPN model, demonstrating that the gain inquality cannot completely be attributed to the useof transformers. The quality of our full model(with rearrangements) is also comparable to thequality of the vanilla seq2seq model (without rear-rangements). This demonstrates that the inclusionof rearrangements from the syntax-based neuraltransducer do not hurt quality, while leading to asubstantially improved diversity performance.

The SCPN model has a high rejection score of40.6%. This demonstrates that out of the 10 tem-plates used to generate paraphrases for each sen-tence, on average 4 were not appropriate for thegiven sentence, and therefore get rejected. On theother hand, for our model, only 15.9% of the gen-erated paraphrases get rejected, implying that therearrangements produced were generally meaning-ful. This is comparable to the 12.7% rejection rate

8The difference in performance between our proposedmodel and baseline models is statistically significant accordingto a paired bootstrap test.

exhibited by the vanilla seq2seq model that doesnot condition on any syntax or rearrangement, andis therefore never obliged to conform to an inap-propriate structure.

Finally, our model exhibits a much higher diver-sity within the generated paraphrases compared tothe transformer seq2seq baseline. As expected, theSCPN model produces slightly more diverse para-phrases as it explicitly conditions the generationson templates with very different top level structures.However, this is often at the cost of semantic equiv-alence, as demonstrated by both quantitative andhuman evaluation (next section). A similar trendwas observed with the diverse-decoding scheme.Although it leads to more diverse generations, thereis a substantial decrease in quality compared toSOW-REAP and the seq2seq model. Moreover, theparaphrases have a higher rejection rate (21.3%),suggesting that diverse decoding is more likely toproduce nonsensical paraphrases. A similar phe-nomenon is also reported by Iyyer et al. (2018),wherein diverse-decoding resulted in paraphraseswith different semantics than the input.

Syntactic Exemplars In addition to SCPN, wecompare our proposed model against the control-lable generation method of Chen et al. (2019b).Their model uses an exemplar sentence as a syn-tactic guide during generation; the generated para-phrase is trained to incorporate the semantics of theinput sentence while emulating the syntactic struc-ture of the exemplar (see Appendix D for exam-ples). However, their proposed approach dependson the availability of such exemplars at test time;they manually constructed these for their test set(800 examples). Since we do not have such exam-ple sentences available for our test data, we reportresults of our model’s performance on their testdata.

Input SOW-REAP SCPN

if at any time in thepreparation of thisproduct the integrityof this container iscompromised itshould not be used .

this container should not be used if any time inthe preparation of this product is compromised

in the preparation of this product , the integrity ofthis container is compromised , but it should not beused .

if the integrity of the packaging is impaired atany time , the product should not be used .

where is the integrity of this product of this containerthe integrity of this container should not be used .

if the product integrity of this container iscompromised it should not be used .

i should not use if at any time in the preparation ofthis product , it should not be used .

i was the first growerto use hydroponics .

to use hydroponics , i was the first one . where did i have the first tendency to use hydropon-ics ?

i used hydroponics for the first time . i used to use hydroponics .to use hydroponics the first time i was . first i was the first grower to use hydroponics

Table 2: Examples of paraphrases generated by our system and the baseline SCPN model. Our model successfullyrearranges the different structural components of the input sentence to obtain meaningful rearrangements. SCPNconforms to pre-enumerated templates that may not align with a given input.

Note that Chen et al. (2019b) carefully curatedthe exemplar to be syntactically similar to the actualtarget paraphrase. Therefore, for fair comparison,we report results using the ground truth ordering(that similarly leverages the target sentence to ob-tain a source reordering), followed by the REAP

model. This model (ground truth order + REAP)achieves a 1-best BLEU score of 20.9, outperform-ing both the prior works: Chen et al. (2019b) (13.6BLEU) and SCPN (17.8 BLEU with template, 19.2BLEU with full parse). Furthermore, our full SOW-REAP model gets an oracle-BLEU (across 10 sen-tences) score of 23.8. These results show that ourproposed formulation outperforms other control-lable baselines, while being more flexible.

4.2 Qualitative Evaluation

Table 2 provides examples of paraphrase outputsproduced by our approach and SCPN. The exam-ples show that our model exhibits syntactic diver-sity while producing reasonable paraphrases of theinput sentence. On the other hand, SCPN tends togenerate non-paraphrases in order to conform toa given template, which contributes to increaseddiversity but at the cost of semantic equivalence. InTable 3, we show the corresponding sequence ofrules that apply to an input sentence, and the finalgenerated output according to that input rearrange-ment. Note that for our model, on average, 1.8phrase-level reorderings were combined to producesentence-level reorderings (we restrict to a maxi-mum of 3). More examples along with the inputrule sequence (for our model) and syntactic tem-plates (for SCPN) are provided in the Appendix.

Human Evaluation We also performed humanevaluation on Amazon Mechanical Turk to evalu-

Input Sentence: if at any time in the preparation of thisproduct the integrity of this container is compromised itshould not be used .

Rule Sequence: if S it should not VB used .→ should notVB used if S (parse tree level: 0)

at NP the integrity of this container VBZ compromised→this container VBZ weakened at NP (parse tree level: 1)

the NN of NP→ NP NN (parse tree level: 2)

Generated Sentence: this container should not be used ifthe product is compromised at any time in preparation .

Table 3: Examples of our model’s rearrangements ap-plied to a given input sentence. Parse tree level indi-cates the rule subtree’s depth from the root node of thesentence. The REAP model’s final generation considersthe rule reordering at the higher levels of the tree but ig-nores the rearrangement within the lower sub-tree.

ate the quality of the generated paraphrases. Werandomly sampled 100 sentences from the develop-ment set. For each of these sentences, we obtained3 generated paraphrases from each of the followingmodels: i) SCPN, ii) vanilla sequence-to-sequenceand iii) our proposed SOW-REAP model. We fol-low earlier work (Kok and Brockett, 2010; Iyyeret al., 2018) and obtain quality annotations on a 3point scale: 0 denotes not a paraphrase, 1 denotesthat the input sentence and the generated sentenceare paraphrases, but the generated sentence mightcontain grammatical errors, 2 indicates that the in-put and the candidate are paraphrases. To emulatethe human evaluation design in Iyyer et al. (2018),we sample paraphrases after filtering using the cri-terion outlined in the previous section and obtainthree judgements per sentence and its 9 paraphrasecandidates. Table 4 outlines the results from the hu-man evaluation. As we can see, the results indicate

Model 2 1 0

SCPN (Iyyer et al., 2018) 35.9 24.8 39.3Transformer seq2seq 45.1 20.6 34.3

SOW-REAP 44.5 22.6 32.9

Table 4: Human annotated quality across different mod-els. The evaluation was done on a 3 point quality scale,2 = grammatical paraphrase, 1 = ungrammatical para-phrase, 0 = not a paraphrase.

Ordering oracle-ppl ↓ oracle-BLEU ↑

Monotone 10.59 27.98Random 9.32 27.10

SOW 8.14 30.02

Ground Truth 7.79 36.40

Table 5: Comparison of different source reorderingstrategies. Our proposed approach outperforms base-line monotone and random rearrangement strategies.

that the quality of the paraphrases generated fromour model is substantially better than the SCPNmodel.9 Furthermore, similar to quantitative evalu-ation, the human evaluation also demonstrates thatthe performance of this model is similar to that ofthe vanilla sequence-to-sequence model, indicatingthat the inclusion of target rearrangements do nothurt performance.

4.3 Ablations and Analysis

4.3.1 Evaluation of SOW ModelNext, we intrinsically evaluate the performance ofour SOW model (Section 2.3). Specifically, givena budget of 10 reorderings, we want to understandhow close our SOW model comes to covering thetarget ordering. We do this by evaluating the REAP

model in terms of oracle perplexity (of the groundtruth paraphrase) and oracle BLEU over these 10orderings.

We evaluate our proposed approach against 3systems: a) Monotone reordering {1, 2, . . . , n}.b) Random permutation, by randomly permutingthe children of each node as we traverse down theconstituency parse tree. c) Ground Truth, usingthe pseudo-ground truth rearrangement (outlinedin Section 3) between the source and ground-truthtarget sentence. This serves as an upper bound forthe reorderings’ performance, as obtained by therecursive phrase-level transducer.

9The difference of our model performance with SCPN isstatistically significant, while that with baseline seq2seq is notaccording to a paired bootstrap test.

−0.5 −0.2 0.1 0.4 0.7 1.0Target Degree of Rearrangementb/w input and ground truth output

−0.5

0.0

0.5

1.0

Ach

ieved

Degre

eof

Rearr

an

gem

ent

b/w

inp

ut

and

gen

erat

edou

tpu

t

Monotone r

Ground Truth r∗

Figure 5: The degree of rearrangement (Kendall’s Tau)achieved by conditioning on monotone and pseudo-ground truth reorderings (r∗). The dotted line de-notes the ideal performance (in terms of reordering-compliance) of the REAP model, when supplied withperfect reordering r∗. The actual performance of theREAP model mirrors the ideal performance.

Table 5 outlines the results for 10 generated para-phrases from each rearrangement strategy. Our pro-posed approach outperforms the baseline monotoneand random reordering strategies. Furthermore, theSOW model’s oracle perplexity is close to that ofthe ground truth reordering’s perplexity, showingthat the proposed approach is capable of generat-ing a diverse set of rearrangements such that oneof them often comes close to the target rearrange-ment. The comparatively high performance of theground truth reorderings demonstrates that the po-sitional embeddings are effective at guiding theREAP model’s generation.

4.3.2 Compliance with target reorderingsFinally, we evaluate whether the generated para-phrases follow the target reordering r. Note thatwe do not expect or want our REAP model to be ab-solutely compliant with this input reordering sincethe model should be able to correct for the mis-takes make by the SOW model and still generatevalid paraphrases. Therefore, we perform reorder-ing compliance experiments on only the monotonereordering and the pseudo-ground truth reorderings(r∗, construction outlined in Section 3), since thesecertainly correspond to valid paraphrases.

For sentences in the test set, we generate para-phrases using monotone reordering and pseudo-ground truth reordering as inputs to REAP. We getthe 1-best paraphrase and compute the degree ofrearrangement10 between the input sentence and

10Quantified by Kendall’s Tau rank correlation betweenoriginal source order and targeted/generated order. Higher

the generated sentence. In Figure 5, we plot thisas a function of the target degree of rearrangement,i.e., the rearrangement between the input sentencex and the ground truth sentence y∗. The dottedline denotes the ideal performance of the model interms of agreement with the perfect reordering r∗.The plot shows that the REAP model performs asdesired; the monotone generation results in highKendall’s Tau between input and output. Condition-ing on the pseudo-ground truth reorderings (r∗) pro-duces rearrangements that exhibit the same amountof reordering as the ideal rearrangement.

5 Related Work

Paraphrase Generation Compared to priorseq2seq approaches for paraphrasing (Hasan et al.,2016; Gupta et al., 2018; Li et al., 2018), our modelis able to achieve much stronger controllabilitywith an interpretable control mechanism. Likethese approaches, we can leverage a wide varietyof resources to train on, including backtranslation(Pavlick et al., 2015; Wieting and Gimpel, 2018;Hu et al., 2019) or other curated data sources (Faderet al., 2013; Lan et al., 2017).

Controlled Generation Recent work on con-trolled generation aims at controlling attributessuch as sentiment (Shen et al., 2017), gender or po-litical slant (Prabhumoye et al., 2018), topic (Wanget al., 2017), etc. However, these methods cannotachieve fine-grained control over a property likesyntax. Prior work on diverse paraphrase genera-tion can be divided into three groups: diverse de-coding, latent variable modeling, and syntax-based.The first group uses heuristics such as Hammingdistance or distinct n-grams to preserve diverseoptions during beam search decoding (Vijayaku-mar et al., 2018; Kumar et al., 2019). The secondgroup includes approaches that use uninterpretablelatent variables to separate syntax and semantics(Chen et al., 2019a), perturb latent representationsto enforce diversity (Gupta et al., 2018; Park et al.,2019) or condition on latent codes used to repre-sent different re-writing patterns (Xu et al., 2018;An and Liu, 2019). Qian et al. (2019) uses distinctgenerators to output diverse paraphrases. Thesemethods achieve some diversity, but do not con-trol generation in an interpretable manner. Finally,methods that use explicit syntactic structures (Iyyeret al., 2018; Chen et al., 2019b) may try to force a

Kendall’s Tau indicates lower rearrangement and vice-versa.

sentence to conform to unsuitable syntax. Phrase-level approaches (Li et al., 2019) are inherently lessflexible than our approach.

Machine Translation Our work is inspired bypre-ordering literature in machine translation.These systems either use hand-crafted rules de-signed for specific languages (Collins et al., 2005;Wang et al., 2007) or automatically learn rewritingpatterns based on syntax (Xia and McCord, 2004;Dyer and Resnik, 2010; Genzel, 2010; Khalilov andSimaan, 2011; Lerner and Petrov, 2013). Therealso exist approaches that do not rely on syntac-tic parsers, but induce hierarchical representationsto leverage for pre-ordering (Tromble and Eisner,2009; DeNero and Uszkoreit, 2011). In the contextof translation, there is often a canonical reorderingthat should be applied to align better with the targetlanguage; for instance, head-final languages likeJapanese exhibit highly regular syntax-governedreorderings compared to English. However, in di-verse paraphrase generation, there doesn’t exist asingle canonical reordering, making our problemquite different.

In concurrent work, Chen et al. (2020) similarlyuse an additional set of position embeddings toguide the order of generated words for machinetranslation. This demonstrates that the REAP tech-nique is effective for other tasks also. However,they do not tackle the problem of generating plau-sible reorderings and therefore their technique isless flexible than our full SOW-REAP model.

6 Conclusion

In this work, we propose a two-step framework forparaphrase generation: construction of diverse syn-tactic guides in the form of target reorderings fol-lowed by actual paraphrase generation that respectsthese reorderings. Our experiments show that thisapproach can be used to produce paraphrases thatachieve a better quality-diversity trade-off com-pared to previous methods and strong baselines.

Acknowledgments

This work was partially supported by NSF GrantIIS-1814522, a gift from Arm, and an equipmentgrant from NVIDIA. The authors acknowledge theTexas Advanced Computing Center (TACC) at TheUniversity of Texas at Austin for providing HPCresources used to conduct this research. Thanks aswell to the anonymous reviewers for their helpfulcomments.

References

Zhecheng An and Sicong Liu. 2019. Towards DiverseParaphrase Generation Using Multi-Class Wasser-stein GAN. arXiv preprint arXiv:1909.13827.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural Machine Translation byJointly Learning to Align and Translate. CoRR,abs/1409.0473.

Regina Barzilay and Lillian Lee. 2003. Learningto paraphrase: an unsupervised approach usingmultiple-sequence alignment. In Proceedings of the2003 Conference of the North American Chapterof the Association for Computational Linguistics onHuman Language Technology-Volume 1, pages 16–23. Association for Computational Linguistics.

Ondrej Bojar, Ondrej Dusek, Tom Kocmi, Jindrich Li-bovicky, Michal Novak, Martin Popel, Roman Su-darikov, and Dusan Varis. 2016. CzEng 1.6: en-larged Czech-English parallel corpus with process-ing tools Dockered. In International Conferenceon Text, Speech, and Dialogue, pages 231–238.Springer.

Kehai Chen, Rui Wang, Masao Utiyama, and Ei-ichiro Sumita. 2020. Explicit Reordering forNeural Machine Translation. arXiv preprintarXiv:2004.03818.

Mingda Chen, Qingming Tang, Sam Wiseman, andKevin Gimpel. 2019a. A Multi-Task Approach forDisentangling Syntax and Semantics in SentenceRepresentations. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 2453–2464.

Mingda Chen, Qingming Tang, Sam Wiseman, andKevin Gimpel. 2019b. Controllable Paraphrase Gen-eration with a Syntactic Exemplar. In Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics, pages 5972–5984, Flo-rence, Italy. Association for Computational Linguis-tics.

Jaemin Cho, Minjoon Seo, and Hannaneh Hajishirzi.2019. Mixture Content Selection for Diverse Se-quence Generation. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 3112–3122.

Michael Collins, Philipp Koehn, and Ivona Kucerova.2005. Clause restructuring for statistical machinetranslation. In Proceedings of the 43rd annualmeeting on association for computational linguis-tics, pages 531–540. Association for ComputationalLinguistics.

John DeNero and Jakob Uszkoreit. 2011. Inducingsentence structure from parallel corpora for reorder-ing. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing, pages193–203. Association for Computational Linguis-tics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186.

Chris Dyer and Philip Resnik. 2010. Context-free re-ordering, finite-state translation. In Human Lan-guage Technologies: The 2010 Annual Conferenceof the North American Chapter of the Associationfor Computational Linguistics, pages 858–866, LosAngeles, California. Association for ComputationalLinguistics.

Anthony Fader, Luke Zettlemoyer, and Oren Etzioni.2013. Paraphrase-driven learning for open questionanswering. In Proceedings of the 51st Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1608–1618.

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-erarchical Neural Story Generation. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 889–898.

Michel Galley, Mark Hopkins, Kevin Knight, andDaniel Marcu. 2004. What’s in a translation rule?In Proceedings of the Human Language Technol-ogy Conference of the North American Chapterof the Association for Computational Linguistics:HLT-NAACL 2004, pages 273–280, Boston, Mas-sachusetts, USA. Association for ComputationalLinguistics.

Dmitriy Genzel. 2010. Automatically learning source-side reordering rules for large scale machine transla-tion. In Proceedings of the 23rd International Con-ference on Computational Linguistics (Coling 2010),pages 376–384.

Ankush Gupta, Arvind Agarwal, Prawaan Singh, andPiyush Rai. 2018. A deep generative framework forparaphrase generation. In Thirty-Second AAAI Con-ference on Artificial Intelligence.

Sadid A Hasan, Kathy Lee, Vivek Datla, AshequlQadir, Joey Liu, Oladimeji Farri, et al. 2016. Neu-ral Paraphrase Generation with Stacked ResidualLSTM Networks. In Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers, pages 2923–2934.

J. Edward Hu, Rachel Rudinger, Matt Post, and Ben-jamin Van Durme. 2019. ParaBank: Monolingual

https://doi.org/10.18653/v1/P19-1599

https://doi.org/10.18653/v1/P19-1599

https://www.aclweb.org/anthology/N10-1128



Bitext Generation and Sentential Paraphrasing viaLexically-constrained Neural Machine Translation.In Proceedings of AAAI.

Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial Example Genera-tion with Syntactically Controlled Paraphrase Net-works. In Proceedings of NAACL-HLT, pages 1875–1885.

Maxim Khalilov and Khalil Simaan. 2011. Context-sensitive syntactic source-reordering by statisticaltransduction. In Proceedings of 5th InternationalJoint Conference on Natural Language Processing,pages 38–46.

Stanley Kok and Chris Brockett. 2010. Hitting the rightparaphrases in good time. In Human Language Tech-nologies: The 2010 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 145–153. Association forComputational Linguistics.

Ashutosh Kumar, Satwik Bhattamishra, Manik Bhan-dari, and Partha Talukdar. 2019. SubmodularOptimization-based Diverse Paraphrasing and its Ef-fectiveness in Data Augmentation. In Proceedingsof the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers), pages 3609–3619.

Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. AContinuously Growing Dataset of Sentential Para-phrases. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,pages 1224–1234, Copenhagen, Denmark. Associa-tion for Computational Linguistics.

Uri Lerner and Slav Petrov. 2013. Source-side classi-fier preordering for machine translation. In Proceed-ings of the 2013 Conference on Empirical Methodsin Natural Language Processing, pages 513–523.

Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li.2018. Paraphrase Generation with Deep Reinforce-ment Learning. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 3865–3878, Brussels, Belgium.Association for Computational Linguistics.

Zichao Li, Xin Jiang, Lifeng Shang, and Qun Liu.2019. Decomposable Neural Paraphrase Genera-tion. In Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics,pages 3403–3414, Florence, Italy. Association forComputational Linguistics.

Kathleen R McKeown. 1983. Paraphrasing questionsusing given and new information. ComputationalLinguistics, 9(1):1–10.

Sunghyun Park, Seung-won Hwang, Fuxiang Chen,Jaegul Choo, Jung-Woo Ha, Sunghun Kim, andJinyeong Yim. 2019. Paraphrase Diversification Us-ing Counterfactual Debiasing. In Proceedings of

the AAAI Conference on Artificial Intelligence, vol-ume 33, pages 6883–6891.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 2:Short Papers), pages 425–430, Beijing, China. As-sociation for Computational Linguistics.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of NAACL-HLT, pages2227–2237.

Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut-dinov, and Alan W Black. 2018. Style TransferThrough Back-Translation. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages866–876.

Lihua Qian, Lin Qiu, Weinan Zhang, Xin Jiang, andYong Yu. 2019. Exploring Diverse Expressionsfor Paraphrase Generation. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 3164–3173.

Abigail See, Peter J Liu, and Christopher D Man-ning. 2017. Get To The Point: Summarization withPointer-Generator Networks. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1073–1083.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural Machine Translation of Rare Wordswith Subword Units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725.

Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In Advances in neural informa-tion processing systems, pages 6830–6841.

Roy Tromble and Jason Eisner. 2009. Learning linearordering problems for better translation. In Proceed-ings of the 2009 Conference on Empirical Methodsin Natural Language Processing: Volume 2-Volume2, pages 1007–1016. Association for ComputationalLinguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

https://doi.org/10.18653/v1/D17-1126

https://doi.org/10.18653/v1/D17-1126

https://doi.org/10.18653/v1/D17-1126

https://doi.org/10.18653/v1/D18-1421

https://doi.org/10.18653/v1/D18-1421

https://doi.org/10.18653/v1/P19-1332

https://doi.org/10.18653/v1/P19-1332

https://doi.org/10.3115/v1/P15-2070

https://doi.org/10.3115/v1/P15-2070

https://doi.org/10.3115/v1/P15-2070

Ashwin K Vijayakumar, Michael Cogswell, Ram-prasaath R Selvaraju, Qing Sun, Stefan Lee, DavidCrandall, and Dhruv Batra. 2018. Diverse beamsearch for improved description of complex scenes.In Thirty-Second AAAI Conference on Artificial In-telligence.

Chao Wang, Michael Collins, and Philipp Koehn. 2007.Chinese syntactic reordering for statistical machinetranslation. In Proceedings of the 2007 Joint Con-ference on Empirical Methods in Natural LanguageProcessing and Computational Natural LanguageLearning (EMNLP-CoNLL), pages 737–745.

Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Ny-berg. 2017. Steering Output Style and Topic inNeural Response Generation. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 2140–2150.

Su Wang, Rahul Gupta, Nancy Chang, and JasonBaldridge. 2019. A task in a suit and a tie: para-phrase generation with semantic augmentation. InProceedings of the AAAI Conference on Artificial In-telligence, volume 33, pages 7176–7183.

John Wieting and Kevin Gimpel. 2018. ParaNMT-50M: Pushing the limits of paraphrastic sentence em-beddings with millions of machine translations. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 451–462, Melbourne, Australia.Association for Computational Linguistics.

John Wieting, Jonathan Mallinson, and Kevin Gimpel.2017. Learning Paraphrastic Sentence Embeddingsfrom Back-Translated Bitext. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 274–285.

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational linguistics, 23(3):377–403.

Fei Xia and Michael McCord. 2004. Improving a statis-tical MT system with automatically learned rewritepatterns. In Proceedings of the 20th internationalconference on Computational Linguistics, page 508.Association for Computational Linguistics.

Qiongkai Xu, Juyan Zhang, Lizhen Qu, Lexing Xie,and Richard Nock. 2018. D-PAGE: Diverse Para-phrase Generation. CoRR, abs/1808.04364.

Adams Wei Yu, David Dohan, Quoc Le, Thang Luong,Rui Zhao, and Kai Chen. 2018. QANet: Combin-ing Local Convolution with Global Self-Attentionfor Reading Comprehension. In International Con-ference on Learning Representations.

Shiyue Zhang and Mohit Bansal. 2019. AddressingSemantic Drift in Question Generation for Semi-Supervised Question Answering. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing

(EMNLP-IJCNLP), pages 2495–2509, Hong Kong,China. Association for Computational Linguistics.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2020. BERTScore:Evaluating Text Generation with BERT. In Interna-tional Conference on Learning Representations.

APPENDIX

A SELECTSEGMENTPAIRS: Limitingnumber of segment pairs

As outlined in Section 2.3, the SELECTSEGMENT-PAIRS subroutine returns a set of non-overlappingsub-phrases (A,B). In order to limit the numberof sub-phrase pairs during inference, we employthe following heuristics:

1. We compute a score based on number of non-abstracted tokens divided by the total numberof tokens in the yield of the parent sub-phraset. We reject pairs (A,B) that have a score ofmore than 0.6. This reduces spurious ambi-guity by encouraging the model to rearrangebig constituents hierarchically rather than onlyabstracting out small pieces.

2. We maintain a list of tags that are never in-dividually selected as sub-phrases. These in-clude constituents that would be trivial to thereordering such as determiners (DT), prepo-sitions (IN), cardinal numbers (CD), modals(MD), etc. However, these may be a part oflarger constituents that form A or B.

B Training Data for SOW MODEL

In Section 3.2, we outlined our approach for obtain-ing phrase-level alignments from the PARANMT-50M dataset used to train the SOW MODEL. Inthe described approach, an alignment score is com-puted between each pair of phrases p, p belongingto sentences s and s respectively. We use the exactprocedure in Zhang and Bansal (2019) to computethe alignment score, outlined below:

1. First, we compute an inverse document fre-quency (idf ) score for each token in the train-ing set. Let M = {s(i)} be the total numberof sentences. Then idf of a word w is com-puted as:

idf(w) = − log1

M

M∑i=i

1[w ∈ s(i)]

2. Next, we extract a contextual representation ofeach word in the two phrases s and s. We useELMo (Peters et al., 2018) in our approach.

https://doi.org/10.18653/v1/P18-1042

https://doi.org/10.18653/v1/P18-1042

https://doi.org/10.18653/v1/P18-1042

http://arxiv.org/abs/1808.04364

http://arxiv.org/abs/1808.04364

https://doi.org/10.18653/v1/D19-1253

https://doi.org/10.18653/v1/D19-1253

https://doi.org/10.18653/v1/D19-1253

https://openreview.net/forum?id=SkeHuCVFDr

https://openreview.net/forum?id=SkeHuCVFDr

SOW Input SOW Output

removing the NN from NP excluding this NN from NP

they might consider VPif NP were imposed

in the case of imposition ofNP , they would considerVP

NP lingered in thedeserted NNS .

in the abandoned NNS ,there was NP .

PP was a black NNarchway .

was a black NN passagePP .

there is already a ringNN PP . PP circular NN exist .

Table 6: Examples of aligned phrase pairs with exactlytwo sub-phrases abstracted out and replaced with con-stituent labels. These phrase pairs are used to train theSOW MODEL.

3. In order to compute a similarity score betweeneach pair of phrases (p, p), we use greedymatching to first align each token in the sourcephrase to its most similar word in the targetphrase. To compute phrase-level similarity,these these word-level similarity scores arecombined by taking a weighted mean, withweights specified by to the idf scores. For-mally,

Rp,p =

∑wi∈p idf(wi)maxwj∈pw

Ti wj∑

wi∈p idf(wi)

Pp,p =

∑wj∈p idf(wj)maxwi∈pw

Ti wj∑

wj∈p idf(wj)

Fp,p =2Pp,pRp,p

Pp,p +Rp,p

This scoring procedure is exactly same as theone proposed by Zhang et al. (2020) to evalu-ate sentence and phrase similarities.

4. Finally, the phrases p ∈ s and p ∈ s arealigned if:

p = argmaxpi∈s

Fpi,p & p = argmaxpj∈s

Fp,pj

These aligned set of phrase pairs (p, p) are usedto construct tuples (tA, B, C) and (t′A, B

′, C ′), asoutlined in Section 3.2. Table 6 provides examplesof such phrase pairs.

C SOW Model Architecture

Figure 6 provides an overview of the SOW seq2seqmodel. We add POS tag embeddings (or cor-

Encoder Decoder

+ + + +

Input tokens x

Original OrderPOS Embeddings

Encoder Output EM

Order o = FLIP

New Encoder Output E

0 2 0 1

1 2 3 4

Output tokens y

If X then YBOS Y

Y if

Token Embeddings

Figure 6: Source Order reWriting (SOW) model. Ourmodel encodes order preference MONOTONE or FLIPthrough position embeddings added to the encoder out-put.

responding constituent label embeddings for ab-stracted X and Y) to the input token embeddingsand original order position embeddings. As out-lined in Section 2.3, another set of position em-beddings corresponding to the order preference,either MONOTONE or FLIP, are further added tothe output of the final layer of the encoder. Thedecoder attends over these augmented encodingsduring both training and inference.

D Syntactic Exemplars

Table 7 provides an example from the test set ofChen et al. (2019b). The output retains the se-mantics of the input sentence while following thestructure of the exemplar.

I: his teammates eyes got an ugly, hostile expression.E: the smell of flowers was thick and sweet.O: the eyes of his teammates had turned ugly and hostile.

Table 7: Example of input (I), syntactic exemplar (E),and the reference output (O) from the evaluation testset of (Chen et al., 2019b).

E Example Generations

In Table 8, we provide examples of paraphrasesgenerated by our system (SOW-REAP) and the base-line SCPN (Iyyer et al., 2018) system. We addition-ally include the phrase level transductions appliedto obtain the sentence level reordering by our sys-tem (column 1) and the input template that thecorresponding SCPN generation was conditionedon (Column 3).

Rules (SOW) Output (REAP) Template (SCPN) Output (SCPN)

Input: the public tender result message normally contains the following information :

NP normally contains the followingNN: → the following NN usuallycontains in NP :

the following informationshall normally be includedin the public procurement re-port :

SBARQ ( WHADVPSQ . )

where is the public pro-curement report report usu-ally contains the followinginformation .

NP normally VP :→ usually VP ,NPVBZ the following NN→ the NNVBZ

normally the following in-formation shall be includedin the public procurement re-sult report :

S ( PP , NP VP . ) in the public competition ,the report on competitioncontains the following in-formation .

Input: the story of obi-wan kenobi ends here .

NP VP .→ VP is NPthe NN of NP→ NP NN .

end of the obi-wan kenobistory .

S ( VP . ) tell the story of obi-wankenobi .

the story PP NNS here . → thereNNS a story PP .

here ends the story of obi-wan kenobi .

S ( S , CC S . ) the story of obi-wankenobi is here , and it endshere .

Input: i leased it before i knew where the money came from .

i VBN it before i VP .→ before iVP , i VBN it .

before i knew where themoney came from , i rentedit .


where did you learn that itwas the money ?

NP knew SBAR . → SBAR , Sknew .

where the money came from, i lent it to me before i knew.

S ( NP VP . ) i borrowed money beforei knew where the moneycame from .

Input: priority actions should be more clearly specified in future reviews .

NP should be more clearly specifiedPP .→ PP , NP should be clearlyspecified .

in future reviews , prioritymeasures should be moreclearly specified .

S ( S , CC S . ) priority actions should bemore clearly specified infuture reviews , and theyshould be informed .

ADVP VBN in future reviews →VBN in future reviews ADVP

priority measures should bespecified in future reviewsclearly .


where should priority ac-tions are more clearly spec-ified in future reviews ?

Input: okay , well , tonight the occasion is calling .

ADJP , S .→ S , ADJP .well , NN the occasion VP→ theoccasion VP , NN

the occasion is calling today, okay ?

S ( NP VP . ) the opportunity is calling .

ADJP , S .→ S , ADJP .well , NP VBZ calling→ VBZ call-ing NP

we ’ll call it tonight , okay ? S ( ADVP NP VP . ) of course , the occasion iscalling .

Input: a minor risk considering the number of telephones in new york .

a JJ risk considering NP .→ NP isa JJ risk .the NN of NP→ NP NN

phones in new york are a mi-nor risk considering .

SBARQ ( WHADVPSQ .)

when do you consider thenumber of telephones innew york ?

NP1 considering NP2 . → consid-ering NP2 for NP1

NN of NP→ NP NNNP in JJ york→ JJ york NP

in new york , the number ofphones is a minor risk .

FRAG ( SBAR ) . that minor risk is the num-ber of telephones in newyork .

Input: that dress gets me into anywhere i want .

that S i VBP .→ i VBP S . i want that dress gets meinto the place .

NP ( NP . ) that dress gets me in there ,i wish .

that S i VBP .→ i VBP S .NN gets me PP→ PP , NN gets me.

i want a dress in front of me.

S ( VP . ) i want everywhere .

Table 8: Examples of paraphrases generated by our system and the baseline SCPN model. The outputs fromour model successfully rearranges the different structural components of the input sentence to obtain meaningfulrearrangements. SCPN on the other hand tends to conform to pre-specified templates that are often not alignedwith a given input.

F Implementation Details

The hyperparameters values used in REAP (see Ta-ble 9) and SOW (see Table 10) models. Note thatwe do not use coverage loss for the SOW model.

Seq2seq transformer architecture

Hidden size 256Num layers 2Num heads 8Dropout 0.1

Training

Optimizer Adam, β = (0.9, 0.999), ε = 10−8

Learning rate 0.0001Batch size 32Epochs 50 (maximum)Coverage loss coeff. 1 (first 10 epochs), 0.5 (10 - 20

epochs), 0 (rest)

Inference

k in top-k 20Beam Size 10

Table 9: Hyperparameters used in the implementationof the REAP model.

Seq2seq transformer architecture

Hidden size 256Num layers 2Num heads 8Dropout 0.1

Training

Optimizer Adam, β = (0.9, 0.999), ε = 10−8

Learning rate 0.0001Batch size 32Epochs 50 (maximum)

Recombination of rules/transductions

Ignored tags DT, IN, CD, MD, TO, PRPMax. no. of rules 3

Table 10: Hyperparameters used in the implementationof the SOW model.

neural syntactic preordering for controlled paraphrase ... · arrangement of content, or high-level...

Documents