lattice transformer for speech translation

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6475–6484Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

6475

Lattice Transformer for Speech Translation

Pei Zhang∗, Boxing Chen∗, Niyu Ge∗, Kai Fan∗†Alibaba Group Inc.

{xiaoyi.zp,boxing.cbx,niyu.ge,k.fan}@alibaba-inc.com

Abstract

Recent advances in sequence modeling havehighlighted the strengths of the transformer ar-chitecture, especially in achieving state-of-the-art machine translation results. However, de-pending on the up-stream systems, e.g., speechrecognition, or word segmentation, the inputto translation system can vary greatly. Thegoal of this work is to extend the attentionmechanism of the transformer to naturallyconsume the lattice in addition to the tradi-tional sequential input. We first propose ageneral lattice transformer for speech transla-tion where the input is the output of the au-tomatic speech recognition (ASR) which con-tains multiple paths and posterior scores. Toleverage the extra information from the lat-tice structure, we develop a novel control-lable lattice attention mechanism to obtain la-tent representations. On the LDC Spanish-English speech translation corpus, our exper-iments show that lattice transformer general-izes significantly better and outperforms botha transformer baseline and a lattice LSTM.Additionally, we validate our approach on theWMT 2017 Chinese-English translation taskwith lattice inputs from different BPE segmen-tations. In this task, we also observe the im-provements over strong baselines.

1 Introduction

Transformer based encoder-decoder framework(Vaswani et al., 2017) for Neural Machine Trans-lation (NMT) has currently become the state-of-the-art in many translation tasks, significantly im-proving translation quality in text (Bojar et al.,2018; Fan et al., 2018) as well as in speech (Janet al., 2018). Most NMT systems fall into the cat-egory of Sequence-to-Sequence (Seq2Seq) model(Sutskever et al., 2014), because both the input and

∗indicates equal contribution.†corresponding author.

0: <s>

1: iban

2: ivan 6: esquinas

5: así

4: esquinas

3: espinas

7: así

8:entonces 9: </s>

0.87 0.87 1

0.13 0.13 1

0.13 0.11 1

0.87 0.76 0.87

1 0.11 0.11

1 0.13 0.13

1 0.89 0.89

1 1 1 1 1 1

1 1 1

x0

x1

x2 x6

x5

x4

x3

x7

x8 x9

0: <s> 1: iban 2: esquinas 3: así 4:entonces 5: </s>

x0 x1 x2 x3 x4 x5

Attention in Standard Transformer Encoder

Attention in Lattice Transformer Encoder

Figure 1: Illustration of our proposed attention mech-anism (best viewed in color). Our attention dependson the tokens of common paths and forward (blue) /marginal (grey) / backward (orange) probability scores.

output consist of sequential tokens. Therefore, inmost neural speech translation, such as that of (Bo-jar et al., 2018), the input to the translation sys-tem is usually the 1-best hypothesis from the ASRinstead of the word lattice output with its corre-sponding probability scores.

How to consume word lattice rather than se-quential input has been substantially researched inseveral natural language processing (NLP) tasks,such as language modeling (Buckman and Neu-big, 2018), Chinese Named Entity Recognition(NER) (Zhang and Yang, 2018), and NMT (Suet al., 2017). Additionally, some pioneering works(Adams et al., 2016; Sperber et al., 2017; Osamuraet al., 2018) demonstrated the potential improve-ments in speech translation by leveraging the ad-ditional information and uncertainty of the packedlattice structure produced by ASR acoustic model.

Efforts have since continued to push theboundaries of long short-term memory (LSTM)(Hochreiter and Schmidhuber, 1997) models.More precisely, most previous works are in linewith the existing method Tree-LSTMs (Tai et al.,

6476

2015), adapting to task-specific variant Lattice-LSTMs that can successfully handle lattices androbustly establish better performance than theoriginal models. However, the inherently sequen-tial nature still remains in Lattice-LSTMs due tothe topological representation of the lattice graph,precluding long-path dependencies (Khandelwalet al., 2018) and parallelization within trainingexamples that are the fundamental constraint ofLSTMs.

In this work, we introduce a generalization ofthe standard transformer architecture to acceptlattice-structured network topologies. The stan-dard transformer is a transduction model relyingentirely on attention modules to compute latentrepresentations, e.g., the self-attention requires tocalculate the intra-attention of every two tokensfor each sequence example. Latest works such as(Yu et al., 2018; Devlin et al., 2018; Lample et al.,2018; Su et al., 2018) empirically find that trans-former can outperform LSTMs by a large mar-gin, and the success is mainly attributed to self-attention. In our lattice transformer, we proposea lattice relative positional attention mechanismthat can incorporate the probability scores of ASRword lattices. The major difference with the self-attention in transformer encoder is illustrated inFigure 1.

We first borrow the idea from the relative po-sitional embedding (Shaw et al., 2018) to maxi-mally encode the information of the lattice graphinto its corresponding relative positional matrix.This design essentially does not allow a token topay attention to any token that has not appearedin a shared path. Secondly, the attention weightsdepend not only on the query and key represen-tations in the standard attention module, but alsoon the marginal / forward / backward probabil-ity scores (Rabiner, 1989; Post et al., 2013) de-rived from the upstream systems (such as ASR).Instead of 1-best hypothesis alone (though it isbased on forward scores), the additional probabil-ity scores have rich information about the distri-bution of each path (Sperber et al., 2017). It is inprinciple possible to use them, for example in at-tention weights reweighing, to increase the uncer-tainty of the attention for other alternative tokens.

Our lattice attention is controllable and flexi-ble enough for the utilization of each score. Thelattice transformer can readily consume the lat-tice input alone if the scores are unavailable. A

common application is found in the Chinese NERtask, in which a Chinese sentence could possi-bly have multiple word segmentation possibilities(Zhang and Yang, 2018). Furthermore, differentBPE operations (Sennrich et al., 2016) or proba-bilistic subwords (Kudo, 2018) can also bring sim-ilar uncertainty to subword candidates and form acompact lattice structure.

In summary, this paper makes the followingmain contributions. i) To our best knowledge, weare the first to propose a novel attention mecha-nism that consumes a word lattice and the proba-bility scores from the ASR system. ii) The pro-posed approach is naturally applied to both theencoder self-attention and encoder-decoder atten-tion. iii) Another appealing feature is that the lat-tice transformer can be reduced to standard lattice-to-sequence model without probability scores, fit-ting the text translation task. iv) Extensive ex-periments on speech translation datasets demon-strate that our method outperforms the previoustransformer and Lattice-LSTMs. The experimenton the WMT 2017 Chinese-English translationtask shows the reduced model can improve manystrong baselines such as the transformer.

2 Background

We first briefly describe the standard transformerthat our model is built upon, and then elaborate onour proposed approach in the next section.

2.1 Transformer

The Transformer follows the typical encoder-decoder architecture using stacked self-attention,point-wise fully connected layers, and theencoder-decoder attention layers. Each layer isin principle wrapped by a residual connection (Heet al., 2016) and a postprocessing layer normal-ization (Ba et al., 2016). Although in principle,it is not necessary to mask for self-attentions inthe encoder, in practical implementation it is re-quired to mask the padding positions. However,self-attention in the decoder only allows positionsup to the current one to be attended to, preventinginformation flow from the left and preserving theauto-regressive property. The illegal connectionswill be masked out by setting as −109 before thesoftmax operation.

6477

2.2 Dot-product Attention

Suppose that for each attention layer in the trans-former encoder and decoder, we have two inputsequences that can be presented as two matricesX ∈ Rn×d and Y ∈ Rm×d, where n,m arethe lengths of source and target sentences respec-tively, and d is the hidden size (usually equal toembedding size), the output is h new sequencesZi ∈ Rn×d/h or ∈ Rm×d/h, where h is the num-ber of heads in attention. In general, the result ofmulti-head attention is calculated according to thefollowing procedure.

Q = XWQ or YWQ or YWQ (1)

K = XWK or YWK or XWK (2)

V = XW V or YW V or XW V (3)

Zi = Softmax

(QK>√d/h

+ IdM

)V (4)

Z = Concat(Z1, ..., Zh)WO (5)

where the matrices WQ,WK ,W V ∈ Rd×d/h andWO ∈ Rd×d represent the learnable projectionparameters, and the masking matrix M ∈ Rm×m

is an upper triangular matrix with zero on the di-agonal and non-zero (−109) everywhere else.

Note that i) the three columns in the right-sideof Eq (1,2,3) are used to compute the encoderself-attention, the decoder self-attention, and theencoder-decoder attention respectively, ii) Id is theindicator function that returns 1 if it computes de-coder self-attention and 0 otherwise, iii) the pro-jection parameters are unique per layer and head,iv) the Softmax in Eq (4) means a row-wise ma-trix operation, computing the attention weights byscaled dot product and resulting in a simplex ∆n

for each row.

3 Lattice Transformer

As motivated in the introduction, our goal is to en-hance the standard transformer architecture, whichis limited to sequential inputs, to consume latticeinputs with additional information from the up-stream ASR systems.

3.1 Lattice Representation

Without loss of generality, we assume a word lat-tice from ASR system to be a directed, connectedand acyclic graph following a topological orderingsuch that a child node comes after its parent nodes.

x0

x1

x2 x6

x5

x4

x3

x7

x8 x9

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

x0 0 1 1 2 2 3 2 3 4 5

x1 -1 0 -inf 1 1 2 X 2 3 4

x2 -1 X 0 -inf -inf X 1 2 3 4

x3 -2 -1 -inf 0 -inf 1 -inf -inf 2 3

x4 -2 -1 -inf -inf 0 -inf -inf 1 2 3

x5 -3 -2 -inf -1 -inf 0 -inf -inf 1 2

x6 -2 -inf -1 -inf inf -inf 0 1 2 3

x7 -3 -2 -2 -inf -1 -inf -1 0 1 2

x8 -4 -3 -3 -2 -2 -1 -2 -1 0 1

x9 -5 -4 -4 -3 -3 -2 -3 -2 -1 0

0

2

-1

-1-3

1

-2

-2

-inf-inf

Figure 2: An example of the lattice relative positionmatrix, where “-inf” in the matrix is a special num-ber denoting that no relative position exists between thecorresponding two tokens.

We add two special tokens to each path of the lat-tice, which represent the start of sentence and theend of sentence (e.g., Figure 1), so that the graphhas a single source node and a single end node,where each node is assigned a token.

Given the definition and property describedabove, we propose to use a relative positional lat-tice matrix L ∈ N n×n to encode the graph infor-mation, where n is number of nodes in the graph.For any two nodes i, j in the lattice graph, the ma-trix entry Lij is the minimum relative distance be-tween them. In other words, if the nodes i, j shareat least one path, then we have

Lij = minp∈common paths for i, j

Lpi0 − Lp

j0, (6)

where Lp·0 is the distance to the source node in path

p. If no common path exists for two nodes, wedenote the relative distance as −∞ (−109 in prac-tice) for subsequent masking in the lattice atten-tion. The reason for choosing the “min” in Eq (6)is that in our dataset we find about 70% of Lijscomputed by “min” and “max” are identical, andabout 20% entries just differ by 1. Empirically, ourexperiments also show no significant difference inthe performance of either one.

An illustration to compute the lattice matrix forthe example in the introduction is shown in Fig-ure 2. Since we can deterministically reconstructthe lattice graph from those matrix elements thatare equal to 1, it indicates the relation informationbetween the parent and child nodes.

6478

3.2 Controllable Lattice Attention

Besides the lattice graph representation, the pos-terior probability scores can be simultaneouslyproduced from the acoustic model and languagemodel in most ASR systems. We deliberately de-sign a controllable lattice attention mechanism toincorporate such information to make the attentionencode more uncertainties.

In general, we denote the posterior probabilityof a node i as the forward score fi, where the sum-mation of the forward scores for its child nodes is1. Following the recursion rule in (Rabiner, 1989),we can further derive another two useful probabil-ities, the marginal score mi = fi

∑j∈Pa(i)mj and

the backward score bi = mi/∑

k∈Ch(i)mk, wherePa(i) or Ch(i) denotes node i’s predecessor or suc-cessor set, respectively. Intuitively, the marginalscore measures the global importance of the cur-rent token compared with its substitutes given allpredecessors; the backward score is analogous tothe forward score, which is only locally associ-ated with the importance of different parents totheir children, where the summation of its parentnodes’ scores is 1. Therefore, our controllable at-tention aims to employ marginal scores and for-ward / backward scores.

3.2.1 Lattice EmbeddingWe first construct the latent representations of therelative positional lattice matrix L. The matrix Lcan be straightforwardly decomposed into two ma-trices: one is the mask LM with only 0 and −∞values, and the other is the matrix with regular val-ues i.e., LR = L − LM . Given a 2D embeddingmatrix WL, the embedded vector of LR

ij can bewritten as WL[LR

ij , :] with the NumPy style index-ing. In order to prevent the the lattice embeddingfrom dynamically changing, we have to clip ev-ery entry of LR with a positive integer c1, suchthat WL ∈ R(2c+1)×d/h has a fixed dimensional-ity and becomes learnable parameters.

3.2.2 Attention with Probability ScoresOur proposed controllable lattice attention is de-picted in the left panel of Figure 3. It shows thecomputational graph with detailed network mod-ules. More concretely, we first denote the latticeembedding for LR as a 3D array E ∈ Rn×n×d/h.Then, the attention weights adapted from tradi-tional transformer are integrated with marginal

1clip(l, c) = max(−c,min(l, c))

scores that capture the distribution of each path inthe lattice. The logits in Eq (4) will become the ad-dition of three individual terms (if we temporarilyomit the mask matrix),

QK> + einsum(’ik,ijk->ij’, Q,E)√d/h

+ wmm . (7)

The original QK> will remain since the word em-beddings have the majority of significant seman-tic information. The difficult part in Eq (7) is thenew dot product term involving the lattice embed-ding by einsum2 operation, where einsum is amulti-dimensional linear algebraic array operationin Einstein summation convention. In our case, ittries to sum out the dimension of the hidden size,resulting in a new 2D array ∈ Rn×n, which is fur-ther be scaled by 1√

d/has well. In addition, we ag-

gregate the scaled marginal score vector m ∈ Rn

together to obtain the logits. With the new param-eterization, each term has an intuitive meaning:term i) represents semantic information, term ii)governs the lattice-dependent positional relation,term iii) encodes the global uncertainty of the ASRoutput.

The attention logits associated with the for-ward or backward scores are much different frommarginal scores, since they govern the local in-formation between the parent and child nodes.They are represented as a matrix rather than a vec-tor, where the matrix has only non-zero values ifnodes i, j have a parent-child relation in the latticegraph. First, an upper or lower triangular maskmatrix is used to enforce every token’s attention tothe forward scores of its successors or the back-ward scores of its predecessors. It seems counter-intuitive but the reason is that the summation ofthe forward scores for each token’s child nodes is1. So is the backward scores of each token’s par-ent nodes. Secondly, before applying the softmaxoperation, the lattice mask matrix LM is addedto each logits to prevent attention from crossingpaths. Eventually, the final attention vector used tomultiply the value representation V is a weighedaveraging of the three proposed attention vectors

2This op is available in NumPy, TensorFlow, or PyTorch.In our example, Q and E are 2D and 3D arrays, and the re-sult of this op is a 2D array, with the element in ith row, jthcolumn is

∑k QikEijk.

6479

MatMul

Q K

MatMul

Q Lattice Embedding

Add

Lattice Mask

SoftMax

Marginal Scores

Scale

Add

Forward / Backward Scores

MatMul

V

Upper triangular Lower triangular

Lattice Mask Lattice Mask

SoftMax SoftMax

!"

!# !$

ScaleScaleInput

Embedding

Lattice Word Inputs

Output Embedding

OutputsLattice Matrix

Lattice Embedding

Controllable Lattice MaskedMulti-Head Attention Scale

marginal scores

Add & Norm

Feed Froward

Add & Norm

F/B scores

Split & Mask

Multi-Head Attention

Masked Multi-Head Attention

Add & Norm

Positional Encoding

Add & Norm

Feed Froward

Add & Norm

Linear

Softmax

Output Probabilities

Scale

N x

N x

Figure 3: Left panel: the controllable lattice attention, where sm, sf , sb are learnable scalars and sm+sf +sb = 1.Right panel: the overall model architecture of lattice transformer.

A· with different probability scores s·,

Afinal =smAm + sfAf + sbAb, (8)

s.t. sm + sf + sb = 1 .

In summary, the overall architecture of latticetransformer is illustrated in the right of Figure 3.

3.2.3 DiscussionA critical point for the lattice transformer iswhether the model can generalize to other com-mon lattice-based inputs. More specifically, howdoes the model apply to the lattice input withoutprobability scores? And to what extent can wetrain the lattice model on a regular sequential in-put? If probability scores are unavailable, we canuse the lattice graph representations alone by set-ting the scalar wm = 0 in Eq (7) and sf = sb = 0,sm = 1 in Eq (8) as non-trainable constants.We validate this viewpoint on the Chinese-Englishtranslation task, where the Chinese input is a purelattice structure derived from different tokeniza-tions. As to sequential inputs, it is just a specialcase of the lattice graph with only one path.

An interesting point to mention is that ourencoder-decoder attention also takes the key andvalue representations from the lattice input and ag-gregates the marginal scores, though the sequen-tial target forbids us to use lattice self-attention inthe decoder. However, we can still visualize howthe sequential target attends to the lattice input.

A practical point for the lattice transformer iswhether the training or inference time for sucha seemingly complicated architecture is accept-able. In our implementation, we first preprocess

the lattice input to obtain the position matrix forthe whole dataset, thus the one-time preprocess-ing will bring almost no over-head to our trainingand inference. In addition, the extra enisum oper-ation in controllable lattice attention is the mosttime-consuming computation, but remaining thesame computational complexity as QK>. Em-pirically, in the ASR experiments, we found thatthe training and inference of the most complicatedlattice transformer (last row in the ablation study)take about 100% and 40% more time than standardtransformer; in the text translation task, our algo-rithm takes about 30% and 20% more time duringtraining and inference.

4 Experiments

We mainly validate our model in two scenarios,speech translation with word lattices and poste-rior scores, and Chinese to English text translationwith different BPE operations on the source side.

4.1 Speech TranslationFor the speech translation experiment, we usethe Fisher and Callhome Spanish-English SpeechTranslation Corpus from LDC (Post et al., 2013),which is produced from telephone conversations.Our baseline models are the vanilla Transformerwith relative positional embeddings (Vaswaniet al., 2017; Shaw et al., 2018), and Lattice-LSTMs (Sperber et al., 2017).

4.1.1 DatasetsThe Fisher corpus includes the contents betweenstrangers, while the Callhome corpus is primarily

6480

between friends and family members. The num-bers of sentence pairs of the two datasets are re-spectively 138,819 and 15,080. The source sideSpanish corpus consists of four data types: ref-erence (human transcripts), oracle of ASR lat-tices (the optimal path with the lowest word errorrate (WER)), ASR 1-best hypothesis, and ASRlattice. For the data processing, we make case-insensitive tokenization with the standard moses3

tokenizer for both the source and target transcripts,and remove the punctuation in source sides. Thesentences of the other three types have been al-ready been lowercased and punctuation-removed.To keep consistent with the lattices, we add a to-ken “<s>” at the beginning for all cases.

Setting DescriptionR baseline, trained with human transcripts only

R+1 fine-tuned on 1-best hypothesisR+L fine-tuned on lattices without probability scores

R+L+S fine-tuned on lattices with probability scores

Table 1: 4 systems for comparison

4.1.2 Training and Cross-Evaluation4 systems in Table 1 are trained for both Lattice-LSTMs and Lattice Transformer. For fair andcomprehensive comparison, we also evaluate allalgorithms on the inputs of four types. We ini-tially train the baseline of our lattice transformerwith the human transcripts on Fisher/Train dataalone, which is equivalent to the modified trans-former (Shaw et al., 2018). Then we fine-tune thepre-trained model with 1-best hypothesis or wordlattices (and probability scores) for either Fisheror Callhome dataset.

The source and target vocabularies are builtrespectively from the transcripts of Fisher/Trainand Callhome/Train corpus, with vocabulary sizes32000 and 20391. The hyper-parameters of ourmodel are the same as Transformer-base with512 hidden size, 6 attention layers, 8 atten-tion heads and beam size 4. We use the sameoptimization strategy as (Vaswani et al., 2017)for pre-training with 4 GPU cards, and applySGD with constant learning rate 0.15 for fine-tuning. We select the best performed modelbased on Fisher/Dev or Callhome/Dev, and test onFisher/Dev2, Fisher/Test or Callhome/Test.

To better analyze the performance of our ap-proach, we use an intensive cross-evaluation

3https://github.com/moses-smt/mosesdecoder

method, i.e., we feed 4 possible inputs to test dif-ferent models. The cross-evaluation results are putinto several 4 × 4 blocks in Table 2 and 3. As theaforementioned discussion, if the input is not ASRlattice, the evaluation on the model R+L+S needsto set wm = sf = sb = 0, sm = 1. If the input isan ASR lattice but fed into the other three models,the probability scores are in fact discarded.

4.1.3 Results on Fisher and CallhomeWe mainly compare our architecture with the pre-vious Lattice-LSTMs (Sperber et al., 2017) andthe transformer (Shaw et al., 2018) in Table 2.Since the transformer itself is a powerful architec-ture for sequence modeling, the BLEU scores ofthe baseline (R) have significant improvement ontest sets. In addition, fine-tuning without scoreshasn’t outperformed the 1-best hypothesis fine-tuning, but has about 0.5 BLEU improvement onoracle and transcript inputs. We suspect this maybe due to the high ASR WER and if the ASR sys-tem has a lower WER, the lattice without scorefine-tuning may get a better translation. We willleave this as a future research direction on otherdatasets from better ASR systems. For now, wejust validate this argument in the BPE lattice ex-periments, and detailed discussion sees next sec-tion. As to fine-tuning with both lattices andprobability scores, it increases the BLEU with arelatively large margin of 0.9/1.0/0.7 on FisherDev/Dev2/Test sets. Besides, for ASR 1-best in-puts, it is still comparable with the R+1 systems,while for oracle and transcript inputs, there areabout 0.5-0.9 BLEU score improvements.

The results of Callhome dataset are all fine-tuned from the pre-trained model based onFisher/Train corpus, since the data size of Call-home is too small to train a large deep learningmodel. This is the reason why we adopt the strat-egy for domain adaption. We use the same methodfor model selection and test. The detailed resultsin Table 3 show the consistent performance im-provement.

4.1.4 Inference AnalysisOn the test datasets of Fisher and Callhome, wemake an inference for predicting the translations,and some examples are shown in Table 4. Wealso visualize the alignment for both encoder self-attention and encoder-decoder attention for the in-put and predicted translation. Two examples areillustrated in Figure 4 and 5. As expected, the to-

6481

Architecture Inference InputsFisher dev Fisher dev2 Fisher test

R R+1 R+L R+L+S R R+1 R+L R+L+S R R+1 R+L R+L+S

Lattice LSTMreference - - - - 53.9 53.8 53.7 54 52.2 51.8 52.2 52.7

oracle - - - - 44.9 45.6 45.2 45.2 44.4 44.6 44.6 44.8ASR 1-best - - - - 35.8 37.1 36.2 36.2 35.9 36.6 36.2 36.4ASR Lattice - - - - 25.9 25.8 36.9 38.5 26.2 25.8 36.1 38

Lattice Transformerreference 57.1 55.0 55.5 55.5 58.0 56.1 56.4 56.6 56.0 53.7 54.1 54.2

oracle 46.3 46.2 46.8 46.7 47.1 47.0 47.5 47.9 46.8 46.4 46.9 46.9ASR 1-best 36.5 37.4 37.6 37.4 37.4 38.4 38.3 38.6 37.7 38.5 38.2 38.4ASR Lattice 32.9 33.8 37.7 38.3 33.4 34.0 38.6 39.4 33.5 33.7 37.9 39.2

Table 2: Cross-Evaluation of BLEU on Fisher. Note that for the lattice transformer architecture with R or R+1 set-ting, the resulted model is equivalent to a standard transformer with relative positional embeddings. The evaluationof oracle inputs is similar to ASR 1-best, but it can indicate an upper bound of the performance. The evaluationresults of Lattice LSTM on Fisher dev are not reported in (Sperber et al., 2017).

Architecture Inference InputsCallhome devtest

R R+1 R+L R+L+S

Lattice Transformer

reference 28.3 29.6 30.0 30.4oracle 17.7 19.7 19.5 19.6

ASR 1-best 13.4 15.2 14.8 15.1ASR Lattice 13.4 13.4 15.6 15.7

Callhome evltestR R+1 R+L R+L+S

Lattice LSTM



Lattice Transformer



Table 3: Cross-Evaluation of BLEU on Callhome.

kens from different paths will not attend to eachother, e.g., “pero” and “perdon” in Figure 4 or“hoy” and “y” in Figure 5. In Figure 4, we ob-serve that the 1-best hypothesis can even result inerroneous translation “sorry, sorry”, which is sup-posed to be “but in peru”. In Figure 5, the transla-tion from 1-best hypothesis obviously misses theimportant information “i heard it”. We primar-ily attribute such errors to the insufficient infor-mation within 1-best hypothesis, but if the latticetransformer is appropriately trained, the transla-tions from lattice inputs can possibly correct them.Due to limited space, more visualization examplescan be found in supplementary material.

4.1.5 Model Ablation Study

We conduct an ablation study to examine the ef-fectiveness of every module in the lattice trans-former. We gradually add one module from astandard transformer model to our most compli-cated lattice transformer. From the results in Ta-ble 5, we can see that the application of marginalscores in encoder or decoder has the most influ-ential impact on the lattice fine-tuning. Further-

more, the superimposed application of marginalscores in both encoder and decoder can gain anadditional promotion, compared to individual ap-plications. However, the use of forward and back-ward scores has no significantly extra rewards inthis situation. Perhaps due to overfitting, the mostcomplicated lattice transformer on the Callhomeof smaller data size cannot achieve better BLEUsthan simpler models.

4.2 Chinese English Text Translation

In this experiment, we demonstrate the perfor-mance of our lattice transformer when the prob-ability scores are unavailable. The compari-son baseline method is the vanilla transformer(Vaswani et al., 2017) in both base and big set-tings.

4.2.1 Datasets and SettingsThe Chinese to English parallel corpus for WMT2017 news task contains about 20 million sen-tences after deduplication. For Chinese word seg-mentation, we use Jieba4 as the baseline (Zhanget al., 2018; Hassan et al., 2018), while the Englishsentences are tokenized by moses tokenizer. Somedata filtering tricks have been applied, such as theratio within [1/3, 3] of lengths between source andtarget sentence pairs and the count of tokens inboth sides (≤ 200).

Then for the Chinese source corpus, we learnthe BPE tokenization with 16K / 32K / 48K op-erations, while for the English target corpus, weonly learn the BPE tokenization with 32K opera-tions. In this way, each Chinese input can be rep-resented as three different tokenized results, thusbeing ready to construct a word lattice.

The hyper-parameters of our model are the

4https://github.com/fxsjy/jieba

6482

src transcript que tal , eh , yo soy guillermo , ¿ como estas ? porque como esto tiene que ir avanzando ¿ no ? pues , ¿ y llevas muchos anos aquı en atlanta ? quererlo y tener fe .tgt reference how are you , eh i ’m guillermo , how are you ? because like this has to be moving forward , no ? well . and you ’ve been many years here in atlanta ? to love him and have faith .ASR 1-best quedar eh yo soy guillermo como estas porque como esto tiene que ir avanzando no pas lleva muchos aos aqu en atlanta quieren lo y tener femt from R+1 stay . eh , i ’m guillermo . how are you ? why do you have to move forward or not ? country has been many years here in atlanta they want to have faith

ASR latticequedar que que eh yo soy dar eh yo tal eh yo soyguillermo cmo comprar con como esta estas

porque como esto tiene que ir avanzando no paıs well lleva lleva muchos anos aquı en atlantaquieren quererlo lo y tenertenerse fe y tener tenerse fe

mt from R+L+S how are you ? i ’m guillermo . how are you ? because since this has to move forward , right ? well , you ’ve been here many years in atlanta loving him and having faith

Table 4: Translation examples on test sets. Note that the presented ASR lattice does not include lattice information.

Figure 4: Visualization of Lattice Transformer encoderself-attention and encoder-decoder attention for infer-ence. Top panel: ASR 1-best. Bottom panel: ASRlattice. Target reference: “But in Peru, I’ve heard thereare parts where it really gets cold.”

same as the setting with the speech translation inprevious experiments. We follow the optimizationconvention in (Vaswani et al., 2017) to use ADAMoptimizer with Noam invert squared decay. Allof our lattice transformers are trained on 4 P-100GPU cards. Similar to our comparison method,detokenized cased-sensitive BLEU is reported inour experiment.

4.2.2 Results

For our lattice transformer, we have three modelstrained for comparison. First we use the 32K BPEChinese corpus alone to train our lattice trans-former, which is equivalent to the standard trans-

Figure 5: Left panel: ASR 1-best. Right panel: ASRlattice. Target reference: “Yes, yes, I heard it.”

Model Fisher dev2 Fisher test Callhome evltestLSTM (1-best input) 37.1 36.6 13.3Lattice LSTM (lattice input) 36.9 36.1 13.7+lattice prob scores 38.5 38 14.1Transformer (1-best input) 38.4 38.5 14.5Lattice Transformer (lattice input) 38.6 37.9 14.2+ marginal scores in decoder 39.0 38.7 14.4+ marginal scores in encoder 38.8 38.2 14.7+ marginal scores in encoder and decoder 39.5 39.0 14.8+ marginal scores in encoder and decoder,and forward / backward scores onlyin encoder self-attention layer 0 and layer 1

39.6 39.1 14.9

+ marginal scores in encoder and decoder,and forward / backward scoresin all encoder self-attention layers

39.4 39.2 14.7

Table 5: Ablation Experiment BLEU Results. Therows of the Lattice LSTM and the Lattice Transformerrepresent the 1-best hypothesis fine-tuning, and theBLEUs are evaluated on 1-best inputs and on latticeinputs for the others. The colored BLEU values comefrom Table 2 and 3.

former with relative positional embeddings. Sec-ondly, we train another lattice transformer withthe word lattice corpus from scratch. In addition,we follow the convention of the speech transla-tion task in previous experiments by fine-tuningthe first model with word lattice corpus. For eachsetting, the model evaluated on test 2017 datasetis selected from the best model performed on thedev2017 data. The fine-tuning of Lattice Model 3starts from the best checkpoint of Lattice Model1. The BLEU evaluation is shown in Table 6, andtwo examples of attention visualization are shownin Figure 6. Notice that the first two results oftransformer-base and -big are directly copied fromthe relevant references. From the result, we can

6483

Figure 6: Attention visualization for Chinese Englishtranslation task.

see that our Model 1 can be comparable with thevanilla transformer-big model in a base setting,and significantly better than the transformer-basemodel. We also validate the argument that train-ing from scratch can also achieve a better resultthan most baselines. Empirically, we find an inter-esting phenomena that training from scratch con-verges faster than other settings.

5 Conclusions

In this paper, we propose a novel lattice trans-former architecture with a controllable lattice at-tention mechanism that can consume a word lat-tice and probability scores from the ASR system.The proposed approach is naturally applied to both

Architecture Inference Inputs test2017Transformer (Zhang et al., 2018) BPE 32K 23.01Transformer-big (Hassan et al., 2018) BPE 32K 24.201. Transformer with BPE 32K BPE 32K 24.262. Lattice Transformer from scratch lattice 24.713. Lattice Transformer with fine-tuning lattice 24.81

Table 6: BLEU on WMT 2017 Chinese English

the encoder self-attention and encoder-decoder at-tention. We mainly validate our lattice trans-former on speech translation task, and addition-ally demonstrate its generalization to text transla-tion on the WMT 2017 Chinese-English transla-tion task. In general, the lattice transformer canincrease the metric BLEU for translation tasks bya significant margin over many baselines.

Acknowledgments

We thank Nguyen Bach to provide the script forattention visualization.

References

Oliver Adams, Graham Neubig, Trevor Cohn, andSteven Bird. 2016. Learning a translation modelfrom word lattices. In Interspeech.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.

Ondrej Bojar, Christian Federmann, Mark Fishel,Yvette Graham, Barry Haddow, Philipp Koehn, andChristof Monz. 2018. Findings of the 2018 confer-ence on machine translation (wmt18). In Proceed-ings of the Third Conference on Machine Transla-tion: Shared Task Papers, pages 272–303.

Jacob Buckman and Graham Neubig. 2018. Neural lat-tice language models. Transactions of the Associa-tion for Computational Linguistics, 6:529–541.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Kai Fan, Bo Li, Fengming Zhou, and Jiayi Wang. 2018.” bilingual expert” can find translation errors. arXivpreprint arXiv:1807.09433.

Hany Hassan, Anthony Aue, Chang Chen, VishalChowdhary, Jonathan Clark, Christian Feder-mann, Xuedong Huang, Marcin Junczys-Dowmunt,William Lewis, Mu Li, et al. 2018. Achieving hu-man parity on automatic chinese to english newstranslation. arXiv preprint arXiv:1803.05567.

6484

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Niehues Jan, Roldano Cattoni, Stuker Sebastian,Mauro Cettolo, Marco Turchi, and Marcello Fed-erico. 2018. The iwslt 2018 evaluation campaign. InInternational Workshop on Spoken Language Trans-lation, pages 2–6.

Urvashi Khandelwal, He He, Peng Qi, and Dan Ju-rafsky. 2018. Sharp nearby, fuzzy far away: Howneural language models use context. arXiv preprintarXiv:1805.04623.

Taku Kudo. 2018. Subword regularization: Improv-ing neural network translation models with multiplesubword candidates. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), volume 1,pages 66–75.

Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, et al. 2018. Phrase-based & neu-ral unsupervised machine translation. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 5039–5049.

Kaho Osamura, Takatomo Kano, Sakriani Sakti, Kat-suhito Sudoh, and Satoshi Nakamura. 2018. Usingspoken word posterior features in neural machinetranslation. architecture, 21:22.

Matt Post, Gaurav Kumar, Adam Lopez, DamianosKarakos, Chris Callison-Burch, and Sanjeev Khu-danpur. 2013. Improved speech-to-text translationwith the fisher and callhome spanish–english speechtranslation corpus. In International Workshop onSpoken Language Translation.

Lawrence R Rabiner. 1989. A tutorial on hiddenmarkov models and selected applications in speechrecognition. Proceedings of the IEEE, 77(2):257–286.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), volume 1, pages1715–1725.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-attention with relative position represen-tations. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers), volume 2, pages464–468.

Matthias Sperber, Graham Neubig, Jan Niehues, andAlex Waibel. 2017. Neural lattice-to-sequence mod-els for uncertain inputs. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 1380–1389.

Jinsong Su, Zhixing Tan, Deyi Xiong, Rongrong Ji, Xi-aodong Shi, and Yang Liu. 2017. Lattice-based re-current neural network encoders for neural machinetranslation. In Thirty-First AAAI Conference on Ar-tificial Intelligence.

Yuanhang Su, Kai Fan, Nguyen Bach, C-C JayKuo, and Fei Huang. 2018. Unsupervised multi-modal neural machine translation. arXiv preprintarXiv:1811.11365.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

Kai Sheng Tai, Richard Socher, and Christopher DManning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),volume 1, pages 1556–1566.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc VLe. 2018. Qanet: Combining local convolutionwith global self-attention for reading comprehen-sion. arXiv preprint arXiv:1804.09541.

Yue Zhang and Jie Yang. 2018. Chinese ner using lat-tice lstm. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 1554–1564.

Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, MingZhou, and Enhong Chen. 2018. Regularizing neu-ral machine translation by target-bidirectional agree-ment. arXiv preprint arXiv:1808.04064.

lattice transformer for speech translation

Documents