exploiting deep representations for neural machine translation · widely-used wmt14 english)german...

Exploiting Deep Representations for Neural Machine Translation

Zi-Yi DouNanjing University

[email protected]

Zhaopeng Tu∗Tencent AI Lab

[email protected]

Xing WangTencent AI Lab

[email protected]

Shuming ShiTencent AI Lab

[email protected]

Tong ZhangTencent AI Lab

[email protected]

Abstract

Advanced neural machine translation (NMT)models generally implement encoder and de-coder as multiple layers, which allows sys-tems to model complex functions and capturecomplicated linguistic structures. However,only the top layers of encoder and decoderare leveraged in the subsequent process, whichmisses the opportunity to exploit the useful in-formation embedded in other layers. In thiswork, we propose to simultaneously exposeall of these signals with layer aggregation andmulti-layer attention mechanisms. In addi-tion, we introduce an auxiliary regularizationterm to encourage different layers to capturediverse information. Experimental results onwidely-used WMT14 English⇒German andWMT17 Chinese⇒English translation datademonstrate the effectiveness and universalityof the proposed approach.

1 Introduction

Neural machine translation (NMT) models haveadvanced the machine translation community inrecent years (Kalchbrenner and Blunsom, 2013;Cho et al., 2014; Sutskever et al., 2014). NMTmodels generally consist of two components: anencoder network to summarize the input sentenceinto sequential representations, based on which adecoder network generates target sentence wordby word with an attention model (Bahdanau et al.,2015; Luong et al., 2015).

Nowadays, advanced NMT models generallyimplement encoder and decoder as multiple lay-ers, regardless of the specific model architecturessuch as RNN (Zhou et al., 2016; Wu et al., 2016),CNN (Gehring et al., 2017), or Self-Attention Net-work (Vaswani et al., 2017; Chen et al., 2018).

∗ Zhaopeng Tu is the corresponding author of the paper.This work was conducted when Zi-Yi Dou was interning atTencent AI Lab.

Several researchers have revealed that differentlayers are able to capture different types of syntaxand semantic information (Shi et al., 2016; Peterset al., 2018; Anastasopoulos and Chiang, 2018).For example, Shi et al. (2016) find that both localand global source syntax are learned by the NMTencoder and different types of syntax are capturedat different layers.

However, current NMT models only leveragethe top layers of encoder and decoder in the sub-sequent process, which misses the opportunity toexploit useful information embedded in other lay-ers. Recently, aggregating layers to better fuse se-mantic and spatial information has proven to be ofprofound value in computer vision tasks (Huanget al., 2017; Yu et al., 2018). In natural languageprocessing community, Peters et al. (2018) haveproven that simultaneously exposing all layer rep-resentations outperforms methods that utilize justthe top layer for transfer learning tasks.

Inspired by these findings, we propose toexploit deep representations for NMT models.Specifically, we investigate two types of strate-gies to better fuse information across layers, rang-ing from layer aggregation to multi-layer atten-tion. While layer aggregation strategies combinehidden states at the same position across differentlayers, multi-layer attention allows the model tocombine information in different positions. In ad-dition, we introduce an auxiliary objective to en-courage different layers to capture diverse infor-mation, which we believe would make the deeprepresentations more meaningful.

We evaluated our approach on two widely-used WMT14 English⇒German and WMT17Chinese⇒English translation tasks. We employedTRANSFORMER (Vaswani et al., 2017) as thebaseline system since it has proven to outper-form other architectures on the two tasks (Vaswaniet al., 2017; Hassan et al., 2018). Experimen-

arX

iv:1

810.

1018

1v1

[cs

.CL

] 2

4 O

ct 2

018

tal results show that exploiting deep represen-tations consistently improves translation perfor-mance over the vanilla TRANSFORMER modelacross language pairs. It is worth mention-ing that TRANSFORMER-BASE with deep rep-resentations exploitation outperforms the vanillaTRANSFORMER-BIG model with only less thanhalf of the parameters.

2 Background: Deep NMT

Deep representations have proven to be of pro-found value in machine translation (Meng et al.,2016; Zhou et al., 2016). Multiple-layer encoderand decoder are employed to perform the transla-tion task through a series of nonlinear transforma-tions from the representation of input sequencesto final output sequences. The layer can be imple-mented as RNN (Wu et al., 2016), CNN (Gehringet al., 2017), or Self-Attention Network (Vaswaniet al., 2017). In this work, we take the advancedTransformer as an example, which will be usedin experiments later. However, we note that theproposed approach is generally applicable to anyother type of NMT architectures.

Specifically, the encoder is composed of a stackof L identical layers, each of which has two sub-layers. The first sub-layer is a self-attention net-work, and the second one is a position-wise fullyconnected feed-forward network. A residual con-nection (He et al., 2016) is employed around eachof the two sub-layers, followed by layer normal-ization (Ba et al., 2016). Formally, the output ofthe first sub-layer Cl

e and the second sub-layer Hle

are calculated as

Cle = LN

(ATT(Ql

e,Kl−1e ,Vl−1

e ) +Hl−1e

),

Hle = LN

(FFN(Cl

e) +Cle

), (1)

where ATT(·), LN(·), and FFN(·) are self-attention mechanism, layer normalization, andfeed-forward networks with ReLU activation inbetween, respectively. {Ql

e,Kl−1e ,Vl−1

e } arequery, key and value vectors that are transformedfrom the (l-1)-th encoder layer Hl−1

e .

The decoder is also composed of a stack of Lidentical layers. In addition to two sub-layers ineach decoder layer, the decoder inserts a third sub-layer Dl

d to perform attention over the output of

the encoder stack HLe :

Cld = LN

(ATT(Ql

d,Kl−1d ,Vl−1

d ) +Hl−1d

),

Dld = LN

(ATT(Cl

d,KLe ,V

Le ) +Cl

d

),

Hld = LN

(FFN(Dl

d) +Dld

), (2)

where {Qld,K

l−1d ,Vl−1

d } are transformed fromthe (l-1)-th decoder layer Hl−1

d , and {KLe ,V

Le }

are transformed from the top layer of the encoder.The top layer of the decoder HL

d is used to gener-ate the final output sequence.

Multi-layer network can be considered as astrong feature extractor with extended receptivefields capable of linking salient features from theentire sequence (Chen et al., 2018). However, onepotential problem about the vanilla Transformer,as shown in Figure 1a, is that both the encoderand decoder stack layers in sequence and only uti-lize the information in the top layer. While stud-ies have shown deeper layers extract more seman-tic and more global features (Zeiler and Fergus,2014; Peters et al., 2018), these do not prove thatthe last layer is the ultimate representation for anytask. Although residual connections have been in-corporated to combine layers, these connectionshave been “shallow” themselves, and only fuseby simple, one-step operations (Yu et al., 2018).We investigate here how to better fuse informationacross layers for NMT models.

In the following sections, we simplify the equa-tions to Hl = LAYER(Hl−1) for brevity.

3 Proposed Approaches

In this section, we first introduce how to exploitdeep representations by simultaneously exposingall of the signals from all layers (Sec 3.1). Then,to explicitly encourage different layers to incor-porate various information, we propose one wayto measure the diversity between layers and adda regularization term to our objective function tomaximize the diversity across layers (Sec 3.2).

3.1 Deep Representations

To exploit deep representations, we investigatetwo types of strategies to fuse information acrosslayers, from layer aggregation to multi-layer atten-tion. While layer aggregation strategies combinehidden states at the same position across differentlayers, multi-layer attention allows the model tocombine information in different positions.

(a) Vanilla (b) Dense

(c) Linear (d) Iterative

Figure 1: Illustration of (a) vanilla model with-out any aggregation and (b,c,d) layer aggregationstrategies. Aggregation nodes are represented bygreen circles.

3.1.1 Layer AggregationWhile the aggregation strategies are inspired byprevious work, there are several differences sincewe have simplified and generalized from the orig-inal model, as described below.Dense Connection. The first strategy is to allowall layers to directly access previous layers:

Hl = f(H1, . . . ,Hl−1). (3)

In this work, we mainly investigate whetherdensely connected networks work for NMT,which have proven successful in computer visiontasks (Huang et al., 2017). The basic strategyof densely connected networks is to connect eachlayer to every previous layer with a residual con-nection:

Hl = Layer(Hl−1) +

l−1∑i=1

Hi. (4)

Figure 1b illustrates the idea of this approach. Ourimplementation differs from (Huang et al., 2017)in that we use an addition instead of a concatena-tion operation in order to keep the state size con-stant. Another reason is that concatenation oper-ation is computationally expensive, while residualconnections are more efficient.

While dense connection directly feeds previ-ous layers to the subsequent layers, the followingmechanisms maintain additional layers to aggre-gate standard layers, from shallow linear combi-nation, to deep non-linear aggregation.

Linear Combination. As shown in Figure 1c,an intuitive strategy is to linearly combine the out-puts of all layers:

H =L∑l=1

WlHl, (5)

where {W1, . . . ,WL} are trainable matrices.While the strategy is similar in spirit to (Peterset al., 2018), there are two main differences: (1)they use normalized weights while we directly useparameters that could be either positive or negativenumbers, which may benefit from more modelingflexibility. (2) they use a scalar that is shared by allelements in the layer states, while we use learnablematrices. The latter offers a more precise controlof the combination by allowing the model to bemore expressive than scalars (Tu et al., 2017).

We also investigate strategies that iterativelyand hierarchically merge layers by incorporatingmore depth and sharing, which have proven effec-tive for computer vision tasks (Yu et al., 2018).

Iterative Aggregation. As illustrated in Figure1d, iterative aggregation follows the iterated stack-ing of the backbone architecture. Aggregation be-gins at the shallowest, smallest scale and then it-eratively merges deeper, larger scales. The itera-tive deep aggregation function I for a series of lay-ers Hl

1 = {H1, · · · ,Hl} with increasingly deeperand semantic information is formulated as

Hl = I(Hl1) = AGG(Hl, Hl−1), (6)

where we set H1 = H1 and AGG(·, ·) is the ag-gregation function:

AGG(x, y) = LN(FFN([x; y]) + x+ y). (7)

As seen, in this work, we first concatenate x andy into z = [x; y], which is subsequently fed toa feed-forward network with a sigmoid activationin between. Residual connection and layer nor-malization are also employed. Specifically, both xand y have residual connections to the output. Thechoice of the aggregation function will be furtherstudied in the experiment section.

(a) CNN-like tree (b) Hierarchical

Figure 2: Hierarchical aggregation (b) that aggre-gates layers through a tree structure (a).

Hierarchical Aggregation. While iterative ag-gregation deeply combines states, it may still beinsufficient to fuse the layers for its sequential ar-chitecture. Hierarchical aggregation, on the otherhand, merges layers through a tree structure topreserve and combine feature channels, as shownin Figure 2. The original model proposed by Yuet al. (2018) requires the number of layers to bethe power of two, which limits the applicability ofthese methods to a broader range of NMT archi-tectures (e.g. six layers in (Vaswani et al., 2017)).To solve this problem, we introduce a CNN-liketree with the filter size being two, as shown in Fig-ure 2a. Following (Yu et al., 2018), we first mergeaggregation nodes of the same depth for efficiencyso that there would be at most one aggregationnode for each depth. Then, we further feed theoutput of an aggregation node back into the back-bone as the input to the next sub-tree, instead ofonly routing intermediate aggregations further upthe tree, as shown in Figure 2b. The interactionbetween aggregation and backbone nodes allowsthe model to better preserve features.

Formally, each aggregation node Hi is calcu-lated as

Hi =

{AGG(H2i−1,H2i), i = 1

AGG(H2i−1,H2i, Hi−1), i > 1

where AGG(H2i−1,H2i) is computed via Eqn. 7,and AGG(H2i−1,H2i, Hi−1) is computed as

AGG(x, y, z) = LN(FFN([x; y; z]) + x+ y + z).

The aggregation node at the top layer HL/2 servesas the final output of the network.

3.1.2 Multi-Layer AttentionPartially inspired by Meng et al. (2016), we alsopropose to introduce a multi-layer attention mech-anism into deep NMT models, for more power of

layer l-1

layer l-2

layer l

Figure 3: Multi-layer attention allows the modelto attend multiple layers to construct each hid-den state. We use two-layer attention for illustra-tion, while the approach is applicable to any layerslower than l.

transforming information across layers. In otherwords, for constructing each hidden state in anylayer-l, we allow the self-attention model to attendany layers lower than l, instead of just layer l-1:

Cl−1 = ATT(Ql,Kl−1,Vl−1),

Cl−2 = ATT(Ql,Kl−2,Vl−2),

. . .

Cl−k = ATT(Ql,Kl−k,Vl−k),

Cl = AGG(Cl−1, . . . ,C

l−k), (8)

where Cl−i is sequential vectors queried from

layer l-i using a separate attention model, andAGG(·) is similar to the pre-defined aggregationfunction to transform k vectors {Cl

−1, . . . ,Cl−k}

to a d-dimension vector, which is subsequentlyused to construct the encoder and decoder layersvia Eqn. 1 and 2 respectively. Note that multi-layer attention only modifies the self-attentionblocks in both encoder and decoder, while doesnot revises the encoder-decoder attention blocks.

3.2 Layer Diversity

Intuitively, combining layers would be moremeaningful if different layers are able to capturediverse information. Therefore, we explicitly adda regularization term to encourage the diversitiesbetween layers:

L = Llikelihood + λLdiversity, (9)

where λ is a hyper-parameter and is set to 1.0 inthis paper. Specifically, the regularization termmeasures the average of the distance between ev-

ery two adjacent layers:

Ldiversity =1

L− 1

L−1∑l=1

D(Hl,Hl+1). (10)

HereD(Hl,Hl+1) is the averaged cosine-squareddistance between the states in layers Hl ={hl

1, . . . ,hlN} and Hl+1 = {hl+1

1 , . . . ,hl+1N }:

D(Hl,Hl+1) =1

N

N∑n=1

(1− cos2(hln,h

l+1n )).

The cosine-squared distance between two vectorsis maximized when two vectors are linearly inde-pendent and minimized when two vectors are lin-early dependent, which satisfies our goal.1

4 Experiments

4.1 SetupDataset. To compare with the results reportedby previous work (Gehring et al., 2017; Vaswaniet al., 2017; Hassan et al., 2018), we conductedexperiments on both Chinese⇒English (Zh⇒En)and English⇒German (En⇒De) translation tasks.For the Zh⇒En task, we used all of the avail-able parallel data with maximum length limitedto 50, consisting of about 20.62 million sentencepairs. We used newsdev2017 as the develop-ment set and newstest2017 as the test set. Forthe En⇒De task, we trained on the widely-usedWMT14 dataset consisting of about 4.56 millionsentence pairs. We used newstest2013 as the de-velopment set and newstest2014 as the test set.Byte-pair encoding (BPE) was employed to al-leviate the Out-of-Vocabulary problem (Sennrichet al., 2016) with 32K merge operations for bothlanguage pairs. We used 4-gram NIST BLEUscore (Papineni et al., 2002) as the evaluation met-ric, and sign-test (Collins et al., 2005) to test forstatistical significance.

Models. We evaluated the proposed approacheson advanced Transformer model (Vaswani et al.,2017), and implemented on top of an open-sourcetoolkit – THUMT (Zhang et al., 2017). We fol-lowed Vaswani et al. (2017) to set the configura-tions and train the models, and have reproduced

1We use cosine-squared distance instead of cosine dis-tance, since the latter is maximized when two vectors are inopposite directions. In such case, the two vectors are in factlinearly dependent, while we aim at encouraging the vectorsindependent from each other.

their reported results on the En⇒De task. The pa-rameters of the proposed models were initializedby the pre-trained model. We tried k = 2 andk = 3 for the multi-layer attention model, whichallows to attend to the lower two or three layers.

We have tested both Base and Big models,which differ at hidden size (512 vs. 1024), filtersize (2048 vs. 4096) and the number of attentionheads (8 vs. 16).2 All the models were trainedon eight NVIDIA P40 GPUs where each was al-located with a batch size of 4096 tokens. In con-sideration of computation cost, we studied modelvariations with Base model on En⇒De task, andevaluated overall performance with Big model onboth Zh⇒En and En⇒De tasks.

4.2 ResultsTable 1 shows the results on WMT14 En⇒Detranslation task. As seen, the proposed approachesimprove the translation quality in all cases, al-though there are still considerable differencesamong different variations.

Model Complexity Except for dense connec-tion, all other deep representation strategies in-troduce new parameters, ranging from 14.7M to33.6M. Accordingly, the training speed decreasesdue to more efforts to train the new parameters.Layer aggregation mechanisms only marginallydecrease decoding speed, while multi-layer atten-tion decreases decoding speed by 21% due to anadditional attention process for each layer.

Layer Aggregation (Rows 2-5): Althoughdense connection and linear combination onlymarginally improve translation performance, it-erative and hierarchical aggregation strategiesachieve more significant improvements, which areup to +0.99 BLEU points better than the baselinemodel. This indicates that deep aggregations out-perform their shallow counterparts by incorporat-ing more depth and sharing, which is consistentwith the results in computer vision tasks (Yu et al.,2018).

Multi-Layer Attention (Rows 6-7): Benefitingfrom the power of attention models, multi-layerattention model can also significantly outperformbaseline, although it only attends to one or twoadditional layers. However, increasing the num-ber of lower layers to be attended from k = 2 to

2Here “filter size” refers to the hidden size of the feed-forward network in the Transformer model.

# Model # Para. Train Decode BLEU 41 TRANSFORMER-BASE 88.0M 1.82 1.33 27.64 –2 + Dense Connection +0.0M 1.69 1.28 27.94 +0.30

3 + Linear Combination +14.7M 1.59 1.26 28.09 +0.45

4 + Iterative Aggregation +31.5M 1.32 1.22 28.61 +0.97

5 + Hierarchical Aggregation +23.1M 1.46 1.25 28.63 +0.99

6 + Multi-Layer Attention (k=2) +33.6M 1.19 1.05 28.58 +0.94

7 + Multi-Layer Attention (k=3) +37.8M 1.12 1.00 28.62 +0.98

8 + Iterative Aggregation + Ldiversity +31.5M 1.28 1.22 28.76 +1.12

9 + Hierarchical Aggregation + Ldiversity +23.1M 1.41 1.25 28.78 +1.14

10 + Multi-Layer Attention (k=2)+ Ldiversity +33.6M 1.12 1.05 28.75 +1.11

Table 1: Evaluation of translation performance on WMT14 English⇒German (“En⇒De”) translationtask. “# Para.” denotes the number of parameters, and “Train” and “Decode” respectively denote thetraining speed (steps/second) and decoding speed (sentences/second) on Tesla P40.

System Architecture En⇒De Zh⇒En# Para. BLEU # Para. BLEU

Existing NMT systems(Wu et al., 2016) RNN with 8 layers N/A 26.30 N/A N/A(Gehring et al., 2017) CNN with 15 layers N/A 26.36 N/A N/A

(Vaswani et al., 2017)TRANSFORMER-BASE 65M 27.3 N/A N/ATRANSFORMER-BIG 213M 28.4 N/A N/A

(Hassan et al., 2018) TRANSFORMER-BIG N/A N/A N/A 24.2

Our NMT systems

this work

TRANSFORMER-BASE 88M 27.64 108M 24.13+ Deep Representations 111M 28.78† 131M 24.76†

TRANSFORMER-BIG 264M 28.58 304M 24.56+ Deep Representations 356M 29.21† 396M 25.10†

Table 2: Comparing with existing NMT systems on WMT14 English⇒German and WMT17Chinese⇒English tasks. “+ Deep Representations” denotes “+ Hierarchical Aggregation + Ldiversity”.“†” indicates statistically significant difference (p < 0.01) from the TRANSFORMER baseline.

k = 3 only gains marginal improvement, at thecost of slower training and decoding speeds. Inthe following experiments, we set set k = 2 forthe multi-layer attention model.

Layer Diversity (Rows 8-10): The introduceddiversity regularization consistently improves per-formance in all cases by encouraging differentlayers to capture diverse information. Our bestmodel outperforms the vanilla Transformer by+1.14 BLEU points. In the following experiments,we used hierarchical aggregation with diversityregularization (Row 8) as the default strategy.

Main Results Table 2 lists the results on bothWMT17 Zh⇒En and WMT14 En⇒De transla-tion tasks. As seen, exploiting deep represen-

tations consistently improves translation perfor-mance across model variations and language pairs,demonstrating the effectiveness and universalityof the proposed approach. It is worth mention-ing that TRANSFORMER-BASE with deep rep-resentations exploitation outperforms the vanillaTRANSFORMER-BIG model, with only less thanhalf of the parameters.

4.3 Analysis

We conducted extensive analysis from differentperspectives to better understand our model. Allresults are reported on the En⇒De task withTRANSFORMER-BASE.

Sentence Length

BLE

U

22

23

24

25

26

27

28

29

30

31

Length of Source Sentence

(0, 10

]

(10, 2

0]

(20, 3

0]

(30, 4

0]

(40, 5

0](50

, )

OursBase

BLE

U

25

26

27

28

29

30


(0, 15

]

(15, 3

0]

(30, 4

5]> 45

OursBase

BLE

U

25

26

27

28

29

30


(0, 15

]

(15, 3

0]

(30, 4

5]> 45

Hier.+Div.Hier.Base

Figure 4: BLEU scores on the En⇒De test set withrespect to various input sentence lengths. “Hier.”denotes hierarchical aggregation and “Div.” de-notes diversity regularization.

4.3.1 Length AnalysisFollowing Bahdanau et al. (2015) and Tu et al.(2016), we grouped sentences of similar lengthstogether and computed the BLEU score for eachgroup, as shown in Figure 4. Generally, the per-formance of TRANSFORMER-BASE goes up withthe increase of input sentence lengths, which issuperior to the performance of RNN-based NMTmodels on long sentences reported by (Bentivogliet al., 2016). We attribute this to the strength ofself-attention mechanism to model global depen-dencies without regard to their distance.

Clearly, the proposed approaches outperformthe baseline model in all length segments, whilethere are still considerable differences between thetwo variations. Hierarchical aggregation consis-tently outperforms the baseline model, and the im-provement goes up on long sentences. One pos-sible reason is that long sentences indeed requiredeep aggregation mechanisms. Introducing diver-sity regularization further improves performanceon most sentences (e.g. ≤ 45), while the improve-ment degrades on long sentences (e.g. > 45). Weconjecture that complex long sentences may needto store duplicate information across layers, whichconflicts with the diversity objective.

4.3.2 Effect on Encoder and DecoderBoth encoder and decoder are composed of a stackof L layers, which may benefit from the proposedapproach. In this experiment, we investigated howour models affect the two components, as shown

Model Applied to BLEUEncoder Decoder

BASE N/A N/A 26.13

OURS

X × 26.32× X 26.41X X 26.69

Table 3: Experimental results of applying hier-archical aggregation to different components onEn⇒De validation set.

in Table 3. Exploiting deep representations ofencoder or decoder individually consistently out-performs the vanilla baseline model, and exploit-ing both components further improves the perfor-mance. These results provide support for the claimthat exploiting deep representations is useful forboth understanding input sequence and generatingoutput sequence.

4.3.3 Impact of Aggregation Choices

Model RESIDUAL AGGREGATE BLEUBASE N/A N/A 26.13

OURS

NoneSIGMOID

25.48Top 26.59

All26.69

RELU 26.56ATTENTION 26.54

Table 4: Impact of residual connections and aggre-gation functions for hierarchical layer aggregation.

As described in Section 3.1.1, the function ofhierarchical layer aggregation is defined as

AGG(x, y, z) = LN(FFN([x; y; z]) + x+ y + z),

where FFN(·) is a feed-forward network with asigmoid activation in between. In addition, all theinput layers {x, y, z} have residual connections tothe output. In this experiment, we evaluated theimpact of residual connection options, as well asdifferent choices for the aggregation function, asshown in Table 4.

Concerning residual connections, if none of theinput layers are connected to the output layer(“None”), the performance would decrease. Thetranslation performance is improved when the out-put is connected to only the top level of the inputlayers (“Top”), while connecting to all input layers(“All”) achieves the best performance. This indi-

cates that cross-layer connections are necessary toavoid the gradient vanishing problem.

Besides the feed-forward network with sigmoidactivation, we also tried two other aggregationfunctions for FFN(·): (1) A feed-forward networkwith a RELU activation in between; and (2) multi-head self-attention layer that constitutes the en-coder and decoder layers in the TRANSFORMER

model. As seen, all the three functions consis-tently improve the translation performance, prov-ing the robustness of the proposed approaches.

4.4 Visualization of Aggregation

𝐻"# 𝐻"$ 𝐻"%

𝐻"&'#𝐻$&'#𝐻$& 0.40 0.20 0.27

0.60 0.35 0.33

0.45 0.40

(a) No regularization.

𝐻"# 𝐻"$ 𝐻"%

𝐻"&'#𝐻$&'#𝐻$& 0.49 0.30 0.32

0.51 0.33 0.32

0.37 0.36

(b) With regularization.

Figure 5: Visualization of the exploitation of inputrepresentations for hierarchical aggregation. x-axis is the aggregation node and y-axis is the in-put representation. H i denotes the i-th aggrega-tion layer, and H i denotes the i-th encoder layer.The rightmost and topmost position in x-axis andy-axis respectively represent the highest layer.

To investigate the impact of diversity regular-ization, we visualized the exploitation of the in-put representations for hierarchical aggregation inencoder side, as shown in Figure 5. Let Hi ={H2i, H2i−1, H i−1} be the input representations,we calculated the exploitation of the j-th input as

sj =

∑w∈Wj

|w|∑Hj′∈H

{∑

w′∈Wj′|w′|}

, (11)

where Wj is the parameter matrix associated withthe input Hj . The score sj is a rough estimationof the contribution of Hj to the aggregation H i.

We have two observations. First, the modeltends to utilize the bottom layer more than the topone, indicating the necessity of fusing informationacross layers. Second, using the diversity regu-larization in Figure 5(b) can encourage each layerto contribute more equally to the aggregation. Wehypothesize this is because of the diversity regu-larization term encouraging the different layers tocontain diverse and equally important information.

5 Related Work

Representation learning is at the core of deeplearning. Our work is inspired by technologicaladvances in representation learning, specifically inthe field of deep representation learning and rep-resentation interpretation.

Deep Representation Learning Deep neuralnetworks have advanced the state of the art in var-ious communities, such as computer vision andnatural language processing. One key challengeof training deep networks lies in how to transforminformation across layers, especially when the net-work consists of hundreds of layers.

In response to this problem, ResNet (He et al.,2016) uses skip connections to combine layers bysimple, one-step operations. Densely connectednetwork (Huang et al., 2017) is designed to betterpropagate features and losses through skip con-nections that concatenate all the layers in stages.Yu et al. (2018) design structures iteratively andhierarchically merge the feature hierarchy to bet-ter fuse information in a deep fusion.

Concerning machine translation, Meng et al.(2016) and Zhou et al. (2016) have shown thatdeep networks with advanced connecting strate-gies outperform their shallow counterparts. Dueto its simplicity and effectiveness, skip connectionbecomes a standard component of state-of-the-artNMT models (Wu et al., 2016; Gehring et al.,2017; Vaswani et al., 2017). In this work, we provethat deep representation exploitation can furtherimprove performance over simply using skip con-nections.

Representation Interpretation Several re-searchers have tried to visualize the representationof each layer to help better understand whatinformation each layer captures (Zeiler andFergus, 2014; Li et al., 2016; Ding et al., 2017).Concerning natural language processing tasks,Shi et al. (2016) find that both local and globalsource syntax are learned by the NMT encoderand different types of syntax are captured atdifferent layers. Anastasopoulos and Chiang(2018) show that higher level layers are morerepresentative than lower level layers. Peters et al.(2018) demonstrate that higher-level layers cap-ture context-dependent aspects of word meaningwhile lower-level layers model aspects of syntax.Inspired by these observations, we propose toexpose all of these representations to better

fuse information across layers. In addition, weintroduce a regularization to encourage differentlayers to capture diverse information.

6 Conclusion

In this work, we propose to better exploit deeprepresentations that are learned by multiple lay-ers for neural machine translation. Specif-ically, the hierarchical aggregation with di-versity regularization achieves the best perfor-mance by incorporating more depth and shar-ing across layers and by encouraging layers tocapture different information. Experimental re-sults on WMT14 English⇒German and WMT17Chinese⇒English show that the proposed ap-proach consistently outperforms the state-of-the-art TRANSFORMER baseline by +0.54 and +0.63BLEU points, respectively. By visualizing the ag-gregation process, we find that our model indeedutilizes lower layers to effectively fuse the infor-mation across layers.

Future directions include validating our ap-proach on other architectures such as RNN (Bah-danau et al., 2015) or CNN (Gehring et al., 2017)based NMT models, as well as combining withother advanced techniques (Shaw et al., 2018;Shen et al., 2018; Yang et al., 2018; Li et al., 2018)to further improve the performance of TRANS-FORMER.

Acknowledgments

We thank the anonymous reviewers for their in-sightful comments.

ReferencesAntonios Anastasopoulos and David Chiang. 2018.

Tied multitask learning for neural speech translation.In NAACL.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In ICLR.

Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, andMarcello Federico. 2016. Neural versus phrase-based machine translation quality: a case study. InEMNLP.

Mia Xu Chen, Orhan Firat, Ankur Bapna, MelvinJohnson, Wolfgang Macherey, Goerge Foster, LlionJones, Parmar Niki, Mike Schuster, Zhifeng Chen,

Yonghui Wu, and Macduff Hughes. 2018. The Bestof Both Worlds: Combining Recent Advances inNeural Machine Translation. In ACL.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder-decoderfor statistical machine translation. In EMNLP.

Michael Collins, Philipp Koehn, and Ivona Kucerova.2005. Clause restructuring for statistical machinetranslation. In ACL.

Yanzhuo Ding, Yang Liu, Huanbo Luan, and MaosongSun. 2017. Visualizing and understanding neuralmachine translation. In ACL.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N Dauphin. 2017. Convolutionalsequence to sequence learning. In ICML.

Hany Hassan, Anthony Aue, Chang Chen, VishalChowdhary, Jonathan Clark, Christian Feder-mann, Xuedong Huang, Marcin Junczys-Dowmunt,William Lewis, Mu Li, et al. 2018. Achieving hu-man parity on automatic chinese to english newstranslation. arXiv preprint arXiv:1803.05567.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In CVPR.

Gao Huang, Zhuang Liu, Laurens van der Maaten, andKilian Q Weinberger. 2017. Densely connected con-volutional networks. In CVPR.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In EMNLP.

Jian Li, Zhaopeng Tu, Baosong Yang, Michael R. Lyu,and Tong Zhang. 2018. Multi-head attention withdisagreement regularization. In EMNLP.

Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, andJohn E Hopcroft. 2016. Convergent learning: Dodifferent neural networks learn the same representa-tions? In ICLR.

Thang Luong, Hieu Pham, and Christopher D Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In EMNLP.

Fandong Meng, Zhengdong Lu, Zhaopeng Tu, HangLi, and Qun Liu. 2016. A deep memory-based archi-tecture for sequence-to-sequence learning. In ICLRWorkshop.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In ACL.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In NAACL.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In ACL.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-Attention with Relative Position Repre-sentations. In NAACL.

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang,Shirui Pan, and Chengqi Zhang. 2018. DiSAN: di-rectional self-attention network for RNN/CNN-freelanguage understanding. In AAAI.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Doesstring-based neural mt learn source syntax? InEMNLP.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In NIPS.

Zhaopeng Tu, Yang Liu, Zhengdong Lu, Xiaohua Liu,and Hang Li. 2017. Context gates for neural ma-chine translation. TACL.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. In ACL.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144.

Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fan-dong Meng, Lidia S. Chao, and Tong Zhang. 2018.Modeling localness for self-attention networks. InEMNLP.

Fisher Yu, Dequan Wang, Evan Shelhamer, and TrevorDarrell. 2018. Deep layer aggregation. In CVPR.

Matthew D Zeiler and Rob Fergus. 2014. Visualiz-ing and understanding convolutional networks. InECCV.

Jiacheng Zhang, Yanzhuo Ding, Shiqi Shen, YongCheng, Maosong Sun, Huanbo Luan, and YangLiu. 2017. THUMT: An Open Source Toolkitfor Neural Machine Translation. arXiv preprintarXiv:1706.06415.

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and WeiXu. 2016. Deep recurrent models with fast-forwardconnections for neural machine translation. TACL.

exploiting deep representations for neural machine translation · widely-used wmt14 english)german...

Documents