1z2+ibp2 .qk bm jbtbm; 7q` l2m` h j +?bm2 h` mbh ibqm · 1z2+ibp2 .qk bm jbtbm; 7q` l2m` h j +?bm2...

9
Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 118–126 Copenhagen, Denmark, September 711, 2017. c 2017 Association for Computational Linguistics Effective Domain Mixing for Neural Machine Translation Reid Pryzant * Stanford University [email protected] Denny Britz * Google Brain [email protected] Quoc V. Le Google Brain [email protected] Abstract Neural Machine Translation (NMT) models are often trained on hetero- geneous mixtures of domains, from news to parliamentary proceedings, each with unique distributions and lan- guage. In this work we show that train- ing NMT systems on naively mixed data can degrade performance versus models fit to each constituent domain. We demonstrate that this problem can be circumvented, and propose three models that do so by jointly learn- ing domain discrimination and transla- tion. We demonstrate the efficacy of these techniques by merging pairs of domains in three languages: Chinese, French, and Japanese. After training on composite data, each approach out- performs its domain-specific counter- parts, with a model based on a discrim- inator network doing so most reliably. We obtain consistent performance im- provements and an average increase of 1.1 BLEU. 1 Introduction Neural Machine Translation (NMT) (Kalch- brenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014) is an end-to-end ap- proach for automated translation. NMT has shown impressive results (Bahdanau et al., 2015; Luong et al., 2015a; Wu et al., 2016) often surpassing those of phrase-based sys- tems while addressing shortcomings such as the need for hand-engineered features. In many translation settings (e.g. web translation, assistant translators), input may * Equal Contribution. come from more than one domain. Each do- main has unique properties that could con- found models not explicitly fitted to it. Thus, an important problem is to effectively mix a diversity of training data in a multi-domain setting. Our problem space is as follows: how can we train a translation model on multi-domain data to improve test-time performance in each constituent domain? This setting differs from the majority of work in domain adaptation, which explores how models trained on some source domain can be effectively applied to outside target domains. This setting is impor- tant, because previous research has shown that both standard NMT and adaptation methods degrade performance on the original source do- main(s) (Farajian et al., 2017; Haddow and Koehn, 2012). We seek to prove that this prob- lem can be overcome, and hypothesize that leveraging the heterogeneity of composite data rather than dampening it will allow us to do so. To this extent, we propose three new models for multi-domain machine translation. These models are based on discriminator networks, adversarial learning, and target-side domain tokens. We evaluate on pairs of linguisti- cally disparate corpora in three translation tasks (EN-JA, EN-ZH, EN-FR), and observe that unlike naively training on mixed data (as per current best practices), the proposed tech- niques consistently improve translation quality in each individual setting. The most signifi- cant of these tasks is EN-JA, where we obtain state-of-the-art performance in the process of examining the ASPEC corpus (Nakazawa et al., 2016) of scientific papers and Sub- Crawl, a new corpus based on an anonymous manuscript (Anonymous, 2017). In summary, 118

Upload: others

Post on 23-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM · 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM _2B/ S`vx Mi ai M7Q`/ lMBp2`bBiv `T`vx Mi!bi M7Q`/X2/m.2MMv "`Bix :QQ;H2

Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers, pages 118–126Copenhagen, Denmark, September 711, 2017. c©2017 Association for Computational Linguistics

Effective Domain Mixing for Neural Machine Translation

Reid Pryzant∗

Stanford [email protected]

Denny Britz∗

Google [email protected]

Quoc V. LeGoogle Brain

[email protected]

Abstract

Neural Machine Translation (NMT)models are often trained on hetero-geneous mixtures of domains, fromnews to parliamentary proceedings,each with unique distributions and lan-guage. In this work we show that train-ing NMT systems on naively mixeddata can degrade performance versusmodels fit to each constituent domain.We demonstrate that this problem canbe circumvented, and propose threemodels that do so by jointly learn-ing domain discrimination and transla-tion. We demonstrate the efficacy ofthese techniques by merging pairs ofdomains in three languages: Chinese,French, and Japanese. After trainingon composite data, each approach out-performs its domain-specific counter-parts, with a model based on a discrim-inator network doing so most reliably.We obtain consistent performance im-provements and an average increase of1.1 BLEU.

1 IntroductionNeural Machine Translation (NMT) (Kalch-brenner and Blunsom, 2013; Sutskever et al.,2014; Cho et al., 2014) is an end-to-end ap-proach for automated translation. NMT hasshown impressive results (Bahdanau et al.,2015; Luong et al., 2015a; Wu et al., 2016)often surpassing those of phrase-based sys-tems while addressing shortcomings such asthe need for hand-engineered features.

In many translation settings (e.g. webtranslation, assistant translators), input may

∗Equal Contribution.

come from more than one domain. Each do-main has unique properties that could con-found models not explicitly fitted to it. Thus,an important problem is to effectively mix adiversity of training data in a multi-domainsetting.

Our problem space is as follows: how canwe train a translation model on multi-domaindata to improve test-time performance in eachconstituent domain? This setting differs fromthe majority of work in domain adaptation,which explores how models trained on somesource domain can be effectively applied tooutside target domains. This setting is impor-tant, because previous research has shown thatboth standard NMT and adaptation methodsdegrade performance on the original source do-main(s) (Farajian et al., 2017; Haddow andKoehn, 2012). We seek to prove that this prob-lem can be overcome, and hypothesize thatleveraging the heterogeneity of composite datarather than dampening it will allow us to doso.

To this extent, we propose three new modelsfor multi-domain machine translation. Thesemodels are based on discriminator networks,adversarial learning, and target-side domaintokens. We evaluate on pairs of linguisti-cally disparate corpora in three translationtasks (EN-JA, EN-ZH, EN-FR), and observethat unlike naively training on mixed data (asper current best practices), the proposed tech-niques consistently improve translation qualityin each individual setting. The most signifi-cant of these tasks is EN-JA, where we obtainstate-of-the-art performance in the process ofexamining the ASPEC corpus (Nakazawaet al., 2016) of scientific papers and Sub-Crawl, a new corpus based on an anonymousmanuscript (Anonymous, 2017). In summary,

118

Page 2: 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM · 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM _2B/ S`vx Mi ai M7Q`/ lMBp2`bBiv `T`vx Mi!bi M7Q`/X2/m.2MMv "`Bix :QQ;H2

our contributions are as follows:

• We show that mixing data from heteroge-nous domains leads to suboptimal re-sults compared to the single-domain set-ting, and that the more distant these do-mains are, the more their merger degradesdownstream translation quality.

• We demonstrate that this problem can becircumvented and propose novel, general-purpose techniques that do so.

2 Neural Machine Translation

Neural machine translation (Sutskever et al.,2014) directly models the conditional log prob-ability log p(y|x) of producing some trans-lation y = y1, ..., ym of a source sentencex = x1, ..., xn. It models this probabilitythrough the encoder-decoder framework. Inthis approach, an encoder network encodesthe source into a series of vector representa-tions H = h1, ..., hn. The decoder networkuses this encoding to generate a translationone target token at a time. At each step, thedecoder casts an attentional distribution oversource encodings (Luong et al., 2015b; Bah-danau et al., 2014). This allows the model tofocus on parts of the input before producingeach translated token. In this way the decoderis decomposing the conditional log probabilityinto

log p(y|x) =

m∑

t=1

log p(yt|y<t, H) (1)

In practice, stacked networks with recurrentLong Short-Term Memory (LSTM) units areused for both the encoder and decoder. Suchunits can effectively distill structure from se-quential data (Elman, 1990).

The cross-entropy training objective inNMT is formulated as,

Lgen =∑

(x,y)∈D− log p(y|x) (2)

Where D is a set of (source, target) sequencepairs (x, y).

Figure 1: The novel mixing paradigms un-der consideration. Discriminative mixing (A),adversarial discriminative mixing (B), andtarget-side token mixing (C) are depicted.

3 ModelsWe now describe three models we are propos-ing that leverage the diversity of informationin heterogeneous corpora. They are summa-rized in Figure 1. We assume dataset D con-sists of source sequences X, target sequencesY and domain class labels D that are onlyknown at training time.

3.1 Discriminative MixingIn the Discriminative Mixing approach, weadd a discriminator network on top of thesource encoder that takes a single vector en-coding of the source c as input. This networkmaximizes P (d|H), the predicted probabilityof the correct domain class label d conditionedon the hidden states of the encoder H. It doesso by minimizing the negative cross-entropyloss Ldisc = − log p(d|H). In other words, thediscriminator uses the encoded representationof the source sequence to predict the correctdomain. Intuitively, this forces the encoderto encode domain-related information into thefeatures it generates. We hypothesize that thisinformation will be useful during the decodingprocess.

119

Page 3: 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM · 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM _2B/ S`vx Mi ai M7Q`/ lMBp2`bBiv `T`vx Mi!bi M7Q`/X2/m.2MMv "`Bix :QQ;H2

The encoder can employ an arbitrary mech-anism to distill the source into a single-vectorrepresentation c. In this work, we use an at-tention mechanism over the encoder states H,followed by a fully connected layer. We set cto be the attention context, and calculate itaccording to Bahdanau et al. (2015):

c =∑

j

ajhj

a = softmax(a)

ai = vTa tanh(Wahi)

The discriminator can be an arbitrary neu-ral network. For this work, we fed c into a fullyconnected layer with a tanh nonlinearity, thenpassed the result through a softmax to obtainprobabilities for each domain class label.

The discriminator is optimized jointly withthe rest of the Sequence-to-Sequence network.If Lgen is the standard sequence generator lossdescribed in Section 2, then the final loss weare optimizing is the sum of the generator anddiscriminator loss L = Lgen + Ldisc.

3.2 Adversarial Discriminative MixingWe also experiment with an adversarial ap-proach to domain mixing. This approach issimilar to that of 3.1, except that when back-propagating from the discriminator network tothe encoder, we reverse the gradients by multi-plying them by −1. Though the discriminatoris still using ∇Ldisc to update its parameters,with the inclusion of the reversal layer, we areimplicitly directing the encoder to optimizewith −∇Ldisc. This has the opposite effectof what we described above. The discrimina-tor still learns to distinguish between domains,but the encoder is forced to compute domain-invariant representations that are not useful tothe discriminator. We hope that such repre-sentations lead to better generalization acrossdomains.

Note the connections between this tech-nique and that of the Generative AdversarialNetwork (GAN) paradigm (Goodfellow et al.,2014). GANs optimize two networks with twoobjective functions (one being the negation ofthe other) and periodically freeze the param-eters of each network during training. We aretraining a single network without freezing anyof its components. Furthermore, we reverse

gradients in lieu of explicitly defining a sec-ond, negated loss function. Last, the adver-sarial parts of this model are trained jointlywith translation in a multitask setting.

Note also that the representations computedby this model are likely to be applicable to un-seen, outside domains. However, this settingis outside the scope of this paper and we leaveits exploration to future work. For our setting,we hypothesize that the domain-agnostic en-codings encouraged by the discriminator mayyield improvements in mixed-domain settingsas well.

3.3 Target Token MixingA simpler alternative to adding a discrimina-tor network is to prepend a domain token tothe target sequence. Such a technique can bereadily incorporated into any existing NMTpipeline and does not require changes to themodel. In particular, we add a single specialvocabulary word such as “domain=subtitles”,per domain and prepend this token to eachtarget sequence therein.

The decoder must learn, similar to the morecomplex discriminator above, to predict thecorrect domain token based on the source rep-resentation at the first step of decoding. Wehypothesize that this technique has a similarregularizing effect as adding a discriminatornetwork. During inference, we remove the firstpredicted token corresponding to the domain.

The advantage of this approach verses thesimilar techniques discussed in related work(Section 5) is that in our proposed method,the model must learn to predict the domainbased on the source sequence alone. It doesnot need to know the domain a-priori.

4 Experiments

4.1 DatasetsFor the Japanese translation task we eval-uate our domain mixing techniques on thestandard ASPEC corpus (Nakazawa et al.,2016) consisting of 3M scientific documentsentence pairs, and the SubCrawl corpus, con-sisting of 3.2M colloquial sentence pairs har-vested from freely available subtitle reposito-ries on the World Wide Web. We use standardtrain/dev/test splits (3M, 1.8k, and 1.8k ex-amples, respectively) and preprocess the data

120

Page 4: 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM · 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM _2B/ S`vx Mi ai M7Q`/ lMBp2`bBiv `T`vx Mi!bi M7Q`/X2/m.2MMv "`Bix :QQ;H2

using subword units1 (Sennrich et al., 2015) tolearn a shared English-Japanese vocabulary ofsize 32,000. To allow for fair comparisons, weuse the same vocabulary and sentence segmen-tation for all experiments, including single-domain models.

To prove its generality, we also evaluate ourtechniques on a small set of about 200k/1k/1ktraining/dev/test examples of English Chinese(EN-ZH) and English-French (EN-FR) lan-guage pairs. For EN-ZH, we use a news com-mentary corpus from WMT’172 and a 2012database dump of TED talk subtitles (Tiede-mann, 2012). For EN-FR, we use professionaltranslations of European Parliament Proceed-ings (Koehn, 2005) and a 2016 dump of theOpenSubtitles database (Lison and Tiede-mann, 2016).

The premise of evaluating on mixed-domaindata is that the domains undergoing mixingare in fact disparate. We need to quantifi-ably measure the disparity therein to obtainfair, valid, and explainable results. Thus,we measured the distances between the do-mains of each language pair with A-distance,an important part of the upper generalizationbounds for domain adaptation (Ben-Davidet al., 2007). Due to the intractability ofcomputing A-distances, we instead compute aproxy for A-distance, dA, which is given theo-retical justification in Ben-David et al. (2007)and used to measure domain distance in Ganiet al. (2015); Glorot et al. (2011). The proxyA-distance is obtained by measuring the gener-alization error ϵ of a linear bag-of-words SVMclassifier trained to discriminate between thetwo domains, and setting dA = 2(1−2ϵ). Notethat by nature of its formulation, dA is onlyuseful in comparative settings, and means lit-tle in isolation (Ben-David et al., 2007). How-ever, it has a minimum value of 1, implyingexact domain match, and a maximum of 2,implying that domains are polar opposites.

4.2 Experimental Protocol

All models are implemented using the Tensor-flow framework and based on the Sequence-to-Sequence implementation of Britz et al.

1Using https://github.com/google/sentencepiece2http://www.statmt.org/wmt17/translation-

task.html

(2017)3. We use a 4-layer bidirectional LSTMencoder with 512 units, and a 4-layer LSTMdecoder. Recall from Section 3 that we useBahdanau-style attention Bahdanau et al.(2015). Dropout of 0.2 (0.8 keep probability)is applied to the input of each cell. We opti-mize using Adam and a learning rate of 0.0001(Kingma and Ba, 2014; Abadi et al., 2016).Each model is trained on 8 Nvidia K40mGPUs with a batch size of 128. The combinedJapanese dataset took approximately a weekto reach convergence.

During training, we save model checkpointsevery hour and choose the best one using theBLEU score on the validation set. To cal-culate BLEU scores for the EN-JA task, wefollow the instruction from WAT 4 and usethe KyTea tokenizer (Neubig et al., 2011).For the EN-FR and EN-ZH tasks, we followthe WMT ’16 guidlines and tokenize with theMoses tokenizer.perl script (Koehn et al.,2007).

4.3 ResultsThe results of our proxy-A distance experi-ment are given in Table 1. dA is a purelycomparative metric that has little meaning inisolation (Ben-David et al., 2007), so it is evi-dent that the EN-JA and EN-ZH domains aremore disparate, while the EN-FR domains aremore similar.

Lanuage Domain 1 Domain 2 dA

Japanese ASPEC SubCrawl 1.89Chinese News TED 1.73French Europarl OpenSubs 1.23

Table 1: Proxy A-distances (dA) for each do-main pair.

To understand the interactions betweenthese models and mixed-domain data, we trainand test on ASPEC, SubCrawl, and their con-catenation. We do the same for the Frenchand Chinese baselines.

In general, our results support the hypoth-esis that the naive concatenation of data fromdisparate domains can degrade in-domaintranslation quality (Table 2). In both the EN-JA and EN-FR settings, the domains under-going mixing are disparate enough to degrade

3https://github.com/google/seq2seq4http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/

121

Page 5: 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM · 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM _2B/ S`vx Mi ai M7Q`/ lMBp2`bBiv `T`vx Mi!bi M7Q`/X2/m.2MMv "`Bix :QQ;H2

(a) Comparing the mixed-domainand individual-domain baselines(BLEUmixed − BLEUindividual)while varying domain distance.The more different two domainsare, the more their mixturedegrades performance.

(b) Comparing the proposed dis-criminator and individual-domainbaseline (BLEUdiscriminator −BLEUindividual) while varyingdomain distance. Comparedto Figure 2a, performance isless degraded when using thediscriminator.

(c) Comparing the pro-posed discriminator approachand mixed-domain baseline(BLEUdiscriminator − BLEUmixed)while varying domain distance.The discriminator always improvesover the baseline, and this isaccentuated when the mergeddomains are more distant.

Figure 2: Comparative performance and domain distance. Trends corresponding to a least-squares fit are indicated with dashed lines.

performance when mixed, and the proposedtechniques recover some of this performancedrop. In the EN-ZH setting, we observe thateven when similar domains are mixed perfor-mance can drop. Notably, in this setting, theproposed techniques successfully improve per-formance over single-domain training.

For a more detailed perspective onthis result, Figure 2a depicts the mixed-domain/individual-domain performancedifferential as a function of domain distance.The two share a negative association, suggest-ing that the most distant two domains are,the more their merger degrades performance.This degradation is particularly strong inJapanese due the vast structural differencesbetween formal and casual language. Thevocabularies, conjugational patterns, andword attachments all follow different rules inthis case (Hori, 1986).

We then trained and tested our proposedmethods on the same mixed data (Table 2).Our results generally agree with the hypoth-esis that the diversity of information in het-erogeneous data can be leveraged to improvein-domain translation. Overall, we find thatall of the proposed methods outperform theirrespective baselines in most settings, but thatthe discriminator appears the most reliable. Itbested its counterparts in 4 of 6 trials, and was

EN-JA Model ASPEC SubCrawlASPEC 38.87 3.85SubCrawl 2.74 16.91ASPEC + SubCrawl 33.85 14.34Discriminator 35.01 15.38Adv. Discriminator 29.87 13.31Target Token 35.05 14.92

EN-FR Model Europarl OpenSubsEuroparl 34.51 13.36OpenSubtitles 13.12 15.2Europarl + OpenSubs 38.26 27.9Discriminator 39.03 27.91Adv. Discriminator 38.38 25.67Target Token 39.1 25.32

EN-ZH Model News TEDNews 12.75 3.12TED 2.79 8.41News + TED 11.36 6.67Discriminator 12.88 8.64Adv. Discriminator 12.15 8.16Target Token 11.98 7.69

Table 2: BLEU scores for models trained onvarious domains and languages (both mixedand unmixed). Rows correspond to train-ing domains and columns correspond to testdomains. Note that our single-domain AS-PEC results are state-of-the-art, indicating thestrength of these baselines.

122

Page 6: 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM · 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM _2B/ S`vx Mi ai M7Q`/ lMBp2`bBiv `T`vx Mi!bi M7Q`/X2/m.2MMv "`Bix :QQ;H2

the only approach that outperformed both in-dividually fit and naively mixed baselines inevery trial.

Figure 2c depicts the dynamics of the dis-criminator approach. More specifically, thisfigure shows the discriminator/naive-mixingperformance differential as a function of do-main distance. The two share a positive asso-ciation, suggesting that the more distant twodomains are, the more the discriminator helpsperformance. This may be because it is easierto classify distant domains, so the discrimina-tor can fit the data better and its gradients en-courage the upstream encoder to include moreuseful domain-related structure.

The adversarial discriminator architectureyielded improvements on the small datasets,but underperformed on EN-JA. It is possiblethat the grammatical differences inherent tocasual and polite domains are such that se-mantic information was lost in the process offorcing their encoded distributions to match.Additionally, adversarial objective functionsare notoriously difficult to optimize on, andthis model was prone to falling into poor localoptimum during training.

The simpler target token approach alsoyields improvement over the baselines, justbarely surpassing that of the Discriminator forASPEC. This approach has the practical ben-efit of requiring no architectural changes to anoff-the-shelf NMT system.

Our EN-FR results are particularly interest-ing. Though the data seem like they shouldcome from sufficiently distant domains (par-liament proceedings and subtitles), the do-mains are actually quite close according to dA

(Table 1). Since these domains are so close,their merger is able to improve baseline per-formance. Thus, if the source and target do-main are sufficiently close, then their mergerdoes indeed help.

Next, we investigated the optimization dy-namics of these models by examining theirlearning curves. Curves for the baselines anddiscriminative models trained on EN-JA dataare depicted in Figure 3a. Single-domaintraining clearly outperforms mixed training,and it appears that adding a discriminativestrategy provides additional gains. From Fig-ure 3b we can see that the discriminator ap-

proach (not reversing gradients), learns to fitthe domain distribution quickly, implying thatthe Japanese domains were in fact quite dis-tant and easily classifiable.

5 Related Work

Our work builds on a recent literature on do-main adaptation strategies in Neural MachineTranslation. Prior work in this space has pro-posed two general categories of methods.

The first proposed method is to take mod-els trained on the source domain and finetuneon target-domain data. Luong and Manning(2015); Zoph et al. (2016) explores how to im-prove transfer learning for a low-resource lan-guage pair by finetuning only parts of the net-work. Chu et al. (2017) empirically evalu-ate domain adaptation methods and proposemixing source and target domain data duringfinetuning. Freitag and Al-Onaizan (2016)explored finetuning using only a small subsetof target domain data. Note that we did notcompare directly against these techniques be-cause they are intended to transfer knowledgeto a new domain and perform well on onlythe target domain. We are concerned withmulti-domain settings, where performance onall constituent domains is important.

A second strain of “multi-domain” thoughtin NMT involves appending a domain indi-cator token to each source sequence (Kobuset al., 2016). Similarly, Johnson et al. (2016)use a token for cross-lingual translation in-stead of domain identification. This idea wasfurther refined by Chu et al. (2017), who in-tegrated source-tokenization into the domainfinetuning paradigm. While it requires nochanges to the NMT architecture, these ap-proaches are inherently limited because theystipulate that domain information for unseentest examples be known. For example, if us-ing a trained model to translate user-generatedsentences, we do not know the domain a-priori,and this approach cannot be used.

Apart from the recent progress in do-main adaptation for NMT, we draw on workthat transfers knowledge between domains insemisupervised settings. Our strongest influ-ence is adversarial domain adaptation (Ganinet al., 2015), where feature distributions inthe source and target domains are matched

123

Page 7: 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM · 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM _2B/ S`vx Mi ai M7Q`/ lMBp2`bBiv `T`vx Mi!bi M7Q`/X2/m.2MMv "`Bix :QQ;H2

0 200000 400000 600000 800000 1000000 1200000 1400000

Steps

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Log P

erp

lexit

ytrain_combined

target_token

train_aspec

discriminator

(a) Log perplexity evaluated on the ASPEC valida-tion set. Single-domain training outperforms com-bined training. The discriminator and target tokenapproaches improve over the naive combined data.

0 200000 400000 600000 800000 1000000 1200000 1400000

Steps

0.000

0.002

0.004

0.006

0.008

0.010

0.012

Log P

erp

lexit

y

discriminator

(b) Discriminator training loss over time on the EN-JAdata. The discriminator learns to fit the data almostperfectly after a few hundred thousand iterations

Figure 3: Training curves for domain mixing and discriminator loss.

with a Domain-Adversarial Neural Network(DANN). Another approach to this problem isthat of Long et al. (2015), which measures andminimizes the distance between domain distri-bution means before training, thereby negat-ing any unique properties.

There is some overlap between past researchin multi-domain statistical machine transla-tion (SMT) and the ideas of this paper. (Fara-jian et al., 2017) compared the efficacy ofphrase-based SMT and NMT on multiple-domain data, observing similar performancedegradations as us in mixed-domain settings.However, that study did not seek to under-stand the issue and offered no explanation,analysis, or solution to the problem. Anotherline of work merged data by only selecting ex-amples with a propensity for relevance in amulti-domain setting (Mandal et al., 2008; Ax-elrod et al., 2011). In a strategy that echosNMT fine-tuning, Pecina et al. (2012) useda variety of in-domain development sets totune hyperparameters to a generalized setting.Similar to our domain discriminator network,Clark et al. (2012) crafted domain-specific fea-tures that are used by the decoder. How-ever, some of these systems’ features are down-stream of binary indicators for domain iden-tity. This approach, then, faces the same in-herent limitations as source-tokenization: do-main knowledge is required for inference. Fur-thermore, the domain features of this system

are integral to the decoding process, while ourdiscriminator network is an independent mod-ule that can be detached during inference.

6 Conclusion

We presented three novel models for apply-ing Neural Machine Translation to multi-domain settings, and demonstrated their ef-ficacy across six domains in three languagepairs, and in the process achieved a new state-of-the-art in EN-JA translation. Unlike thenaive combining of training data, these mod-els improve their translational ability on eachconstituent domain. Furthermore, these mod-els are the first of their kind to not requireknowledge of each example’s domain at in-ference time. All the proposed approachesoutperform the naive combining of trainingdata, so we advise practitioners to imple-ment whichever most easily fits into their pre-existing pipelines, but an approach based ona discriminator network offered the most reli-able results.

In future work we hope to explore the dy-namics of adversarial discriminative trainingobjectives, which force the model to learndomain-agnostic features, in the related prob-lem of adaptation to unseen test-time do-mains.

124

Page 8: 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM · 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM _2B/ S`vx Mi ai M7Q`/ lMBp2`bBiv `T`vx Mi!bi M7Q`/X2/m.2MMv "`Bix :QQ;H2

ReferencesMartín Abadi, Ashish Agarwal, Paul Barham, Eu-

gene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, MatthieuDevin, et al. 2016. Tensorflow: Large-scale ma-chine learning on heterogeneous distributed sys-tems. arXiv preprint arXiv:1603.04467 .

Anonymous. 2017. Subcrawl: A colloquial par-allel corpus for english-japanese translation.Manuscript submitted for publication. .

Amittai Axelrod, Xiaodong He, and Jianfeng Gao.2011. Domain adaptation via pseudo in-domaindata selection. In Proceedings of the conferenceon empirical methods in natural language pro-cessing. Association for Computational Linguis-tics, pages 355–362.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. arXivpreprint arXiv:1409.0473 .

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate. In ICLR.

Shai Ben-David, John Blitzer, Koby Crammer,Fernando Pereira, et al. 2007. Analysis of repre-sentations for domain adaptation. Advances inneural information processing systems 19:137.

D. Britz, A. Goldie, T. Luong, and Q. Le. 2017.Massive Exploration of Neural Machine Trans-lation Architectures. ArXiv e-prints .

Kyunghyun Cho, Bart van Merrienboer, ÇaglarGülçehre, Fethi Bougares, Holger Schwenk, andYoshua Bengio. 2014. Learning phrase represen-tations using RNN encoder-decoder for statisti-cal machine translation. In EMNLP.

Chenhui Chu, Raj Dabre, and Sadao Kuro-hashi. 2017. An empirical comparison ofsimple domain adaptation methods for neuralmachine translation. CoRR abs/1701.03214.http://arxiv.org/abs/1701.03214.

Jonathan H Clark, Alon Lavie, and Chris Dyer.2012. One system, many domains: Open-domain statistical machine translation via fea-ture augmentation .

Jeffrey L Elman. 1990. Finding structure in time.Cognitive science 14(2):179–211.

M Amin Farajian, Marco Turchi, Matteo Negri,Nicola Bertoldi, and Marcello Federico. 2017.Neural vs. phrase-based machine translation ina multi-domain scenario. EACL 2017 page 280.

Markus Freitag and Yaser Al-Onaizan. 2016.Fast domain adaptation for neural ma-chine translation. CoRR abs/1612.06897.http://arxiv.org/abs/1612.06897.

Yaroslav Gani, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, FrançoisLaviolette, Mario Marchand, and Victor Lem-pitsky. 2015. Domain-adversarial training ofneural networks. arxiv preprint. arXiv preprintarXiv:1505.07818 .

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain,H. Larochelle, F. Laviolette, M. Marchand, andV. Lempitsky. 2015. Domain-Adversarial Train-ing of Neural Networks. ArXiv e-prints .

Xavier Glorot, Antoine Bordes, and Yoshua Ben-gio. 2011. Domain adaptation for large-scalesentiment classification: A deep learning ap-proach. In Proceedings of the 28th interna-tional conference on machine learning (ICML-11). pages 513–520.

Ian Goodfellow, Jean Pouget-Abadie, MehdiMirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio.2014. Generative adversarial nets. In Advancesin neural information processing systems. pages2672–2680.

Barry Haddow and Philipp Koehn. 2012.Analysing the effect of out-of-domain dataon smt systems. In Proceedings of the SeventhWorkshop on Statistical Machine Translation.Association for Computational Linguistics,pages 422–432.

Motoko Hori. 1986. A sociolinguistic analysis ofthe japanese honorifics. Journal of pragmatics10(3):373–386.

Melvin Johnson, Mike Schuster, Quoc V. Le,Maxim Krikun, Yonghui Wu, Zhifeng Chen,Nikhil Thorat, Fernanda B. Viégas, MartinWattenberg, Greg Corrado, Macduff Hughes,and Jeffrey Dean. 2016. Google’s multilingualneural machine translation system: Enablingzero-shot translation. CoRR abs/1611.04558.http://arxiv.org/abs/1611.04558.

Nal Kalchbrenner and Phil Blunsom. 2013. Recur-rent continuous translation models. In EMNLP.

Diederik Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. arXivpreprint arXiv:1412.6980 .

Catherine Kobus, Josep Crego, and Jean Senellart.2016. Domain control for neural machine trans-lation. arXiv preprint arXiv:1612.06140 .

Philipp Koehn. 2005. Europarl: A parallel corpusfor statistical machine translation. In MT sum-mit. volume 5, pages 79–86.

Philipp Koehn, Hieu Hoang, Alexandra Birch,Chris Callison-Burch, Marcello Federico, NicolaBertoldi, Brooke Cowan, Wade Shen, ChristineMoran, Richard Zens, et al. 2007. Moses: Open

125

Page 9: 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM · 1z2+iBp2 .QK BM JBtBM; 7Q` L2m` H J +?BM2 h` MbH iBQM _2B/ S`vx Mi ai M7Q`/ lMBp2`bBiv `T`vx Mi!bi M7Q`/X2/m.2MMv "`Bix :QQ;H2

source toolkit for statistical machine transla-tion. In Proceedings of the 45th annual meetingof the ACL on interactive poster and demon-stration sessions. Association for ComputationalLinguistics, pages 177–180.

Pierre Lison and Jörg Tiedemann. 2016. Open-subtitles2016: Extracting large parallel corporafrom movie and tv subtitles. In Proceedings ofthe 10th International Conference on LanguageResources and Evaluation.

Mingsheng Long, Yue Cao, Jianmin Wang, andMichael I Jordan. 2015. Learning transferablefeatures with deep adaptation networks. InICML. pages 97–105.

Minh-Thang Luong and Christopher D Manning.2015. Stanford neural machine translation sys-tems for spoken language domains. In Proceed-ings of the International Workshop on SpokenLanguage Translation.

Minh-Thang Luong, Hieu Pham, and Christo-pher D. Manning. 2015a. Effective approachesto attention-based neural machine translation.In EMNLP.

Minh-Thang Luong, Hieu Pham, and Christo-pher D Manning. 2015b. Effective approachesto attention-based neural machine translation.arXiv preprint arXiv:1508.04025 .

Arindam Mandal, Dimitra Vergyri, Wen Wang,Jing Zheng, Andreas Stolcke, Gokhan Tur,D Hakkani-Tur, and Necip Fazil Ayan. 2008. Ef-ficient data selection for machine translation. InSpoken Language Technology Workshop, 2008.SLT 2008. IEEE. IEEE, pages 261–264.

Toshiaki Nakazawa, Manabu Yaguchi, KiyotakaUchimoto, Masao Utiyama, Eiichiro Sumita,Sadao Kurohashi, and Hitoshi Isahara. 2016.Aspec: Asian scientific paper excerpt corpus.In Proceedings of the Ninth International Con-ference on Language Resources and Evaluation(LREC 2016). pages 2204–2208.

Graham Neubig, Yosuke Nakata, and ShinsukeMori. 2011. Pointwise prediction for robust,adaptable japanese morphological analysis. InProceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics:Human Language Technologies: short papers-Volume 2. Association for Computational Lin-guistics, pages 529–533.

Pavel Pecina, Antonio Toral, and Josef Van Gen-abith. 2012. Simple and effective parameter tun-ing for domain adaptation of statistical machinetranslation. In COLING. pages 2209–2224.

Rico Sennrich, Barry Haddow, and AlexandraBirch. 2015. Neural machine translation ofrare words with subword units. arXiv preprintarXiv:1508.07909 .

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.2014. Sequence to sequence learning with neuralnetworks. In NIPS.

Jörg Tiedemann. 2012. Parallel data, tools and in-terfaces in opus. In LREC . volume 2012, pages2214–2218.

Yonghui Wu, Mike Schuster, Zhifeng Chen,Quoc V. Le, Mohammad Norouzi, WolfgangMacherey, Maxim Krikun, Yuan Cao, Qin Gao,Klaus Macherey, Jeff Klingner, Apurva Shah,Melvin Johnson, Xiaobing Liu, Lukasz Kaiser,Stephan Gouws, Yoshikiyo Kato, Taku Kudo,Hideto Kazawa, Keith Stevens, George Kurian,Nishant Patil, Wei Wang, Cliff Young, Ja-son Smith, Jason Riesa, Alex Rudnick, OriolVinyals, Greg Corrado, Macduff Hughes, andJeffrey Dean. 2016. Google’s neural ma-chine translation system: Bridging the gap be-tween human and machine translation. CoRRabs/1609.08144.

Barret Zoph, Deniz Yuret, Jonathan May,and Kevin Knight. 2016. Transferlearning for low-resource neural ma-chine translation. CoRR abs/1604.02201.http://arxiv.org/abs/1604.02201.

126