integrating weakly supervised word sense disambiguation into...

15
Integrating Weakly Supervised Word Sense Disambiguation into Neural Machine Translation Xiao Pu * Nuance Communications Jülicher Str. 376 52070 Aachen, Germany [email protected] Nikolaos Pappas Idiap Research Institute Rue Marconi 19, CP 592 1920 Martigny, Switzerland [email protected] James Henderson Idiap Research Institute Rue Marconi 19, CP 592 1920 Martigny, Switzerland [email protected] Andrei Popescu-Belis HEIG-VD / HES-SO Route de Cheseaux 1, CP 521 1401 Yverdon-les-Bains, Switzerland [email protected] Abstract This paper demonstrates that word sense disambiguation (WSD) can improve neural machine translation (NMT) by widening the source context considered when modeling the senses of potentially ambiguous words. We first introduce three adaptive cluster- ing algorithms for WSD, based on k-means, Chinese restaurant processes, and random walks, which are then applied to large word contexts represented in a low-rank space and evaluated on SemEval shared-task data. We then learn word vectors jointly with sense vectors defined by our best WSD method, within a state-of-the-art NMT sys- tem. We show that the concatenation of these vectors, and the use of a sense selec- tion mechanism based on the weighted av- erage of sense vectors, outperforms several baselines including sense-aware ones. This is demonstrated by translation on five lan- guage pairs. The improvements are above one BLEU point over strong NMT base- lines, +4% accuracy over all ambiguous nouns and verbs, or +20% when scored manually over several challenging words. 1 Introduction The correct translation of polysemous words re- mains a challenge for machine translation (MT). While some translation options may be in- terchangeable, substantially different senses of * Work conducted while at the Idiap Research Institute. source words must generally be rendered by dif- ferent words in the target language. Hence, an MT system should identify – implicitly or explicitly – the correct sense conveyed by each occurrence in order to generate an appropriate translation. For instance, in the following sentence from Europarl, the translation of ‘deal’ should convey the sense ‘to handle’ (in French ‘traiter’) and not ‘to cope’ (in French ‘remédier’, which is wrong): Source: How can we guarantee the system of prior notification for high-risk products at ports that have the necessary facilities to deal with them? Reference translation: Comment pouvons-nous garantir le système de notification préalable pour les produits présentant un risque élevé dans les ports qui disposent des installations nécessaires pour traiter ces produits ? Baseline NMT: [...] les ports qui disposent des moyens nécessaires pour y remédier ? Sense-aware NMT: [...] les ports qui disposent des installations nécessaires pour les traiter ? Current MT systems perform word sense disam- biguation implicitly, based on co-occurring words in a rather limited context. In phrase-based sta- tistical MT, the context size is related to the or- der of the language model (often between 3 and 5) and to the length of n-grams in the phrase ta- ble (seldom above 5). In attention-based neu- ral MT (NMT), the context extends to the entire

Upload: others

Post on 23-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

Integrating Weakly Supervised Word Sense Disambiguationinto Neural Machine Translation

Xiao Pu∗Nuance Communications

Jülicher Str. 37652070 Aachen, [email protected]

Nikolaos PappasIdiap Research InstituteRue Marconi 19, CP 592

1920 Martigny, [email protected]

James HendersonIdiap Research InstituteRue Marconi 19, CP 592

1920 Martigny, [email protected]

Andrei Popescu-BelisHEIG-VD / HES-SO

Route de Cheseaux 1, CP 5211401 Yverdon-les-Bains, [email protected]

Abstract

This paper demonstrates that word sensedisambiguation (WSD) can improve neuralmachine translation (NMT) by widening thesource context considered when modelingthe senses of potentially ambiguous words.We first introduce three adaptive cluster-ing algorithms for WSD, based on k-means,Chinese restaurant processes, and randomwalks, which are then applied to large wordcontexts represented in a low-rank spaceand evaluated on SemEval shared-task data.We then learn word vectors jointly withsense vectors defined by our best WSDmethod, within a state-of-the-art NMT sys-tem. We show that the concatenation ofthese vectors, and the use of a sense selec-tion mechanism based on the weighted av-erage of sense vectors, outperforms severalbaselines including sense-aware ones. Thisis demonstrated by translation on five lan-guage pairs. The improvements are aboveone BLEU point over strong NMT base-lines, +4% accuracy over all ambiguousnouns and verbs, or +20% when scoredmanually over several challenging words.

1 Introduction

The correct translation of polysemous words re-mains a challenge for machine translation (MT).While some translation options may be in-terchangeable, substantially different senses of

∗ Work conducted while at the Idiap Research Institute.

source words must generally be rendered by dif-ferent words in the target language. Hence, an MTsystem should identify – implicitly or explicitly –the correct sense conveyed by each occurrence inorder to generate an appropriate translation. Forinstance, in the following sentence from Europarl,the translation of ‘deal’ should convey the sense‘to handle’ (in French ‘traiter’) and not ‘to cope’(in French ‘remédier’, which is wrong):

Source: How can we guarantee the system ofprior notification for high-risk products atports that have the necessary facilities to dealwith them?

Reference translation: Comment pouvons-nousgarantir le système de notification préalablepour les produits présentant un risque élevédans les ports qui disposent des installationsnécessaires pour traiter ces produits ?

Baseline NMT: [. . .] les ports qui disposent desmoyens nécessaires pour y remédier ?

Sense-aware NMT: [. . .] les ports qui disposentdes installations nécessaires pour les traiter ?

Current MT systems perform word sense disam-biguation implicitly, based on co-occurring wordsin a rather limited context. In phrase-based sta-tistical MT, the context size is related to the or-der of the language model (often between 3 and5) and to the length of n-grams in the phrase ta-ble (seldom above 5). In attention-based neu-ral MT (NMT), the context extends to the entire

Page 2: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

sentence, but multiple word senses are not mod-eled explicitly. The implicit sense informationcaptured by word representations used in NMTleads to a bias in the attention mechanism towardsdominant senses. Therefore, the NMT decoderscannot clearly identify the contexts in which oneword sense should be used instead of another one.Hence, while NMT can use local constraints totranslate ‘great rock band’ into French as ‘su-perbe groupe de rock’ rather than ‘grande bandede pierre’ – thus correctly assigning the musicalrather than geological sense to ‘rock’ – it fails todo so for word senses which require larger con-texts.

In this paper, we demonstrate that the explicitmodeling of word senses can be helpful to NMTby using combined vector representations of wordtypes and senses, which are inferred from contextsthat are larger than that of state-of-the-art NMTsystems. We make the following contributions:

• Weakly supervised word sense disambigua-tion (WSD) approaches integrated into NMT,based on three adaptive clustering methodsand operating on large word contexts.

• Three sense selection mechanisms for inte-grating WSD into NMT, respectively basedon top, average, and weighted average (i.e.,attention) of word senses.

• Consistent improvements against baselineNMT on five language pairs: from Englishinto Chinese, Dutch, French, German, andSpanish.

The paper is organized as follows. In Section 2,we present three adaptive WSD methods basedon k-means clustering, the Chinese restaurant pro-cess, and random walks. In Section 3, we presentthree sense selection mechanisms which integratethe word senses into NMT. The experimental de-tails appear in Section 4, while the results concern-ing the optimal parameter settings are presented inSection 5, where we also show that our WSD com-ponent is competitive on the SemEval 2010 sharedtask. Section 6 presents our results: the BLEUscores increase by about one point with respect toa strong NMT baseline, the accuracy of ambigu-ous noun and verb translation improves by about4%, while a manual evaluation of several challeng-ing and frequent words shows an improvement ofabout 20%. A discussion of related work appearsfinally in Section 7.

2 Adaptive Sense Clustering for MT

In this section, we present the three unsupervisedor weakly supervised WSD methods used in ourexperiments, which aim at clustering different oc-currences of the same word type according to theirsenses. We first consider all nouns and verbs inthe source texts that have more than one sense inWordNet, and extract from there the definition ofeach sense and, if available, the example. For eachoccurrence of such nouns or verbs in the trainingdata, we use word2vec to build word vectors fortheir contexts, i.e. neighboring words. All vec-tors are passed to an unsupervised clustering al-gorithm, possibly instantiated with WordNet def-initions or examples. The resulting clusters canbe numbered and used as labels, or their centroidword vector can be used as well, as explained inSection 3.

This approach answers several limitations ofprevious supervised or unsupervised WSD meth-ods. On the one hand, supervised methods requiredata with manually sense-annotated labels and arethus limited to typically small subsets of all wordtypes, e.g. up to a hundred of content words tar-geted in SemEval 20101 (Manandhar et al., 2010)and up to a thousand words in SemEval 2015(Moro and Navigli, 2015). In contrast, our methoddoes not require labeled texts for training, and ap-plies to all word types with multiple senses inWordNet (e.g. nearly four thousands for some datasets, see Table 1). On the other hand, unsupervisedmethods often pre-define the number of possiblesenses for all ambiguous words before clusteringtheir occurrences, and do not adapt to what is ac-tually observed in the data; as a result, the sensesare often too fine-grained for the needs of MT, es-pecially for a particular domain. In contrast, ourmodel learns the number of senses for each ana-lyzed ambiguous word directly from the data.

2.1 Definitions and Notations

For each noun or verb type Wt appearing in thetraining data, as identified by the Stanford POStagger,2 we extract the senses associated to it inWordNet3 (Fellbaum, 1998) using NLTK.4 Specif-ically, we extract the set of definitions Dt ={dtj |j = 1, . . . ,mt} and the set of examples of

1www.cs.york.ac.uk/semeval2010_WSI2nlp.stanford.edu/software3wordnet.princeton.edu/4www.nltk.org/howto/wordnet.html

Page 3: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

use Et = {etj |j = 1, . . . , nt}, each of them con-taining multiple words. While most of the sensesare accompanied by a definition, only about halfof them also include an example of use.

Definitions dtj and examples etj are repre-sented by vectors defined as the average of theword embeddings over all the words constitutingthem (except stopwords). Formally, these vec-tors are dtj = (

∑wl∈dtj wl)/|dtj | and etj =

(∑

wl∈e′tjwl)/|e′tj |, respectively, where |dtj | is the

number of tokens of the definition. While the en-tire definition dtj is used to build the dtj vector,we do not consider all words in the example etj ,but limit the sum to a fragment e′tj contained in awindow of size c centered around the consideredword, to avoid noise from long examples. Hence,we divide by the number of words in this window,noted |e′tj |. All the word vectors wl above arepre-trained word2vec embeddings from Google5

(Mikolov et al., 2013). If dim is the dimensional-ity of the word vector space, then all vectors wl,dtj , and etj are in Rdim . Each definition vectordtj or example vector etj for a word type Wt isconsidered as a center vector for each sense dur-ing the clustering procedure.

Turning now to tokens, each word occurrencewi in a source sentence is represented by the av-erage vector ui of the words from its context, i.e.a window of c words centered on wi, c being aneven number. We calculate the vector ui for wi byaveraging vectors from c/2 words before wi andfrom c/2 words after it. We stop nevertheless atthe sentence boundaries, and filter out stopwordsbefore averaging.

2.2 Clustering Word Occurrences by Sense

We adapt three clustering algorithms to our needsfor WSD applied to NMT. The objective is to clus-ter all occurrences wi of a given word type Wt,represented as word vectors ui, according to thesimilarity of their senses, as inferred from the sim-ilarity of the context vectors. We compare the al-gorithms empirically in Section 5.k-means Clustering. The original k-means al-

gorithm (MacQueen, 1967) aims to partition a setof items, which are here tokens w1, w2, . . . , wn ofthe same word type Wt, represented through theirembeddings u1,u2, . . . ,un where ui ∈ Rdim .The goal of k-means is to partition (or cluster)these vectors into k sets S = {S1, S2, . . . , Sk} so

5code.google.com/archive/p/word2vec/

as to minimize the within-cluster sum of squareddistances to each centroid µi:

S = argminS

k∑i=1

∑u∈Si

||u− µi||2. (1)

At the first iteration, when there are no clusters yet,the algorithm selects k random points as centroidsof the k clusters. Then, at each subsequent itera-tion t, the algorithm calculates for each candidatecluster a new centroid of the observations, definedas their average vector, as follows:

µ t+1i =

1

|Sti |

∑uj∈St

i

uj . (2)

In an earlier application of k-means to phrase-based statistical MT, but not neural MT, we madeseveral modifications to the original k-means algo-rithm, to make it adaptive to the word senses ob-served in training data (Pu et al., 2017). We main-tain these changes and summarize them brieflyhere. The initial number of clusters kt for eachambiguous word type Wt is set to the number ofits senses in WordNet, either considering only thesenses that have a definition or those that have anexample. The centroids of the clusters are initial-ized to the vectors representing the senses fromWordNet, either using their definition vectors dtj

or their example vectors etj . These initializationsare thus a form of weak supervision of the cluster-ing process.

Finally, and most importantly, after running thek-means algorithm, the number of clusters foreach word type is reduced by removing the clus-ters that contain fewer than 10 tokens and assign-ing their tokens to the closest large cluster. ‘Clos-est’ is defined in terms of the cosine distance be-tween ui and their centroids. The final numberof clusters thus depends on the observed occur-rences in the training data (which is the same dataas for MT), and avoids modeling infrequent senseswhich are difficult to translate anyway. When usedin NMT, in order to assign each new token fromthe test data to a cluster, i.e. to perform WSD, weselect the closest centroid, again in terms of cosinedistance.

Chinese Restaurant Process. The ChineseRestaurant Process (CRP) is an unsupervisedmethod considered as a practical interpretationof a Dirichlet process (Ferguson, 1973) for non-parametric clustering. In the original analogy,

Page 4: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

each token is compared to a customer in a restau-rant, and each cluster is a table where customerscan be seated. A new customer can choose to sitat a table with other customers, with a probabilityproportional to the numbers of customers at thattable, or sit at a new, empty table. In an applicationto multi-sense word embeddings, Li and Jurafsky(2015) proposed that the probability to “sit at a ta-ble” should also depend on the contextual similar-ity between the token and the sense modeled bythe table. We build upon this idea and bring sev-eral modifications that allow for an instantiationwith sense-related knowledge from WordNet, asfollows.

For each word typeWt appearing in the data, westart by fixing the maximal number kt of senses orclusters as the number of senses of Wt in Word-Net. This avoids an unbounded number of clus-ters (as in the original CRP algorithm) and the riskof cluster sparsity by setting a non-arbitrary limitbased on linguistic knowledge. Moreover, we de-fine the initial centroid of each cluster as the wordvector corresponding either to the definition dtj ofthe respective sense, or alternatively to the exam-ple etj illustrating the sense.

For each token wi and its context vector ui thealgorithm decides whether the token is assigned toone of the sense clusters Sj to which previous to-kens have been assigned, or whether it is assignedto a new empty cluster, by selecting the optionwhich has the highest probability, which is com-puted as follows:

P ∝

Nj(λ1s(ui,dtj) + λ2s(ui,µj))

if Nj 6= 0 (non-empty sense)γs(ui,dtj)

if Nj = 0 (empty sense).

(3)

In other words, for a non-empty sense, the proba-bility is proportional to the popularity of the sense(number of tokens it already contains, Nj) and tothe weighted sum of two cosine similarities s(·, ·):one between the context vector ui of the token andthe definition of the sense dtj , and another onebetween ui and the average context vector of thetokens already assigned to the sense (µj). Theseterms are weighted by the two hyper-parametersλ1 and λ2. For an empty sense, only the secondterm is used, weighted by the γ hyper-parameter.

Random Walks. Finally, we also considerfor comparison a WSD method based on randomwalks on the WordNet knowledge graph (Agirre

and Soroa, 2009; Agirre et al., 2014) availablefrom the UKB toolkit.6 In the graph, senses cor-respond to nodes and the relationships or depen-dencies between pairs of senses correspond to theedges between those nodes. From each input sen-tence, we extract its content words (nouns, verbs,adjectives, and adverbs) that have an entry in theWordNet weighted graph. The method calculatesthe probability of a random walk over the graphfrom a target word’s sense ending on any othersense in the graph, and determines the sense withthe highest probability for each analyzed word. Inthis case, the random walk algorithm is PageRank(Grin and Page, 1998), which computes a relativestructural importance or ‘rank’ for each node.

3 Integration with Neural MT

3.1 Baseline Neural MT model

We now present several models integrating WSDinto NMT, starting from an attention-based NMTbaseline (Bahdanau et al., 2015; Luong et al.,2015). Given a source sentence X with words wx,X = (wx

1 , wx2 , ..., w

xT ), the model computes a con-

ditional distribution over translations, expressedas p(Y = (wy

1 , wy2 , ..., w

yT ′)|X). The neural net-

work model consists of an encoder, a decoder andan attention mechanism. First, each source wordwxt ∈ V is projected from a one-hot word vec-

tor into a continuous vector space representationxt. Then, the resulting sequence of word vectorsis read by the bidirectional encoder which con-sists of forward and backward recurrent networks(RNNs). The forward RNN reads the sequence inleft-to-right order, i.e.

−→h t =

−→φ (−→h t−1,xt), while

the backward RNN reads it right-to-left:←−h t =←−

φ (←−h t+1,xt).

The hidden states from the forward and back-ward RNNs are concatenated at each time stept to form an ‘annotation’ vector ht = [

−→ht;←−ht].

Taken over several time steps, these vectors formthe ‘context’ i.e. a tuple of annotation vectors C =(h1,h2, ...,hT). The recurrent activation func-tions

−→φ and

←−φ are either long short-term memory

units (LSTM) or gated recurrent units (GRU).The decoder RNN maintains an internal hidden

state zt′ . After each time step t′, it first uses the

6ixa2.si.ehu.es/ukb. Strictly speaking, this is theonly genuine WSD method, as the two previous ones pertainto sense induction rather than disambiguation. However, forsimplicity, we will refer to all of them as WSD.

Page 5: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

attention mechanism to weight the annotation vec-tors in the context tuple C. The attention mecha-nism takes as input the previous hidden state of thedecoder and one of the annotation vectors, and re-turns a relevance score et′,t = fATT(zt′−1,ht).These scores are normalized to obtain attentionscores:

αt′,t = exp(et′,t)/

T∑k=1

exp(et′,k). (4)

These scores serve to compute a weighted sum ofannotation vectors ct′ =

∑Tt=1 αt′,tht, which are

used by the decoder to update its hidden state:

zt′ = φz(zt′−1,yt′−1, ct′). (5)

Similarly to the encoder, φz is implemented as ei-ther an LSTM or GRU and yt′−1 is the target-side word embedding vector corresponding toword wy.

3.2 Sense-aware Neural MT ModelsTo model word senses for NMT, we concatenatethe embedding of each token with a vector rep-resentation of its sense, either obtained from oneof the clustering methods presented in Section 2,or learned during encoding, as we will explain.In other words, the new vector w′i represent-ing each source token wi consists of two parts:w′i = [wi ; µi], where wi is the word em-bedding learned by the NMT, and µi is the senseembedding obtained from WSD or learned by theNMT. To represent these senses, we create twodictionaries, one for words and the other one forsense labels, which will be embedded in a low-dimensional space, before the encoder. We pro-pose several models for using and/or generatingsense embeddings for NMT, named and defined asfollows.

Top sense (TOP). In this model, we directly usethe sense selected for each token by one of theWSD systems above, and use the embeddings ofthe respective sense as generated by NMT aftertraining.

Weighted average of senses (AVG). Instead offully trusting the decision of a WSD system (evenone adapted to MT), we consider all listed sensesand the respective cluster centroids learned by theWSD system. Then we convert the distances dlbetween the input token vector and the centroid ofeach sense Sl into a normalized weight distribu-tion either by a linear or a logistic normalization:

ωj =1− dj∑1≤l≤k dl

or ωj =e−d

2j∑

1≤l≤k e−d2l

, (6)

where k is the total number of senses of token wi.The sense embedding for each token is computedas the weighted average of all sense embeddings:

µi =∑

1≤j≤kωjµij . (7)

Attention-based sense weights (ATT). Insteadof obtaining the weight distribution from the cen-troids computed by WSD, we also propose to dy-namically compute the probability of relatednessto each sense based on the current word and senseembeddings during encoding, as follows. Givena token wi, we consider all the other tokens inthe sentence (w1, . . . , wi−1, wi+1, . . . , wL) as thecontext of wi, where L is the length of the sen-tence. We define the context vector of wi as themean of all the embeddings uj of the words wj ,i.e. ui = (

∑l 6=i ul)/(L − 1). Then, we compute

the similarity f(ui,µij) between each sense em-bedding µij and the context vector ui using an ad-ditional attention layer in the network, with twopossibilities which will be compared empirically:

f(ui,µij) = υT tanh(Wui + Uµij) (8)

or f(ui,µij) = uTi Wµij . (9)

The weights ωj are now obtained through the fol-lowing softmax normalization:

ωj =ef(ui,µij)∑

1≤l≤k ef(ui,µil)

. (10)

Finally, the average sense embedding is obtainedas in Eq. (7) above, and is concatenated to theword vector ui.

ATT model with initialization of embeddings(ATTini ). The fourth model is similar to the ATTmodel, with the difference that we initialize theembeddings of the source word dictionary usingthe word2vec vectors of the word types, and theembeddings of the sense dictionary using the cen-troid vectors obtained from k-means.

4 Data, Metrics and Implementation

Datasets. We train and test our sense-aware MTsystems on the data shown in Table 1: the UN

Page 6: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

TL Train Dev TestLabels

WordsNouns Verbs

FR0.5M 5k 50k 3,910 1,627 2,0065.3M 4,576 6,003 8,276 3,059 3,876

DE0.5M 5k 50k 3,885 1,576 1,9764.5M 3,000 5,172 7,520 1,634 3,194

ES0.5M 5k 50k 3,862 1,627 1,9873.9M 4,576 6,003 7,549 2,798 3,558

ZH 0.5M 5K 50K 3,844 1,475 1,915NL 0.5M 5K 50K 3,915 1,647 2,210

Table 1: Size of data sets used for machine translationfrom English to five different target languages (TL).

Corpus7 (Rafalovitch and Dale, 2009) and the Eu-roparl Corpus8 (Koehn, 2005). We first experi-ment with our models using the same dataset andprotocol as in our previous work (Pu et al., 2017),to enable comparisons with phrase-based statis-tical MT systems, for which the sense of eachambiguous source word was modeled as a factor.Moreover, in order to make a better comparisonwith other related approaches, we train and testour sense-aware NMT models on large datasetsfrom WMT shared tasks over three language pairs(EN/DE, EN/ES and EN/FR).

The dataset used in our previous work consistsof 500k parallel sentences for each language pair,5k for development and 50k for testing. The dataoriginates from UN for EN/ZH, and from Europarlfor the other pairs. The source sides of thesesets contain around 2,000 different English wordforms (after lemmatization) that have more thanone sense in WordNet. Our WSD system gener-ates ca. 3.8k different noun labels and 1.5k verblabels for these word forms.

The WMT datasets additionally used in this pa-per are the following ones. First, we use the com-plete EN/DE set from WMT 2016 (Bojar et al.,2016) with a total of ca. 4.5M sentence pairs. Inthis case, the development set is NewsTest 2013,and the testing set is made of NewsTest 2014 and2015. Second, for EN/FR and EN/ES, we usedata from WMT 2014 (Bojar et al., 2014)9 with5.3M sentences for EN/FR and 3.8M sentences forEN/ES. Here, the development sets are NewsTest2008 and 2009, while the testing sets are NewsTest

7www.uncorpora.org8www.statmt.org/europarl9We selected the data from different years of WMT be-

cause the EN/FR and EN/ES pairs were only available inWMT 2014.

2012 and 2013 for both language pairs. The sourcesides of these larger additional sets contain around3,500 unique English word forms with more thanone sense in WordNet, and our system generatesca. 8k different noun labels and 2.5k verb labelsfor each set.

Finally, for comparison purposes and model se-lection, we use the WIT3 Corpus10 (Cettolo et al.,2012), a collection of transcripts of TED talks. Weuse 150k sentence pairs for training, 5k for devel-opment and 50k for testing.

Pre-processing. Before assigning sense labels,we tokenize all the texts and identify the parts ofspeech using the Stanford POS tagger11. Then,we filter out the stopwords and the nouns whichare proper names according to the Stanford NameEntity Recognizer11. Furthermore, we convert theplural forms of nouns to their singular forms andthe verb forms to infinitives using the stemmerand lemmatizer from NLTK12, which is essentialbecause WordNet has description entries only forbase forms. The pre-processed text is used for as-signing sense labels to each occurrence of a nounor verb that has more than one sense in WordNet.

K-means settings. Unless otherwise stated, weadopt the following settings in the k-means algo-rithm, with the implementation provided in Scikit-learn (Pedregosa et al., 2011). We use the defi-nition of each sense for initializing the centroids,and later compare this choice with the use of ex-amples. We set kt, the initial number of clusters,to the number of WordNet senses of each ambigu-ous word type Wt, and set the window size for thecontext surrounding each occurrence to c = 8.

Neural MT. We build upon the attention-basedneural translation model (Bahdanau et al., 2015)from the OpenNMT toolkit (Klein et al., 2017)13.We use long short-term memory units (LSTM) andnot gated recurrent units (GRU). For the proposedATT and ATT ini models, we add an external atten-tion layer before the encoder, but do not otherwisealter the internals of the NMT model.

We set the source and target vocabulary sizesto 50,000 and the dimension of word embeddingsto 500, which is recommended for OpenNMT, soas to reach a strong baseline. For the ATT ini

model, since the embeddings from word2vec usedfor initialization have only 300 dimensions, we

10wit3.fbk.eu11nlp.stanford.edu/software12www.nltk.org13www.opennmt.net

Page 7: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

randomly pick up a vector with 200 dimensionswithin range [-0.1,0.1] and concatenate it with thevector from word2vec to reach the required num-ber of dimensions, ensuring a fair comparison.

It takes around 15 epochs (25-30 hours onIdiap’s GPU cluster) to train each of the five NMTmodels: the baseline and our four proposals. TheAVG model takes more time for training (around40 hours) since we use additional weights andsenses for each token. In fact, we limit the numberof senses for AVG to 5 per word type, after ob-serving that in WordNet there are fewer than 100words with more than 5 senses.

Evaluation metrics. For the evaluation of in-trinsic WSD performance, we use the V -score, theF1-score, and their average, as used for instanceat SemEval 2010 (Manandhar et al., 2010). TheV -score is the weighted harmonic mean of homo-geneity and completeness (favoring systems gen-erating more clusters than the reference), whilethe F1-score measures the classification perfor-mance (favoring systems generating fewer clus-ters). Therefore, the ranking metric for SemEval2010 is the average of the two.

We select the optimal model configurationbased on MT performance on development sets,as measured with the traditional multi-bleu score(Papineni et al., 2002). Moreover, to estimate theimpact of WSD on MT, we also measure the actualimpact on the nouns and verbs that have severalWordNet senses, by counting how many of themare translated exactly as in the reference transla-tion. To quantify the difference with the baseline,we use the following coefficient. First, for a cer-tain set of tokens in the source data, we note asNimproved the number of tokens which are trans-lated by our system with the same token as in thereference translation, but are translated differentlyby the baseline system. Conversely, we note asNdegraded the number of tokens which are trans-lated by the baseline system as in the reference, butdifferently by our system14. We use the normal-ized coefficient ρ = (Nimproved − Ndegraded)/T ,where T is the total number of tokens, as a met-ric to specifically evaluate the translation of wordssubmitted to WSD.

For all tables we mark in bold the best score percondition. For MT scores in Tables 5, 7, and 8, we

14The values of Nimproved and Ndegraded are obtainedusing automatic word alignment. They do not capture, ofcourse, the intrinsic correctness of a candidate translation, butonly its identity or not with one reference translation.

show the improvement over the baseline and itssignificance based on two confidence levels: ei-ther p < 0.05 (indicated with a ‘†’) or p < 0.01(‘‡’). P -values larger than 0.05 are treated as notsignificant and are left unmarked.

5 Optimal Values of the Parameters

5.1 Best WSD Method Based on BLEUWe first select the optimal clustering method andits initialization settings, in a series of experimentswith statistical MT over the WIT3 corpus, extend-ing and confirming our previous results (Pu et al.,2017). In Table 2, we present the BLEU and ρscores of our previous WSD+SMT system for thethree clustering methods, initialized with vectorseither from the WordNet definitions or from ex-amples, for two language pairs. We also provideBLEU scores of baseline systems and of oracleones, i.e. using correct senses as factors. The bestmethod is k-means and the best initialization iswith the vectors of definitions. All values of ρshow improvements over the baseline, with up to4% for k-means on DE/EN.

Moreover, we found that random initializationsunder-perform with respect to definitions or exam-ples. For a fair comparison, we set the number ofclusters equal either to the number of synsets withdefinitions or with examples, for each word type,and obtained BLEU scores on EN/ZH of 15.34and 15.27 respectively – hence lower than 15.54and 15.41 in Table 2. We investigated earlier (Puet al., 2017) the effect of the context window sur-rounding each ambiguous token, and found withthe WSD+SMT factored system on EN/ZH WIT3

data that the optimal size was 8, which we use hereas well.

5.2 Best WSD Method Based on V/F1 ScoresTable 3 shows our WSD results in terms of V -score and F1-score, comparing our methods (sixlines at the bottom) with other significant systemsthat participated in the SemEval 2010 shared task(Manandhar et al., 2010).15 The adaptive k-meansinitialized with definitions has the highest averagescore (35.20) and ranks among the top systemsfor most of the metrics individually. Moreover,the adaptive k-means method finds on average 4.5senses per word type, which is very close to theground-truth value of 4.46. Overall, we observed

15We provide comparisons with more systems from Se-mEval in our previous paper (Pu et al., 2017).

Page 8: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

Pair Initialization BLEU ρ (%)Baseline Graph CRP k-means Oracle Graph CRP k-means

EN/ZH Definitions 15.23 15.31 15.31 15.54 16.24 +0.20 +0.27 +2.25Examples 15.28 15.41 15.85 +0.13 +1.60

EN/DE Definitions 19.72 19.74 19.69 20.23 20.99 -0.07 -0.19 +3.96Examples 19.74 19.87 20.45 -0.12 +2.15

Table 2: Performance of the WSD+SMT factored system for two language pairs from WIT3, with three clusteringmethods and two initializations.

System V-score F1-score Average CAll Nouns Verbs All Nouns Verbs All Nouns VerbsUoY 15.70 20.60 8.50 49.80 38.20 66.60 32.75 29.40 37.50 11.54KCDC-GD 6.90 5.90 8.50 59.20 51.60 70.00 33.05 28.70 39.20 2.78Duluth-Mix-Gap 3.00 2.90 3.00 59.10 54.50 65.80 31.05 29.70 34.40 1.61k-means+definitions 13.65 14.70 12.60 56.70 53.70 59.60 35.20 34.20 36.10 4.45k-means+examples 11.35 11.00 11.70 53.25 47.70 58.80 32.28 29.30 35.25 3.58CRP + definitions 1.45 1.50 1.45 64.80 56.80 72.80 33.13 29.15 37.10 1.80CRP + examples 1.20 1.30 1.10 64.75 56.80 72.70 32.98 29.05 36.90 1.66Graph + definitions 11.30 11.90 10.70 55.10 52.80 57.40 33.20 32.35 34.05 2.63Graph + examples 9.05 8.70 9.40 50.15 45.20 55.10 29.60 26.96 32.25 2.08

Table 3: WSD results from three SemEval 2010 systems and our six systems, in terms of V -score, F1 score andtheir average. ‘C’ is the average number of clusters. The adaptive k-means using definitions outperforms the otherson the average of V and F1, when considering both nouns and verbs, or nouns only. The SemEval systems are UoY(Korkontzelos and Manandhar, 2010); KCDC-GD (Kern et al., 2010); and Duluth-Mix-Gap (Pedersen, 2010).

that k-means infers fewer senses per word typethan WordNet. These results show that k-meansWSD is effective and provides competitive per-formance against other weakly supervised alterna-tives (CRP or Random Walk) and even against Se-mEval WSD methods, but using additional knowl-edge not available to SemEval participants.

System and settings BLEUBaseline 29.55TOP 29.63 (+0.08)AVG with linear norm. in Eq. 6 29.67 (+0.12)AVG with logistic norm. in Eq. 6 30.15 (+0.60)ATT with NULL label 29.80 (+0.33)ATT with word used as label 30.23 (+0.68)ATTini with uT

i Wµij in Eq. 8 29.94 (+0.39)ATTini with tanh in Eq. 8 30.61 (+1.06)

Table 4: Performance of various WSD+NMT configu-rations on a EN/FR subset of Europarl, with variationswrt. baseline. We select the settings with the best per-formance (bold) for our final experiments in Section 6.

5.3 Selection of WSD+NMT Model

To compare several options of the WSD+NMTsystems, we trained and tested them on a subset ofEN/FR Europarl (a smaller dataset shortened thetraining times). The results are shown in Table 4.For the AVG model, the logistic normalization inEq. 6 works better than the linear one. For the

ATT model, we compared two different labelingapproaches for tokens which do not have multiplesenses: either use the same NULL label for all to-kens, or use the word itself as a label for its sense;the second option appeared to be the best. Finally,for the ATT ini model, we compared the two op-tions for the attention function in Eq. 8, and foundthat the formula with tanh is the best. In what fol-lows, we use these settings for the AVG and ATTsystems.

6 Results

We first evaluate our sense-aware models withsmaller data sets (ca. 500k lines) for five languagepairs with English as source. We evaluate themthrough both automatic measures and human as-sessment. Later on, we evaluate our sense-awareNMT models with larger WMT data sets to enablea better comparison with other related approaches.

BLEU scores. Table 5 displays the per-formance of both sense-aware phrase-based andneural MT systems with the training sets of500k lines listed in Table 1 on five languagepairs. Specifically, we compare several ap-proaches which integrate word sense informationin SMT and NMT. The best hyper-parameters arethose found above, for each of the WSD+NMTcombination strategies, in particular the k-means

Page 9: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

EN/FR EN/DE EN/ZH EN/ES EN/NLSMT baseline 31.96 20.78 23.25 39.95 23.56Graph 32.01 (+.05) 21.17 (+.39) 23.47 (+.22) 40.15 (+.20) 23.74 (+.18)CRP 32.08 (+.12) 21.19 (+.41) † 23.55 (+.29) 40.14 (+.19) 23.79 (+.23)k-means 32.20 (+.24) 21.32 (+.54) † 23.69 (+.44) † 40.37 (+.42) † 23.84 (+.26)NMT baseline 34.60 25.80 27.07 44.09 24.79k-means + TOP 34.52 (−.08) 25.84 (+.04) 26.93 (−.14) 44.14 (+.05) 24.71 (−.08)k-means + AVG 35.17 (+.57) † 26.47 (+.67) † 27.44 (+.37) 45.05 (+.97) ‡ 25.04 (+.25)None + ATT 35.32 (+.72) ‡ 26.50 (+.70) ‡ 27.56 (+.49) † 44.93 (+.84) ‡ 25.36 (+.57) †k-means + ATTini 35.78 (+1.18) ‡ 26.74 (+.94) ‡ 27.84 (+.77) ‡ 45.18 (+1.09) ‡ 25.65 (+.86) ‡

Table 5: BLEU scores of our sense-aware NMT systems over five language pairs: ATTini is the best one amongSMT and NMT systems. Significance testing is indicated by † for p < 0.05 and ‡ for p < 0.01.

method for WSD+SMT, and the ATT ini methodfor WSD+NMT, i.e. the attention-based model ofsenses initialized with the output of k-means clus-tering.

Comparisons with baselines. Table 5 showsthat our WSD+NMT systems perform consistentlybetter than the baselines, with the largest improve-ments achieved by NMT on EN/FR and EN/ES.The neural systems outperform the phrase-basedstatistical ones (Pu et al., 2017), which are shownfor comparison in the upper part of the table.

We compare our proposal to the recent systemproposed by Yang et al. (2017), on the 500k-lineEN/FR Europarl dataset (the differences betweentheir system and ours are listed in Section 7). Wecarefully implemented their model by followingtheir paper, since their code is not available. Us-ing the sense embeddings of the multi-sense skip-gram model (MSSG) (Neelakantan et al., 2014)as they do, and training for 6 epochs as in theirstudy, our implementation of their model reachesonly 31.05 BLEU points. When increasing thetraining stage until convergence (15 epochs), thebest BLEU score is 34.52, which is still below ourNMT baseline of 34.60. We also found that the ini-tialization of embeddings with MSSG brings lessthan 1 BLEU point improvement with respect torandom initializations (which scored 30.11 over 6epochs and 33.77 until convergence), while Yanget al. found a 1.3–2.7 increase on two differenttest sets. In order to better understand the differ-ence, we tried several combinations of their modelwith ours. We obtain a BLEU score of 35.02 byreplacing their MSSG sense specification modelwith our adaptive k-means approach, and a BLEUscore of 35.18 by replacing our context calcula-tion method (averaging word embeddings withinone sentence) with their context vector generationmethod, which is computed from the output of a

bi-directional RNN. In the end, the best BLEUscore on this EN/FR data set (35.78 as shown inTable 5, column 1, last line) is reached by our sys-tem with its best options.

Lexical choice. Using word alignment, we as-sess the improvement brought by our systems withrespect to the baseline in terms of the number ofwords – here, WSD-labeled nouns and verbs – thatare translated exactly as in the reference transla-tion (modulo alignment errors). These numberscan be arranged in a confusion matrix with fourvalues: the words translated correctly (i.e., as inthe reference) by both systems, those translatedcorrectly by one system but incorrectly by theother one, and vice-versa, and those translated in-correctly by both.

Table 6 shows the confusion matrix for oursense-aware NMT with the ATT ini model versusthe NMT baseline over the Europarl test data. Thenet improvement, i.e. the fraction of words im-proved by our system minus those degraded,16

appears to be +2.5% for EN/FR and +3.6% forEN/ES. For comparison, we show the results ofthe WSD+SMT system versus the SMT baselinein the lower part of Table 6: the improvementis smaller, at +1.4% for EN/FR and +1.5% forEN/ES. Therefore, the ATT ini NMT model bringshigher benefits over the NMT baseline than theWSD+SMT factored model, although the NMTbaseline is stronger than the SMT one (see Ta-ble 5).

Human assessment. To compare our systemsagainst baselines we also consider a human eval-uation of the translation of words with multiplesenses (nouns or verbs). The goal is to capturemore precisely the correct translations that are,

16Explicitly, improvements are (system-correct &baseline-incorrect) minus (system-incorrect & baseline-correct), and degradations the converse difference.

Page 10: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

BaselinesEN/FR EN/ES

Correct Incorrect Correct IncorrectWSD+ C. 134,552 17,145 146,806 16,523NMT I. 10,551 101,228 8,183 58,387WSD+ C. 124,759 13,408 139,800 11,194SMT I. 9,676 115,633 7,559 71,346

Table 6: Confusion matrix for our WSD+NMT(ATTini ) system and our WSD+SMT system againsttheir respective baselines (NMT and SMT), over theEuroparl test data, for two language pairs.

deal face mark subjectCandidate

0.0

0.2

0.4

0.6

0.8

1.0

Pro

por

tion

2 - Good

1 - Acceptable

0 - Wrong

Baseline

Our method

deal face mark subjectCandidate

0.0

0.2

0.4

0.6

0.8

1.0

Pro

por

tion

O > B

O = B

O < B

(a) System ratings. (b) Comparative scores.

Figure 1: Human comparison of the EN/FR transla-tions of four word types. (a) Proportion of good (lightgray), acceptable (middle gray) and wrong (dark gray)translations per word and system (baseline left, ATTini

right, for each word). (b) Proportion of translations inwhich ATT ini is better (light gray), equal (middle gray)or worse (dark gray) than the baseline.

however, different from the reference.Given the cost of the procedure, one evalua-

tor with good knowledge of EN and FR rated thetranslations of four word types that appear fre-quently in the test set and have multiple possiblesenses and translations into French. These wordsare: ‘deal’ (101 tokens), ‘face’ (84), ‘mark’ (20),and ‘subject’ (58). Two translations of ‘deal’ areexemplified in Section 1.

For each occurrence, the evaluator sees thesource sentence, the reference translation, and theoutputs of the NMT baseline and the ATT ini inrandom order, so that the system cannot be identi-fied. The two translations of the considered wordare rated as good, acceptable, or wrong. We sub-mit only cases in which the two translations differ,to minimize the annotation effort with no impacton the comparison between systems.

Firstly, Figure 1 (a) shows that ATT ini has ahigher proportion of good translations, and a lowerproportion of wrong ones, for all four words. Thelargest difference is for ‘subject’, where ATT ini

has 75% good translations and the baseline only46%; moreover, the baseline has 22% errors and

ATT ini has only 9%. Secondly, Figure 1 (b) showsthe proportions of tokens, for each type, for whichATT ini was respectively better, equal, or worsethan the baseline. Again, for each of the fourwords, there are far more improvements broughtby ATT ini than degradations. On average, 40% ofthe occurrences are improved and only 10% aredegraded.

Results on WMT datasets. To demonstratethat our findings generalize to larger datasets,we report results on three datasets provided bythe WMT conference (see Section 4), namelyfor EN/DE, EN/ES and EN/FR. Tables 7 and 8show the results of our proposed NMT modelson these test sets. The results in Table 7 con-firm that our sense-aware NMT models improvesignificantly the translation quality also on largerdatasets, which permit stronger baselines. Com-paring these results to the ones from Table 5, weeven conclude that our models trained on larger,mixed-domain datasets achieve higher improve-ments than the models trained on smaller, domain-specific datasets (Europarl). This clearly showsthat our sense-aware NMT models are beneficialon both narrow and broad domains.

Finally, we compare our model to several recentNMT models which make use of contextual infor-mation, thus sharing a similar overall goal to ourstudy. Indeed, the model proposed by Choi et al.(2017) attempts to improve NMT by integratingcontext vectors associated to source words into thegeneration process during decoding. The modelproposed by Zhang et al. (2017) is aware of previ-ous attended words on the source side in order tobetter predict which words will be attended in fu-ture. The self-attentive residual decoder designedby Werlen et al. (2018) leverages the contextual in-formation from previously translated words on thetarget side. BLEU scores on the English-Germanpair shown in Table 8 demonstrate that our base-line is strong and that our model is competitivewith respect to recent models that leverage con-textual information in different ways.

7 Related Work

Word sense disambiguation aims to identify thesense of a word appearing in a given context(Agirre and Edmonds, 2007). Resolving wordsense ambiguities should be useful, in particular,for lexical choice in MT. An initial investigationfound that a statistical MT system which makes

Page 11: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

EN/FR EN/ESNT12 NT13 NT12 NT13

Baseline 29.09 29.60 32.66 29.57None + ATT 29.47 (+.38) 30.21 (+.61) † 33.15 (+.49) † 30.27 (+.70) ‡k-means + ATTini 30.26 (+1.17) ‡ 30.95 (+.1.35) ‡ 34.14 (+1.48) ‡ 30.67 (+1.1) ‡

Table 7: BLEU scores on WMT NewsTest 2012 and 2013 (NT) test sets for two language pairs. Significancetesting is indicated by † for p < 0.05 and ‡ for p < 0.01.

NMT model NT14 NT15Context-dependent (Choi et al., 2017) - 21.99Context-aware (Zhang et al., 2017) 22.57 -Self-attentive (Werlen et al., 2018) 23.2 25.5Baseline 22.79 24.94None + ATT 23.34 † 25.28k-means + ATTini 23.85 (+1.14) ‡ 25.71 (+0.77) ‡

Table 8: BLEU score on English-to-German translation over the WMT NewsTest (NT) 2014 and 2015 test sets.Significance testing is indicated by † for p < 0.05 and ‡ for p < 0.01. The highest score per column is in bold.

use of off-the-shelf WSD does not yield signifi-cantly better quality translations than a SMT sys-tem not using it (Carpuat and Wu, 2005). How-ever, several studies (Vickrey et al., 2005; Cabezasand Resnik, 2005; Chan et al., 2007; Carpuat andWu, 2007) reformulated the task of WSD for SMTand showed that integrating the ambiguity infor-mation generated from modified WSD improvedSMT by 0.15–0.57 BLEU points compared tobaselines.

Recently, Tang et al. (2016) used only the super-senses from WordNet (coarse-grained semantic la-bels) for automatic WSD, using maximum entropyclassification or sense embeddings learned usingword2vec. When combining WSD with SMT us-ing a factored model, Tang et al. improved BLEUscores by 0.7 points on average, though with largedifferences between their three test subsets (ITQ&A pairs).

Although the above-mentioned reformulationsof the WSD task proved helpful for SMT, they didnot determine whether actual source-side sensesare helpful or not for end-to-end SMT. Xiong andZhang (2014) attempted to answer this questionby performing self-learned word sense inductioninstead of using pre-specified word senses as tra-ditional WSD does. However, they created the riskof discovering sense clusters which do not cor-respond to the senses of words actually neededfor MT. Hence, they left open an important ques-tion, namely whether WSD based on semantic re-sources such as WordNet (Fellbaum, 1998) can besuccessfully integrated with SMT.

Several studies integrated sense information as

features to SMT, either obtained from the sensegraph provided by WordNet (Neale et al., 2016) orgenerated from both sides of word dependencies(Su et al., 2015). However, apart from the sensegraph, WordNet provides also textual informationsuch as sense definitions and examples, whichshould be useful for WSD, but were not used inthe above studies. In previous work (Pu et al.,2017), we used this information to perform senseinduction on source-side data using k-means anddemonstrated improvement with factored phrase-based SMT but not NMT.

Neural MT became the state of the art(Sutskever et al., 2014; Bahdanau et al., 2015).Instead of working directly at the discrete sym-bol level as SMT, it projects and manipulates thesource sequence of discrete symbols in a con-tinuous vector space. However, NMT generatesonly one embedding for each word type, regard-less of its possibly different senses, as analyzede.g. by Hill et al. (2017). Several studies pro-posed efficient non-parametric models for mono-lingual word sense representation (Neelakantanet al., 2014; Li and Jurafsky, 2015; Bartunov et al.,2016; Liu et al., 2017), but left open the ques-tion whether sense representations can help neu-ral MT by reducing word ambiguity. Recent stud-ies integrate the additional sense assignment withneural MT based on these approaches, either byadding such sense assignments as additional fea-tures (Rios et al., 2017) or by merging the contextinformation on both sides of parallel data for en-coding and decoding (Choi et al., 2017).

Yang et al. (2017) recently proposed to add

Page 12: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

sense information by using weighted sense em-beddings as input to neural MT. The sense la-bels were generated by a multi-sense skip-grammodel (MSSG) (Neelakantan et al., 2014), and thecontext vector used for sense weight generationwas computed from the output of a bi-directionalRNN. Finally, the weighted average sense embed-dings were used in place of the word embeddingfor the NMT encoder. The numerical results givenin Section 6 above show that our options for usingsense embeddings outperform Yang et al.’s pro-posal. In fact, their approach even performed be-low the NMT baseline on our EN/FR dataset. Weconclude that adaptive k-means clustering is betterthan MSSG for use in NMT, and that concatenat-ing the word embedding and its sense vector asinput for the RNN encoder is better than just usingthe sense embedding for each token. In terms ofefficiency, Yang et al. (2017) need an additional bi-directional RNN to generate the context vector foreach input token, while we compute the contextvector by averaging the embeddings of the neigh-boring tokens. This slows down the training of theencoder by a factor of 3, which may explain whythey only trained their model for 6 epochs.

8 Conclusion

We presented a neural MT system enhanced withan attention-based method to represent multipleword senses, making use of a larger context to dis-ambiguate words that have various possible trans-lations. We proposed several adaptive context-dependent clustering algorithms for WSD andcombined them in several ways with NMT – fol-lowing our earlier experiments with SMT (Puet al., 2017) – and found that they had competi-tive WSD performance on data from the SemEval2010 shared task.

For NMT, the best performing method used theoutput of k-means to initialize the sense embed-dings that are learned by our system. In partic-ular, it appeared that learning sense embeddingsfor NMT is better than using embeddings learnedseparately by other methods, although such em-beddings may be useful for initialization. Our ex-periments with five language pairs showed that oursense-aware NMT systems consistently improveover strong NMT baselines, and that they specifi-cally improve the translation of words with multi-ple senses.

In the future, our approach to sense-aware NMT

could be extended to other NMT architecturessuch as the Transformer network proposed byVaswani et al. (2017). As was the case withthe LSTM-based architecture studied here, theTransformer network does not explicitly modelor utilize the sense information of words, and,therefore, we hypothesize that its performancecould also be improved by using our sense in-tegration approaches. To encourage further re-search in sense-aware NMT, our code is madeavailable at https://github.com/idiap/sense_aware_NMT.

Acknowledgments

The authors are grateful for support from theSwiss National Science Foundation through theMODERN Sinergia project on Modeling Dis-course Entities and Relations for CoherentMachine Translation, grant n. 147653 (www.idiap.ch/project/modern), and from theEuropean Union through the SUMMA Hori-zon 2020 project on Scalable Understandingof Multilingual Media, grant n. 688139 (www.summa-project.eu). The authors would liketo thank the TACL editors and reviewers for theirhelpful comments and suggestions.

References

Eneko Agirre and Philip Edmonds. 2007. WordSense Disambiguation: Algorithms and Appli-cations. Springer-Verlag, Berlin, Germany.

Eneko Agirre, Oier López de Lacalle, and AitorSoroa. 2014. Random walks for knowledge-based word sense disambiguation. Computa-tional Linguistics, 40(1):57–84.

Eneko Agirre and Aitor Soroa. 2009. Person-alizing PageRank for word sense disambigua-tion. In Proceedings of the 12th Conferenceof the European Chapter of the Association forComputational Linguistics (EACL), pages 33–41, Athens, Greece.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate. InProceedings of the International Conferenceon Learning Representations, San Diego, CA,USA.

Page 13: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

Sergey Bartunov, Dmitry Kondrashkin, Anton Os-okin, and Dmitry Vetrov. 2016. Breaking sticksand ambiguities with adaptive skip-gram. InArtificial Intelligence and Statistics, pages 130–138.

Ondrej Bojar, Christian Buck, Christian Feder-mann, Barry Haddow, Philipp Koehn, JohannesLeveling, Christof Monz, Pavel Pecina, MattPost, Herve Saint-Amand, Radu Soricut, LuciaSpecia, and Aleš Tamchyna. 2014. Findings ofthe 2014 workshop on statistical machine trans-lation. In Proceedings of the Ninth Workshopon Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA.

Ondrej Bojar, Rajen Chatterjee, Christian Feder-mann, Yvette Graham, Barry Haddow, MatthiasHuck, Antonio Jimeno Yepes, Philipp Koehn,Varvara Logacheva, Christof Monz, Matteo Ne-gri, Aurelie Neveol, Mariana Neves, MartinPopel, Matt Post, Raphael Rubino, CarolinaScarton, Lucia Specia, Marco Turchi, KarinVerspoor, and Marcos Zampieri. 2016. Find-ings of the 2016 conference on machine transla-tion. In Proceedings of the First Conference onMachine Translation, pages 131–198, Berlin,Germany.

Clara Cabezas and Philip Resnik. 2005. UsingWSD techniques for lexical selection in sta-tistical machine translation. Technical report,DTIC Document.

Marine Carpuat and Dekai Wu. 2005. Word sensedisambiguation vs. statistical machine transla-tion. In Proceedings of the 43rd Annual Meet-ing of the Association for Computational Lin-guistics, pages 387–394, Michigan, MI, USA.

Marine Carpuat and Dekai Wu. 2007. Improvingstatistical machine translation using word sensedisambiguation. In Proceedings of the 2007Joint Conference on Empirical Methods in Nat-ural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL),pages 61–72, Prague, Czech Republic.

Mauro Cettolo, Christian Girardi, and MarcelloFederico. 2012. WIT3: Web inventory of tran-scribed and translated talks. In Proceedings ofthe 16th Conference of the European Associ-ation for Machine Translation (EAMT), pages261–268, Trento, Italy.

Yee Seng Chan, Hwee Tou Ng, and David Chi-ang. 2007. Word sense disambiguation im-proves statistical machine translation. In Pro-ceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics, pages33–40, Prague, Czech Republic.

Heeyoul Choi, Kyunghyun Cho, and Yoshua Ben-gio. 2017. Context-dependent word representa-tion for neural machine translation. ComputerSpeech & Language, 45:149–160.

Christiane Fellbaum, editor. 1998. WordNet: AnElectronic Lexical Database. The MIT Press,Cambridge, MA, USA.

Thomas S. Ferguson. 1973. A Bayesian analysisof some nonparametric problems. The Annalsof Statistics, 1(2):209–230.

Sergey Grin and Lawrence Page. 1998. Theanatomy of a large-scale hypertextual Websearch engine. Computer Networks and ISDNSystems, 30(1-7):107–117.

Felix Hill, Kyunghyun Cho, Sébastien Jean, andYoshua Bengio. 2017. The representational ge-ometry of word meanings acquired by neuralmachine translation models. Machine Transla-tion, 31(1):3–18.

Roman Kern, Markus Muhr, and Michael Gran-itzer. 2010. KCDC: Word sense induction byusing grammatical dependencies and sentencephrase structure. In Proceedings of the 5th In-ternational Workshop on Semantic Evaluation(SemEval-2010), pages 351–354, Los Angeles,CA, USA.

Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander M. Rush. 2017. Open-NMT: Open-source toolkit for neural machinetranslation. CoRR, abs/1701.02810v2.

Philipp Koehn. 2005. Europarl: A parallel cor-pus for statistical machine translation. In Pro-ceedings of MT Summit X, pages 79–86, Phuket,Thailand.

Ioannis Korkontzelos and Suresh Manandhar.2010. UoY: graphs of ambiguous vertices forword sense induction and disambiguation. InProceedings of the 5th International Workshopon Semantic Evaluation (SemEval 2010), pages355–358, Los Angeles, California.

Page 14: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

Jiwei Li and Dan Jurafsky. 2015. Do multi-senseembeddings improve natural language under-standing? In Proceedings of the 2015 Con-ference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1722–1732,Lisbon, Portugal.

Frederick Liu, Han Lu, and Graham Neubig. 2017.Handling homographs in neural machine trans-lation. CoRR, abs/1708.06510v2.

Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective approaches toattention-based neural machine translation. InProceedings of the 2015 Conference on Empir-ical Methods in Natural Language Processing(EMNLP), pages 1412–1421, Lisbon, Portugal.

James MacQueen. 1967. Some methods for clas-sification and analysis of multivariate observa-tions. In Proceedings of the 5th Berkeley Sym-posium on Mathematical Statistics and Proba-bility, pages 281–297. Oakland, CA, USA.

Suresh Manandhar, Ioannis P. Klapaftis, DmitriyDligach, and Sameer S. Pradhan. 2010.SemEval-2010 task 14: Word sense inductionand disambiguation. In Proceedings of the 5thInternational Workshop on Semantic Evalua-tion (SemEval 2010), pages 63–68, Los Ange-les, CA, USA.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg SCorrado, and Jeff Dean. 2013. Distributedrepresentations of words and phrases and theircompositionality. In Advances in Neural In-formation Processing Systems (NIPS), pages3111–3119.

Andrea Moro and Roberto Navigli. 2015.Semeval-2015 task 13: Multilingual all-wordssense disambiguation and entity linking. InProceedings of the 9th International Workshopon Semantic Evaluation (SemEval 2015), pages288–297, Denver, Colorado. Association forComputational Linguistics.

Steven Neale, Luıs Gomes, Eneko Agirre,Oier Lopez de Lacalle, and António Branco.2016. Word sense-aware machine translation:Including senses as contextual features for im-proved translation models. In Proceedings ofthe 10th International Conference on LanguageResources and Evaluation (LREC 2016), pages2777–2783, Portoroz, Slovenia.

Arvind Neelakantan, Jeevan Shankar, AlexandrePassos, and Andrew McCallum. 2014. Efficientnon-parametric estimation of multiple embed-dings per word in vector space. In Proceed-ings of the 2014 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP),pages 1059–1069, Doha, Qatar.

Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. 2002. BLEU: A method for au-tomatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of As-sociation for Computational Linguistics, pages311–318, Philadelphia, USA.

Ted Pedersen. 2010. Duluth-WSI: SenseClustersapplied to the sense induction task of SemEval-2. In Proceedings of the 5th InternationalWorkshop on Semantic Evaluation (SemEval2010), pages 363–366, Los Angeles, CA, USA.

Fabian Pedregosa, Gaël Varoquaux, AlexandreGramfort, Vincent Michel, Bertrand Thirion,Olivier Grisel, Mathieu Blondel, Peter Pret-tenhofer, Ron Weiss, Vincent Dubourg, JakeVanderplas, Alexandre Passos, David Courna-peau, Matthieu Brucher, Matthieu Perrot, andÉdouard Duchesnay. 2011. Scikit-learn: Ma-chine learning in Pmaython. Journal of Ma-chine Learning Research, 12:2825–2830.

Xiao Pu, Nikolaos Pappas, and Andrei Popescu-Belis. 2017. Sense-aware statistical machinetranslation using adaptive context-dependentclustering. In Proceedings of the Second Con-ference on Machine Translation, pages 1–10,Copenhagen, Denmark.

Alexandre Rafalovitch and Robert Dale. 2009.United Nations general assembly resolutions: Asix-language parallel corpus. In Proceedings ofMT Summit XII, pages 292–299, Ontario, ON.Canada.

Annette Rios, Laura Mascarell, and Rico Sen-nrich. 2017. Improving word sense disam-biguation in neural machine translation withsense embeddings. In Second Conference onMachine Translation, pages 11–19, Copen-hagen, Denmark.

Jinsong Su, Deyi Xiong, Shujian Huang, Xian-pei Han, and Junfeng Yao. 2015. Graph-based

Page 15: Integrating Weakly Supervised Word Sense Disambiguation into …publications.idiap.ch/downloads/papers/2018/Pu_TACL_2018.pdf · 2018. 10. 8. · tj and examples e tj are repre-sented

collective lexical selection for statistical ma-chine translation. In Proceedings of the Con-ference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1238–1247,Lisbon, Portugal.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.2014. Sequence to sequence learning with neu-ral networks. In Advances in Neural Informa-tion Processing Systems (NIPS), pages 3104–3112.

Haiqing Tang, Deyi Xiong, Oier Lopez de Lacalle,and Eneko Agirre. 2016. Improving transla-tion selection with supersenses. In Proceedingsof the 26th International Conference on Com-putational Linguistics (COLING), pages 3114–3123, Osaka, Japan.

Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In Advances in NeuralInformation Processing Systems, pages 5998–6008.

David Vickrey, Luke Biewald, Marc Teyssier, andDaphne Koller. 2005. Word-sense disambigua-tion for machine translation. In Proceedingsof the Conference on Human Language Tech-nology and Empirical Methods in Natural Lan-guage Processing (HLT-EMNLP), pages 771–778, Vancouver, BC, Canada.

Lesly Miculicich Werlen, Nikolaos Pappas,Dhananjay Ram, and Andrei Popescu-Belis.2018. Self-attentive residual decoder for neuralmachine translation. In Proceedings of the 2018Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies (NAACL-HLT),pages 1366–1379, New Orleans, LA, USA.

Deyi Xiong and Min Zhang. 2014. A sense-based translation model for statistical machinetranslation. In Proceedings of the 52nd AnnualMeeting of the Association for ComputationalLinguistics, pages 1459–1469, Baltimore MD,USA.

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu.2017. Multi-sense based neural machine trans-lation. In International Joint Conference onNeural Networks, pages 3491–3497, Anchor-age, AK, USA.

Biao Zhang, Deyi Xiong, Jinsong Su, and HongDuan. 2017. A context-aware recurrent encoderfor neural machine translation. IEEE/ACMTransactions on Audio, Speech and LanguageProcessing (TASLP), pages 2424–2432.