dynamic topic adaptation for smt using distributional profiles · proceedings of the ninth workshop...

Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 445–456,Baltimore, Maryland USA, June 26–27, 2014. c©2014 Association for Computational Linguistics

Dynamic Topic Adaptation for SMT using Distributional Profiles

Eva Hasler1 Barry Haddow1 Philipp Koehn1,2

1School of Informatics, University of Edinburgh2Center for Language and Speech Processing, Johns Hopkins [email protected], {bhaddow,pkoehn}@inf.ed.ac.uk

Abstract

Despite its potential to improve lexicalselection, most state-of-the-art machinetranslation systems take only minimal con-textual information into account. We cap-ture context with a topic model over dis-tributional profiles built from the contextwords of each translation unit. Topic dis-tributions are inferred for each transla-tion unit and used to adapt the translationmodel dynamically to a given test contextby measuring their similarity. We showthat combining information from both lo-cal and global test contexts helps to im-prove lexical selection and outperforms abaseline system by up to 1.15 BLEU. Wetest our topic-adapted model on a diversedata set containing documents from threedifferent domains and achieve competitiveperformance in comparison with two su-pervised domain-adapted systems.

1 Introduction

The task of lexical selection plays an importantrole in statistical machine translation (SMT). Itstrongly depends on context and is particularly dif-ficult when the domain of a test document is un-known, for example when translating web doc-uments from diverse sources. Selecting transla-tions of words or phrases that preserve the senseof the source words is closely related to the fieldof word sense disambiguation (WSD), which hasbeen studied extensively in the past.

Most approaches to WSD model context at thesentence level and do not take the wider contextof a word into account. Some of the ideas fromthe field of WSD have been adapted for machinetranslation (Carpuat and Wu, 2007b; Carpuat andWu, 2007a; Chan et al., 2007). For example,Carpuat and Wu (2007a) extend word sense dis-ambiguation to phrase sense disambiguation and

show improved performance due to the better fitwith multiple possible segmentations in a phrase-based system. Carpuat (2009) test the “one senseper discourse” hypothesis (Gale et al., 1992) forMT and find that enforcing it as a constraint at thedocument level could potentially improve transla-tion quality. Our goal is to make correct lexicalchoices in a given context without explicitly en-forcing translation consistency.

More recent work in SMT uses latent repre-sentations of the document context to dynam-ically adapt the translation model with eithermonolingual topic models (Eidelman et al., 2012;Hewavitharana et al., 2013) or bilingual topicmodels (Hasler et al., 2014), thereby allowing thetranslation system to disambiguate source phrasesusing document context. Eidelman et al. (2012)also apply a topic model to each test sentence andfind that sentence context is sufficient for pick-ing good translations, but they do not attempt tocombine sentence and document level informa-tion. Sentence-level topic adaptation for SMT hasalso been employed by Hasler et al. (2012). Otherapproaches to topic adaptation for SMT includeZhao and Xing (2007) and Tam et al. (2008), bothof which use adapted lexical weights.

In this paper, we present a topic model thatlearns latent distributional representations of thecontext of a phrase pair which can be applied toboth local and global contexts at test time. Weintroduce similarity features that compare latentrepresentations of phrase pair types to test con-texts to disambiguate senses for improved lexi-cal selection. We also propose different strate-gies for combining local and global topical contextand show that using clues from both levels of con-texts is beneficial for translation model adaptation.We evaluate our model on a dynamic adaptationtask where the domain of a test document is un-known and hence the problem of lexical selectionis harder.

445

2 Related work

Most work in the WSD literature has modelleddisambiguation using a limited window of con-text around the word to disambiguate. Cai et al.(2007), Boyd-graber and Blei (2007) and Li et al.(2010) further tried to integrate the notion of la-tent topics to address the sparsity problem of thelexicalised features typically used in WSD classi-fiers. The most closely related work in the areaof sense disambiguation is by Dinu and Lapata(2010) who propose a disambiguation method forsolving lexical similarity and substitution tasks.They measure word similarity in context by learn-ing distributions over senses for each target wordin the form of lower-dimensional distributionalrepresentations. Before computing word similar-ities, they contextualise the global sense distribu-tion of a word using the sense distribution of wordsin the test context, thereby shifting the sense distri-bution towards the test context. We adopt a simi-lar distributional representation, but argue that ourrepresentation does not need this disambiguationstep because at the level of phrase pairs the ambi-guity is already much reduced.

Our model performs adaptation using similar-ity features which is similar to the approach ofCosta-jussa and Banchs (2010) who learn a vec-tor space model that captures the source contextof every training sentence. In Banchs and Costa-jussa (2011), the vector space model is replacedwith representations inferred by Latent Seman-tic Indexing. However, because their latent rep-resentations are learned over training sentences,they have to compare the current test sentence tothe latent vector of every training instance associ-ated with a translation unit. The highest similar-ity value is then used as a feature value. Instead,our model learns latent distributional representa-tions of phrase pairs that can be directly comparedto test contexts and are likely to be more robust.Because context words of a phrase pair are tied to-gether in the distributional representations, we canuse sparse priors to cluster context words associ-ated with the same phrase pair into few topics.

Recently, Chen et al. (2013) have proposed avector space model for domain adaptation wherephrase pairs are assigned vectors that are definedin terms of the training corpora. A similar vectoris built for an in-domain development set and thesimilarity to the development set is used as a fea-ture during translation. While their vector repre-sentations are similar to our latent topic represen-

tations, their model has no notion of structure be-yond corpus boundaries and is adapted towards asingle target domain (cross-domain). Instead, ourmodel learns the latent topical structure automati-cally and the translation model is adapted dynam-ically to each test instance.

We are not aware of prior work in the field ofMT that investigates combinations of local andglobal context. In their recent work on neural lan-guage models, Huang et al. (2012) combine thescores of two neural networks modelling the wordembeddings of previous words in a sequence aswell as those of words from the surrounding doc-ument by averaging over all word embeddings oc-curring in the same document. The score of thenext word in a sequence is computed as the sum ofthe scores of both networks, but they do not con-sider alternative ways of combining contextual in-formation.

3 Phrase pair topic model (PPT)

Our proposed model aims to capture the relation-ship between phrase pairs and source words thatfrequently occur in the local context of a phrasepair, that is, context words occurring in the samesentence. It therefore follows the distributionalhypothesis (Harris, 1954) which states that wordsthat occur in the same contexts tend to have sim-ilar meanings. For a phrase pair, the idea is thatwords that occur frequently in its context are in-dicative of the sense that is captured by the targetphrase translating the source phrase.

We assume that all phrase pairs share a globalset of topics and during topic inference the distri-bution over topics for each phrase pair is inducedfrom the latent topic of its context words in thetraining data. In order to learn topic distributionsfor each phrase pair, we represent phrase pairs asdocuments containing all context words from thesource sentence context in the training data. Thesedistributional profiles of phrase pairs are the in-put to the topic modelling algorithm which learnstopic clusters over context words.

Figure 1a shows a graphical representation ofthe following generative process for training. Foreach of P phrase pairs ppi in the collection

1. Draw a topic distribution from an asymmetricDirichlet prior, θp ∼ Dirichlet(α0, α . . . α).

2. For each position c in the distributional pro-file of ppi, draw a topic from that distribution,zp,c ∼Multinomial(θp).

446

(a) Inference on phrase pair documents (training).

(b) Inference on local test contexts (test).

Figure 1: Graphical representation of the phrasepair topic (PPT) model.

3. Conditioned on topic zp,c, choose a contextword wp,c ∼Multinomial(ψzp,c).

α and β are parameters of the Dirichlet distribu-tions and φk denotes topic-dependent vocabulariesover context words. Test contexts are generatedsimilarly by drawing topic mixtures θl for each testcontext1 as shown in Figure 1b, drawing topics zfor each context position and then drawing contextwordsw for each z. The asymmetric prior on topicdistributions (α0 for topic 0 and α for all other top-ics) encodes the intuition that there are words oc-curring in the context of many phrase pairs which

1A local test context is defined as all words in the testsentence excluding stop words, while contexts of phrase pairsin training do not include the words belonging to the sourcephrase. The naming in the figure refers to local test contextsL, but global test contexts will be defined similarly.

can be grouped under a topic with higher a prioriprobability than the other topics. Figure 1a showsthe model for training inference on the distribu-tional representations for each phrase pair, whereCl−all denotes the number of context words in allsentence contexts that the phrase pair was seen inthe training data, P denotes the number of phrasepairs and K denotes the number of latent topics.The model in Figure 1b has the same structurebut shows inference on test contexts, where Cl de-notes the number of context words in the test sen-tence context and L denotes the number of test in-stances. θp and θl denote the topic distribution fora phrase pair and a test context, respectively.

3.1 Inference for PPT modelWe use collapsed variational Bayes (Teh et al.,2006) to infer the parameters of the PPT model.The posterior distribution over topics is computedas shown below

P (zp,c = k|z−(p,c),wc, p, α, β) ∝(Eq[n−(p,c)

.,k,wc] + β)

(Eq[n−(p,c).,k,. ] +Wc · β)

· (Eq[n−(p,c)d,k,. ] + α)

(1)

where zp,c denotes the topic at position c inthe distributional profile p, wc denotes all con-text word tokens in the collection, Wc is the totalnumber of context words and Eq is the expecta-tion under the variational posterior. n

−(p,c).,k,wc

and

n−(p,c)p,k,. are counts of topics occurring with context

words and distributional profiles, respectively, andn−(p,c).,k,. is a topic occurrence count.Before training the topic model, we remove stop

words from all documents. When inferring top-ics for test contexts, we ignore unseen words be-cause they do not contribute information for topicinference. In order to speed up training inference,we limit the documents in the collection to thosecorresponding to phrase pairs that are needed totranslate the test set2. Inference was run for 50 it-erations on the distributional profiles for trainingand for 10 iterations on the test contexts. The out-put of the training inference step is a model filewith all the necessary statistics to compute pos-terior topic distributions (which are loaded beforerunning test inference), and the set of topic vectorsfor all phrase pairs. The output of test inference is

2Reducing the training contexts by scaling or samplingwould be expected to speed up inference considerably.

447

the set of induced topic vectors for all test con-texts.

3.2 Modelling local and global contextAt training time, our model has access to contextwords only from the local contexts of eachphrase pair in their distributional profiles, that is,other words in the same source sentence as thephrase pair. This is useful for reducing noise andconstraining the semantic space that the modelconsiders for each phrase pair during training. Attest time, however, we are not limited to applyingthe model only to the immediate surroundings ofa source phrase to disambiguate its meaning. Wecan potentially take any size of test context intoaccount to disambiguate the possible senses of asource phrase, but for simplicity we consider twosizes of context here which we refer to as localand global context.

Local context Words appearing in the sentencearound a test source phrase, excluding stop words.

Global context Words appearing in the documentaround a test source phrase, excluding stop words.

4 Similarity features

We define similarity features that compare thetopic vector θp assigned to a phrase pair3 to thetopic vector assigned to a test context, The fea-ture is defined for each source phrase and all itspossible translations in the phrase table, as shownbelow

sim(ppi, test context) = cosine(θpi , θc),∀ppi ∈ {ppi|s→ ti} (2)

Unlike Banchs and Costa-jussa (2011), we donot learn topic vectors for every training sentencewhich results in a topic vector per phrase pair to-ken, but instead we learn topic vectors for eachphrase pair type. This is more efficient but alsomore appealing from a modelling point of view, asthe topic distributions associated with phrase pairscan be thought of as expected latent contexts. Theapplication of the similarity feature is visualisedin Figure 2. On the left, there are two applicablephrase pairs for the source phrase noyau, noyau→ kernel and noyau→ nucleus, with their distri-butional representations (words belonging to the

3The mass of topic 0 is removed from the vectors andthe vectors are renormalised before computing similarity fea-tures.

IT topic versus the scientific topic) and assignedtopic vectors θp. The local and global test contextsare similarly represented by a document contain-ing the context words and a resulting topic vectorθl or θg. The test context vector θc can be one ofθl and θg or a combination of both. In this ex-ample, the distributional representation of noyau→ kernel has a larger topical overlap with the testcontext and will more likely be selected during de-coding.

Figure 2: Similarity between topic vectors of twoapplicable phrase pairs θp and the topic vectors θl

and θg from the local and global test context dur-ing test time.

While this work focuses on exploring vec-tor space similarity for adaptation, mostly forcomputational ease, it may be possible to deriveprobabilistic translation features from the PPTmodel. This could be a useful addition to themodel and we leave this as an avenue for futurework.

Types of similarity featuresWe experiment with local and global phrase simi-larity features, phrSim-local and phrSim-global, toperform dynamic topic adaptation. These two sim-ilarity features can be combined by adding themboth to the log-linear SMT model, in which caseeach receive separate feature weights. Wheneverwe use the + symbol in our results tables, theadditional features were combined with existingfeatures log-linearly. However, we also experi-mented with an alternative combination of localand global information where we combine the lo-cal and global topic vectors for each test contextbefore computing similarity features.4 We were

4The combined topic vectors were renormalised beforecomputing their similarities with each candidate phrase pair.

448

motivated by the observation that there are caseswhere the local and global features have an op-posite preference for one translation over another,but the log-linear combination can only learn aglobal preference for one of the features. Com-bining the topic vectors allows us to potentiallyencode a preference for one of the contexts thatdepends on each test instance.

For similarity features derived from combinedtopic vectors, ⊕ denotes the additive combinationof topic vectors,⊗ denotes the multiplicative com-bination of topic vectors and ~ denotes a combina-tion that favours the local context for longer sen-tences and backs off incrementally to the globalcontext for shorter sentences.5 The intuition be-hind this combination is that if there is already suf-ficient evidence in the local context, the local topicmixture may be more reliable than the global mix-ture.

We also experiment with a combination of thephrase pair similarity features derived from thePPT model with a document similarity featurefrom the pLDA model described in Hasler et al.(2014). The motivation is that their model learnstopic mixtures for documents and uses phrases in-stead of words to infer the topical context. There-fore, it might provide additional information to oursimilarity features.

5 Data and experimental setup

Our experiments were carried out on a mixedFrench-English data set containing the TED cor-pus (Cettolo et al., 2012), parts of the News Com-mentary corpus (NC) and parts of the Common-crawl corpus (CC) from the WMT13 shared task(Bojar et al., 2013) as described in Table 1. Toensure that the baseline model does not have animplicit preference for any particular domain, weselected subsets of the NC and CC corpora suchthat the training data contains 2.7M English wordsper domain. We were guided by two constraintsin chosing our data set in order to simulate anenvironment where very diverse documents haveto be translated, which is a typical scenario forweb translation engines: 1) the data has docu-ment boundaries and the content of each docu-ment is assumed to be topically related, 2) there issome degree of topical variation within each dataset. This setup allows us to evaluate our dynamic

5The interpolation weights between local and global topicvectors were set proportional to sentence lengths between 1and 30. The length of longer sentences was clipped to 30.

topic adaptation approach because the test docu-ments are from different domains and also differwithin each domain, which makes lexical selec-tion a much harder problem. The topic adaptationapproach does not make use of the domain labelsin training or test, because it infers topic mixturesin an unsupervised way. However, we compare theperformance of our dynamic approach to domainadaptation methods by providing them the domainlabels for each document in training and test.

In order to abstract away from adaptation ef-fects that concern tuning of length penalties andlanguage models, we use a mixed tuning set con-taining data from all three domains and train onelanguage model on the concatenation of the tar-get sides of the training data. Word alignmentsare trained on the concatenation of all training dataand fixed for all models. Table 2 shows the aver-age length of a document for each domain. Whilea CC document contains 29.1 sentences on aver-age, documents from NC and TED are on averagemore than twice as long. The length of a documentcould have an influence on how reliable globaltopic information is but also on how important itis to have information from both local and globaltest contexts.

Data Mixed CC NC TEDTrain 354K (6450) 110K 103K 140KDev 2453 (39) 818 817 818Test 5664 (112) 1892 1878 1894

Table 1: Number of sentence pairs and documents(in brackets) in the data sets.

Data CC NC TEDTest documents 65 31 24Avg sentences/doc 29.1 60.6 78.9

Table 2: Average number of sentences per docu-ment in the test set (per domain).

5.1 Unadapted baseline system

Our baseline is a phrase-based French-Englishsystem trained on the concatenation of all paralleldata. It was built with the Moses toolkit (Koehnet al., 2007) using the 14 standard core featuresincluding a 5-gram language model. Translationquality is evaluated on a large test set, using theaverage feature weights of three optimisation runswith PRO (Hopkins and May, 2011). We use the

449

noyau→ kernel noyau→ nucleus noyau→ core

Figure 3: Topic distributions for source phrase noyau and three of its translations (20 topics without topic0). Colored bars correspond to topics IT, politics, science, economy with topic proportions ≥10%.

mteval-v13a.pl script to compute case-insensitiveBLEU scores.

5.2 Domain-adapted benchmark systems

As domain-aware benchmark systems, we usethe linear mixture model (DOMAIN1) of Sen-nrich (2012) and the phrase table fill-up method(DOMAIN2) of Bisazza et al. (2011) (both avail-able in the Moses toolkit). For both systems,the domain labels of the documents are used togroup documents of the same domain together. Webuild adapted tables for each domain by treatingthe remaining documents as out-of-domain dataand combining in-domain with out-of-domain ta-bles. For development and test, the domain labelsare used to select the respective domain-adaptedmodel for decoding. Both systems have an advan-tage over our model because of their knowledgeof domain boundaries in the data. This allows formuch more confident lexical choices than using anunadapted system but is not possible without priorknowledge about each document.

5.3 Implementation of similarity features

After all topic vectors have been computed, a fea-ture generation step precomputes the similarityfeatures for all pairs of test contexts and applica-ble phrase pairs for translating source phrases ina test instance. The phrase table of the baselinemodel is filtered for every test instance (a sentenceor document, depending on the context setting)and each entry is augmented with features that ex-press its semantic similarity to the test context. Weuse a wrapper around the Moses decoder to reloadthe phrase table for each test instance, which en-ables us to run parameter optimisation (PRO) inthe usual way to get one set of tuned weights forall test sentences. It would be conceivable to use

topic-specific weights instead of one set of globalweights, but this is not the focus of this work.

6 Qualitative evaluation of phrase pairtopic distributions

In order to verify that the topic model is learninguseful topic representations for phrase pairs, weinspect the inferred topic distributions for threephrase pairs where the translation of the samesource word differs depending on the topicalcontext: noyau → kernel, noyau → nucleusand noyau → core. Figure 3 shows the topicdistributions for a PPT model with 20 topics(with topic 0 removed) and highlights the mostprominent topics with labels describing theircontent (politics, IT, science, economy)6. Themost peaked topic distribution was learned forthe phrase pair noyau → kernel which would beexpected to occur mostly in an IT context andthe topic with the largest probability mass is infact related to IT. The most prominent topic forthe phrase pair noyau → nucleus is the sciencetopic, though it seems to be occurring in with thepolitical topic as well. The phrase pair noyau→ core was assigned the most ambiguous topicdistribution with peaks at the politics, economyand IT topics. Note also that its topic distributionoverlaps with those of the other translations, forexample, like the phrase pair noyau → kernel,it can occur in IT contexts. This shows that themodel captures the fact that even within a giventopic there can still be ambiguity about the correcttranslation (both target phrases kernel and coreare plausible translations in an IT context).

6Topic labels were assigned by inspecting the most prob-able context words for each topic according to the model.

450

Ambiguity of phrase pair topic vectorsThe examples in the previous section show thatthe level of ambiguity differs between phrase pairsthat constitute translations of the same sourcephrase. It is worth noting that introducing bilin-gual information into topic modelling reduces thesense ambiguity present in monolingual text bypreserving only the intersection of the senses ofsource and target phrases. For example, the distri-butional profiles of the source phrase noyau wouldcontain words that belong to the senses IT, poli-tics, science and economy, while the words in thecontext of the target phrase kernel can belong tothe senses IT and food (with source context wordssuch as grain, proteines, produire). Thus, themonolingual representations would still contain arelatively high level of ambiguity while the distri-butional profile of the phrase pair noyau→ kernelpreserves only the IT sense.

7 Results and discussion

In this section we present experimental resultsof our model with different context settings andagainst different baselines. We used bootstrap re-sampling (Koehn, 2004) to measure significanceon the mixed test set and marked all statisticallysignificant results compared to the respective base-lines with asterisk (*: p ≤ 0.01).

7.1 Local contextIn Table 3 we compare the results of the con-catenation baseline and a model containing thephrSim-local feature in addition to the baselinefeatures, for different numbers of latent topics. Weshow results for the mixed test set containing doc-uments from all three domains as well as the in-dividual results on the documents from each do-main. While all topic settings yield improvementsover the baseline, the largest improvement on themixed test set (+0.48 BLEU) is achieved with 50topics. Topic adaptation is most effective on theTED portion of the test set where the increase inBLEU is 0.59.

7.2 Global contextTable 4 shows the results of the baseline plus thephrSim-global feature that takes into account thewhole document context of a test sentence. Whilethe largest overall improvement on the mixed testset is equal to the improvement of the local feature,there are differences in performance for the indi-vidual domains. For Commoncrawl documents,

Model Mixed CC NC TEDBaseline -26.86 19.61 29.42 31.8810 topics *27.15 19.87 29.63 32.3620 topics *27.19 19.92 29.76 32.3150 topics *27.34 20.13 29.70 32.47

100 topics *27.26 20.02 29.75 32.40

>Baseline +0.48 +0.52 +0.34 +0.59

Table 3: BLEU scores of baseline system +phrSim-local feature for different numbers of top-ics.

the results vary slightly but the largest improve-ment is still achieved with 50 topics and is al-most the same for both. For News Commentary,the scores with the local feature are consistentlyhigher than the scores with the global feature (0.20and 0.22 BLEU higher for 20 and 50 topics). ForTED, the trend is opposite with the global featureperforming better than the local feature for all top-ics (0.28 and 0.40 BLEU higher for 10 and 20 top-ics). The best improvement over the baseline forTED is 0.83 BLEU, which is higher than the im-provement with the local feature.

Model Mixed CC NC TEDBaseline -26.86 19.61 29.42 31.8810 topics *27.30 20.01 29.61 32.6420 topics *27.34 20.07 29.56 32.7150 topics *27.27 20.12 29.48 32.55

100 topics *27.24 19.95 29.66 32.52

>Baseline +0.48 +0.51 +0.24 +0.83

Table 4: BLEU scores of baseline system +phrSim-global feature for different numbers oftopics.

7.3 Relation to properties of test documents

To make these results more interpretable, Ta-ble 5 lists some of the properties of the test doc-uments per domain. Of the three domains, CC

has the shortest documents on average and TED

the longest. To understand how this affects topicinference, we measure topical drift as the aver-age divergence (cosine distance) of the local topicdistributions for each test sentence to the globaltopic distribution of their surrounding document.There seems to be a correlation between docu-ment length and topical drift, with CC documentsshowing the least topical drift and TED documentsshowing the most. This makes sense intuitively

451

because the longer a document is, the more likelyit is that the content of a given sentence divergesfrom the overall topical structure of the document.While this can explain why for CC documents us-ing local or global context results in similar perfor-mance, it does not explain the better performanceof the local feature for NC documents. The lastrow of Table 5 shows that sentences in the NC

documents are on average the longest and longersentences would be expected to yield more reli-able topic estimates than shorter sentences. Thus,we assume that local context yields better perfor-mance for NC because on average the sentencesare long enough to yield reliable topic estimates.When local context provides reliable information,it may be more informative than global context be-cause it can be more specific.

For TED, we see the largest topical drift perdocument, which could lead us to believe that thedocument topic mixtures do not reflect the topicalcontent of the sentences too well. But consideringthat the sentences are on average shorter than forthe other two domains, it is more likely that thelocal context in TED documents can be unreliablewhen the sentences are too short. TED documentscontain transcribed speech and are probably lessdense in terms of information content than Newscommentary documents. Therefore, the globalcontext may be more informative for TED whichcould explain why relying on the global topicmixtures yields better results.

Property CC NC TEDPer documentAvg number of sentences 29.1 60.6 78.9Avg topical divergence 0.35 0.43 0.49Avg sentence length 26.2 31.5 21.7

Table 5: Properties of test documents per domain.Average topical divergence is defined as the aver-age cosine distance of local to global topic distri-butions in a document.

7.4 Combinations of local and global context

In Table 6 we compare a system that already con-tains the global feature from a model with 50 top-ics to the combinations of local and global simi-larity features described in Section 4.

Of the four combinations, the additive combi-nation of topic vectors (⊕) yields the largest im-provement over the baseline with +0.63 BLEU on

Model Mixed CC NC TEDBaseline -26.86 19.61 29.42 31.88+ global -27.27 20.12 29.48 32.55

+ local *27.43 20.18 29.65 32.79⊕ local *27.49 20.30 29.66 32.76⊗ local -27.34 20.24 29.61 32.50~ local *27.45 20.22 29.51 32.79

⊕ >BL +0.63 +0.69 +0.24 +0.88

Table 6: BLEU scores of baseline and combina-tions of phrase pair similarity features with localand global context (significance compared to base-line+global). All models were trained with 50 top-ics.

the mixed test set and +0.88 BLEU on TED. Theimprovements of the combined model are largerthan the improvements for each context on its own,with the only exception being the NC portion ofthe test set where the improvement is not largerthan using just the local context. A possible reasonis that when one feature is consistently better forone of the domains (local context for NC), the log-linear combination of both features (tuned on datafrom all domains) would result in a weaker overallmodel for that domain. However, if both featuresencode similar information, as we assume to be thecase for CC documents, the presence of both fea-tures would reinforce the preference of each andresult in equal or better performance. For the ad-ditive combination, we expect a similar effect be-cause adding together two topics vectors that havepeaks at different topics would make the resultingtopic vector less peaked than either of the originalvectors.

The additive topic vector combination isslightly better than the log-linear feature combina-tion, though the difference is small. Nevertheless,it shows that combining topic vectors before com-puting similarity features is a viable alternativeto log-linear combination, with the potential todesign more expressive combination functions.The multiplicative combination performs slightlyworse than the additive combination, whichsuggests that the information provided by the twocontexts is not always in agreement. In somecases, the global context may be more reliablewhile in other cases the local context may havemore accurate topic estimates and a voting ap-proach does not take advantage of complementaryinformation. The combination of topic vectors

452

Source: Le noyau contient de nombreux pilotes, afin de fonctionner chez la plupart des utilisateurs.Reference: The precompiled kernel includes a lot of drivers, in order to work for most users.

Source: Il est prudent de consulter les pages de manuel ou les faq specifiques a votre os.Reference: It’s best to consult the man pages or faqs for your os.

Source: Nous fournissons nano (un petit editeur), vim (vi ameliore), qemacs (clone de emacs), elvis, joe .Reference: Nano (a lightweight editor), vim (vi improved), qemacs (emacs clone), elvis and joe.

Source: Elle a introduit des politiques [..] a cote des relations de gouvernement a gouvernement traditionnelles.Reference: She has introduced policies [..] alongside traditional government-to-government relations.

Figure 4: Examples of test sentences and reference translations with the ambiguous source words andtheir translations in bold.

depending on sentence length (~) performs wellfor CC and TED but less well for NC where wewould expect that it helps to prefer the localinformation. This indicates that the rather ad-hoc way in which we encoded dependency onthe sentence length may need further refinementto make better use of the local context information.

Model noyau→ os→Baseline nucleus bonesglobal kernel* os*local nucleus bonesglobal⊕local kernel* os*

Table 7: Translations of ambiguous source wordswhere global context yields the correct translation(* denotes the correct translation).

Model elvis→ relations→Baseline elvis* relations*global the king relationshiplocal elvis* relations*global⊕local the king relations*

Table 8: Translations of ambiguous source wordswhere local context yields the correct translation(* denotes the correct translation).

7.5 Effect of contexts on translation

To give an intuition of how lexical selection is af-fected by contextual information, Figure 4 showsfour test sentences with an ambiguous source wordand its translation in bold. The correspondingtranslations with the baseline, the global and lo-cal similarity features and the additive combina-tion are shown in Table 7 for the first two exampleswhere the global context yields the correct transla-

tion (as indicated by *) and in Table 8 for the lasttwo examples where the local context yields thecorrect translation.7 In Table 7, the additive com-bination preserves the choice of the global modeland yields the correct translations, while in Table 8only the second example is translated correctly bythe combined model. A possible explanation isthat the topical signal from the global context isstronger and results in more discriminative simi-larity values. In that case, the preference of theglobal context would be likely to have a larger in-fluence on the similarity values in the combinedmodel. A useful extension could be to try to de-tect for a given test instance which context pro-vides more reliable information (beyond encodingsentence length) and boost the topic distributionfrom that context in the combination.

7.6 Comparison with domain adaptationTable 9 compares the additive model (⊕) to thetwo domain-adapted systems that know the do-main label of each document during training andtest. Our topic-adapted model yields overall com-petitive performance with improvements of +0.37and +0.25 BLEU on the mixed test set, respec-tively. While it yields slightly lower performanceon the NC documents, it achieves equal perfor-mance on TED documents and improves by up to+0.94 BLEU on Commoncrawl documents. Thiscan be explained by the fact that Commoncrawl isthe most diverse of the three domains with docu-ments crawled from all over web, thus we expecttopic adaptation to be most effective in compari-son to domain adaptation in this scenario. Our dy-namic approach allows us to adapt the similarityfeatures to each test sentence and test documentindividually and is therefore more flexible than

7For these examples, the local model happens to yield thesame translations as the baseline model.

453

Type of adaptation Model Mixed CC NC TED

Domain-adaptedDOMAIN1 -27.24 19.61 29.87 32.73DOMAIN2 -27.12 19.36 29.78 32.71

Topic-adapted global ⊕ local *27.49 20.30 29.66 32.76

>DOMAIN1 +0.25 +0.69 -0.21 +0.03>DOMAIN2 +0.37 +0.94 -0.12 +0.05

Table 9: BLEU scores of translation model using similarity features derived from PPT model (50 topics)in comparison with two (supervised) domain-adapted systems.

Model Mixed CC NC TEDBaseline -26.86 19.61 29.42 31.88+ docSim -27.22 20.11 29.63 32.40

+ phrSim-global ⊕ phrSim-local *27.58 20.34 29.71 32.96+ phrSim-global ~ phrSim-local *27.60 20.35 29.70 33.03

global~local>BL +0.74 +0.74 +0.38 +1.15

Table 10: BLEU scores of baseline, baseline + document similarity feature and additional phrase pairsimilarity features (significance compared to baseline+docSim). All models were trained with 50 topics.

cross-domain adaptation approaches while requir-ing no information about the domain of a test in-stance.

7.7 Combination with an additionaldocument similarity feature

To find out whether similarity features derivedfrom different types of topic models can providecomplementary information, we add the phrSimfeatures to a system that already includes a docu-ment similarity feature (docSim) derived from thepLDA model (Hasler et al., 2014) which learnstopic distributions at the document level and usesphrases instead of words as the minimal units. Theresults are shown in Table 10. Adding the twobest combinations of local and global context fromTable 6 yields the best results on TED documentswith an increase of 0.63 BLEU over the baseline +docSim model and 1.15 BLEU over the baseline.On the mixed test set, the improvement is 0.38BLEU over the baseline + docSim model and 0.74BLEU over the baseline. Thus, we show that com-bining different scopes and granularities of sim-ilarity features consistently improves translationresults and yields larger gains than using each ofthe similarity features alone.

8 Conclusion

We have presented a new topic model for dynamicadaptation of machine translation systems thatlearns topic distributions for phrase pairs. These

latent topic representations can be compared to la-tent representations of local or global test contextsand integrated into the translation model via simi-larity features.

Our experimental results show that it is ben-eficial for adaptation to use contextual informa-tion from both local and global contexts, withBLEU improvements of up to 1.15 over the base-line system on TED documents and 0.74 on alarge mixed test set with documents from three do-mains. Among four different combinations of lo-cal and global information, we found that the ad-ditive combination of topic vectors performs best.We conclude that information from both contextsshould be combined to correct potential topic de-tection errors in either of the two contexts. Wealso show that our dynamic adaptation approachperforms competitively in comparison with twosupervised domain-adapted systems and that thelargest improvement is achieved for the most di-verse portion of the test set.

In future work, we would like to experimentwith more compact distributional profiles to speedup inference and explore the possibilities of de-riving probabilistic translation features from thePPT model as an extension to the current model.Another avenue for future work could be to com-bine contextual information that captures differenttypes of information, for example, to distinguishbetween semantic and syntactic aspects in the lo-cal context.

454

Acknowledgements

This work was supported by funding from theScottish Informatics and Computer Science Al-liance (Eva Hasler) and funding from the Eu-ropean Union Seventh Framework Programme(FP7/2007-2013) under grant agreement 287658(EU BRIDGE) and grant agreement 288769 (AC-CEPT). Thanks to Annie Louis for helpful com-ments on a draft of this paper and thanks to theanonymous reviewers for their useful feedback.

ReferencesRafael E Banchs and Marta R Costa-jussa. 2011. A Se-

mantic Feature for Statistical Machine Translation.In SSST-5 Proceedings of the Fifth Workshop on Syn-tax, Semantics and Structure in Statistical Transla-tion, pages 126–134.

Arianna Bisazza, Nick Ruiz, and Marcello Federico.2011. Fill-up versus Interpolation Methods forPhrase-based SMT Adaptation. In Proceedings ofIWSLT.

Ondrej Bojar, Christian Buck, Chris Callison-Burch,Christian Federmann, Barry Haddow, PhilippKoehn, Christof Monz, Matt Post, Radu Soricut, andLucia Specia. 2013. Findings of the 2013 workshopon statistical machine translation. In Proceedings ofWMT 2013.

Jordan Boyd-graber and David Blei. 2007. A TopicModel for Word Sense Disambiguation. In Proceed-ings of EMNLP-CoNLL, pages 1024–1033.

Jun Fu Cai, Wee Sun Lee, and Yee Whye Teh. 2007.Improving Word Sense Disambiguation Using TopicFeatures. In Proceedings of EMNLP, pages 1015–1023.

Marine Carpuat and Dekai Wu. 2007a. How PhraseSense Disambiguation outperforms Word Sense Dis-ambiguation for Statistical Machine Translation.In International Conference on Theoretical andMethodological Issues in MT.

Marine Carpuat and Dekai Wu. 2007b. Improving Sta-tistical Machine Translation using Word Sense Dis-ambiguation. In Proceedings of EMNLP, pages 61–72.

Marine Carpuat. 2009. One Translation per Discourse.In Proceedings of the NAACL HLT Workshop on Se-mantic Evaluations: Recent Achievements and Fu-ture Directions, pages 19–27.

Mauro Cettolo, Christian Girardi, and Marcello Fed-erico. 2012. WIT3: Web Inventory of Transcribedand Translated Talks. In Proceedings of EAMT.

Yee Seng Chan, Hwee Tou Ng, and David Chiang.2007. Word Sense Disambiguation Improves Statis-tical Machine Translation. In Proceedings of ACL.

Boxing Chen, Roland Kuhn, and George Foster. 2013.Vector Space Model for Adaptation in StatisticalMachine Translation. In Proceedings of ACL, pages1285–1293.

Marta R. Costa-jussa and Rafael E. Banchs. 2010. Avector-space dynamic feature for phrase-based sta-tistical machine translation. Journal of IntelligentInformation Systems, 37(2):139–154, August.

Georgiana Dinu and Mirella Lapata. 2010. MeasuringDistributional Similarity in Context. In Proceedingsof EMNLP, pages 1162–1172.

Vladimir Eidelman, Jordan Boyd-Graber, and PhilipResnik. 2012. Topic Models for Dynamic Trans-lation Model Adaptation. In Proceedings of ACL.

William A Gale, Kenneth W Church, and DavidYarowsky. 1992. One Sense Per Discourse. InProceedings of the workshop on Speech and Natu-ral Language.

Zellig Harris. 1954. Distributional structure. Word,10(23):146–162.

Eva Hasler, Barry Haddow, and Philipp Koehn. 2012.Sparse Lexicalised Features and Topic Adaptationfor SMT. In Proceedings of IWSLT.

Eva Hasler, Phil Blunsom, Philipp Koehn, and BarryHaddow. 2014. Dynamic Topic Adaptation forPhrase-based MT. In Proceedings of the 14th Con-ference of the European Chapter of the Associationfor Computational Linguistics, Gothenburg, Swe-den.

Sanjika Hewavitharana, Dennis N Mehay, andSankaranarayanan Ananthakrishnan. 2013. Incre-mental Topic-Based Translation Model Adaptationfor Conversational Spoken Language Translation.In Proceedings of ACL, pages 697–701.

Mark Hopkins and Jonathan May. 2011. Tuning asranking. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing,Edinburgh, United Kingdom.

Eric H Huang, Richard Socher, Christopher D Man-ning, and Andrew Y Ng. 2012. Improving WordRepresentations via Global Context and MultipleWord Prototypes. In Proceedings of ACL, pages873–882.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for SMT. In Proceedings of ACL:Demo and poster sessions.

Philipp Koehn. 2004. Statistical Significance Testsfor Machine Translation Evaluation. In Proc. ofEMNLP.

455

Linlin Li, Benjamin Roth, and Caroline Sporleder.2010. Topic Models for Word Sense Disambigua-tion and Token-based Idiom Detection. In Proceed-ings of ACL, pages 1138–1147.

Rico Sennrich. 2012. Perplexity Minimization forTranslation Model Domain Adaptation in StatisticalMachine Translation. In Proceedings of EACL.

Yik-Cheung Tam, Ian Lane, and Tanja Schultz. 2008.Bilingual LSA-based adaptation for statistical ma-chine translation. Machine Translation, 21(4):187–207, November.

Yee Whye Teh, David Newman, and Max Welling.2006. A collapsed variational Bayesian inferencealgorithm for LDA. In Proceedings of NIPS.

B Zhao and E P Xing. 2007. HM-BiTAM: Bilingualtopic exploration, word alignment, and translation.Neural Information Processing.

456

dynamic topic adaptation for smt using distributional profiles · proceedings of the ninth workshop...

Documents