align modele

8/13/2019 Align Modele

1/10

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics , pages 166175,Soa, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics

Word Alignment Modeling with Context Dependent Deep Neural Network

Nan Yang 1, Shujie Liu 2, Mu Li 2, Ming Zhou 2, Nenghai Yu 11University of Science and Technology of China, Hefei, China

2Microsoft Research Asia, Beijing, China{v-nayang,shujliu,muli,mingzhou }@microsoft.com

[email protected]

Abstract

In this paper, we explore a novel bilin-gual word alignment approach based onDNN (Deep Neural Network), which hasbeen proven to be very effective in var-ious machine learning tasks (Collobertet al., 2011). We describe in detailhow we adapt and extend the CD-DNN-HMM (Dahl et al., 2012) method intro-duced in speech recognition to the HMM-based word alignment model, in whichbilingual word embedding is discrimina-tively learnt to capture lexical translationinformation, and surrounding words areleveraged to model context informationin bilingual sentences. While being ca-pable to model the rich bilingual corre-spondence, our method generates a verycompact model with much fewer parame-ters. Experiments on a large scale English-Chinese word alignment task show that theproposed method outperforms the HMMand IBM model 4 baselines by 2 points inF-score.

1 Introduction

Recent years research communities have seen astrong resurgent interest in modeling with deep(multi-layer) neural networks. This trending topic,usually referred under the name Deep Learning , isstarted by ground-breaking papers such as (Hin-ton et al., 2006), in which innovative training pro-cedures of deep structures are proposed. Unlikeshallow learning methods, such as Support VectorMachine, Conditional Random Fields, and Maxi-mum Entropy, which need hand-craft features asinput, DNN can learn suitable features (represen-tations) automatically with raw input data, given atraining objective.

DNN did not achieve expected success until2006, when researchers discovered a proper way

to intialize and train the deep architectures, whichcontains two phases: layer-wise unsupervised pre-training and supervised ne tuning. For pre-training, Restricted Boltzmann Machine (RBM)(Hinton et al., 2006), auto-encoding (Bengio et al.,2007) and sparse coding (Lee et al., 2007) are pro-

posed and popularly used. The unsupervised pre-training trains the network one layer at a time, andhelps to guide the parameters of the layer towardsbetter regions in parameter space (Bengio, 2009).Followed by ne tuning in this region, DNN isshown to be able to achieve state-of-the-art per-formance in various area, or even better (Dahl etal., 2012) (Kavukcuoglu et al., 2010). DNN alsoachieved breakthrough results on the ImageNetdataset for objective recognition (Krizhevsky etal., 2012). For speech recognition, (Dahl et al.,

2012) proposed context-dependent neural network with large vocabulary, which achieved 16.0% rel-ative error reduction.

DNN has also been applied in Natural Lan-guage Processing (NLP) eld. Most works con-vert atomic lexical entries into a dense, low di-mensional, real-valued representation, called word embedding ; Each dimension represents a latent as-pect of a word, capturing its semantic and syntac-tic properties (Bengio et al., 2006). Word embed-ding is usually rst learned from huge amount of

monolingual texts, and then ne-tuned with task-specic objectives. (Collobert et al., 2011) and(Socher et al., 2011) further apply Recursive Neu-ral Networks to address the structural predictiontasks such as tagging and parsing, and (Socheret al., 2012) explores the compositional aspect of word representations.

Inspired by successful previous works, we pro-pose a new DNN-based word alignment method,which exploits contextual and semantic similari-ties between words. As shown in example (a) of Figure 1, in word pair {juda mammoth },the Chinese word juda is a common word, but

166


2/10

mammothwill be a

jiang shi yixiang juda gongcheng

job

(a)

A :farmer Yibula said

nongmin yibula shuo :

(b)

Figure 1: Two examples of word alignment

the English word mammoth is not, so it is veryhard to align them correctly. If we know that

mammoth has the similar meaning with big,or huge, it would be easier to nd the corre-sponding word in the Chinese sentence. As wementioned in the last paragraph, word embedding(trained with huge monolingual texts) has the abil-ity to map a word into a vector space, in which,similar words are near each other.

For example (b) in Figure 1, for the word pair{yibula Yibula }, both the Chinese wordyibula and English word Yibula are rare nameentities, but the words around them are very com-

mon, which are {nongmin, shuo } for Chineseside and {farmer, said } for the English side.The pattern of the context {nongmin X shuo farmer X said } may help to align the wordpair which ll the variable X , and also, the pattern{yixiang X gongcheng a X job} is helpfulto align the word pair {juda mammoth } forexample (a).

Based on the above analysis, in this paper, boththe words in the source and target sides are rstlymapped to a vector via a discriminatively trained

word embeddings, and word pairs are scored by amulti-layer neural network which takes rich con-texts (surrounding words on both source and targetsides) into consideration; and a HMM-like distor-tion model is applied on top of the neural network to characterize structural aspect of bilingual sen-tences.

In the rest of this paper, related work aboutDNN and word alignment are rst reviewed inSection 2, followed by a brief introduction of DNN in Section 3. We then introduce the detailsof leveraging DNN for word alignment, includingthe details of our network structure in Section 4

and the training method in Section 5. The mer-its of our approach are illustrated with the experi-ments described in Section 6, and we conclude ourpaper in Section 7.

2 Related Work

DNN with unsupervised pre-training was rstlyintroduced by (Hinton et al., 2006) for MNISTdigit image classication problem, in which, RBMwas introduced as the layer-wise pre-trainer. Thelayer-wise pre-training phase found a better localmaximum for the multi-layer network, thus led toimproved performance. (Krizhevsky et al., 2012)proposed to apply DNN to do object recognitiontask (ImageNet dataset), which brought down thestate-of-the-art error rate from 26.1% to 15.3%.

(Seide et al., 2011) and (Dahl et al., 2012) applyContext-Dependent Deep Neural Network withHMM (CD-DNN-HMM) to speech recognitiontask, which signicantly outperforms traditionalmodels.

Most methods using DNN in NLP start with aword embedding phase, which maps words intoa xed length, real valued vectors. (Bengio etal., 2006) proposed to use multi-layer neural net-work for language modeling task. (Collobert et al.,2011) applied DNN on several NLP tasks, such

as part-of-speech tagging, chunking, name entityrecognition, semantic labeling and syntactic pars-ing, where they got similar or even better resultsthan the state-of-the-art on these tasks. (Niehuesand Waibel, 2012) shows that machine transla-tion results can be improved by combining neurallanguage model with n-gram traditional language.(Son et al., 2012) improves translation quality of n-gram translation model by using a bilingual neu-ral language model. (Titov et al., 2012) learns acontext-free cross-lingual word embeddings to fa-

cilitate cross-lingual information retrieval.For the related works of word alignment, the

most popular methods are based on generativemodels such as IBM Models (Brown et al., 1993)and HMM (Vogel et al., 1996). Discriminative ap-proaches are also proposed to use hand crafted fea-tures to improve word alignment. Among them,(Liu et al., 2010) proposed to use phrase and rulepairs to model the context information in a log-linear framework. Unlike previous discriminativemethods, in this work, we do not resort to any handcrafted features, but use DNN to induce featuresfrom raw words.

167


3/10

3 DNN structures for NLP

The most important and prevalent features avail-able in NLP are the words themselves. To ap-ply DNN to NLP task, the rst step is to trans-form a discrete word into its word embedding , a

low dimensional, dense, real-valued vector (Ben-gio et al., 2006). Word embeddings often implic-itly encode syntactic or semantic knowledge of the words. Assuming a nite sized vocabulary V ,word embeddings form a (L| V |)-dimension em-bedding matrix W V , where L is a pre-determinedembedding length; mapping words to embed-dings is done by simply looking up their respec-tive columns in the embedding matrix W V . Thelookup process is called a lookup layer LT , whichis usually the rst layer after the input layer in neu-

ral network.After words have been transformed to their em-beddings, they can be fed into subsequent classi-cal network layers to model highly non-linear re-lations:

z l = f l(M lz l 1 + bl ) (1)

where z l is the output of lth layer, M l is a |z l | |z l 1 | matrix, bl is a |z l |-length vector, and f lis an activation function. Except for the lastlayer, f l must be non-linear. Common choices forf l include sigmoid function, hyperbolic function,hard hyperbolic function etc. Following (Col-lobert et al., 2011), we choose hard hyperbolicfunction as our activation function in this work:

htanh (x) =1 if x is greater than 1 1 if x is less than -1x otherwise

(2)If probabilistic interpretation is desired, a softmaxlayer (Bridle, 1990) can be used to do normaliza-tion:

z li = ez

l 1i

|z l |

j =1ez

l 1j

(3)

The above layers can only handle xed sized in-put and output. If input must be of variable length,convolution layer and max layer can be used, (Col-lobert et al., 2011) which transform variable lengthinput to xed length vector for further processing.

Multi-layer neural networks are trained withthe standard back propagation algorithm (LeCun,1985). As the networks are non-linear and thetask specic objectives usually contain many lo-cal maximums, special care must be taken in the

optimization process to obtain good parameters.Techniques such as layerwise pre-training (Bengioet al., 2007) and many tricks(LeCun et al., 1998)have been developed to train better neural net-works. Besides that, neural network training alsoinvolves some hyperparameters such as learningrate, the number of hidden layers. We will addressthese issues in section 4.

4 DNN for word alignment

Our DNN word alignment model extends classicHMM word alignment model (Vogel et al., 1996).Given a sentence pair ( e, f ), HMM word alignmenttakes the following form:

P (a , e|f ) =|e|

i=1

P lex (ei |f a i )P d (a i ai 1) (4)

where P lex is the lexical translation probabilityand P d is the jump distance distortion probability.

One straightforward way to integrate DNNinto HMM is to use neural network to computethe emission (lexical translation) probability P lex .Such approach requires a softmax layer in the neu-ral network to normalize over all words in sourcevocabulary. As vocabulary for natural languagesis usually very large, it is prohibitively expen-sive to do the normalization. Hence we give up

the probabilistic interpretation and resort to a non-probabilistic, discriminative view:

sN N (a |e, f ) =|e|

i=1

t lex (ei , f a i |e, f )td (a i , a i 1 |e, f )

(5)where t lex is a lexical translation score computedby neural network, and td is a distortion score.

In the classic HMM word alignment model,context is not considered in the lexical translationprobability. Although we can rewrite P lex (ei |f a i )to P lex (ei |context of f a i ) to model context, it in-troduces too many additional parameters and leadsto serious over-tting problem due to data sparse-ness. As a matter of fact, even without any con-texts, the lexical translation table in HMM al-ready contains O(|V e | |V f |) parameters, where|V e | and V f denote source and target vocabularysizes. In contrast, our model does not maintaina separate translation score parameters for everysource-target word pair, but computes tlex througha multi-layer network, which naturally handlescontexts on both sides without explosive growthof number of parameters.

168


4/10

169


5/10

our model from raw sentence pairs, they are toocomputational demanding as the lexical transla-tion probabilities must be computed from neu-ral networks. Hence, we opt for a simpler su-pervised approach, which learns the model fromsentence pairs with word alignment. As we donot have a large manually word aligned corpus,we use traditional word alignment models such asHMM and IBM model 4 to generate word align-ment on a large parallel corpus. We obtain bi-directional alignment by running the usual grow-diag-nal heuristics (Koehn et al., 2003) on uni-directional results from both directions, and usethe results as our training data. Similar approachhas been taken in speech recognition task (Dahl etal., 2012), where training data for neural network model is generated by forced decoding with tradi-tional Gaussian mixture models.

Tunable parameters in neural network align-ment model include: word embeddings in lookuptable LT , parameters W l , bl for linear transforma-tions in the hidden layers of the neural network,and distortion parameters sd of jump distance. Wetake the following ranking loss with margin as ourtraining criteria:

loss () =

every (e,f )

max {0, 1 s (a+ |e, f ) + s (a |e, f )}

(9)where denotes all tunable parameters, a+ isthe gold alignment path, a is the highest scor-ing incorrect alignment path under , and s ismodel score for alignment path dened in Eq. 5. One nuance here is that the gold alignment af-ter grow-diag-nal contains many-to-many links,which cannot be generated by any path. Our solu-tion is that for each source word alignment multi-ple target, we randomly choose one link among allcandidates as the golden link.

Because our multi-layer neural network is in-herently non-linear and is non-convex, directlytraining against the above criteria is unlikely toyield good results. Instead, we take the followingsteps to train our model.

5.1 Pre-training initial word embedding withmonolingual data

Most parameters reside in the word embeddings.To get a good initial value, the usual approach isto pre-train the embeddings on a large monolin-gual corpus. We replicate the work in (Collobert

et al., 2011) and train word embeddings for sourceand target languages from their monolingual cor-pus respectively. Our vocabularies V s and V t con-tain the most frequent 100,000 words from eachside of the parallel corpus, and all other words aretreated as unknown words. We set word embed-ding length to 20, window size to 5, and the lengthof the only hidden layer to 40. Follow (Turian etal., 2010), we randomly initialize all parametersto [-0.1, 0.1], and use stochastic gradient descentto minimize the ranking loss with a xed learn-ing rate 0.01. Note that embedding for null wordin either V e and V f cannot be trained from mono-lingual corpus, and we simply leave them at theinitial value untouched.

Word embeddings from monolingual corpuslearn strong syntactic knowledge of each word,which is not always desirable for word align-ment between some language pairs like Englishand Chinese. For example, many Chinese wordscan act as a verb, noun and adjective without anychange, while their English counter parts are dis-tinct words with quite different word embeddingsdue to their different syntactic roles. Thus wehave to modify the word embeddings in subse-quent steps according to bilingual data.

5.2 Training neural network based on local

criteriaTraining the network against the sentence levelcriteria Eq. 5 directly is not efcient. Instead, werst ignore the distortion parameters and train neu-ral networks for lexical translation scores againstthe following local pairwise loss:

max {0, 1 t ((e, f )+ |e, f ) + t ((e, f ) |e, f )}(10)

where (e, f )+ is a correct word pair, (e, f ) is awrong word pair in the same sentence, and t is asdened in Eq. 6 . This training criteria essentiallymeans our model suffers loss unless it gives cor-rect word pairs a higher score than random pairsfrom the same sentence pair with some margin.

We initialize the lookup table with embed-dings obtained from monolingual training, andrandomly initialize all W l and bl in linear layersto [-0.1, 0.1]. We minimize the loss using stochas-tic gradient descent as follows. We randomly cy-cle through all sentence pairs in training data; foreach correct word pair (including null alignment),we generate a positive example, and generate twonegative examples by randomly corrupting either

170


6/10

side of the pair with another word in the sentencepair. We set learning rate to 0.01. As there is noclear stopping criteria, we simply run the stochas-tic optimizer through parallel corpus for N itera-tions. In this work, N is set to 50.

To make our model concrete, there are stillhyper-parameters to be determined: the windowsize sw and tw, the length of each hidden layerL l . We empirically set sw and tw to 11, L1 to120, and L2 to 10, which achieved a minimal losson a small held-out data among several settings wetested.

5.3 Training distortion parameters

We x neural network parameters obtained fromthe last step, and tune the distortion parameters

sd with respect to the sentence level loss usingstandard stochastic gradient descent. We use aseparate parameter for jump distance from -7 and7, and another two parameters for longer for-ward/backward jumps. We initialize all parame-ters in sd to 0, set the learning rate for the stochas-tic optimizer to 0.001. As there are only 17 param-eters in sd , we only need to run the optimizer overa small portion of the parallel corpus.

5.4 Tuning neural network based on sentence

level criteriaUp-to-now, parameters in the lexical translationneural network have not been trained against thesentence level criteria Eq. 5. We could achievethis by re-using the same online training methodused to train distortion parameters, except that wenow x the distortion parameters and let the lossback-propagate through the neural networks. Sen-tence level training does not take larger context inmodeling word translations, but only to optimizethe parameters regarding to the sentence level loss.This tuning is quite slow, and it did not improvealignment on an initial small scale experiment; so,we skip this step in all subsequent experiment inthis work.

6 Experiments and Results

We conduct our experiment on Chinese-to-Englishword alignment task. We use the manually alignedChinese-English alignment corpus (Haghighi etal., 2009) which contains 491 sentence pairs astest set. We adapt the segmentation on the Chineseside to t our word segmentation standard.

6.1 Data

Our parallel corpus contains about 26 millionunique sentence pairs in total which are minedfrom web.

The monolingual corpus to pre-train word em-

beddings are also crawled from web, whichamounts to about 1.1 billion unique sentences forEnglish and about 300 million unique sentencesfor Chinese. As pre-processing, we lowercase allEnglish words, and map all numbers to one spe-cial token; and we also map all email addressesand URLs to another special token.

6.2 Settings

We use classic HMM and IBM model 4 as ourbaseline, which are generated by Giza++ (Och and

Ney, 2000). We train our proposed model from re-sults of classic HMM and IBM model 4 separately.Since classic HMM, IBM model 4 and our modelare all uni-directional, we use the standard grow-diag-nal to generate bi-directional results for allmodels.

Models are evaluated on the manually alignedtest set using standard metric: precision, recall andF1-score.

6.3 Alignment Result

It can be seen from Table 1, the proposed modelconsistently outperforms its corresponding base-line whether it is trained from alignment of classicHMM or IBM model 4. It is also clear that the

setting prec. recall F-1HMM 0.768 0.786 0.777HMM+NN 0.810 0.790 0.798IBM4 0.839 0.805 0.822IBM4+NN 0.885 0.812 0.847

Table 1: Word alignment result. The rst rowand third row show baseline results obtained byclassic HMM and IBM4 model. The second rowand fourth row show results of the proposed modeltrained from HMM and IBM4 respectively.

results of our model also depends on the qualityof baseline results, which is used as training dataof our model. In future we would like to explorewhether our method can improve other word align-ment models.

We also conduct experiment to see the effecton end-to-end SMT performance. We train hier-

171


7/10

172


8/10

word good history british served labs zetian laggards

LM

bad tradition russian worked networks hongzhang underperformersgreat culture japanese lived technologies yaobang transferees

strong practice dutch offered innovations keming megabankstrue style german delivered systems xingzhi mutualseasy literature canadian produced industries ruihua non-starters

WA

nice historical uk offering lab hongzhang underperformersgreat historic britain serving laboratories qichao illiteratesbest developed english serve laboratory xueqin transferees

pretty record classic delivering exam fuhuan matriculantsexcellent recording england worked experiments bingkun megabanks

Table 2: Nearest neighbors of several words according to their embedding distance. LM shows neighborsof word embeddings trained by monolingual language model method; WA shows neighbors of wordembeddings trained by our word alignment model.

improvement. Due to time constraint, we havenot tuned the hyper-parameters such as length of hidden layers in 1 and 3-hidden-layer settings, norhave we tested settings with more hidden-layers.It would be wise to test more settings to verifywhether more layers would help.

6.4.4 Word Embedding

Following (Collobert et al., 2011), we show somewords together with its nearest neighbors using theEuclidean distance between their embeddings. As

we can see from Table 2, after bilingual training,bad is no longer in the nearest neighborhood of good as they hold opposite semantic meanings;the nearest neighbor of history is now changedto its related adjective historical. Neighbors of proper nouns such as person names are relativelyunchanged. For example, neighbors of wordzetian are all Chinese names in both settings.As Chinese language lacks morphology, the singleform and plural form of a noun in English oftencorrespond to the same Chinese word, thus it is

desirable that the two English words should havesimilar word embeddings. While this is true forrelatively frequent nouns such as lab and labs,rarer nouns still remain near their monolingualembeddings as they are only modied a few timesduring the bilingual training. As shown in lastcolumn, neighborhood of laggards still consistsof other plural forms even after bilingual training.

7 Conclusion

In this paper, we explores applying deep neu-ral network for word alignment task. Our model

integrates a multi-layer neural network into anHMM-like framework, where context dependentlexical translation score is computed by neuralnetwork, and distortion is modeled by a sim-ple jump-distance scheme. Our model is dis-criminatively trained on bilingual corpus, whilehuge monolingual data is used to pre-train word-embeddings. Experiments on large-scale Chinese-to-English task show that the proposed methodproduces better word alignment results, comparedwith both classic HMM model and IBM model 4.

For future work, we will investigate more set-tings of different hyper-parameters in our model.Secondly, we want to explore the possibility of unsupervised training of our neural word align-ment model, without reliance of alignment resultof other models. Furthermore, our current modeluse rather simple distortions; it might be helpfulto use more sophisticated model such as ITG (Wu,1997), which can be modeled by Recursive NeuralNetworks (Socher et al., 2011).

Acknowledgments

We thank anonymous reviewers for insightfulcomments. We also thank Dongdong Zhang, LeiCui, Chunyang Wu and Zhenyan He for fruitfuldiscussions.

References

Yoshua Bengio, Holger Schwenk, Jean-S ebastienSen ecal, Fr ederic Morin, and Jean-Luc Gauvain.2006. Neural probabilistic language models. Inno-

vations in Machine Learning , pages 137186.Yoshua Bengio, Pascal Lamblin, Dan Popovici, and

173


9/10

Hugo Larochelle. 2007. Greedy layer-wise trainingof deep networks. Advances in neural information processing systems , 19:153.

Yoshua Bengio. 2009. Learning deep architectures forai. Foundations and Trends R in Machine Learning ,2(1):1127.

Adam L. Berger, Vincent J. Della Pietra, and StephenA. Della Pietra. 1996. A maximum entropy ap-proach to natural language processing. Comput. Linguist. , 22(1):3971, March.

JS Bridle. 1990. Neurocomputing: Algorithms, archi-tectures and applications, chapter probabilistic inter-pretation of feedforward classication network out-puts, with relationships to statistical pattern recogni-tion.

Peter F Brown, Vincent J Della Pietra, Stephen A Della

Pietra, and Robert L Mercer. 1993. The mathemat-ics of statistical machine translation: Parameter esti-mation. Computational linguistics , 19(2):263311.

David Chiang. 2007. Hierarchical phrase-based trans-lation. computational linguistics , 33(2):201228.

Ronan Collobert, Jason Weston, L eon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. The Journal of Machine Learning Re-search , 12:24932537.

George E Dahl, Dong Yu, Li Deng, and Alex Acero.2012. Context-dependent pre-trained deep neuralnetworks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on , 20(1):3042.

John DeNero and Klaus Macherey. 2011. Model-based aligner combination using dual decomposi-tion. In Proc. ACL .

Chris Dyer, Jonathan Clark, Alon Lavie, and Noah ASmith. 2011. Unsupervised word alignment with ar-bitrary features. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin-

guistics: Human Language Technologies-Volume 1 ,pages 409419. Association for Computational Lin-guistics.

Aria Haghighi, John Blitzer, John DeNero, and DanKlein. 2009. Better word alignments with su-pervised itg models. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACLand the 4th International Joint Conference on Natu-ral Language Processing of the AFNLP: Volume 2-Volume 2 , pages 923931. Association for Compu-tational Linguistics.

Geoffrey E Hinton, Simon Osindero, and Yee-WhyeTeh. 2006. A fast learning algorithm for deep be-lief nets. Neural computation , 18(7):15271554.

Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau,Karol Gregor, Micha el Mathieu, and Yann LeCun.2010. Learning convolutional feature hierarchies forvisual recognition. Advances in Neural InformationProcessing Systems , pages 10901098.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.

2003. Statistical phrase-based translation. InProceedings of the 2003 Conference of the North American Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 , pages 4854. Association for Computa-tional Linguistics.

Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton.2012. Imagenet classication with deep convolu-tional neural networks. In Advances in Neural Infor-mation Processing Systems 25 , pages 11061114.

Yann LeCun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied todocument recognition. Proceedings of the IEEE ,86(11):22782324.

Yann LeCun. 1985. A learning scheme for asymmet-ric threshold networks. Proceedings of Cognitiva ,85:599604.

Honglak Lee, Alexis Battle, Rajat Raina, and An-drew Y Ng. 2007. Efcient sparse coding algo-rithms. Advances in neural information processingsystems , 19:801.

Shujie Liu, Chi-Ho Li, and Ming Zhou. 2010. Dis-

criminative pruning for discriminative itg alignment.In Proceedings of the 48th Annual Meeting of the As-sociation for Computational Linguistics, ACL , vol-ume 10, pages 316324.

Y MarcAurelio Ranzato, Lan Boureau, and Yann Le-Cun. 2007. Sparse feature learning for deep belief networks. Advances in neural information process-ing systems , 20:11851192.

Robert C Moore. 2005. A discriminative framework for bilingual word alignment. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Process-ing , pages 8188. Association for ComputationalLinguistics.

Jan Niehues and Alex Waibel. 2012. Continuousspace language models using restricted boltzmannmachines. In Proceedings of the nineth Interna-tional Workshop on Spoken Language Translation(IWSLT) .

Franz Josef Och and Hermann Ney. 2000. Giza++:Training of statistical translation models.

Frank Seide, Gang Li, and Dong Yu. 2011. Conversa-

tional speech transcription using context-dependentdeep neural networks. In Proc. Interspeech , pages437440.

174


10/10

Noah A Smith and Jason Eisner. 2005. Contrastiveestimation: Training log-linear models on unlabeleddata. In Proceedings of the 43rd Annual Meetingon Association for Computational Linguistics , pages354362. Association for Computational Linguis-tics.

Richard Socher, Cliff C Lin, Andrew Y Ng, andChristopher D Manning. 2011. Parsing naturalscenes and natural language with recursive neu-ral networks. In Proceedings of the 26th Inter-national Conference on Machine Learning (ICML) ,volume 2, page 7.

Richard Socher, Brody Huval, Christopher D Manning,and Andrew Y Ng. 2012. Semantic compositional-ity through recursive matrix-vector spaces. In Pro-ceedings of the 2012 Joint Conference on Empiri-cal Methods in Natural Language Processing and Computational Natural Language Learning , pages12011211. Association for Computational Linguis-

tics.

Le Hai Son, Alexandre Allauzen, and Francois Yvon.2012. Continuous space translation models withneural networks. In Proceedings of the 2012 confer-ence of the north american chapter of the associa-tion for computational linguistics: Human languagetechnologies , pages 3948. Association for Compu-tational Linguistics.

Ivan Titov, Alexandre Klementiev, and Binod Bhat-tarai. 2012. Inducing crosslingual distributed rep-resentations of words.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: a simple and general methodfor semi-supervised learning. Urbana , 51:61801.

Stephan Vogel, Hermann Ney, and Christoph Tillmann.1996. Hmm-based word alignment in statisticaltranslation. In Proceedings of the 16th conferenceon Computational linguistics-Volume 2 , pages 836841. Association for Computational Linguistics.

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational linguistics , 23(3):377403.

175

align modele

Documents