cs460/626 : natural language processing/speech, nlp and the web (lecture 18– alignment in smt and...
TRANSCRIPT
![Page 1: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/1.jpg)
CS460/626 : Natural Language Processing/Speech, NLP and the Web
(Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses)
Pushpak BhattacharyyaCSE Dept., IIT Bombay
15th Feb, 2011
![Page 2: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/2.jpg)
Going forward from word alignment
Word alignment
Phrase Alignment Decoding(going to bigger units (best possibleOf correspondence) translation)
![Page 3: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/3.jpg)
Abstract Problem
Given: eoe1e2e3….enen+1 (Entities)
Goal: lol1l2l3….lnln+1 (Labels)
The Goal is to find the best possible label sequence
Generative Model
))|((maxarg* ELPLL
)|().(maxarg)|(maxarg LEPLPELPL
![Page 4: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/4.jpg)
Simplification
Using Markov Assumption, the Language Model can be represented using bigrams
Similarly translation model can also be represented in the following way:
)|()( 10
ii
n
iLLPLP
n
iii lePLEP
0
)|()|(
![Page 5: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/5.jpg)
Statistical Machine Translation
Finding the best possible English sentence given the foreign sentence
P(E)= Language Model P(F|E) = Translation Model E: English, F: Foreign Language
)|().(maxarg)|(maxarg* EFPEPFEPeE
![Page 6: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/6.jpg)
Problems in the framework Labels are words of the target
language Very large in number Who do you want to_go with ? With whom do you want to go ? आप कि�स �� _स�थ जा�ना� चा�हते�_ह� (Aap kis ke_sath jaana chahate_ho)
who who
do do and so on you youwant want
to_go to_gowith with
Each word have multiple translation options.
Preposition Stranding
![Page 7: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/7.jpg)
Column of words of target language on
the source language words ^ Aap kis ke_sath jaana
chahate_ho . who who do do and so on you you^ want want …
. to_go to_go with with
Find the best possible path from ‘^’ to ‘.’ using transition andObservation probabilities.
Viterbi can be used
![Page 8: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/8.jpg)
TUTORIAL ON Giza++ and Moses tools(delivered by Kushal Ladha)
![Page 9: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/9.jpg)
Word-based alignment
For each word in source language, align words from target language that this word possibly produces
Based on IBM models 1-5 Model 1 – simplest As we go from models 1 to 5, models
get more complex but more realistic This is all that Giza++ does
![Page 10: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/10.jpg)
Alignment
A function from target position to source position:
10
The alignment sequence is: 2,3,4,5,6,6,6Alignment function A: A(1) = 2, A(2) = 3 ..A different alignment function will give the sequence:1,2,1,2,3,4,3,4 for A(1), A(2)..
To allow spurious insertion, allow alignment with word 0 (NULL)No. of possible alignments: (I+1)J
![Page 11: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/11.jpg)
IBM Model 1: Generative Process
11
![Page 12: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/12.jpg)
Training Alignment Models
12
Given a parallel corpora, for each (F,E) learn the best alignment A and the component probabilities: t(f|e) for Model 1 lexicon probability P(f|e) and alignment
probability P(ai|ai-1,I) How to compute these probabilities if
all you have is a parallel corpora
![Page 13: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/13.jpg)
Intuition : Interdependence of Probabilities
13
If you knew which words are probable translation of each other then you can guess which alignment is probable and which one is improbable
If you were given alignments with probabilities then you can compute translation probabilities
Looks like a chicken and egg problem
EM algorithm comes to the rescue
![Page 14: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/14.jpg)
Limitation: Only 1->Many Alignments allowed
14
![Page 15: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/15.jpg)
Phrase-based alignment
More natural
Many-to-one mappings allowed
![Page 16: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/16.jpg)
Giza++ and Moses Package
http://cl.naist.jp/~eric-n/ubuntu-nlp/ Select your Ubuntu version Browse the nlp folder Download debian package of giza+
+, moses, mkcls, srilm Resolve all the dependencies and
they get installed For alternate installation, refer to
http://www.statmt.org/moses_steps.html
![Page 17: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/17.jpg)
Steps
Input - sentence aligned parallel corpus
Output- target side tagged data Training Tuning Generate output on test corpus
(decoding)
![Page 18: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/18.jpg)
Training Create a folder named corpus
containing test, train and tuning file Giza++ is used to generate
alignment Phrase table is generated after
training Before training language model
needs to be build on target side mkdir lm ; /usr/bin/ngram-count -order 3 -interpolate -kndiscount -
text $PWD/corpus/train_surface.hi -lm lm/train.lm; /usr/share/moses/scripts/training/train-factored-phrase-model.perl
-scripts-root-dir /usr/share/moses/scripts -root-dir . -corpus train.clean -e hi -f en -lm 0:3:$PWD/lm/train.lm:0;
![Page 19: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/19.jpg)
Example
train.enh e l l oh e l l ow o r l dc o m p o u n d w o r dh y p h e n a t e do n eb o o mk w e e z l e b o t t e r
train.prhh eh l ow
hh ah l ow
w er l d
k aa m p aw n d w er d
hh ay f ah n ey t ih d
ow eh n iy
b uw m
k w iy z l ah b aa t ah r
![Page 20: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/20.jpg)
Sample from Phrase-table
b o ||| b aa ||| (0) (1) ||| (0) (1) ||| 1 0.666667 1 0.181818 2.718
b ||| b ||| (0) ||| (0) ||| 1 1 1 1 2.718c o m p o ||| aa m p ||| (2) (0,1) (1) (0) (1) ||| (1,3)
(1,2,4) (0) ||| 1 0.0486111 1 0.154959 2.718c ||| p ||| (0) ||| (0) ||| 1 1 1 1 2.718d w ||| d w ||| (0) (1) ||| (0) (1) ||| 1 0.75 1 1 2.718d ||| d ||| (0) ||| (0) ||| 1 1 1 1 2.718e b ||| ah b ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718e l l ||| ah l ||| (0) (1) (1) ||| (0) (1,2) ||| 1 1 0.5 0.5
2.718e l l ||| eh l ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.111111
0.5 0.111111 2.718e l ||| eh ||| (0) (0) ||| (0,1) ||| 1 0.111111 1
0.133333 2.718e ||| ah ||| (0) ||| (0) ||| 1 1 0.666667 0.6 2.718h e ||| hh ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.6 2.718h ||| hh ||| (0) ||| (0) ||| 1 1 1 1 2.718l e b ||| l ah b ||| (0) (1) (2) ||| (0) (1) (2) ||| 1 1 1
0.5 2.718l e ||| l ah ||| (0) (1) ||| (0) (1) ||| 1 1 1 0.5 2.718
l l o ||| l ow ||| (0) (0) (1) ||| (0,1) (2) ||| 0.5 1 1 0.227273 2.718l l ||| l ||| (0) (0) ||| (0,1) ||| 0.25 1 1 0.833333 2.718l o ||| l ow ||| (0) (1) ||| (0) (1) ||| 0.5 1 1 0.227273 2.718l ||| l ||| (0) ||| (0) ||| 0.75 1 1 0.833333 2.718m ||| m ||| (0) ||| (0) ||| 1 0.5 1 1 2.718n d ||| n d ||| (0) (1) ||| (0) (1) ||| 1 1 1 1 2.718n e ||| eh n iy ||| (1) (2) ||| () (0) (1) ||| 1 1 0.5 0.3 2.718n e ||| n iy ||| (0) (1) ||| (0) (1) ||| 1 1 0.5 0.3 2.718n ||| eh n ||| (1) ||| () (0) ||| 1 1 0.25 1 2.718o o m ||| uw m ||| (0) (0) (1) ||| (0,1) (2) ||| 1 0.5 1 0.181818 2.718o o ||| uw ||| (0) (0) ||| (0,1) ||| 1 1 1 0.181818 2.718o ||| aa ||| (0) ||| (0) ||| 1 0.666667 0.2 0.181818 2.718o ||| ow eh ||| (0) ||| (0) () ||| 1 1 0.2 0.272727 2.718o ||| ow ||| (0) ||| (0) ||| 1 1 0.6 0.272727 2.718w o r ||| w er ||| (0) (1) (1) ||| (0) (1,2) ||| 1 0.1875 1 0.424242 2.718w ||| w ||| (0) ||| (0) ||| 1 0.75 1 1 2.718
![Page 21: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/21.jpg)
Tuning
Not a compulsory step but will improve the decoding by a small percentage
mkdir tuning; cp $WDIR/corpus/tun.en tuning/input; cp $WDIR/corpus/tun.hi tuning/reference; /usr/share/moses/scripts/training/mert-moses.pl $PWD/tuning/input $PWD/tuning/reference /usr/bin/moses $PWD/model/moses.ini --working-dir $PWD/tuning --rootdir /usr/share/moses/scripts
It will take around 1 hour on a server with 32GB RAM
![Page 22: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Alignment in SMT and Tutorial on Giza++ and Moses) Pushpak Bhattacharyya CSE](https://reader035.vdocuments.mx/reader035/viewer/2022070306/5518a23c550346991f8b4918/html5/thumbnails/22.jpg)
Testing
mkdir evaluation; /usr/bin/moses -config $WDIR/tuning/moses.ini -input-file $WDIR/corpus/test.en >evaluation/test.output;
The output will be in evaluation/test.output file
Sample Output h o t hh aa t p h o n e p|UNK hh ow eh n iy b o o k b uw k