the state of the art in phrase-based statistical machine translation (smt) roland kuhn, george...

The State of the Art in Phrase-Based Statistical Machine Translation (SMT)

Roland Kuhn, George Foster, Nicola Ueffing

February 2007

Tutorial Plan

A. Overview

B. Details & research topicsNOTE: best overall reference for SMT hasn’t been

published yet – Philipp Koehn’s « Statistical Machine Translation » (to be published by Cambridge University Press). Some of the material presented here is from a draft of that book.

Tutorial Plan

A.OverviewThe MT Task & Approaches to it Examples of SMT outputSMT Research: Culture, Evaluations, & Metrics SMT History: IBM Models Phrase-based SMT Phrase-Based Search Loglinear Model Combination Target Language Model P(T) Flaws of Phrase-based, Loglinear Systems PORTAGE: a Typical SMT System

The MT Task & Approaches to it

• Core MT task: translate a sentence from a source language S to target language T

• Conventional expert system approach: hire experts to write rules for translating S to T

• Statistical approach: using a bilingual text corpus (lots of S sentences & their translations into T), train a statistical translation model that will map each new S sentence into a T sentence


Expert System

+

If « … » then …If « … » then ……………Else ….

Manually coded rules

Experts

S: Mais où sont les neiges d’antan?

Bilingual parallel corpus

+

S T

Machine Learning

P(but | mais)=0.7P(however | mais)=0.3P(where | où)=1.0 ……

Statistical rules

T1: But where are the snowsof yesteryear? P = 0.41T2: However, where are yesterday’s snows? P = 0.33T3: Hey - where did the old snow go? P = 0.18

…

Statistical system output

T: But where are the snows of yesteryear?

Expert system output

Statistical System

“Expert” vs. “Statistical” systems• Expert systems incorporate deep linguistic knowledge• They still yield top performance for well-studied language pairs in

non-specialized domains• Computationally cheap (compared to statistical MT) BUT -• Brittle • Expensive to maintain (messy software engineering)• Expensive to port to new semantic domains or new language

pairs• Typically yield only one T sentence for each S sentence


“Expert” vs. “Statistical” systems• More E-text, better algorithms, stronger machines quality of SMT output

approaching that of expert systems • Statistical approach has beaten expert systems in related areas - e.g.,

automatic speech recognition• SMT is robust (does well on frequent phenomena)• Easy to maintain • Easily ported to new semantic domain or new language pairs – IF training

corpora available• For each S sentence, yields many T sentences (each with a probabilistic

score) – useful for semi-supervised translation


Bilingual parallel corpus

S T

Phrase TranslationModel

Target LanguageModel

Preprocessor

Decoder

S: Mais où sont les neiges d’antan?

mais où sont les neiges d’ antan ?

Extra Target Corpora

(optional extra LM training corpora)

Other KnowledgeSources

offline training

T1: however where are the snows #d’ antan# P = 0.22T2: but where are the snows #d’ antan# P = 0.21T3: but where did the #d’ antan# snow go P = 0.13

…

T1: But where are the snows of yesteryear? P=0.41T2: However, where are yesterday’s snows? P = 0.33

…Postprocessor

Initial N-best hypotheses

Reordering

Final N-best hypotheses

Structure of Typical SMT System


http://images.google.ca/imgres?imgurl=http://www.nur.utexas.edu/0212/cdelossantos/BOOKS.JPG&imgrefurl=http://www.nur.utexas.edu/0212/cdelossantos/&h=524&w=498&sz=58&tbnid=N7AvdnAOqhAJ:&tbnh=128&tbnw=122&start=9&prev=/images%3Fq%3Dbooks%26hl%3Den%26lr%3D%26ie%3DUTF-8%26sa%3DG

Commercial Systems• Systran, biggest MT company, uses expert systems; so do most MT

companies. However, Systran has recently begun exploring possibility of adding a statistical component to their system.

• Important exception: LanguageWeaver, new company based on SMT (closely linked to researchers at ISI, U. Southern California)

• Google has superb SMT research team – but online, they still mainly use Systran (probably because of computational cost of online SMT). Seem to be gradually swapping in SMT systems for language pairs with lower traffic.


Examples of SMT output

Chinese → English output: REF: Hong Kong citizens jumped for joy when they knew Beijing's bid for 2008 Olympic games was successful. PORTAGE Dec. 2004: The public see that Beijing's hosting of the Olympic Games in 2008 excited.PORTAGE Nov. 2006: Hong Kong people see Beijing's successful bid for the 2008 Olympic Games, very happy.

REF: The U.S. delegation includes a China expert from Stanford University, two Senate foreign policy aides and a former State Department official who has negotiated with North Korea. PORTAGE Dec. 2004: The United States delegation comprising members from the Stanford University, one of the Chinese experts, two of the Senate foreign policy as well as assistant who was responsible for dealing with Pyongyang authorities of the former State Department officials.PORTAGE Nov. 2006: The US delegation included members from Stanford University and an expert on China, two Senate foreign policy, and one who is responsible for dealing with Pyongyang authorities, a former State Department officials.

REF: Kuwait foreign minister Mohammad Al Sabah and visiting Jordan foreign minister Muasher jointly presided the first meeting of the joint higher committee of the two countries on that day. PORTAGE Dec. 2004: Kuwaiti Foreign Secretary Sabah on that day and visiting Jordan Foreign Secretary maasher co-chaired the section about the two countries mixed Committee at the inaugural meeting.PORTAGE Nov. 2006: Kuwaiti Foreign Minister Sabah day and visiting Jordanian Foreign Minister of Malaysia, co-chaired by the two countries, the joint commission met for the first time.

REF: The Beagle 2 was scheduled to land on Mars on Christmas Day, but its signal is still difficult to pin down. PORTAGE Dec. 2004: small dog meat, originally scheduled for Christmas landing Mars, but it is a signal remains elusive.PORTAGE Nov. 2006: 2 small dog meat for Christmas landing on Mars, but it signals is still unpredictable.

Examples of SMT output

And a silly English → German example from Google (Jan. 25, 2007):

the hotel has a squash court das Hotel hat ein Kürbisgericht (think “zucchini tribunal”)

* but this kind of error – perfect syntax, never-seen word combination – isn’t typical of a statistical system, so this was probably a rule-based system

Culture• SMT research is very engineering-oriented; driven by performance in

NIST & other evaluations (see later slides) if a heuristic yields a big improvement in BLEU scores & a wonderful

new theoretical approach doesn’t, expect the former to get much more attention than the latter

• Advantages of SMT culture: open-minded to new ideas that can be tested quickly; researchers who count have working systems with reasonably well-written software (so they can participate in evaluations)

• Disadvantages of SMT culture: closed-minded to ideas not tested in a working system if you have a brilliant theory that doesn’t show a BLEU score improvement in a reasonable baseline system, don’t expect SMT researchers to read your paper!

SMT Research: Culture,Evaluations, & Metrics

• Since 2001, US National Institute of Standards & Technology (NIST) has been evaluating MT systems

• Participants include MIT , IBM , CMU , RWTH , Hong Kong UST , ATR , IRST , others …

- and NRC:

NRC’s system is called PORTAGE (in NIST evaluation 2005 & 2006).• Main NIST language pairs: ChineseEnglish, ArabicEnglish• Semantic domains: news stories & multigenre• Training corpora released each fall, test corpus each spring; participants have

1 working week to submit target sentences• NIST evaluates systems comparatively In 2005 http://www.nist.gov/speech/tests/mt/mt05eval_official_results_release_20050801_v3.html

& 2006 http://www.nist.gov/speech/tests/mt/mt06eval_official_results.html

statistical systems beat expert systems according to BLEU metric

The NIST MT Evaluations


http://www.nist.gov/speech/tests/mt/mt06eval_official_results.html

http://www.counton.org/numberland/48/china.jpg

http://www.paloschi.co.uk/italian_flag.gif

Other MT Evaluations• WPT/WMT usually organized each spring by Philipp Koehn &

Christoph Monz – smaller training corpora than NIST, European language pairs. In 2006, evaluated on French <-> English, German <-> English, Spanish <->English. http://www.statmt.org/wmt06/proceedings/

• TC-STAR Evaluation for spoken language translation. In 2006, evaluated on Chinese->English (one direction only) and Spanish <->English http://www.elda.org/tcstar-workshop/2006eval.htm

• IWSLT Evaluation for spoken language translation. In 2006, evaluated on Arabic->English, Chinese->English, Italian->English, Japanese->English http://www.slt.atr.jp/IWSLT2006_whatsnew/index.html


http://www.statmt.org/wmt06/proceedings/

http://www.elda.org/tcstar-workshop/2006eval.htm

GALE Project• Huge DARPA-sponsored project: $50 million per year for 5 years. Three consortia:

BBN-led « Agile », IBM-led « Rosetta », SRI-led « Nightingale ». • NRC team is in MT working group of Nightingale.

(Arabic or Chinese) speech

(Arabic or Chinese) transcriptions

English text IR/database component

(Arabic or Chinese) documents

Automatic speech recognition (ASR)

Machine translation (MT)

Distillation


What is BLEU? • Human evaluation of automatic translation quality hard & expensive. BLEU metric (invented at IBM) compares MT output with human-generated reference translations via N-gram matches. • N-gram precision = # (N-grams in MT output seen in ref.)

# (N-grams in MT output)• Example (from P. Koehn):

REF = Israeli officials are responsible for airport security

Sys A = Israeli officials responsibility of airport safety

Sys B = airport security Israeli officials are responsible


2-gram matches

1-gram match

4-gram match

What is BLEU?

• REF = Israeli officials are responsible for airport security

Sys A = Israeli officials responsibility of airport safety

Sys B = airport security Israeli officials are responsible

• Sys A: 1-gram precision = 3/6 (Israeli, officials, airport);

2-gram precision = 2/5 (Israeli officials);

3-gram precision = 0/4 = 4-gram precision = 0/3.

Sys B: 1-gram precision = 6/6; 2-gram precision = 4/5;

3-gram precision = 2/4; 4-gram precision = 1/3.

• BLEU-N multiplies together the N N-gram precisions – the higher the value, the better the translation. But, could cheat by having very few words in MT output – so, brevity penalty.


What is BLEU?

BLEU-N = (brevity-penalty)*Πi=1N(precisioni)i, where

brevity-penalty = min(1,output-length/ref-length) .

Usually, we set N=4 and all i = 1, so we have

BLEU-4 = (min(1,output-length/ref-length))*Πi=14precisioni.

• If any MT output has no N-grams matching ref., for some N=1, …, 4, BLEU-4 is zero. So, normally compute BLEU over whole test set of at least a hundred or so sentences.

• Multiple references: if an N-gram has K occurrences in output, look for single ref. that has K or more copies of that N-gram. If find such a single ref., that N-gram has matched K times. If not, look for a ref. that has the highest # of copies (L) of that N-gram; use L in precision calculation. Ref-length = closest length.


Does BLEU correlate with human judgment?

Qu

alit

y sc

ore:

0 =

ter

rib

le, 3

= e

xcel

len

t

Translator Identity


* BLEU kind of correlates with human judgment ; works best

with multiple references.

Why BLEU Is Controversial• If system produces a brilliant translation that uses many N-grams not found in the references, it will receive a low score. • Proponents of the expert system approach argue that BLEU is biased against this approach, & favours SMT• Partial confirmation:

1. in NIST 2006 Arabic-to-English evaluation, AppTek hybrid system (rule-based + SMT system) did best according to human evaluators, but not according to BLEU. 2. in 2006 WMT evaluation Systran was scored comparably to other systems for some European language pairs (e.g., French-English) by human evaluators, but had much lower in-domain BLEU scores (see graphs in http://www.statmt.org/wmt06/proceedings/pdf/WMT14.pdf).


http://www.statmt.org/wmt06/proceedings/pdf/WMT14.pdf

Other Automatic Metrics• SMT systems need an automatic metric for tuning (must try out thousands of variants). Automatic metrics compare MT output with human-generated reference translations. • Rivals of BLEU:

* translation edit rate (TER) – how many edit ops to match references? http://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf

* METEOR – compares MT output with references in way that’s less dependent on word choice (via stemming, WordNet, etc.) Gaining credibility: correlates better than

BLEU with human scores. However,

METEOR only defined for translation into English.

http://www.cs.cmu.edu/~alavie/METEOR/.


http://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf

http://www.cs.cmu.edu/~alavie/METEOR/



Manual Metrics• Human evaluation of SMT preferable to automatic evaluation, but much slower & more expensive. Can’t use for system tuning.• Ask humans to rank systems by adequacy and fluency. Adequacy: does MT output convey same meaning as source?

Fluency: does MT output look like normal target-language text? (Good syntax & idiom). • Metrics based on human postediting of MT output. E.g., HTER.• Metrics based on human understanding of MT output. Related to adequacy, but less subjective. E.g., Lincoln Labs metric: give English output of Arabic MT

system to unilingual English analyst, then test him with standard « Defense Language Proficiency Test » (see Jones05).


Who Uses Which Metric When?• Many groups use BLEU for automatic system tuning• NIST, WPT/WMT, TC-STAR, & other evaluations often have BLEU as official metric, with some human reality checks. Koehn & Monz WPT/WMT: participants do

human fluency/adequacy evaluations - nice analyses!• Many « expert/rule-based MT » researchers hate BLEU (can become excuse not to evaluate system competitively)• In theory, manual metrics should be related to MT task: e.g., adequacy for browsing/gisting, Lincoln Labs metric for intelligence community, HTER if MT output will be

post-edited. So why is HTER GALE’s official metric? HTER = Human Translation Edit Rate: MT output hand-edited by humans; measure # of operations performed.


• In the late 1980s, members of IBM’s speech recognition group applied statistical learning techniques to bilingual corpora. These American researchers worked mainly with the Canadian Hansard – bilingual transcription of parliamentary proceedings.

• These researchers quit IBM around 1991 for a hedge fund, Renaissance Technologies – they are now very rich!

• Renewed interest in their work sparked the revival of research into statistical learning for MT that occurred from late 1990s onward. Newer « phrase-based » approach still partially relies on IBM models.

• The IBM approach used Bayes’s Theorem to define the « Fundamental Equation » of MT (Brown et al. 1993)

SMT History: IBM Models

The best-fit translation of a source-language (French) sentence S into a target-language (English) sentence T is:

T = argmaxT [P(T)*P(S|T)]^

search task language model word translation model

Fundamental Equation of MT

Job of language model: ensure well-formed target-language TJob of translation model: ensure T could have generated S Search task: find T maximizing product P(T)*P(S|T)


• The IBM researchers defined five statistical translation models (numbered in order of complexity)

• Each defines a mechanism for generation of text in one language (e.g., French or foreign = F) from another (e.g., English = E)• Most general many-to-many case is not covered by IBM models;

in this forbidden case, a group of E words generates a group of F words, e.g. :

The poor don’t have any money

Les pauvres sont démunis


• The IBM models only allow one-to-many generation, e.g.:

And the program has been implemented

Le programme a été mis en application Ø

• IBM models 1 & 2 – all lengths for F sentence equally likely• Model 1 is « bag of words » - word order in F & E doesn’t matter• In model 2, chance that an E word generates given F word(s) depends on position • IBM models 3, 4, & 5 are fertility-based


IBM model 1: « bag of words »

IBM model 2: « position-dependent bag of words »

e1

e2

….

eL

f1

f2

….

fM

P(2 →1)

P(2 →M)

P(L→1)

P(1 →1)

P(L→M)

….

e1

e2

….

eL

f1

f2

….

fM

(draw with uniform probability)

(draw with position-dep. prob)

….P(L→M)

P(1→M)


Parameters: φ(ei) = fertility of ei = prob. will produce

0, 1, 2 … words in F; t(f|ei) = probability that ei can generate f;

Π(j | i, k) = distortion prob. = prob. that kth word generated by ei ends up in pos. j of F

IBM model 3

e1

e2

….

eL

φ(e1)2

φ(e2)0

φ(eL)Ø

1

f1

f2

….

fM

f2

fM

….

f1

Distortion model Π P(1→1), P(1→2),

…, P(M→M)

IBM model 5: cleaned-up version of model 4 (e.g., two F words can’t be given same position)

e1

e2

….

eL

3

0Ø

1

Distortion model

Π IBM model 4

φ(e1)

φ(e2)

φ(eL)

f1

f2

f3

….

fM

f1

f2

f3

fM

NOTE: phrases can be broken up,but with lower prob. than in model 3

(phrase)

t

t

t

t

t


Phrase-based SMT

Four key ideas• phrase-based models (Och04, Koehn03, Marcu02)• dynamic programming search algorithms (Koehn04)• loglinear model combination (Och02)• error-driven learning (Och03)

Phrase-based approach introduced around 1998 by Franz Josef Och & others (Ney, Wong, Marcu): many-words-to-many-words (improvement on IBM one-to-many)

Example: « cul de sac » word-based translation = « ass of bag » (N. Am), « arse of bag » (British)phrase-based translation = « dead end » (N. Am.), « blind alley » (British)

This knowledge is stored in a phrase table : collection of conditional probabilities of form P(S|T) = backward phrase table or P(T|S) = forward phrase table. Recall Bayes: T = argmaxT [P(T)*P(S|T)] backward table essential,

forward table used for heuristics. Tables for French->English:

^

backward: P(S|T)p(sac|bag) = 0.9p(sacoche|bag) = 0.1…p(cul de sac|dead end) = 0.7p(impasse|dead end) = 0.3…

forward: P(T|S)p(bag|sac) = 0.5p(hand bag|sac) = 0.2…p(cul|ass) = 0.5p(dead end|cul de sac) = 0.85…

Phrase-based SMT

Overall Phrase Pair Extraction Algorithm

1. Run a sentence aligner on a parallel bilingual corpus (won’t go over this)

2. Run word aligner (e.g., one based on IBM models) on each aligned sentence pair – see next slide.

3. From each aligned sentence pair, extract all phrase pairs with no external links - see two slides ahead.

Phrase-based SMT

Symmetrized Word Alignment using IBM Models Alignments produced by IBM models are asymmetrical: source words have at

most one connection, but target words may have many connections.

To improve quality, use symmetrization heuristic (Och00):

1. Perform two separate alignments, one in each different translation direction.

2. Take intersection of links as starting point.

3. Add neighbouring links from union until all words are covered.

S: I want to go home

T: Je veux aller chez moi

S: Je veux aller chez moi

T: I want to go home

I want to go home

Je veux aller chez moi

Phrase-based SMT

Je l’ ai vu à la télévision

I saw him on television

Extract all phrase pairs with no external links, for example:

Good pairs:

(Je, I) (Je l’ ai vu, I saw him) (ai vu, saw) (l’ ai vu à la, saw him on)Bad pairs:

(Je l’ ai vu, I saw) (l’ ai vu à, saw him on) (la télévision, television)

Input: aligned sentence pair

Output: set of consistent phrases

Phrase-based SMT

« Diag-And » phrase extraction

Phrase-Based Search

Generative process:1. Split source sentence into “phrases” (N-grams).2. Translate each source phrase (one-to-one).3. Permute target phrases to get final translation. much simpler and more intuitive than the IBM process, but the price of this is no provision for gaps, e.g., ne VERB pas

Jel’aivuàlatélévision

Jel’aivuàlatélévision

Ihim

saw

ontelevision

1 2Isawhimontelevision

3

*** NOTE: XRCE’s Matrax does handle gaps

Source: s1 s2 s3 s4 s5 s6 s7 s8 s9 Segmentation

P(S|T)p(s2 s3 | t8)p(s2 s3 | t5 t3)…p(s3 s4 | t4 t9)…

phrase table: 1. suggests possible segments2. supplies phrase translation scores

Backward Table

Order: Target hypotheses grow left->right, from source segments consumed in any order

(pick s2 s3 first)

Source: s1 s2 s3 s4 s5 s6 s7 s8 s9

Tgt hyp: t8| … Tgt hyp: t5 t3| …

(pick s3 s4 first)


Tgt hyp: t4 t9| …(pick s5 s6 s7)


Tgt hyp: t8| t6 t2| … LanguageModel P(T)…

…

…

language model: scores growing target hypotheses left -> right

(phrase transl)

(phrase transl)

(phrase transl)

Phrase-Based Search

Loglinear Model Combination

Previous slides show basic system that ranks hypotheses by P(S|T)*P(T). Now let’s introduce an alignment/reordering variable A (aligns T & S phrases). We want

T = argmaxT P(T|S) ≈ argmaxT ,AP(T, A|S) =

argmaxT, A f1(T,A,S)λ1* f2(T,A,S)λ2 * … * fM(T,A,S)λM =argmax exp (∑i λi log fi(T,A,S)).

The fi now typically include not only functions related to P(S|T) and language model P(T), but also to A « distortion », P(T|S), length(T), etc. The λi serve as reliability weights. This change in score computation doesn’t fundamentally change the search algorithm.

^

AdvantagesVery flexible! Anyone can devise dozens of features.

• E.g., if lots of mismatched brackets in output, include feature function that outputs +1 if no mismatched brackets, -1 if have mismatched brackets.

• So lots of new features being tried in somewhat haphazard way.

• But systems steadily improving – outputs from NIST 2006 look much better than those from NIST 2002. SMT not good enough to replace human translators, but good enough for, e.g., most Web browsing. Using 1000 machines and massive quantities of data, Google got 45.4 BLEU for Arabic to English, 35.0 for Chinese to English – very high scores!


Typical Loglinear Components for SMT Decoding• Joint counts C(S,T) from phrase extraction yield estimates P(S|T) stored in

“backward” phrase table and estimates P(T|S) stored in “forward” phrase table. These are typically relative frequency estimates (but we’ve looked at smoothed variants).

• Distortion model D(T,A,S) assigns score to amount of phrase reordering incurred in going from S to hypothesis T. Can be based purely on displacement, or be lexicalized (identity of words in S & T is important).

• Length model L(T,S) scores probability that hypothesis of length |T| generated from source of length |S|.

• Language model P(T) gives probability of word sequence T in target language – see next few slides.

NOTE: these are just for decoding – you can use lots more components for N-best/lattice reordering!


Target Language Model P(T)

The Stupidest Thing Noam Chomsky Ever Said « It must be recognized that the notion of a ‘probability of a sentence’ is

an entirely useless one, under any interpretation of this term ».

Chomsky, 1969.


• Language model helps generate fluent output by

1. assigning higher probability to correct word order – e.g., PLM(the house is small) >> PLM(small the is house)

2. assigning higher probability to correct word choices – e.g.,

PLM(i am going home) >> PLM(I am going house)

• Almost everyone in both SMT and ASR (automatic speech recognition) communities uses N-gram language models. Start with

P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|w1,…,wi-1)*…*P(wm|w1,…,wm-1),

then limit window to N words. E.g., for N=3, trigram LM:

P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|wi-2,wi-1)*…*P(wm|wm-2,wm-1).


• Estimation is done by relative frequency on large corpus :

P(wi|wi-2,wi-1) ≈ f(wi|wi-2,wi-1) = C(wi-2,wi-1,wi)/Σw C(wi-2,wi-1,w).

E.g., in Europarl corpus, see 225 trigrams starting « the red … »: C(the red cross)=123, C(the red tape)=31, C(the red army)=9,

C(the red card)=7, C(the red ,)=5 (and 50 other trigrams). So estimate P(cross | the red) = 123/225 = 0.547 .

• But need to reserve probability mass for unseen events - maybe never saw « the red planet » in Europarl, but don’t want to have estimate P(planet | the red) = 0. Also, want estimates whose variance isn’t too high. Smoothing techniques are used to solve both problems. E.g., could linearly smooth trigrams with bigrams & unigrams: P(wi|wi-2,wi-1) ≈ *f (wi|wi-2,wi-1) + μ*f(wi|wi-1) + (1--μ)*f(wi); 0 < , μ < 1.

• Perplexity: metric that measures predictive power of an LM on new data as an average branching factor. E.g., model that says «any digit 0, …, 9 has equal probability of occurrence » will yield perplexity of 10.0 on digit sequence generated randomly from these 10 digits.

• Perplexity of LM measured on corpus W = (w1 … wN) is

PerpLM(T) = (Πwi P(wi|LM))-1/N = 1/(average per word prob.)

The better the LM is as a model for W, the less « surprised » it is by words of W higher estimated prob. lower entropy.

Typical perplexities for well-trained English trigram LMs with lexica of about 25K words for various dictation domains:

Perp(radiology)=20, Perp(emergency medicine)=60, Perp(journalism)=105, Perp(general English)=247 .

Measuring Language Model Quality



• « A Bit of Progress in Language Modeling » (Goodman01) is good summary of state of the art in N-gram language modeling.

• Consistently superior method: Kneser-Ney. Intuition: if «Francisco» & «eggplant» each seen 103 times in our

corpus of 106 words, and neither «eggplant Francisco» nor «eggplant stew» seen, which should be higher, P(Francisco|eggplant) or P(stew|eggplant)?

Interpolation answer: P(wi|wi-1) ≈ *f(wi|wi-1) + (1-)*f(wi ). So P(Francisco|eggplant) ≈ *0 + (1- )*10-3 = P(stew|eggplant). Kneser-Ney answer: no, «Francisco» only occurs after «San», but

1,000 occurrences of « stew » preceded by 100 different words. So when (wi-1 wi) has never been seen before, wi = «stew» more probable than wi = «Francisco» P(stew|eggplant) >> P(Francisco|eggplant).


• Kneser-Ney formula (for bigrams – easily extended to N-grams):

PKN(wi | wi-1) = max [C(wi-1 wi)-D, 0]/C(wi-1) +

(wi-1)*#{v | C(v wi) > 0}/w #{v | C(v w) > 0} ,

where D is a discount factor < 1, (wi-1) is a normalization constant,

#{v | C(v wi) > 0} is the number of different words that precede wi in

the training corpus, and w #{v | C(v w) > 0} is the number of

different bigrams in the training corpus.

Flaws of Phrase-based, Loglinear Systems

• Loglinear feature function combination is too flexible! Makes it

easy not to think about theoretical properties of models.

• The IBM models were true models: given arbitrary source sentence S and target sentence T, could estimate non-zero P(T|S). Phrase-based “models” are not models: in general, for T which is a good translation of S, they give P(T|S) = 0. They don’t guarantee existence of an alignment between T and S. Thus, the only translations T’ to which a phrase-based system is guaranteed to assign P(T’|S) > 0 are T’ output by same system.

• This has practical consequences: in general, a phrase-based MT system can’t be used for analyzing pre-existing translations. This rules out many useful forms of assistance to human translators - e.g., spotting potential errors in translations based on regions of low P(T|S).

PORTAGE: A Typical SMT System

1. Sentence-align a big bilingual corpus2. On each sentence pair, use IBM models to align words3. Build phrase tables from word alignments via “diag-and” or

similar heuristic (Koehn03). Backwards phrase table gives P(S|T) (& is implicit segmentation model).

4. Build language model (LM) for target language: estimates P(T) , based on n-grams in T

5. P(S|T) and P(T) are sufficient for decoding, but one often adds other loglinear feature functions such as a distortion penalty

6. Use (Och03) method to find good weights λi for loglinear features

7. Optionally, include reordering step: i.e., decoder outputs many hypotheses (via N-best list or lattice) which are rescored by larger set of feature functions

« Small » set of information sources – for Canoe decoder

LM TM DM

(at least one language model)

(at least 1 phrasetranslation model)

(at least one distortion model)(number-of-words model)

Weights for « small » set

Canoe decoder

NM A1 A2 A3

(any # of additional info. sources - for rescorer only)

« Large » set of information sources – for Rescorermais où sont les neiges d’ antan ?

Source sentence

H1: hey , where did the old snow go ? P = 0.41H2: yet where are yesterday’s snows ? P = 0.33H3: but where are the snows of yesteryear ? P = 0.18 …

N-best hypotheses

Rescorer

wLM*LMwTM*TM…wNM*NM

Weighted « small » info

kLM*LMkTM*TM…kA3*A3

Weighted « large »info

H1: but where are the snows of yesteryear ? P = 0.53H2: however , where are yesterday’s snows ? P = 0.20…

Rescored N-best

Weights for « large » set

feature functions

PORTAGE: A Typical SMT System

Core Engine

Training Core Components of PORTAGE

lang. model builder

src-lang texttgt-lang text

Raw parallel corpus

src preproc.tgt. preproc.

Src-lang textTgt-lang text

Clean, aligned parallel corpus

sentence aligner

Preprocessing

Tgt-lang text

Tgt-lang text…IBM training

(models 1 & 2)

phrase pair extraction

dev1 corpus

srctgt

LM

…

PTphrase translation model

language model

model3 modelK

other small set models

decoder wt optimizer

w1, …, wK

small set wts

…modelK+1modelM

extra models for large set

small set info only

dev2 corpus

srctgt

rescorer wt optimizer

large set info

w1’, …, wM

’large set wts

Additional monolingual corpora

« Small » set of information sourcesInitial Weights

[w1i , w2

i ,…, wsi]

Canoe decoder

Canoe Optimization of Weights (COW)Purpose: find weights [w1, …, ws] on « small » set of information sources (N around 100)

I1 I2 IS…S1: hé quoi ?S2: charmante élise , vous devenez mélancolique . …. SD: la fin .

D source-language sentences

D target-language ref. translationsT1: what’s this ?T2: charming élise , you’re becoming melancholy .…. TD: the end .

Dev corpus for COW (D sentences)

(first call to Canoe)

BLEU scoring(based on top hyp.)

New Weights (from « rescore-train »)

[w1r , w2

r,…, wsr]

Expanded list H1(S1), …, HN(S1), …(>N hyp. for S1)H1(S2), …, HN(S2), …(>N hyp. for S2)…H1(SD), …, HN(SD), …(>N hyp. for SD)

(2nd & subsequentcalls to Canoe)

union of old & new hypotheses

List of D N-best hyp.H1(S1): what’s up ?…HN (S1): are you OK ?H1(S2): cute élise , you’re bummed out .……HN(SD): all done .

(union: 2nd & subsequentcalls to rescore-train)

(first call to rescore-train)

Rescore_train

W1=[w11

, w21,…, ws

1]

…WK=[w1

K, w2K,…, ws

K]

Powell’s alg.K random wt. vectors

Powell’s alg.

Ŵ1

…

ŴK

}Ŵ…

IIT

The background of the "BLEU scoring" bubble is too dark: makes it hard to read the text

Rescoring = Finding Weights on « Large » Info. Set for Rescorer(N around 1000)

Weights for « small » fixed by previous COW step

Canoe decoder

« Large » set

S1: hé quoi ?S2: charmante élise , vous devenez mélancolique . …. SD: la fin .

D source-language sentences

D target-language ref. translationsT1: what’s this ?T2: charming élise , you’re becoming melancholy .…. TD: the end .

Dev corpus for « large » wts (D sent)I1 I2 IS+1IS… IL…

« Small » set

w1* I1

…wS*IS

Weighted « small » info

H1(S1): what’s up ?…HN (S1): are you OK ?H1(S2): cute élise , you’re bummed out .……HN(SD): all done .

List of D N-best hyp.

w1* I1

…wL*IL

Weighted « large » info

BLEU scoring(based on top hyp.)

[w1i , w2

i,…, wLi]

Initial Weights

[w1f , w2

f,…, wLf]

Final « large » wts

Rescore_train

feature functions

W1=[w11

, w21,…, wL

1]

…WK=[w1

K, w2K,…, wL

K]

Powell’s alg.K random wt. vectors

Powell’s alg.

Ŵ1

…

ŴK

}Ŵ…

IIT

Is the Dev corpus for "large" weights normally different from the one for small weights?If so we might want to change the example material here.

IIT

Same comment as for BLEU bubble on previous page.

Tutorial Plan

B. Details & research topicsNamed entities

Large-scale discriminative training (George Foster)

Decoding for SMT (prepared by Nicola Ueffing)

Hierarchical models (George Foster)

System combination

Named entity recognition & transliteration

Chinese Example« Secretary-General Wong appeared with Larry Ellison, Chief Executive Officer

of Oracle Corporation, at a press conference to announce Oracle’s investment of $100 million dollars in a new research centre in Szechuan Province ».

Personal names: “Wong”, “Larry Ellison”. Titles: “Secretary-General”, “Chief Executive Officer”.Organization name: “Oracle Corporation”.Place name: “Szechuan Province”.

Recognition problem: detect these entities in a continuous stream of ideograms.Transliteration problem: when ideograms are used phonetically (esp. for non-

Chinese names like “Larry Ellison”) become aware of that, & map them onto Latin characters.


Made-up Chinese Transliteration Example

How to translate “ 唐纳德 · 拉姆斯菲尔德” ?唐 [táng] (surname) - Tang Dynasty; 纳 (F 納 ) [nà] receive, accept, enjoy, pay, sew;

德 [dé] virtue拉 [lā] pull, drag, haul; 姆 [mǔ] nurse; 斯 [sī] (thus; now used mostly for sound:)

菲 [fěi] 菲薄 humble; 尔 (F 爾 ) [ěr] (archaic:) you; 德 [dé] virtue “After receiving virtue from the Tang Dynasty, you thus pulled the

humble nurse away from virtue” (????). No – « tang na de la mu si fei de » = DONALD RUMSFELD.

Actual ChineseEnglish example generated by PORTAGE

“Outgoing president Iliescu has also congratulated Basescu.” “Outgoing president of Iraq, has also been made to the road to the public.”


Other ExamplesArabicEnglish: Muammar Ghadafy = Moammar Khaddafi = Muamar Qadafy = …;Azeddine = Elzedine = Alsuddin = Ahzudin = … (depending on region, pronounced

differently & thus transliterated into Latin alphabet differently)

EnglishFrench (Google Translate Jan. 24, 2007):“The Englishman John Snow thought cholera was transmitted by small, living

organisms.” “Le choléra de pensée de neige de John d'Anglais a été transmis par la petite, organique

matière.”

System Combination

Introduction• Different systems make different errors – why not combine

information? This worked well for ASR …• But, because of reordering, synonyms, etc., system combination not

as easy for MT! • RWTH (Aachen) is SMT powerhouse – has recently been working on

parallel system combination (Evgeny Matusov). • NRC has been working on serial system combination.• Both teams now getting good results.

System Combination

Parallel System Combination (RWTH Aachen)• Hypotheses from different systems aligned; some word reordering

allowed; use of synonyms• Generate confusion network choices at each position scored with

system weights and word confidence scores• N-best consensus translations are generated from confusion network

& rescored with various information sources• A year ago, results unimpressive. Since then, added new information

sources (e.g., LMs trained on N-best lists from contributing systems) that encourage preservation of original phrases. Nice preliminary Arabic results: improvement of +2-3 BLEU points over best individual system in combination.

System Combination

Example of RWTH Parallel Combination

Ref: Chinese president directs unprecedented criticism at

leaders of Hong Kong. Best System: Chinese president slams unprecedented leaders to Hong Kong.System Comb.: Chinese president sends unprecedented criticism of the

leaders of Hong Kong.

System Combination

Serial System Combination (NRC)• Use SMT to correct mistakes made by another method (e.g., a rule-

based one)

Source text

MT1 SMTInitial

target textFinal

target text

Training Procedure• Use MT1 to produce initial target translation of source half of a parallel

human-translated corpus, thus giving a corpus of MT1 target output in parallel with good target versions of same sentences; use parallel corpus of (MT1 target || human target) sentences to train SMT.

• Even better, if can get humans to post-edit MT1 output, have MT1 target in parallel with corrected target as SMT training corpus.

System Combination

Serial System Combination (NRC)

System Combination

Discussion & Future Work• Parallel combination probably best for similar systems of good

quality, serial combination for systems that are very different• Future work for serial combination: allow SMT both direct & indirect

(via MT1) access to source text. Could do this using, e.g.: Rescoring Parallel phrasetables Parallel LMs Parallel decoding … (etc.)

Source text MT1 SMT

Initial target text

Final target text

References (1)

Best overall referencePhilipp Koehn, « Statistical Machine Translation », University of Edinburgh (textbook to appear 2007 or 2008, Cambridge University Press).

Papers (NOTE: short summary of key papers available from Kuhn/Foster)Brown93 Peter F. Brown, Stephen A. Della Pietra, Vincent Della J. Pietra, and Robert L. Mercer. The mathematics of Machine Translation: Parameter estimation. Computational Linguistics, 19(2):263-312, June 1993.Chomsky69 Noam Chomsky. Quine’s Empirical Assertions. In Words and Objections – Essays on the Work of W.V. Quine (ed. D. Davidson and J. Hintikka). Dordrecht, Netherlands, 1969. Foster06 George Foster, Roland Kuhn, and Howard Johnson. Phrasetable Smoothing for Statistical Machine Translation. EMNLP 2006, Sydney, Australia, July 22-23, 2006. Germann01 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fast decoding and optimal decoding for machine translation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), Toulouse, July 2001.

References (2)

Goodman01 Joshua Goodman. A Bit of Progress in Language Modeling (extended version). Microsoft Research Technical Report, Aug. 2001. Downloadable from research.microsoft.com/~joshuago/publications.htmJones05 Douglas Jones, Edward Gibson, et al. Measuring Human Readability of Machine Generated Text: Studies in Speech Recognition and Machine Translation. In Proceedings of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, March 2005 (Special Session on Human Language Technology: Applications and Challenge of Speech Processing). Knight99 Kevin Knight. Decoding complexity in word-replacement translation models. Computational Linguistics, Squibs and Discussion, 25(4), 1999. Koehn04 Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, Georgetown University, Washington D.C., October 2004. Springer-Verlag. KoehnDec03 Philipp Koehn. PHARAOH - a Beam Search Decoder for Phrase-Based Statistical Machine Translation Models (User Manual and Description). USC Information Sciences Institute, Dec. 2003.

References (3)

KoehnMay03 Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In Eduard Hovy, editor, Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), pp. 127-133, Edmonton, Alberta, Canada, May 2003. Marcu02 Daniel Marcu and William Wong. A phrase-based, joint probability model for statistical machine translation. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA, 2002. OchJHU04 Franz Josef Och, Daniel Gildea, et al. Final Report of the Johns Hopkins 2003 Summer Workshop on Syntax for Statistical Machine Translation (revised version). http://www.clsp.jhu.edu/ws03/groups/translate (JHU-syntax-for-SMT.pdf), Feb. 2004. Och04 Franz Och and Hermann Ney. The alignment template approach to statistical machine translation. Computational Linguistics, V. 30, pp. 417-449, 2004. Och03 Franz Josef Och. Minimum error rate training for statistical machine translation. In Proceedings of the 41th Annual Meeting of the Association for Computational Linguistics (ACL), Sapporo, July 2003.

References (4)

Och02 Franz Josef Och and Hermann Ney. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002. Och01 Franz Josef Och, Nicola Ueffing, and Hermann Ney. An Efficient A* Search Algorithm for Statistical Machine Translation. In Proc. Data-Driven Machine Translation Workshop, July 2001. Och00 Franz Josef Och and Hermann Ney. A Comparison of Alignment Models for Statistical Machine Translation. Int. Conf. on Computational Linguistics (COLING), Saarbrucken, Germany, August 2000.Papineni01 Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of Machine Translation. Technical Report RC22176, IBM, September 2001. Ueffing02 Nicola Ueffing, Franz Josef Och, and Hermann Ney. Generation of Word Graphs in Statistical Machine Translation. Empirical Methods in Natural Language Processing, July 2002.

the state of the art in phrase-based statistical machine translation (smt) roland kuhn, george...

Documents

statistical mt

statistical translation

t statistical approach

statistical system output

t sentence slide

new s sentence

mt task approaches

typical smt system slide