february 2006machine translation ii.21 postgraduate diploma in translation example based machine...

February 2006 Machine Translation II.2 1

Postgraduate DiplomaIn Translation

Example Based Machine Translation Statistical Machine Translation


Three ways to lighten the load Restrict coverage to specialised domains Exploit existing sources of knowledge

(convert machine readable dictionaries) Try to manage without explicit

representations Example Based MT (EBMT) Statistical MT (SMT)


Today’s Lecture

Example Based MT Statistical MT


Part I

Example Based Machine Translation


EBMT

Basic idea is that instead of being based on on rules and abstract representations, translation should be based on a database of examples.

Each example is pairing of a source/target fragment.

The original intuition came from Nagao, a well-known pioneer in the field of En/Jp translation.


EBMT (Nagao 1984)

Man does translation by: by properly decomposing an input sentence

into certain fragmental phrases, then by translating these phrases into other

language phrases, and finally by properly composing these fragmental

translations into one long sentence.


Three Step Process

Match: identify relevant source language examples in database.

Align: find corresponding fragments in target language.

Recombine: target language fragments to form sentences.


EBMT

An Example Based Machine Translation

as used in the Pangloss system at Carnegie Mellon University

Based on Notes by Dave Inman


EBMT Corpus & Index

CorpusS1: The cat eats a fish. Le chat mange un poisson.

S2: A dog eats a cat. Un chien mange un chat.

S99,999,999 ….

Index

the: S1

cat: S1,S2

eats: S1,S2

fish: S1

dog: S2


EBMT: find chunks

A source language sentence is input.The cat eats a dog.

Chunks of this sentence are matched against the corpus.The cat: S1

The cat eats: S1

The cat eats a: S1

a dog: S2


Match and Align Chunks

For each chunk retrieve target. the cat eats : S1

The cat eats a fish. Le chat mange un poisson a dog: S2

a dog. Un chien mange un chat. The chunks are aligned with target sentences

The cat eats Le chat mange un poisson

Alignment is difficult.


Recombination

Chunks are scored to find good match… The cat eats/Le chat mange Score 78% The cat eats /Le chat dorme Score 43% a dog/un chien Score 67% a dog/le chien Score 56% a dog/un arbreScore 22%

The best translated chunks are put together to make the final translation. The cat eats/Le chat mange a dog/un chien


What Data Are Needed?

1. A bilingual dictionary…but we can induce this from the corpus.

2. A target language root/synonym list.… so we can see similarity between words and inflected forms (e.g. verbs)

3. Classes of words easily translated… such as numbers, towns, weekdays.

4. A large corpus of parallel sentences.…if possible in the same domain as the translations.


How to create a bilingual lexicon Take each sentence pair in the corpus. For each word in the source sentence, add

each word in the target sentence and increment the frequency count.

Repeat for as many sentences as possible. Use a threshold to get possible alternative

translations.


How to create a lexiconThe cat eats a fish. Le chat mange un poisson.

the le,1 chat,1 mange,1 un,1 poisson,1

cat le,1 chat,1 mange,1 un,1 poisson,1

eats le,1 chat,1 mange,1 un,1 poisson,1

a le,1 chat,1 mange,1 un,1 poisson,1

fish le,1 chat,1 mange,1 un,1 poisson,1


After many sentences …

the le,956la,925un,235------ Threshold ----------chat,47mange,33poisson,28....arbre,18


After many sentences …

cat chat,963------ Threshold ----------le,604la,485un,305mange,33poisson,28....arbre,47


Indexing the Corpus

For speed the corpus is indexed on the source language sentences.

Each word in each source language sentence is stored with info about the target sentence.

Words can be added to the corpus and the index easily updated.

Tokens are used for common classes of words (e.g. numbers). This makes matching more effective.


Finding Chunks to Translate

Look up each word in the source sentence in the index.

Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus.

Select last few matches against the corpus (translation memory).

Pangloss uses the last 5 matches for any chunk.


Matching a chunk against the target. For each source chunk found previously, retrieve the

target sentences from the corpus (using the index).

Try to find the translation for the source chunk from these sentences.

This is the hard bit!

Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments.


Scoring a segment…

Unmatched Words : Higher priority is given to sentences containing all the words in an input chunk.

Noise : Higher priority is given to corpus sentences which have fewer extra words.

Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk.

Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants.


Whole Sentence Match

If we are lucky the whole sentence will be found in the corpus!

In that case the target sentence is used without previous alignment.

Useful if translation memory is available (sentences recently translated are added to the corpus).


Quality of Translation

Pangloss was tested against source sentences in a different domain to the examples in the corpus.

Pangloss “covered” about 70% of the sentences input.

This means a match was found against the corpus….

…but not necessarily a good match. Others report around 60% of the translation can be

understood by a native speaker. Systran manages about 70%.


Speed of Translation

Translations are much faster than for Systran.

Simple sentences translated in seconds. Corpus can be added to (translation memory)

at about 6MBytes per minute (Sun Sparc Station)

A 270 Mbytes corpus takes 45 minutes to index.


Positive Points

Fast Easy to add a new language pair No need to analyse languages (much) Can induce a dictionary from the corpus Allows easy implementation of translation

memory


Negative Points

Quality is second best at present Depends on a large corpus of parallel, well

translated sentences 30% of source has no coverage (translation) Matching of words is brittle – we can see a

match Pangloss cannot. Domain of corpus should match domain to be

translated - to match chunks


Conclusions

An alternative to Systran Faster Lower quality Quick to develop for a new language pair – if

corpus exists! Needs no linguistics Might improve as bigger corpora become

available?


Part II

Statistical Translation


Statistical Translation

Robust Domain independent Extensible Does not require language specialists Uses noisy channel model of translation


Noisy Channel ModelSentence Translation (Brown et. al. 1990)

sourcesentence

target sentence

sentence


Basic Principle

John loves Mary (S)

Jean aime Marie (T)

Given T, I have to find S such that Ptrans = probability that T is a translation of S

Ps = probability of S

Ptrans x Ps is greater than for any other S’


A Statistical MT System

Source Language

Model

TranslationModel

Ps Ptrans

S T

DecoderT S


The Three Components of a Statistical MT model1. Method for computing language model (Ps)

probabilities

2. Method for computing translation (Ptrans) probabilities

3. Method for searching amongst source sentences for one that maximises Ptrans x Ps


Simplest Language Model

Probability Ps of any sentence is the product of the probabilities of the words in it.

For example, Probability ofJohn loves Mary= P(John) x P(loves) x P(Mary)


Simplest Translation Model (1)

Assumption: target sentence is generated from the source sentence word-by-word

S: John loves Mary

T: Jean aime Marie


Simplest Translation Model (2)

Ptrans is just the product of the translation probabilities of each of the words.

Ptrans =P(Jean|John) * P(aime|loves) * P(Marie|Mary)


More Realistic Example

The proposal will not now be implemented

Les propositions ne seront pas mises en application maintenant


More Realistic Translation Models Better translation models include other

features such as Fertility: the number of words in the target

that are paired with each source word: (0 – N)

Distortion: the difference in sentence position between the source word and the target word


Searching

Maintain list of hypotheses. Initial hypothesis: (Jean aime Marie | *)

Search proceeds interatively. At each iteration we extend most promising hypotheses with additional wordsJean aime Marie | John(1) *Jean aime Marie | * loves(2) *Jean aime Marie | * Mary(3) *Jean aime Marie | Jean(1) *


Building Models

In general - large quantities of data For language model, we need only source

language text. For translation model, we need pairs of

sentences that are translations of each other. Use EM Algorithm (Baum 1972) to optimize

model parameters.


Experiment 1 (Brown et. al. 1990) Hansard. 40,000 pairs of sentences = approx.

800,000 words in each language. Considered 9,000 most common words in each

language. Assumptions (initial parameter values)

each of the 9000 target words equally likely as translations of each of the source words.

each of the fertilities from 0 to 25 equally likely for each of the 9000 source words

each target position equally likely given each source position and target length


English: the

French Probability

le .610

la .178

l’ .083

les .023

ce .013

il .012

de .009

à .007

que .007

Fertility Probability

1 .871

0 .124

2 .004


English: not

French Probability

pas .469

ne .460

non .024

pas du tout .003

faux .003

plus .002

ce .002

que .002

jamais .002


2 .758

0 .133

1 .106


English: hear

French Probability

bravo .992

entendre .005

entendu .002

entends .001


0 .584

1 .416


Experiment 2

Perform translation using 1000 most frequent words in the English corpus.

1,700 most frequently used French words in translations of sentences completely covered by 1000 word English vocabulary.

117,000 pairs of sentences completely covered by both vocabularies.

Parameters of English language model from 570,000 sentences in English part.


Experiment 2 contd

73 French sentences tested from elsewhere in corpus. Results were classified as Exact – same as actual translation Alternate – same meaning Different – legitimate translation but different meaning Wrong – could not be intepreted as a translation Ungrammatical – grammatically deficient

Corrections to the last three categories were made and keystrokes were counted


Results

Category # sentences percent

Exact 4 5

Alternate 18 25

Different 13 18

Wrong 11 15

Ungrammatical 27 37

Total 73


Results - Discussion

According to Brown et. al., system performed successfully 48% of the time (first three categories).

776 keystrokes needed to repair 1916 keystrokes to generate all 73 translations from scratch.

According to authors, system therefore reduces work by 60%.


Bibliography

Statistical MTBrown et. al., A Statistical Approach to MT, Computational Linguistics 16.2, 1990 pp79-85 (search “ACL Anthology”)

february 2006machine translation ii.21 postgraduate diploma in translation example based machine...

Documents

s2machine translation

final translation

s1the cat eats

dave inmanmachine translation

field of enjp translation

lexiconthe cat

machine translationas

target sentence