morphology & machine translation eric davis mt seminar 02/06/08 professor alon lavie professor...

Morphology & Machine Translation

Eric DavisMT Seminar02/06/08Professor Alon LavieProfessor Stephan

Vogel

Outline

IntroThe Issue at HandSupervised MAUnsupervised MAIntegration of Morphology into MT

PapersMorfessorBridging Inflectional Morphological Gap --> Arabic SMTUnsupervised MA w/ Finnish, Swedish, & DanishTurkish SMT

DiscussionThe GoodThe BadFuture Directions

Q&A

Morfessor

morpheme segmentation & simple morphology induction algorithm utilized Finnish & English data sets used in Morpho challenge unsupervised method for segmentation of words into morpheme-like units idea: propose substrings occurring frequently enough in several different

word forms as morphs words = concatenation of morphs look for optimal balance btwn compactness of morph lexicon &

representation of corpus very compact lexicon = individual letters --> as many morphs as letters

in word short rep of corpus: whole words --> large lexicon

corpus represented as sequence of pointers to entries in morph lexicon uses probabilistic framework or MDL to produce segmentation resembling

linguistic morpheme segmentation 3 'flavors:' Baseline, Categories ML, Categories-MAP

Morfessor Baseline

context-independent splitting algorithm optimization criterion = max P(lexicon) P(corpus|lexicon)

= ∏ P(α) ∏ P(μ) lexicon = all distinct morphs spelled out forming strings of letters α = strings of letters formed by morphs P(lexicon) = product of probability of each letter in α string corpus --> sequence of morphs morphs --> particular segmentation of words in corpus prob of segmentation P(corpus|lexicon) = product of probability of

each morph token μ letter & morph probs are max likelihood 3 errors:

1) undersegmentation: freq string stored as whole b/c most concise rep 2) oversegmentation: infreq string best coded in parts 3) morphotactic violations: b/c model context-independent

Morfessor Categories ML

introduce morph categories use HMM

transition probabilities between categories emission probabilities of morphs from categories

4 categories: use properties of morphs in proposed segmentation prefix: morph preceding large # of diff morphs (right perplexity) stem: morph that is not very short suffix: morph following large # of diff morphs (left perplexity) noise: morph not obvious prefix, suffix, or stem in pos it occurs

in use heuristics & noise category to remove some errors from baseline split redundant morphs in lexicon to reduce undersegmentation prohibit splitting into 'noise' join morphs tagged as noise w/ neighbors to reduce

oversegmentation introduce context-sensitivity (HMM) to reduce morphotactic

violations

Morfessor Categories MAP

2 probabilities calculated:P(lexicon) & P(representation of corpus conditioned on lexicon)

frequent strings represented as whole words in lexicon frequent strings now have hierarchical representation

morph --> string of letters or 2 sub-morphs expand morphs into sub-morphs to avoid undersegmentation do not expand nodes in tree if next level = 'noise' to avoid

oversegmentation

Experiments & Results

baseline entirely unsupervised ML & MAP not unsupervised

optimize perplexity threshold separately for 3 lang

run 3 models on Challenge data ML & MAP > baseline baseline did best on English

MAP had much higher precision than other models BUT lower recall MAP & ML great improvement in recall BUT lower precision explanation: different complexities of morphology

Turkish/Finnish: high type/token ratioword formation --> concat of morphemesSo, proportion of frequently occurring word forms is lower

English: word formation --> fewer morphemesSo, proportion of frequently occurring word forms is higher

BAMA & Arabic MT

take advantage of source & target lang context when conducting MA preprocess data w/ BAMA

morphological analysis at word level analyzes word --> returns all possible segmentations for word segmentations --> prefixes, stems, suffixes built in word-based heuristics --> rank candidates gloss info provided by BAMA's manually constructed lexicon

3 methods to analysis1) BAMA only2) BAMA & context3) BAMA & corresponding match

BAMA & Arabic MT3 Methods of Analysis

1) BAMA only Replace each Arabic word 1st possible split returned by

BAMA2) BAMA & context

Take full advantage of gloss info provided by BAMA’s lexicon

Each split particular prefix, stem, suffix existing in lexicon

Set of possible translations (glosses) for each fragment Select fragment (split for source word) using context winner = split w/ most target side matches in translation of

full sentence Save choice of split & use for all occurrences of surface

form of word in training & testing3) BAMA & corresponding match

Arabic info in surface form not present in English Confusing for word-alignment unless fragments assigned to

null Remove fragments w/ lexical info not present in English Find b/c English translations in BAMA lexicon empty

Data BTEC IWSLT05 Arabic language data 20,000 Arabic/English sentence pairs (training) DevSet/Test05 500 arabic sentences each w/ 16 reference

translations per Arabic sentence Also evaluated on randomly sampled dev & test sets

worried test & dev sets too similar Used Vogel system w/ reordering & future cost estimation baseline --> Normalize (merge Alif, tar marbuta, ee) Trained translation parameters for 10 scores (LM, word & phrase

count, & 6 translation models) Used MERT on dev set Optimized system (separately) for both BLEU & NIST

BAMA & Arabic MTData & System

Results

NIST scores steady improvement w/ better splitting techniques (up to 5% relative)

Improvements statistically significant Better improvements for NIST than BLEU NIST sensitivity to correctly translating certain high gain words in

test corpus Unknown word inflectional splitting technique correct

translation increase score

Unsupervised MA for Finnish, Swedish, & Danish SMT

used morphological information found in unsupervised way in SMT3 languages: Danish, Swedish & Finnish

Danish, Swedish very close to each othertrained system on corpus containing 300,000 sentences from EuroParltypical IBM model --> trans model & LM

used morphs as tokens NOT wordsused Morfessor Categories-MAP to find morpheme-like units

even works w/ agglutinative languages, e.g., FinnishReasoning: speech recognition using morph-based vocabulary shown to

improve resultsused MAP because:

1) has better segmentation accuracy than Morfessor Baseline or ML2) can handle unseen words

word = (PRE* STM SUF*)+

Language Models & Phrases

used basic n-gram LM --> base on sub-word units NOT wordsused varigram model --> gets smaller n-gram model w/o restricting n too

muchmodel grows incrementally & includes longer contexts only when

necessaryused 3 types of LM

2 baseline 3-gram & 4-gram models trained w/ SRL LM toolkit3rd --> varigram model trained w/ VariKN LM toolkit based on

(Siivola, 2007)observed --> trans quality improved by translating seq of words (phrases) used Moses --> generalized phrase-based approach to work w/ morphologyused morphs w/o modifications to Mosessimilar phrases constructed from morphs as words

morphs suitable for translating compound words in partsmorph category info (pre, stm, suf) part of morph label+ --> not last morph of word --> necessary to reconstruct words from morphs in output

Data & Experiments

ran all experiments on Moses & used BLEU to scoredata = European Parliament from 1996-2001 --> strip bi-texts of XML tags

& converted letters to lowercasetest --> last 3 months of 2000dev --> sessions of September 2000training --> rest (excluding above)Trained Morfessor on training set & used to segment dev & test setscreated 2 data sets for each alignment pair --> 1 w/ words, 1 w/ morphsused training sets for LM trainingused dev sets for parameter tuningMoses cleaning script removed mis-aligned sentences:

a) 0 tokensb) too many tokensc) bad token ratio

test set --> sentences had at least 5 words & at most 15 words

Results

morphs shorter than words --> need longer n-gram to cover same amount of context info4-gram improves scores over 3-gram LM for morphs & for words (3/4) -->

use 4-gram LMdefault phrase length in Moses = 7 --> not long enough for morphs --> increased to 10varigram model --> mixed: overall --> translation based on morph phrases worse

signif icantly worse in 2 cases: Finnish-Swedish & Swedish-FinnishReasons:

only 1 reference translation --> hurts scoreFinnish has fewer words for same text than Swedish or Danish

1 mistake in suffix of word --> word is error even if can understand

Untranslated Words

word-based translation model only translates words present in training datadata --> morphs have notably lower type count

same vocabulary coverage w/ smaller # more frequently occurring unitsreduces OOV problem

results --> morph-based system translated many more sentences fully--> morph-based system translated more wordshigher # compound words & inflected word forms left untranslated by word-based system

Performance on Baseforms

translating into Fin --> word & morph models trouble getting grammatical endings right

Morph-based model translated more wordsRestored words to baseform morph-based model improve?

used FINTWOL = Finnish MA to produce baseforms for each word in outcome of Swedish-Finnish translation

3.3%(word) & 2.2(morph) & 1.8%(ref) words not recognized by MAleft unchanged

BLEU scores about 5% higher for modified dataWord-based model still outperformed morph-based modelno test on other language pairs

Quality of Morfessor’s Segmentation

selected 500 words from data randomly & manually segmentedprecision = proportion of morph boundaries proposed by Morfessor agreeing w/ linguistic segmentationrecall = proportion of boundaries in linguistic segmentation found by Morfessorsegmentation accuracy for Danish & Swedish very similarFinnish morphology more challenging --> results worseprecision around 80% for all languages --> 4/5 morph boundaries suggested

by Morfessor correctprefer high precision --> proposed morph boundaries usually correctlower recall --> words generally undersegmented (segmentation more conservative)difference btwn standard word representation & Morfessor segmentation smaller than difference btwn words & linguistic morphs

Closer Look at Segmentation

looked for phrases not spanning entire wordsat least 1 phrase boundary = morph boundary w/in word

3 categories:1) same structure across languages

compound words common in 3 languages studiedDanish & Swedish similar similar morphological structureparallel structures when translating to or from Finnish N & V

2) differing structures across languagesmorph-based model captures fairly wellneed way to re-order phrasesinteresting: Finnish (written) turns V to N

3) lexicalized forms split into phrasesSwedish & Danish: translate phrase piece by piece even though

phrases may be very short & not morphologically productivedata: 2/3 translated sent btwn Swedish & Danish have at least 1

phrase boundary w/in word --> only 1/3 in Finnish

Conclusion

unsupervised MA flexible --> provide language independencegeneralization ability increased through more refined phrasesImprovements:

specialize alignment algorithm for morphs instead of wordsrescore translations with word-based LM combine allomorphs of same morpheme into equivalence classes

use factored translation models to combine in translation

English-Turkish SMT

looked at sub-lexical structure b/c Turkish word aligns to complete phrase on English side

phrase on English side may be discontinuousTurkish 150 diff suffixes & 30,000 root words

use morphs to alleviate sparsenessAbstract away from word-internal details w/ morph representationwords at morph level that appear different may be similar on surface

Turkish has many more distinct word forms (2X Eng) but fewer distinct content wordsMay overload distortion mechanisms b/c account for both word-internal morph sequence & sentence level word orderingSegmentation of word might not be unique

generate representation with lexical & morphological features for all possible segmentations & interpretations of wordDisambiguate analyses w/ statistical disambiguator using morph features

Exploiting Turkish Morphology

Process docs: 1) improve statistical alignment segment words into lexical

morphemes to remove differences b/c of word-internals 2) tag English side w/ TreeTagger lemma & POS for each word

Remove any tags not implying morpheme or exceptional form3) extract sequence of roots for open class content words from morph-

segmented data Remove all closed-class words as well as tags signaling morph on

open class wordProcessing bolsters training corpus, improves alignmentGoal: align roots w/o additional noise from morphs or function words

Framework & Systems

used monolingual Turkish text of 100,000 sentences & training data for LM decoded & rescored n-best listsurface words directly recoverable from concatenated representation of segmentationused word-based representation for word-based LM used for rescoringused phrase-based SMT framework (Koehn) & Moses toolkit (Koehn) & SRILM LM toolkit (Stolke)evaluated decoded translations w/ BLEU using single reference translation3 Systems:

BaselineFully morphologically segmented modelSelectively segmented model

Baseline System

Trained model using default Moses parameters w/ word-based training corpusDecoded English test set w/ default decoder parameters & w/ distortion limit set to unlimitedAlso tried distortion weight set to 0.1 to allow for long distance distortionsTried MERT but did not improve scoresAdded content word data & trained 2nd baseline modelAdding content word hurt performance (16.29 vs. 16.13 & 20.16 vs. 19.77)

Fully Morphologically Segmented Model

Trained model w/ morphs & w/ & w/o adding content wordsUsed 5-gram morpheme based LM for decoding

goal: capture local morphotactic constraints & sentence level ordering of words

2 morph per word covers 2 wordsdecoded 1000-best listsConverted 1000 sentences into words & rescored w/ 4-gram word-based LM

goal: enforce distant word sequence constraintsExperimented w/ parameters & various linear combos of word-based LM and trans model w/ tuningDefault decoding parameters used by Moses decoder provided bad results

English & Turkish word order very different need distortionAllow longer distortions w/ less penalty 7 point BLEU improvementAdd content words 6.2% improvement (no rescoring) better alignmentRescored 1000-best sentence output w/ 4-gram word-based LM

4% relative improvement (.79 BLEU points)Best: allow distortion & rescore 1.96 BLEU points (9.4% relative)

Selectively Segmented Model

Analyzed GIZA++ filescertain morphemes on Turkish side almost never aligned w/ anything

Only derivational MA on Turkish sideNominalization, agreement markers, et al mostly unaligned

For above cases attach morphemes to root (intervening morphs for V too)Case morphemes did align w/ prepositions on English side, so left aloneTrained model w/ added content words & parameters from best scoring

model in last slide2.43 pts (11% rel) improvement BLEU over best model above

Model Iteration

used iterative approach to use multiple models like post-editingused selective segmentation model & decoded English training & English test sets to obtain T1 test & traintrained next model on T1 train and T train data & built modelaim: T1 < model < Tmodel applied to T1 train & T1 test produces T2 train & T2 test repeatdid not include content word corpus in experiments:

Preliminary experiments word-based models perform better than morpheme-based models in next iterations

Adding content words for word-based models not helpfuldecoded data on original test data using 3-gram word-based LMRe-ranked 1000-best outputs using 4-gram LM2nd iteration 4.86 (24% relative) improvement in BLEU over 1st fully morph-segmented model (no rescoring)

Errors & Word Repair

Errors in any translated morpheme or morphotactics word incorrect1-gram precision score get almost 65% of root words correct

1-gram precision score only about 52% w/ best modelMismatches poorly formed root correct but morphs not applicable or in wrong position

Many cases mismatches only 1 morpheme edit distance away from correct word

Solution:Utilize morpheme level ‘spelling corrector’ operating on segmented

representationsCorrects forms w/ minor morpheme errors form lattice & use to

rescore contextually correct formUsed BLEU+ to investigate

recover all words 1 or 2 morphs away raise word BLEU score to 29.86 and 30.48

Oracle scores BUT very close to root word BLEU scores

Other Scoring Methods

BLEU very harsh on Turkish & morph-based approach all-or-none nature of token comparison

Possible to have almost interchangeable words w/ very similar semanticsnot exact match BLEU marks as wrong

Solution: use stems & synonyms (METEOR) alter notion of token similarity score increases to 25.08 use root word synonymy & Wordnet score increase to 25.45 combine rules & Wordnet score increases to 25.46

Conclusions

Morphology & rescoring significant boost in BLEU scoreOther solutions to morphotactics problem:

use skip LM in SRILM toolkit content word order directly used by decoder

identify morphologically correct OOV words or assigned low probability by LM using posterior probabilities

generate additional ‘close’ morphological words & construct lattice that can be rescored

morphology & machine translation eric davis mt seminar 02/06/08 professor alon lavie professor...

Documents