morphology & machine translation eric davis mt seminar 02/06/08 professor alon lavie professor...
TRANSCRIPT
Morphology & Machine Translation
Eric DavisMT Seminar02/06/08Professor Alon LavieProfessor Stephan
Vogel
Outline
IntroThe Issue at HandSupervised MAUnsupervised MAIntegration of Morphology into MT
PapersMorfessorBridging Inflectional Morphological Gap --> Arabic SMTUnsupervised MA w/ Finnish, Swedish, & DanishTurkish SMT
DiscussionThe GoodThe BadFuture Directions
Q&A
Morfessor
morpheme segmentation & simple morphology induction algorithm utilized Finnish & English data sets used in Morpho challenge unsupervised method for segmentation of words into morpheme-like units idea: propose substrings occurring frequently enough in several different
word forms as morphs words = concatenation of morphs look for optimal balance btwn compactness of morph lexicon &
representation of corpus very compact lexicon = individual letters --> as many morphs as letters
in word short rep of corpus: whole words --> large lexicon
corpus represented as sequence of pointers to entries in morph lexicon uses probabilistic framework or MDL to produce segmentation resembling
linguistic morpheme segmentation 3 'flavors:' Baseline, Categories ML, Categories-MAP
Morfessor Baseline
context-independent splitting algorithm optimization criterion = max P(lexicon) P(corpus|lexicon)
= ∏ P(α) ∏ P(μ) lexicon = all distinct morphs spelled out forming strings of letters α = strings of letters formed by morphs P(lexicon) = product of probability of each letter in α string corpus --> sequence of morphs morphs --> particular segmentation of words in corpus prob of segmentation P(corpus|lexicon) = product of probability of
each morph token μ letter & morph probs are max likelihood 3 errors:
1) undersegmentation: freq string stored as whole b/c most concise rep 2) oversegmentation: infreq string best coded in parts 3) morphotactic violations: b/c model context-independent
Morfessor Categories ML
introduce morph categories use HMM
transition probabilities between categories emission probabilities of morphs from categories
4 categories: use properties of morphs in proposed segmentation prefix: morph preceding large # of diff morphs (right perplexity) stem: morph that is not very short suffix: morph following large # of diff morphs (left perplexity) noise: morph not obvious prefix, suffix, or stem in pos it occurs
in use heuristics & noise category to remove some errors from baseline split redundant morphs in lexicon to reduce undersegmentation prohibit splitting into 'noise' join morphs tagged as noise w/ neighbors to reduce
oversegmentation introduce context-sensitivity (HMM) to reduce morphotactic
violations
Morfessor Categories MAP
2 probabilities calculated:P(lexicon) & P(representation of corpus conditioned on lexicon)
frequent strings represented as whole words in lexicon frequent strings now have hierarchical representation
morph --> string of letters or 2 sub-morphs expand morphs into sub-morphs to avoid undersegmentation do not expand nodes in tree if next level = 'noise' to avoid
oversegmentation
Experiments & Results
baseline entirely unsupervised ML & MAP not unsupervised
optimize perplexity threshold separately for 3 lang
run 3 models on Challenge data ML & MAP > baseline baseline did best on English
MAP had much higher precision than other models BUT lower recall MAP & ML great improvement in recall BUT lower precision explanation: different complexities of morphology
Turkish/Finnish: high type/token ratioword formation --> concat of morphemesSo, proportion of frequently occurring word forms is lower
English: word formation --> fewer morphemesSo, proportion of frequently occurring word forms is higher
BAMA & Arabic MT
take advantage of source & target lang context when conducting MA preprocess data w/ BAMA
morphological analysis at word level analyzes word --> returns all possible segmentations for word segmentations --> prefixes, stems, suffixes built in word-based heuristics --> rank candidates gloss info provided by BAMA's manually constructed lexicon
3 methods to analysis1) BAMA only2) BAMA & context3) BAMA & corresponding match
BAMA & Arabic MT3 Methods of Analysis
1) BAMA only Replace each Arabic word 1st possible split returned by
BAMA2) BAMA & context
Take full advantage of gloss info provided by BAMA’s lexicon
Each split particular prefix, stem, suffix existing in lexicon
Set of possible translations (glosses) for each fragment Select fragment (split for source word) using context winner = split w/ most target side matches in translation of
full sentence Save choice of split & use for all occurrences of surface
form of word in training & testing3) BAMA & corresponding match
Arabic info in surface form not present in English Confusing for word-alignment unless fragments assigned to
null Remove fragments w/ lexical info not present in English Find b/c English translations in BAMA lexicon empty
Data BTEC IWSLT05 Arabic language data 20,000 Arabic/English sentence pairs (training) DevSet/Test05 500 arabic sentences each w/ 16 reference
translations per Arabic sentence Also evaluated on randomly sampled dev & test sets
worried test & dev sets too similar Used Vogel system w/ reordering & future cost estimation baseline --> Normalize (merge Alif, tar marbuta, ee) Trained translation parameters for 10 scores (LM, word & phrase
count, & 6 translation models) Used MERT on dev set Optimized system (separately) for both BLEU & NIST
BAMA & Arabic MTData & System
Results
NIST scores steady improvement w/ better splitting techniques (up to 5% relative)
Improvements statistically significant Better improvements for NIST than BLEU NIST sensitivity to correctly translating certain high gain words in
test corpus Unknown word inflectional splitting technique correct
translation increase score
Unsupervised MA for Finnish, Swedish, & Danish SMT
used morphological information found in unsupervised way in SMT3 languages: Danish, Swedish & Finnish
Danish, Swedish very close to each othertrained system on corpus containing 300,000 sentences from EuroParltypical IBM model --> trans model & LM
used morphs as tokens NOT wordsused Morfessor Categories-MAP to find morpheme-like units
even works w/ agglutinative languages, e.g., FinnishReasoning: speech recognition using morph-based vocabulary shown to
improve resultsused MAP because:
1) has better segmentation accuracy than Morfessor Baseline or ML2) can handle unseen words
word = (PRE* STM SUF*)+
Language Models & Phrases
used basic n-gram LM --> base on sub-word units NOT wordsused varigram model --> gets smaller n-gram model w/o restricting n too
muchmodel grows incrementally & includes longer contexts only when
necessaryused 3 types of LM
2 baseline 3-gram & 4-gram models trained w/ SRL LM toolkit3rd --> varigram model trained w/ VariKN LM toolkit based on
(Siivola, 2007)observed --> trans quality improved by translating seq of words (phrases) used Moses --> generalized phrase-based approach to work w/ morphologyused morphs w/o modifications to Mosessimilar phrases constructed from morphs as words
morphs suitable for translating compound words in partsmorph category info (pre, stm, suf) part of morph label+ --> not last morph of word --> necessary to reconstruct words from morphs in output
Data & Experiments
ran all experiments on Moses & used BLEU to scoredata = European Parliament from 1996-2001 --> strip bi-texts of XML tags
& converted letters to lowercasetest --> last 3 months of 2000dev --> sessions of September 2000training --> rest (excluding above)Trained Morfessor on training set & used to segment dev & test setscreated 2 data sets for each alignment pair --> 1 w/ words, 1 w/ morphsused training sets for LM trainingused dev sets for parameter tuningMoses cleaning script removed mis-aligned sentences:
a) 0 tokensb) too many tokensc) bad token ratio
test set --> sentences had at least 5 words & at most 15 words
Results
morphs shorter than words --> need longer n-gram to cover same amount of context info4-gram improves scores over 3-gram LM for morphs & for words (3/4) -->
use 4-gram LMdefault phrase length in Moses = 7 --> not long enough for morphs --> increased to 10varigram model --> mixed: overall --> translation based on morph phrases worse
signif icantly worse in 2 cases: Finnish-Swedish & Swedish-FinnishReasons:
only 1 reference translation --> hurts scoreFinnish has fewer words for same text than Swedish or Danish
1 mistake in suffix of word --> word is error even if can understand
Untranslated Words
word-based translation model only translates words present in training datadata --> morphs have notably lower type count
same vocabulary coverage w/ smaller # more frequently occurring unitsreduces OOV problem
results --> morph-based system translated many more sentences fully--> morph-based system translated more wordshigher # compound words & inflected word forms left untranslated by word-based system
Performance on Baseforms
translating into Fin --> word & morph models trouble getting grammatical endings right
Morph-based model translated more wordsRestored words to baseform morph-based model improve?
used FINTWOL = Finnish MA to produce baseforms for each word in outcome of Swedish-Finnish translation
3.3%(word) & 2.2(morph) & 1.8%(ref) words not recognized by MAleft unchanged
BLEU scores about 5% higher for modified dataWord-based model still outperformed morph-based modelno test on other language pairs
Quality of Morfessor’s Segmentation
selected 500 words from data randomly & manually segmentedprecision = proportion of morph boundaries proposed by Morfessor agreeing w/ linguistic segmentationrecall = proportion of boundaries in linguistic segmentation found by Morfessorsegmentation accuracy for Danish & Swedish very similarFinnish morphology more challenging --> results worseprecision around 80% for all languages --> 4/5 morph boundaries suggested
by Morfessor correctprefer high precision --> proposed morph boundaries usually correctlower recall --> words generally undersegmented (segmentation more conservative)difference btwn standard word representation & Morfessor segmentation smaller than difference btwn words & linguistic morphs
Closer Look at Segmentation
looked for phrases not spanning entire wordsat least 1 phrase boundary = morph boundary w/in word
3 categories:1) same structure across languages
compound words common in 3 languages studiedDanish & Swedish similar similar morphological structureparallel structures when translating to or from Finnish N & V
2) differing structures across languagesmorph-based model captures fairly wellneed way to re-order phrasesinteresting: Finnish (written) turns V to N
3) lexicalized forms split into phrasesSwedish & Danish: translate phrase piece by piece even though
phrases may be very short & not morphologically productivedata: 2/3 translated sent btwn Swedish & Danish have at least 1
phrase boundary w/in word --> only 1/3 in Finnish
Conclusion
unsupervised MA flexible --> provide language independencegeneralization ability increased through more refined phrasesImprovements:
specialize alignment algorithm for morphs instead of wordsrescore translations with word-based LM combine allomorphs of same morpheme into equivalence classes
use factored translation models to combine in translation
English-Turkish SMT
looked at sub-lexical structure b/c Turkish word aligns to complete phrase on English side
phrase on English side may be discontinuousTurkish 150 diff suffixes & 30,000 root words
use morphs to alleviate sparsenessAbstract away from word-internal details w/ morph representationwords at morph level that appear different may be similar on surface
Turkish has many more distinct word forms (2X Eng) but fewer distinct content wordsMay overload distortion mechanisms b/c account for both word-internal morph sequence & sentence level word orderingSegmentation of word might not be unique
generate representation with lexical & morphological features for all possible segmentations & interpretations of wordDisambiguate analyses w/ statistical disambiguator using morph features
Exploiting Turkish Morphology
Process docs: 1) improve statistical alignment segment words into lexical
morphemes to remove differences b/c of word-internals 2) tag English side w/ TreeTagger lemma & POS for each word
Remove any tags not implying morpheme or exceptional form3) extract sequence of roots for open class content words from morph-
segmented data Remove all closed-class words as well as tags signaling morph on
open class wordProcessing bolsters training corpus, improves alignmentGoal: align roots w/o additional noise from morphs or function words
Framework & Systems
used monolingual Turkish text of 100,000 sentences & training data for LM decoded & rescored n-best listsurface words directly recoverable from concatenated representation of segmentationused word-based representation for word-based LM used for rescoringused phrase-based SMT framework (Koehn) & Moses toolkit (Koehn) & SRILM LM toolkit (Stolke)evaluated decoded translations w/ BLEU using single reference translation3 Systems:
BaselineFully morphologically segmented modelSelectively segmented model
Baseline System
Trained model using default Moses parameters w/ word-based training corpusDecoded English test set w/ default decoder parameters & w/ distortion limit set to unlimitedAlso tried distortion weight set to 0.1 to allow for long distance distortionsTried MERT but did not improve scoresAdded content word data & trained 2nd baseline modelAdding content word hurt performance (16.29 vs. 16.13 & 20.16 vs. 19.77)
Fully Morphologically Segmented Model
Trained model w/ morphs & w/ & w/o adding content wordsUsed 5-gram morpheme based LM for decoding
goal: capture local morphotactic constraints & sentence level ordering of words
2 morph per word covers 2 wordsdecoded 1000-best listsConverted 1000 sentences into words & rescored w/ 4-gram word-based LM
goal: enforce distant word sequence constraintsExperimented w/ parameters & various linear combos of word-based LM and trans model w/ tuningDefault decoding parameters used by Moses decoder provided bad results
English & Turkish word order very different need distortionAllow longer distortions w/ less penalty 7 point BLEU improvementAdd content words 6.2% improvement (no rescoring) better alignmentRescored 1000-best sentence output w/ 4-gram word-based LM
4% relative improvement (.79 BLEU points)Best: allow distortion & rescore 1.96 BLEU points (9.4% relative)
Selectively Segmented Model
Analyzed GIZA++ filescertain morphemes on Turkish side almost never aligned w/ anything
Only derivational MA on Turkish sideNominalization, agreement markers, et al mostly unaligned
For above cases attach morphemes to root (intervening morphs for V too)Case morphemes did align w/ prepositions on English side, so left aloneTrained model w/ added content words & parameters from best scoring
model in last slide2.43 pts (11% rel) improvement BLEU over best model above
Model Iteration
used iterative approach to use multiple models like post-editingused selective segmentation model & decoded English training & English test sets to obtain T1 test & traintrained next model on T1 train and T train data & built modelaim: T1 < model < Tmodel applied to T1 train & T1 test produces T2 train & T2 test repeatdid not include content word corpus in experiments:
Preliminary experiments word-based models perform better than morpheme-based models in next iterations
Adding content words for word-based models not helpfuldecoded data on original test data using 3-gram word-based LMRe-ranked 1000-best outputs using 4-gram LM2nd iteration 4.86 (24% relative) improvement in BLEU over 1st fully morph-segmented model (no rescoring)
Errors & Word Repair
Errors in any translated morpheme or morphotactics word incorrect1-gram precision score get almost 65% of root words correct
1-gram precision score only about 52% w/ best modelMismatches poorly formed root correct but morphs not applicable or in wrong position
Many cases mismatches only 1 morpheme edit distance away from correct word
Solution:Utilize morpheme level ‘spelling corrector’ operating on segmented
representationsCorrects forms w/ minor morpheme errors form lattice & use to
rescore contextually correct formUsed BLEU+ to investigate
recover all words 1 or 2 morphs away raise word BLEU score to 29.86 and 30.48
Oracle scores BUT very close to root word BLEU scores
Other Scoring Methods
BLEU very harsh on Turkish & morph-based approach all-or-none nature of token comparison
Possible to have almost interchangeable words w/ very similar semanticsnot exact match BLEU marks as wrong
Solution: use stems & synonyms (METEOR) alter notion of token similarity score increases to 25.08 use root word synonymy & Wordnet score increase to 25.45 combine rules & Wordnet score increases to 25.46
Conclusions
Morphology & rescoring significant boost in BLEU scoreOther solutions to morphotactics problem:
use skip LM in SRILM toolkit content word order directly used by decoder
identify morphologically correct OOV words or assigned low probability by LM using posterior probabilities
generate additional ‘close’ morphological words & construct lattice that can be rescored