grammatical machine translation stefan riezler & john maxwell
TRANSCRIPT
Grammatical Machine Translation
Stefan Riezler & John Maxwell
Overview
1. Introduction2. Extracting F-Structure
Snippets3. Parsing-Transfer-Generation4. Statistical Models and
Training5. Experimental Evaluation6. Discussion
Section 1:Introduction
IntroductionRecent approaches to SMT use• Phrase-based SMT• Syntactic knowledge
Phrase-base SMT is great for • Local ordering• Short idiomatic expressions
But not so good for • Learning LDDs• Generalising to unseen phrases that share non-
overt linguistic info
Statistical Parsers
Statistical Parsers can provide information to• Resolve LDDs• Generalise to unseen phrases that share non-
overt linguistic info
Examples:• Xia & McCord 2004• Collins et al. 2005
• Lin 2004• Ding & Palmer 2005• Quirk et al. 2005
Grammar-based Generation
Could grammar-based generation be useful for MT?
Quirk et al. 2005• Simple statistical model outperforms grammar-base
generator of Menezes & Richardson 2001 on BLEU score
Charniak et al. 2003• Parsing-based language modelling can improve
grammaticality of translations while not improving BLEU score
Perhaps BLEU score is not sufficient way to test for grammaticality.
Further investigation needed
Grammatical Machine Translation
Aim: Investigate incorporating a grammar-based generator into a dependency-based SMT system
The authors present:• A dependency-based SMT model• Statistical components that are modelled on phrase-based
system of Koehn et al. 2003
Also used:• Component weights adjusted using MER training (Och
2003)• Grammar-based generator• N-gram and distortion models
Section 2:Extracting F-Structure
Snippets
Extracting F-Structure Snippets
SL and TL sentences of bilingual corpus parsed using
LFG grammars
For each English and German f-structure pair• The two f-structures that most preserve
dependencies are selected• Many-to-many word alignments used to create
many-to-many correspondences between the substructures
• Correspondences are the basis for deciding what goes into the basic transfer rule
Extracting F-Structure Snippets:Example
Dafur bin ich zutiefst dankbar I have a deep appreciation for that
<for that> <am> <I> <deepest> <thankful>
Many-to-many bidirectional word alignment:
Transfer Rule Extraction: Example
From the aligned words we get the following substructure correspondences:
Transfer Rule Extraction: Example
From the correspondences two kinds of transfer rules are extracted:
1. Primitive Transfer Rules
2. Complex Transfer Rules
Transfer Contiguity Constraint1. Source and target f-structures are each
connected.2. F-structures in the transfer source can only be
aligned with f-structures in the transfer target and vice versa.
Transfer Rule Extraction: ExamplePrimitive Rule 1:
pred( X1, sein) pred( X1, have)
subj( X1, X2) subj( X1, X2)
xcomp( X1, X3) obj( X1, X3)
Transfer Rule Extraction: Example
Primitive Rule 2:
pred( X1, ich) pred( X1, I)
Transfer Rule Extraction: ExamplePrimitive Rule 3:
pred( X1, dafur) pred( X1, for)
obj( X1, X2)
pred( X2, that)
Transfer Rule Extraction: ExamplePrimitive Rule 4:
pred( X1, dankbar) pred( X1, appreciation)adj( X1, X2) spec( X1, X2)in_set( X3, X2) pred( X2, a)pred(X3, zutiefst) adj( X1, X3)
in_set( X4, X3)pred( X4, deep)
Transfer Rule Extraction: Example
Complex Transfer Rules• primitive transfer rules that are adjacent in f-structure
combined to form more complex rules
Example (rules 1 & 2 above):
pred( X1, sein) pred( X1, have)subj( X1, X2) subj( X1, X2)pred( X2, ich) pred( X2, I)xcomp( X1, X3) obj( X1, X3)
In the worst case, there can be an exponential number of combinations of primitive transfer rules, the number of primitive rules used to form a complex rule is restricted to 3 – causing the no. of transfer rules taken to be O(n2) in the worst case.
Section 3:Parsing-Transfer-Generation
Parsing
• LFG grammars used to parse source and target text
• FRAGMENT grammar is used to augment standard grammar increasing robustness
• Correct parse determined by fewest chunk method
Transfer
• Rules applied to source f-structure non-deterministically and in parallel
• Each fact of German f-structure translated by exactly one transfer rule
• Default rule included that allows any fact to be translated as itself
• Chart used to encode translations• Beam search decoding used to select the
most probable translations
Generation
Method of generation has to be fault tolerant• Transfer system can be given a fragmentary
parse as input• Transfer system can output an non-valid f-
structure • Unknown predicates
– Default morphology used to inflect source stem for English
• Unknown structures– Default grammar used that allows any attribute to be
generated in any order with any category
Section 4:Statistical Models & Training
Statistical Components
Modelled on statistical components of Pharaoh
Paraoh integrates 8 statistical models1. Relative frequency of phrase translations in source-to-
target2. Relative frequency of phrase translations in target-to-
source3. Lexical weighting in source-to-target4. Lexical weighting in target-to-source5. Phrase count6. Language model probability7. Word count8. Distortion probability
Statistical Components
Following statistics for each translation:1. Log-probability of source-to-target transfer rules,
where the probability r(e|f) of a rule that transfers source snippet f into target snippet e is estimated by the relative frequency
2. Log-probability of target-to-source rules
Statistical Components
3. Log-probability of lexical translations fromsource to target snippets, estimated from Viterbi alignments â between source word positions i = 1, …, n and target word positions j = 1, …, m for stems fi and ej in snippets f and e with relative word translation frequencies t(ej|fi)
4. Log-probability of lexical translations from target-to-source snippets
Statistical Components
5. Number of transfer rule6. Number of transfer rules with frequency 17. Number of default transfer rules8. Log-probability of strings of predicates from root to
frontier of target f-structure, estimated from predicate trigrams of English
9. Number of predicates in target language10. Number of constituent movements during generation
based on the original order of the head predicates of the constituents (for example, AP[2] BP[3] CP[1] counts as two movements since the head predicate of CP moved from first to third position)
Statistical Components11. Number of generation repairs12. Log-probability of target string as computed by trigram language
model13. Number of words in target string
• 1 – 10 are used to choose the most probable parse from the transfer chart
• 1 – 7 are are tests on source and target f-structure snippets related via transfer rules
• 8 -10 are language model and distortion features on the target c- and f-structures
• 11 – 13 are computed on the strings that are generated from the target f-structure
The statistics are combined into a log-linear model whose parameters are adjusted by minimum error rate training.
Section 5:ExperimentalEvaluation
Experimental Evaluation• Europarl German to English• Sents of length 5 – 15 words
Training set: 163,141 sents
Development set: 1,967 sents
Test set: 1,755 sents (same as Koehn et al 2003)
• Bidirectional word alignment created from word alignment of IBM model 4 as implemented by Giza++ (Och et al. 1999)
• Grammars achieve 100% coverage on unseen data– 80% as full parses– 20% as fragment parses
• 700,000 transfer rules extracted• For language modelling trigram model of Stolcke 2002 is used
Experimental Evaluation
For translating the test set• 1 parse for each German sentence was used• 10 transferred f-structures• 1,000 generated strings for each transferred f-
structure
• Most probable target f-structure is gotten by a beam search on the transfer chart using features 1-10 above, with a beam size of 20.
• Features 11-13 are computed on the strings that are generated
Experimental Evaluation
• For automatic evaluation they used NIST combined with the approximate randomization test (Noreen, 1999)
6.40*5.62*5.57Full test set
*5.99*5.825.13In-coverage (44%)
Phrase-based SMT
LFGIBM Model4
Experimental EvaluationManual Evaluation• To separate the factors of grammaticality and translation adequacy• 500 sentences randomly extracted from in-coverage examples• 2 independent human judges• Presented with the output from the phrase-based SMT system and
LFG-based system in a blind test and asked them to choose a preference for one of the translations based on– Grammaticality / fluency – Translational / semantic adequacy
22344511926053equal
1711361810510LFG
92367848P
equalLFGPequalLFGPJ1 \ j2
grammaticalityadequacy
Experimental Evaluation• Promising results for examples that are in-coverage of LFG
grammars • However, back-off to robustness techniques for parsing and
generation results in loss of translation quality
Rule Extraction Problems• 20% of the parses are fragmental• Errors occur in rule extraction process resulting in ill-formed transfer
rules
Parsing-Transfer-Generation Problems• Parsing errors errors in transfer generation errors• In-coverage disambiguation errors in parsing and transfer
suboptimal translation
Experimental Evaluation
• Despite use of minimum error rate training and n-gram language models, the system cannot be used to maximize n-gram scores on reference translations in the same way as phrase-based systems since statistical ordering models are employed in the framework after generation
• This gives preference to grammaticality over similarity to reference translations
Conclusion
• SMT model that marries phrase-based SMT with traditional grammar-based MT
• NIST measure showed that results achieved are comparable with phrase-based SMT system of Koehn et al 2003 for in-coverage examples
• Manual evaluation showed significant improvements in both grammaticality and translational adequacy for in-coverage examples
Conclusion
• Determinable with this system whether or not a source sentence is in-coverage
• Possibility for hybrid system that achieves improved grammaticality at state-of-the-art translation quality
Future Work:• Improvement of translation of in-coverage
source sentences e.g. stochastic generation• Apply system to other language pairs and data
sets
ReferencesMiriam Butt, Dyvik Helge, Tracy King, Hiroshi Masuichi and Christian Rohrer. 2002 The Parallel Grammar Project.Eugene Charniak, Kevin Knight and Kenji Yamada. 2003 Syntax-based Language Models for Statistical Machine Translation.Michael Collins, PhilippKoehn and Ivona Kucerova. 2005 Clause Restructuring for Statistical Machine Translation.Philipp Koehn, Franz Och and Daniel Marcu. 2003 Statistical Phrase-based Translation.Philipp Koehn. 2004 Pharaoh: a beam search decoder for phrase-based statistical machine translation Arul Menezes and Stephen Richardson. 2001 A best-first alignment for automatic extraction of transfer mappings from bilingual corpora.Franz Och, Christoph Tillmann and Ney Hermann. 1999 Improved Alignment Models for Statistical Machine Translation.Franz Och. 2003 Minimum error rate training in statistical machine translation.Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002 BLEU: a method for automatic evaluation of machine translation.Stefan Riezler, Tracy King, Ronald Kaplan, Richard Crouch, John Maxwell and Mark Johnson. 2002 Parsing the Wall Street Journal using LFG and Discriminative Estimation TechniquesStefan Riezler and John Maxwell. 2006 Grammatical Machine Translation.Fei Xia and Michael McCord. 2004 Improving a statistical MT system with automatically learned rewrite patterns