chris dyer - kevin gimpel waleed ammar - noah smith
DESCRIPTION
Knowledge-Rich MT. Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith. November 4, 2011. Outline. Where are we starting with end-to-end MT? Adapting SMT for low-resource scenarios What progress have we been making? What does Year 2 hold?. Cross-site system comparison. - PowerPoint PPT PresentationTRANSCRIPT
Knowledge-Rich MT
November 4, 2011
Outline
•Where are we starting with end-to-end MT?
•Adapting SMT for low-resource scenarios
•What progress have we been making?
•What does Year 2 hold?
Cross-site system comparison
TM
learner
English français
LMlearner
English
decoder
S'il vous plaît traduire...
Please translate...
The SMT baseline
SMT Baselines
BLEU
English – Kinyarwanda (Hiero) 4.7
BLEU
Kinyarwanda – English (Hiero) 6.8
SMT Baselines
BLEU
English – Kinyarwanda (Hiero) 4.7
BLEU
English – Malagasy (Hiero) 25.0
English – Malagasy (Moses) 30.5
BLEU
Kinyarwanda – English (Hiero) 6.8
BLEU
Malagasy – English (Hiero) 24.3
Malagasy – English (Moses) 24.2
Let’s make things better.
TM
learner
English français
LMlearner
EnglishThe problem?
TM
learner
EnglishMalagasy
LMlearner
EnglishLow-resource!
TMEnglishMalagasy
LMlearner
EnglishLow-resource!
Small,Out of
domain
TMEnglishMalagasy
LMlearner
EnglishLow-resource!
Malagasy verbal morphology“Partial” language models
TMEnglishMalagasy
LMlearner
EnglishLow-resource!
Malagasy verbal morphology
Unsupservisedmodel outputs
Dependency parses
TMEnglishMalagasy
LMlearner
EnglishLow-resource!
Malagasy verbal morphology
Unsupservisedmodel outputs
Dependency parses
36:dieny,fara,fiompiny,hamoaka,handehanany
37:adinina,aforeto,ahevahevao,akaiky,alao,
Word clusters
Year 1 MT Challenge
Year 1 MT ChallengeEnglishMalagasy
Malagasy verbal morphology
Dependency parses
36:dieny,fara,fiompiny,hamoaka,handehanany
37:adinina,aforeto,ahevahevao,akaiky,alao,
Word clusters
Year 1 MT ChallengeEnglishMalagasy
Malagasy verbal morphology
Dependency parses
36:dieny,fara,fiompiny,hamoaka,handehanany
37:adinina,aforeto,ahevahevao,akaiky,alao,
Word clusters
Translation ModelTranslation Model
Year 1 MT ChallengeEnglishMalagasy
Malagasy verbal morphology
Dependency parses
36:dieny,fara,fiompiny,hamoaka,handehanany
37:adinina,aforeto,ahevahevao,akaiky,alao,
Word clusters
Translation ModelTranslation Modelhenemana no hana ... something intelligible ...
Accomplishments
Model 4 CMU
Model 4 CMU
Model 4 CMU
Model 4 CMU
Similar pattern of improvements,no language-specific features (yet).
Malagasy - English
BLEUBLEU
Model 4 - GDAModel 4 - GDA 24.2
Model 4 - GDFAModel 4 - GDFA 26.7
CMU - GDFACMU - GDFA 26.3
Model 4 +CMUModel 4 +CMU 27.6
Malagasy - English version 1.0
the sons of simeon were jemoela , jamin , jakin , and ohada zohara saul , the son of a canaanite woman .
the sons of simeon were jemuel , jamin , ohada , jakin , zohar , and shaul , the son of a canaanite woman .
the sons of simeon : jemuel , jamin , ohad , jakin , zohar , and shaul ( the son of a canaanite woman ) .
What improvements?
the sons of simeon were jemoela , jamin , jakin , and ohada zohara saul , the son of a canaanite woman .
the sons of simeon were jemuel , jamin , ohada , jakin , zohar , and shaul , the son of a canaanite woman .
the sons of simeon : jemuel , jamin , ohad , jakin , zohar , and shaul ( the son of a canaanite woman ) .
What improvements?
then the woman said to the serpent , “ no ! you will not die .
now the serpent said to the woman , “ you will not die .
the serpent said to the woman , “ surely you will not die ,
What improvements?
then the woman said to the serpent , “ no ! you will not die .
now the serpent said to the woman , “ you will not die .
the serpent said to the woman , “ surely you will not die ,
What improvements?
Feature-rich translation
•Discriminative learning on training data
•Learn much sparser features than possible with just a development set
•Update weights to improve translation probability
•Final tuning pass on development set to optimize translation metrics (BLEU, METEOR, etc.)
What features?
Contexts give clues to contintuents
Contexts give clues to contintuents
German - English
BLEUBLEU FeatureFeaturess
baselinebaseline 25.0 11 / 11
+7-gram+7-gram 25.0 13 / 13
+Context+Context 25.211,194 /
80,006,646
+Context+Context+7-gram+7-gram
25.411,196 /
80,006,648
Phrasal dependency
translation model
Phrase-
based
output:
zimbabwe african national congresssanctions againstopposition to
ANC opposition sanction Zimbabwe
非国大 反对 制裁 津巴布韦
african national congress opposes sanctions against zimbabweReference:
OurSyste
m:
zimbabweafrican national congress sanctions againstis opposed to
ANC opposition sanction Zimbabwe
$
非国大 反对 制裁 津巴布韦
african national congress opposes sanctions against zimbabweReference:
zimbabwe african national congresssanctions againstopposition to
ANC opposition sanction Zimbabwe
非国大 反对 制裁 津巴布韦
Phrase-
based
output:
OurSyste
m:african national congress opposes sanctions against zimbabweReference:
zimbabwe african national congresssanctions againstopposition to
ANC opposition sanction Zimbabwe
非国大 反对 制裁 津巴布韦
zimbabweafrican national congress sanctions againstis opposed to
ANC opposition sanction Zimbabwe$
$
非国大 反对 制裁 津巴布韦
Use features from source-side parse
Phrase-
based
output:
Target Syntax Only
% BLEU
Target Syntax +
String-to-Tree Rules
Target Syntax Only
% BLEU
Target Syntax +
String-to-Tree Rules
Target Syntax Only
% BLEU
Target Syntax +
String-to-Tree Rules +
Tree-to-Tree Features
•Our best results use supervised parsers for both source and target languages
•What about unsupervised parsing?
•Our best results use supervised parsers for both source and target languages
•What about unsupervised parsing?
•We use the dependency model with valence (Klein & Manning, 2004)
•With careful initialization, it gives state-of-the-art results (Gimpel & Smith, 2011):
•53.1% attachment accuracy on Penn Treebank
•44.4% on Chinese Treebank
% BLEU
Year 2
•Target morphological complexity
•Generate novel word forms
•Leverage morphological resources and machine learning
•Need better language models, not just translation models
“Into other languages”
Year 2 Challenges
•Generating new word forms means a much larger search space than is usual in MT
•Inference is expensive
•Use “high-recall” linguistic tools to constrain search
•Statistics do the rest
Year 2
•Data requirements
•Large non-English monolingual corpora
•Test sets for focus languages