finding translation correspondences from parallel parsed corpus for example-based translation eiji...
TRANSCRIPT
Finding Translation Correspondences from Parallel Parsed Corpus for
Example-based Translation
Eiji Aramaki (Kyoto-U),
Sadao Kurohashi (U-Tokyo),
Satoshi Sato (Kyoto-U),
Hideo Watanabe (IBM Japan)
Our method
Introduction
1-2%
Co-occurrence informationParallelCorpus
Syntactic InformationTranslation dictionary
Statistical approach
50%
Translationexamples
Goal
大きく 寄与して いること が(great) (contribution) case-maker
大きく 寄与して いること が(great) (contribution) case-maker
This paper showsshows great contributionsgreat contributions of TFPof TFP ・・・
示されている(show)
示されている(show)
・・・全要素生産性 が(TFP) case-maker
全要素生産性 が(TFP) case-maker
Problems
• For finding many correspondences
Translation Dictionary
1: some words can not be consulted by a dictionary
2: ambiguity resolution of consulting dictionary
2 Problems
Method
Step 1 Detection of Phrasal Dependency Structure
Detection of Basic Phrasal Correspondences by Consulting Dictionary
Discovery of New CorrespondencesBy Handling Remaining Phrases
Step 2
Step 3
Step1: Phrasal Dependency Structures
I
bought
this car
by monthly installments
I bought this car by monthly installments.
ESG (English Parser)
Rules
Step1: Phrasal Dependency Structures
RulesRules
Function words are grouped together with a following content-word.
A compound noun is considered as one phrase.
Auxiliary verbs are grouped together with a following verb. (is playing, was tired, …)
A parallel-relation word is considered as one phrase. ( and , or ,… )
Step2: Detection of Phrasal Correspondences
information technology in science technology
科学 技術 に(Science Technology)
おける 情報 技術(Information Technology)
… …
… …
information technology in science technology
科学 技術 に(Science Technology)
おける 情報 技術(Information Technology)
Step2: Detection of Phrasal Correspondences
… …
… …
information technology in science technology
科学 技術 に(Science Technology)
おける 情報 技術(Information Technology)
Step2: Detection of Phrasal Correspondences
… …
… …
Step2: Detection of Phrasal Correspondences
information technology in science technology
科学 技術 に(Science Technology)
おける 情報 技術(Information Technology)
…
…
…
…
Step2: Detection of Phrasal Correspondences
in science technology
科学 技術 に(Science Technology)
おける 情報 技術(Information Technology)
…
……
information technology…
• Criteria to choose phrasal correspondences – Correspondences of content words
– Correspondences of neighboring phrases
# of word-link X 2
# of J content-word + # of E content-word
Step2: Detection of Phrasal Correspondences
Method
Step 1 Detection of Phrasal Dependency Structure
Detection of Basic Phrasal Correspondences by Consulting Dictionary
Discovery of New CorrespondencesBy Handling Remaining Phrases
Step 2
Step 3
Step3: Discovery of New CorrespondencesBy Handling Remaining Phrases
(New)
in post Cold war yearsCold war years
冷戦 終結 後 に(cold-war) (end) (after) case-maker
冷戦 終結 後 に(cold-war) (end) (after) case-maker
and servicesservicesgoods
物 や(object)
サービス の(service)
サービス の(service)
(merge)
• Criteria to discover new correspondences– Local and Global supports
• Local support: other phrasal correspondences within two-phrase distance in the dependency structure.
• Global support: phrase correspondences in the parallel sentences.
– POS Consistency– Inner Sufficiency
Step3: Discovery of New CorrespondencesBy Handling Remaining Phrases
JapanJapan the rolethe role
日本 は(Japan) case-maker
日本 は(Japan) case-maker
役割 を(Role) case-maker
役割 を(Role) case-maker
果たす(Achieve)
play
Step3: Discovery of New CorrespondencesBy Handling Remaining Phrases
・・・
technologytechnology become importantbecome important
技術 が(technology) case-maker
技術 が(technology) case-maker
重要 と( important )
重要 と( important )
なっている( become )
has・・・
Step3: Discovery of New CorrespondencesBy Handling Remaining Phrases
Experiments
Evaluation data:
200 sentence-pairs form White Paper & Example sentences in a Japanese-English dictionary
Gold standard data:
We manually tagged correct correspondences on
these sentences.
Correct : Exactly equal with a pre-aligned
Near-correct : Partly matches with a pre-aligned
Wrong : No match with Correct & Near-correct
Output Examples
English Japanese Scoreis being pursued
of G7 nations
geographical proximity
行われている(is doing by )
先進 7 カ国の(advanced 7 countries )
地理的に近い(near in geography)
2.75
2.6
2.0
tree (become)
went [to bed]
She ( held)
その木は(That tree is)
寝る(Go to bed)
彼女は(She is)
1.2
1.0
0.5
Near-correct
Correct
70
75
80
85
90
60 65 70 75 80
Recall
Precision
Precision – Recall
Correct→
Correct + Near-Correct × 0.5→
Conclusion
• We can find more correspondences than statistical approach.
• In comparable corpus, a statistical approach seems to be effective, however in parallel corpus, our approach is more effective to get large number of translation examples.
Statistical approach 1-2% of the input corpus
Our system 51-68% of the input corpus