using parallel propbanks to enhance word-alignments
DESCRIPTION
This short paper describes the use of the linguistic annotation available in parallel PropBanks (Chinese and English) for the enhancement of automatically derived word alignments. Specifically, we suggest ways to refine and expand word alignments for verb-predicates by using predicate-argument structures. Evaluations demonstrate improved alignment accuracies that vary by corpus type.TRANSCRIPT
Using Parallel Propbanks to enhance Word-alignmentsJinho Choi, Martha Palmer, Niawen Xue
Institute of Cognitive Science, University of Colorado at Boulder
Background
Propbank- A corpus annotated with verbal propositions and their arguments.- Adds semantic information (semantic roles) to the phrase structures.- e.g. John opened the door with his foot
Word-alignments- Parallel sentences: a sentence s and t are called parallel if t is a translation of s.- Word alignment: Given parallel sentences, align words that are semantically close.- GIZA++: a statistical machine translation toolkit used to train word- alignment models.
Phrase Structure
System Overview
Motivation
Issues with GIZA++ generated word-alignments- It is hard to verify if the alignments are correct.- Words with low frequencies may not get aligned to any words.- GIZA++ does not account for semantics.
Using parallel Propbanks to enhance word-alignments for verb-predicates- Let S and T be a source and a target language, respectively.- For each verb-predicate vs ∈ S aligned to some word wt ∈ T, : if wt is also a verb-predicate and the arguments of vs and wt match,
consider the alignment is correct (top-down matching).- For each verb-predicate vs ∈ S aligned to no word ∈ T, : if the arguments of vs match to the arguments of some verb- predicate vt ∈ T, align vs to vt (bottom-up matching).
Propbank Annotations
Corpus Description
English Chinese Translation Treebank (ECTB)- A parallel corpus between English and Chinese- The corpus is divided into two parts : Xinhua Chinese newswire with literal English translations (4,363 parallel sentences) : Sinorama Chinese news magazine with non-literal English translations (12,600 parallel sentences)
Predicate Matching
For each Chinese verb-predicate vc aligned to some English word we, we checked if we is also a verb-predicate.
pred = predicates, be = be-verbs, else = non-verbs, none = no words
Top-down Argument Matching
For each Chinese verb vc aligned to an English verb ve- Convert all Chinese words in the arguments of vc to their English alignments (skip ones not aligned to any English words).- Compare the converted arguments of vc with the arguments of ve.- For each argument, check how many words are matched. If the matching is above a certain threshold, consider the alignment is correct.
Measurements- CA = a set of arguments of vc, where cai ∈ CA- EA = a set of arguments of ve, where eai ∈ EA - Macro average argument matching score
=
- Micro average argument matching score
=
Evaluations
Test corpus- English-Chinese parallel corpus provided by Wei Wang (Information Sciences Institute at the Univ. of Southern California)- 100 parallel sentences, 273 Chinese verb-types (365 verb-tokens)- Test if word-alignments found in ECTB can correctly translate Chinese verbs to English verbs
Measurements- Term coverage (TC): how many Chinese verb-types are covered by word-alignments found in ECTB- Term expansion (TE): for each covered Chinese verb-type, how many English verb-types are suggested by the word-alignments - Alignment accuracy (AA): how many suggested English verb-types are correct
Refining word-alignments- Apply only the word-alignments whose macro-average scores are above a certain threshold- Thresholds: 0 (accept all alignments), 0.4 (accept alignments whose macro average scores are above 40%)
ATE = Average term expansion, AAA = Average alignment accuracy
Expanding word-alignments- Apply only the word-alignments whose macro and micro average scores are above certain thresholds
Bottom-up Matching
For each Chinese verb vc aligned to no English word- Convert all Chinese words to their English alignments.- Compare the converted arguments of vc with the arguments of each English verb ve that is not aligned to any Chinese verb, and find the one, say vm, with the maximum micro average score.- If the micro average score of vc and vm is above a certain threshold, align vc to vm.
Xinhua Sinorama
Macro Avg. 80.55% 53.56%
Micro Avg. 83.91% 52.62%
Average Top-down Argument Matching Scores
Average Bottom-up Argument Matching Scores
XinhuaSinorama
Threshold 0.7 0.8 0.7 0.8
Macro Avg. 80.74% 83.99% 77.70% 82.86%
Micro Avg. 82.63% 86.46% 79.45& 85.07%
Xinhua Sinorama
TH TC ATE AAA TC ATE AAA
0.0 79 1.77 83.35% 129 2.29 57.76%
0.4 76 1.72 83.54% 93 1.8 65.88%
0.5 76 1.68 83.71% 62 1.58 78.09%
Macro – 0.7Macro – 0.8
TC ATE AAA TC ATE AAA
Micro Xinhua
0.0 22 4.27 50.38% 20 3.35 57.50%
0.6 21 3.9 54.76% 18 3.39 63.89%
0.7 19 3.47 55.26% 17 3.12 61.76%
Micro Sinorama
0.0 37 3.59 18.01% 29 3.14 14.95%
0.6 31 3.06 15.11% 27 2.93 14.46%
0.7 21 2.81 11.99% 25 2.6 11.82%
Summary and Future Works
• Top-down Argument Matching is most effective with non-literal translations that have proven difficult for GIZA++.• Bottom-up Argument Matching shows promise for expanding the coverage of GIZA++ alignments that are based on literal translations.• In future work, we will try to enhance word-alignments by using automatically labeled Propbanks, Nombanks, and Named-entity tags.