using parallel propbanks to enhance word-alignments

14
Using Parallel Propbanks to Enhance Word-Alignments Jinho D. Choi (Univ. of Colorado at Boulder) Martha Palmer (Univ. of Colorado at Boulder) Niawen Xue (Brandeis University) The 3rd Linguistic Annotation Workshop at ACL ’09 August 7th, 2009

Upload: jinho-d-choi

Post on 11-Jun-2015

652 views

Category:

Technology


0 download

DESCRIPTION

This short paper describes the use of the linguistic annotation available in parallel PropBanks (Chinese and English) for the enhancement of automatically derived word alignments. Specifically, we suggest ways to refine and expand word alignments for verb-predicates by using predicate-argument structures. Evaluations demonstrate improved alignment accuracies that vary by corpus type.

TRANSCRIPT

Page 1: Using Parallel Propbanks to Enhance Word-alignments

Using Parallel Propbanks to Enhance Word-Alignments

Jinho D. Choi (Univ. of Colorado at Boulder)

Martha Palmer (Univ. of Colorado at Boulder)

Niawen Xue (Brandeis University)

The 3rd Linguistic Annotation Workshop at ACL ’09August 7th, 2009

Page 2: Using Parallel Propbanks to Enhance Word-alignments

Parallel Propbanks

• Propbank

- Corpus annotated with verbal propositions and their arguments (semantic roles)

• Parallel Propbanks

- Propbanks annotated in parallel corpus

2

Gansu Province also actively explored high risk business[ ] [ ] [ ]

Arg0: explorer Arg1: things explored

!!" " #极 #$ % $% &'[ ] [ ] [ ]

Arg0 Arg1

Page 3: Using Parallel Propbanks to Enhance Word-alignments

Word-Alignments

• Given parallel sentences, discover translation for each word

• GIZA++: a statistical machine translation toolkit

- It is hard to verify if the alignments are correct.

- Words with low frequencies may not get aligned.

- It does not account for semantics.

3

!" # 开! $" % & # '( $% )&

is a principal economic activity in developing PudongConstruction

Page 4: Using Parallel Propbanks to Enhance Word-alignments

Predicate Matching (based on GIZA++)

• English Chinese Parallel Treebank (ECTB)

- Xinhua: Chinese newswire + literal translation

- Sinorama: Chinese news magazine + non-literal translation

6

32%

19% 3%

45%

56%22%

3%

19%

En.verbEn.beEn.elseEn.none

Xinhua: 12,895 Sinorama: 40,086

Page 5: Using Parallel Propbanks to Enhance Word-alignments

Top-down Argument Matching

• Verify word-alignments

- For each Chinese verb vc aligned to some English verb ve

- Verify that the alignment is correct if the arguments of vc and ve match

7

!!" " #极 #$ % $% &'

Gansu Province also actively explored high risk business[ ][ ][ ] [ ][ ]

Arg0 ArgM ArgM Rel Arg1

Arg0 ArgM ArgM Rel Arg1

[ ] [ ] [ ] [ ] [ ]

Bingo!

Page 6: Using Parallel Propbanks to Enhance Word-alignments

Bottom-up Argument Matching

• Expand word-alignments

- For each Chinese verb vc aligned to no English word

- Align vc to ve such that ve is an English verb that maximizes the argument matching with vc

8

!!" # $" %# & ' ( $ )" %& 担'

Foreign funded enterprises in Gansu Province no longer worry about investment risk[ ][ ][ ][ ][ ]

Arg0 A.M A.M Rel Arg1

Arg0 A.M A.M A.M Arg1 Rel

[ ] [ ] [ ][ ][ ] [ ]

Page 7: Using Parallel Propbanks to Enhance Word-alignments

Bottom-up Argument Matching

• Expand word-alignments

- For each Chinese verb vc aligned to no English word

- Align vc to ve such that ve is an English verb that maximizes the argument matching with vc

8

ArgM Rel Arg1

[ ][ ][ ]Foreign funded enterprises in Gansu Province no longer worry about investment risk

!!" # $" %# & ' ( $ )" %& 担'

Foreign funded enterprises in Gansu Province no longer worry about investment risk

[ ] [ ] [ ][ ][ ] [ ]

Arg0 A.M A.M A.M Arg1 Rel

[ ][ ][ ][ ][ ]

Arg0 A.M A.M Rel Arg1

Page 8: Using Parallel Propbanks to Enhance Word-alignments

Argument Matching Score

• Macro argument matching score

• Micro argument matching score

• Thresholds

- Top-down: thresholds on macro score

- Bottom-up: thresholds on both macro and micro scores

9

Page 9: Using Parallel Propbanks to Enhance Word-alignments

System Overview

10

GIZA++

WordAlignmentsVerbs aligned

to verbsVerbs alignedto no word

Source Language Corpus

Target Language Corpus

ParallelPropbanksTop-down

MatchingBottom-upMatching

VerifiedAlignments

ExpandedAlignments

EnhancedAlignments

Page 10: Using Parallel Propbanks to Enhance Word-alignments

Evaluations

• Test Corpus

- NIST-GALE Web Genre Test Data

- 100 parallel sentences, 365 verb tokens, 273 verb types

• Measurements

- Term Coverage: how many Chinese verb-types are covered

- Term Expansion: how many English verb-types are suggested

- Alignment Accuracy: how many suggested English verb-types are correct

11

Page 11: Using Parallel Propbanks to Enhance Word-alignments

Evaluations: Top-down

12

0

32.5

65.0

97.5

130.0

Xinhua Sinorama

6276

129

79

Term Coverage

Mac.th = 0.0 (GIZA++) Mac.th = 0.5 (TDAM)

0%

22.5%

45.0%

67.5%

90.0%

Xinhua Sinorama

78.09%83.71%

57.76%

83.35%

Average Alignment Accuracy

Page 12: Using Parallel Propbanks to Enhance Word-alignments

Evaluations: Bottom-up

13

0

7.5

15.0

22.5

30.0

Xinhua Sinorama

27

18

Term Coverage

0%

17.5%

35.0%

52.5%

70.0%

Xinhua Sinorama

14.46%

63.89%

Average Alignment Accuracy

Mac.th = 0.8, Mic.th = 0.6

5.5% error-reduction17% abs-improvement

Page 13: Using Parallel Propbanks to Enhance Word-alignments

Conclusions & Future Work

• Conclusions

- Top-down Argument Matching is most effective for verifying word-alignments based on non-literal translations that have proven difficult for GIZA++.

- Bottom-up Argument Matching shows promise for expanding the coverage of GIZA++ alignments based on literal translations.

• We will try to enhance word-alignments by using

- Automatically labeled Propbanks

- Nombanks, Named-entity tags

- Parallel Propbanks prior to GIZA++

14

Page 14: Using Parallel Propbanks to Enhance Word-alignments

Acknowledgements

• We gratefully acknowledge the support of the National Science Foundation Grants IIS-0325646, Domain Independent Semantic Parsing, CISE-CRI-0551615, Towards a Comprehensive Linguistic Annotation, and a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc.

• Special thanks to Daniel Gildea, Ding Liu (University of Rochester) who provided word-alignments, Wei Wang (Information Sciences Institute at University of Southern California) who provided the test-corpus, and Hua Zhong (University of Colorado at Boulder) who performed the evaluations.

15