acl 2005 workshop on building and using parallel texts (wpt-05), ann arbor, mi. june 2005 1...

13
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment Model Ying Zhang Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University

Post on 15-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

1

Competitive Grouping in Integrated Segmentation and Alignment Model

Ying Zhang Stephan Vogel

Language Technologies Institute

School of Computer Science

Carnegie Mellon University

Page 2: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

2

Integrated Segmentation and Alignment Model

• Phrase alignment models (Och et al., 1999; Marcu and Wong, 2002; Kohen et al., 2003)– Many of these models rely on the pre-calculated word alignment.– Use different heuristics to extract phrase pairs from the Viterbi word

alignment path.

• Integrated Segmentation and Alignment model (Zhang 2003)– No such word alignments needed– Segment source and target sentences into phrases and align them

simultaneously– Use chi-square(f, e) instead of the conditional probability P(f|e) for word

pair associations– Greedy search for phrase pairs– Key idea: competitive grouping algorithm– Inspired by the competitive linking algorithm (Melamed 1997) for word

alignment

Page 3: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

3

Competitive Linking Algorithm

• A greedy word alignment algorithm.

• The word pair has the highest likelihood L(f,e) “wins” the competition.

• One-to-one assumption: when pair{f, e} is “linked”, neither f nor e can be aligned with any other words.

• Example:

Page 4: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

4

Competitive Grouping Algorithm

• Discard the one-to-one assumption in competitive linking, make it less greedy.

• When a pair {e, f} wins the competition, inviting the neighboring pairs to join the “winner’s club”.

• Introducing the locality assumption: a source phrase of adjacent words can only be aligned to a target phrase of adjacent words.– Words inside the aligned phrase pairs can not be aligned to other words

Page 5: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

5

Expanding the Phrase Pair Aligned

• Two criteria have to be satisfied to expand the seeding word pair to phrase pairs1. If a new source word f is to be grouped, the best e that f is associated

should not be “blocked” by this expansion; similar for grouping a new target word.

2. The highest word pair likelihood value in the expanded area needs to be “similar” to the seed value

• According to the locality assumption, words in the aligned phrase pairs can not be aligned with other words again.

Page 6: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

6

Exploring All Possible Phrase Pairs

• Criterion 2 is used to control the granularity of the phrase pairs aligned– Two short phrase pairs

– Or one long phrase pairs

• Short phrases give better coverage for unseen testing data

• Long phrases encapsulate more context, e.g. local reordering, word sense, and etc.

• Hard to decided on the optimal granularity without knowing the testing data

• Solution: for each grouping, try all possible granularities

Page 7: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

7

Exploring All Possible Phrase Pairs

French: Je déclare reprise la session

English: I declare resumed the session

Page 8: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

8

The Likelihood of Word Associations

• Chi-square statistics is used to measure the likelihood of word associations for pair {e, f}

• For each word pair {e, f} null hypothesis: e and f are independent of each other.

• Calculating to measure how true is this hypothesis

• Construct the contingency table using the counts from the corpus given the current alignment, e.g. uniform alignment– O11: number of times when e and f are aligned

– O12: number of times when e aligned with other f

– O21: number of times when f aligned with other e

– O22: number of times when other f aligned with other e

f ~f

e O11 O12

~e O21 O22

Page 9: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

9

In WPT-05

• Submitted results for all four languages

• Training data as provided

• Language model as provided

• Decoder (Pharaoh) as provided

BLEU German Spanish Finnish French

Dev-test 18.63 26.20 12.88 26.20

Test 18.93 26.14 12.66 26.71

Page 10: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

10

Conclusion

• Competitive grouping algorithm at the core of the ISA model

• Simple and efficient model

• Comparable results as other phrase alignment models

Page 11: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

11

The Evolution of ISA

Page 12: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

12

Matrix of the Likelihood

Page 13: ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005 1 Competitive Grouping in Integrated Segmentation and Alignment

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June 2005

13

Expanding the Phrase Pairs