combining word-alignment symmetrizations in dependency tree projection

Combining Word-Alignment Symmetrizations in Dependency Tree Projection

David Mareč[email protected]

Charles University in PragueInstitute of Formal and Applied Linguistics

CICLING conferenceTokyo, Japan, February 21, 2011

Motivation

Let’s have a text in a language which is not very common...

We would like to parse it, but we do not have any parser no manually annotated treebank

But we do have a parallel corpus with another language English

Our goal – To create a parser

Take the parallel corpus with English

Make a word-alignment on it GIZA++

Parse the English side of the corpus MST dependency parser

Transfer the dependencies from English to the target language using the word-alignment

Train the parser on the resulting trees

Previous works

Rebecca Hwa (2002, 2005) Simple algorithm for projecting trees from English to Spanish and

Chinesse Only one type of alignment used and not specified which one

K. Ganchev, J. Gillenwater, B. Taskar (2009) Unsuprevised parser with posterior regularization, in which inferred

dependencies should correspond to projected ones English to Bulgarian

Our contribution To show that utilization of various types of alignment improves the

quality of dependency projection

GIZA++ [Och and Ney, 2003] two uni-directonal asymmetric alignments symmetrization methods

Simple algorithm for projecting dependencies using different types of alignment links

Training and evaluating MST parser

Word alignment GIZA++ toolkit has asymmetric output

For each word in one language just one counterpart from the other language is found

Coordination of fiscal policies indeed , can be counterproductive .

Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .



ENGLISH-to-X

X-to-ENGLISH

Symmetrization methods Combinations of previous two unidirectional alignments



INTERSECTION



GROW-DIAG-FINAL

Which alignment to use for the projection? We have presented four different types of alignment

ENGLISH-to-X, X-to-ENGLISH, INTERSECTION, GROW-DIAG-FINAL

We prefer X-to-ENGLISH alignment we need to find a parent for each token in the language X we don’t mind English words that are not aligned

We recognize three types of links A: links that appeared in INTERSECTION alignment (red) B: links that appeared in GROW-DIAG-FINAL and also in X-to-ENGLISH

alignment (orange) C: links that appeared only in X-to-ENGLISH alignment (blue)



Algorithm - example



Results

The best results for each of the testing languages: English parser trained on CoNLL-X data The projection was made on first 100.000 sentence pairs from News-

commentaries (or Acquis-communautaire) parallel corpus We used McDonald’s maximum spaning tree parser

Language Parallel Corpus Testing Data Accuracy

Bulgarian Acquis CoNLL-X 52.7 %

Czech News CoNLL-X 62.0 %

Dutch Acquis CoNLL-X 52.4 %

German News CoNLL-X 55.7 %

Why is the accuracy so low? Treebanks in CoNLL differ in annotation guidelines Different handling of coordination structures, auxiliary verbs, noun

phrases, ...

Comparison with previous work We have run our projection method on the same datasets as in the

previous work by Ganchev et al. (2009) Bulgarian, OpenSubtitles parallel corpus English parser trained on PennTreebank Tested on Bulgarian CoNLL-X train sentences up to 10 words

Method Parser Accuracy

Ganchev et al. Discriminative model 66.9 %

Ganchev et al. Generative model 67.8 %

Our method MST parser 68.1 %

Our results are slightly better we did NOT use any unsupervised inference of dependency edges we utilized better the word aligment

Conclusions

We proved that using combination of different word-alignment improves dependency tree projection

We outperform the state-of-the art results

The problem of testing is in a different anotation guidelines for each treebank

Thank you for your attention

combining word-alignment symmetrizations in dependency tree projection

Documents