part d: multilingual dependency parsing. motivation a difficult syntactic ambiguity in one language...

Semi-supervised Dependency Parsing

Part D: multilingual dependency parsingIn this part, we will survey the current work on multilingual DP, and discuss some potential directions on this research line.1MotivationA difficult syntactic ambiguity in one language may be easy to resolve in another language (bilingual constraints)A more accurate parser on one language may help a less accurate one on another language (semi-supervised)Rich labeled resources in one language can be transferred to build parsers of another language (unsupervised)

Here multilingual DP means tries to improve target-language dependency parsing by making use of bilingual constraints or Source-language resources or parsers.

There are a few motivations behind this research line.

First, a difficult syntactic ambiguity in one language, For example, PP attachment problem is very challenging in English;however, in Chinese, it is quite easy to attach a Preposition phrase to the correct position.

Second, a more For example, English dep parsers can achieve 90% on UAS, whereas Chinese parsers can only achieve 80%. Such accuracy gap may be due to the intrinsic difficulty of languages or scale of available labeled data.Therefore, it is reasonable to suppose that the English parser can help Chinese parser on bilingual text.

Third, when the target language has no labeled training data, it is a natural and interesting idea to transfer labeled resources in source language in order to build the target-language parser.

Intrinsic complexity of languages (Chinese vs. English)Scale of labeled resources

More slidesGive a few examples

For example, English Chinese PP attachment, PP boundary for ChineseChinese DE structure, attribute clause

2ObstaclesSyntactic non-isomorphism across languagesDifferent annotation choices (guideline)Partial (incomplete) parse trees resulted from projectionParsing errors on the source sideWord alignment errors

However, there exist a few obstacles for this research line

First, syntactic non-isomorphism across languages is the most severe and challenging problem, especially whenthe source language and the target language belong to different language families, such as English and Chinese.

The second challenge is that even for the same language, there are many different annotation choices to make during treebanking.One typical example is the noun coordination phrase.Should the first conjuncting noun or the last noun be the head of the whole phraseWhere should the conjunct word be attached?

Third, if we project a tree of the source-language sentence into the target-language sentence via word alignments, we usually get a partial tree. Then, the problem is how to make use of the target-language sentences with partial trees.

Also, the parsing errors on the source side and the word alignment errors make the bilingual constraints and projected structures contain a lot of noises, and therefore become unreliable 3Research map (two perspectives)MethodsDelexicalized (not rely on bitext)Projection (rely on bitext)Project hard edgesProject edge weights (distributions)As training objective (bilingual constraints)

Supervision (existence of target treebank)Semi-supervisedUnsupervised

According to the methodology of the proposed methods, we divide the current works into two major categories, The delexicalized method and the projection method.The delexicalized method does not need bitext, and directly use the source-language treebank to train an unlexicalized parser, which is only based on POS tag related features.The projection method relies on bitext to transfer source-language syntactic knowledge into the target language.We can further divide the projection method into three cases, Projecting hard edges, projecting edge weights (also known as distributions)Or embedding bilingual constraints into training objective

From another perspective, depending on whether the target treebank exists, Such works can be classified into semi-supervised and unsupervised.

4Bilingual word reordering features (Huang+ 09)Supervised with bilingual constraintsUse the word reordering information in source language as extra features when parsing target text (using automatic word alignment as bridge)

Two characteristicsBilingual text with target-side labelsNo parse tree in the source side

Huang et al., 2009 propose a very simple and effective method to make use of bilingual constraints to help monolingual parsing.

They find that the relative word reordering information in the source language can help target-language parsing.Therefore, during both training and test phases, they use such word reordering information as extra features in the parsing model.For example, suppose we want to parse the English sentence, and we have the translated Chinese sentence and the word alignments.For the English sentence, where the preposition phrase ``with a telescope should be attached is known as the notorious PP-attachment problem.We can see that the Chinese-side word reordering information can provide helpful clues in resolving this.

Train and test on bitext

5Build local classifiers via projection (Jiang & Liu 10)Semi-supervised; project edgesStep 1: projection to obtain dependency/non-dependency classification instancesStep 2: build a target-language local dependency/non-dependency classifierStep 3: feed the outputs of the classifier into a supervised parser as extra weights during test phase.

Jiang and Liu, 2010 propose a projection method.

First, they project the source-language parse tree into the target language to obtain some dependency/non-dependency instances

Second, they build ...

Finally, they integrate the outputs of the classifier into a supervised target-language parser as extra weights during test phase.In this way, the target-language parser is enhanced to achieve higher accuracy.

The example show that projection process.The blue arcs are dependency instances, and the red arcs are non-dependency instances.They only use such dependency/non-dependency instances to build a local classifier.Therefore, they avoid the partial tree problem

------------------

P13-1105 [bib]: Kai Liu; Yajuan L; Wenbin Jiang; Qun LiuBilingually-Guided Monolingual Dependency Grammar Induction

Simplify their work and give a simple example6Prune projected structures with target-language marginal (Li+ 14)Semi-supervised; project edges from EN to CHThe goal is to acquire extra large-scale high-quality labeled training instances from bitext

Step 1: parse English sentences of bitextStep 2: project English trees into Chinese, and filter unlikely dependencies with target-language marginal probabilities (handling cross-language non-isomorphism, or parsing and alignment errors)Step 3: train with extra training instances with partial annotations.

Then we introduce our recent work, which will be presented in detail the day after tomorrow.Our goal is to acquire extra large-scale high-quality labeled training instances from bitext.

It works in three stepsFirst,

7Prune projected structures with target-language marginal (Li+ 14)

Here is an example.In the above figure, we have an English sentence and its auto-parsed tree, and a translated Chinese sentence, and the word alignments between the two sentences.

In Fig. b, we project the English tree into Chinese, and get a partial Chinese tree. Please note that there is no cross-lingual non-isomorphism structures in this example, therefore, no filtering step is needed.

Then, we convert the partial tree into a forest by attaching the unlinked words in any possible ways.Such projected sentences with forest labels are used as extra training data.And during training, the objective is to maximize the probabilities of the forests, instead of trees in traditional parsing.

8Delexicalized transfer (no bitext) (McDonald+ 11)Unsupervised; delexicalized Direct delexicalized transfer (Zeman & Resnik 08, Sogaard 11, Cohen+ 11)Train a delexicalized parser on source treebank, ignoring words, only using universal POS tags (Petrov+ 11)

From this slides, we introduce a few papers that belongs to unsupervised approaches, where no target-language labeled data is available.

McDonald et al. propose a few methods on delexicalized multilingual dependency parsing.The basic method is called direct delexicalized transfer, which is previously studied by Zeman and Resnik 2009, and Sogaard 2011, and Cohen et al., 2011.

Suppose we use English treebank as the source treebank.The idea is to ignore the words, and only use POS tag related features when building an English parser.The parser is called a declexicalzied parser.This parser can be used to parse sentences in any other languages, as long as the POS tags of the input sentences are provided.

9Delexicalized transfer (no bitext) (McDonald+ 11)Direct delexicalized transferStep 1: replace the language-specific POS tags in source treebank with the universal POS tags (Petrov+ 11)NOUN (nouns); VERB (verbs); ADJ (adjectives); ADV (adverbs) PRON (pronouns); DET (determiners); ADP (pre/postpositions); NUM (numerals); CONJ (conjunctions); PRT (particles); PUNC (punctuation marks); X (a catch-all tag)Step 2: train a delexicalized source-language parserOnly use POS tag features (no lexical features)89.3 (lex) => 82.5 (delex) on English test setStep 3: parse target-language sentences with universal POS tags

In detail, this method works in three steps.First, replace the language-specific POS tags in the source treebank with the universal POS tags.The idea of using universal POS tags is proposed by Petrov et al., 2011.Universal POS tags contains about 10 coarse-grained POS tags, which can be easily constructed for any language based on existing POS tagging resources.

In step 2, we train a delexicliazed source-language parser by using only POS tag features.

McDonald+ show that on English, a lexicalized parser can achieve an accuracy of 89.3%, while a delexicalzied parserAchieves 82.5%.

In step 3, the delexicalzied parser can be used to parse target-language sentences with universal POS tags. 10Multi-source delexicalized transferUse multiple source treebanks (besides English), and concatenate them as a large training data.

Can improve parsing accuracy.

Delexicalized transfer (no bitext) (McDonald+ 11)Concatenated delexicalized dataEnglishGreekGermanTrain a delexicalized parser for DanishThen, McDonald+ further propose multi-source delexicalized transfer

The idea is to use multiple source treebanks, and concatenate them as larger training data. Here is an illustration.We concatenate the treebanks of English, Greek, German into one large treebank, andThe words are ignored and only universal POS tags are used.

Then, we use this large treebank to train a delexicalzied parser for the target language, say Danish.

Results show that this simple method can further improve parsing accuracy.

11Delexicalized + projected transfer (with bitext) (McDonald+ 11)Unsupervised; delexicalized + project (as training objective)Workflow (Danish as target language)Step 1: parse a set of Danish sentences with the delexicalized transfer parser (trained on En treebank).Step 2: train a lexicalized Danish parser using the predicted parses as gold-standard.Step 3: improve the lexicalized Danish parser on bitextExtrinsic objective: the parsers outputs are more consistent with the outputs of the lexicalized English parser (training with multiple objectives, Hall+ 11).

Motivated by the fact that lexicalized parsers largely outperform unlexicalzied parsers,McDonald+ try to relexcialize the delexicalized parser by combining delexicalzied method and projection method.

Here, we suppose English is the source language and Danish is the target language.The idea works in three steps.First, use the delexicalized parser to parse a set of Danish sentences, and get one 1-best parse tree for each sentence.

In step 2, train a lexicalized Danish parser on the set of Danish sentences by considering the auto-parsed trees as gold-standard.

In step 3, improve the lexicalized Danish parser on some bitext by using bilingual constraints as an extrinsic objective.The bilingual constraints encourage the parsers outputs on Danish become more consitstent with the outputs of a lexicalized English parser on English.

Results show that this idea can improve delexicalzied transfer.

12Delexicalized transfer with cross-lingual word clusters (Tackstrom+ 12)Unsupervised; delexicalized + cross-lingual word clusters

In delexicalized dependency parsersOnly POS tag features are used

In other previous work on dependency parsingFeatures based on word clusters work very wellTackstrom+ try to relexicalize the delexicalized parsers from another perspective.

Their work is motivated by the previous finding that word cluster features can help parsing.

13Delexicalized transfer with cross-lingual word clusters (Tackstrom+ 12)Word clusters derived from large-scale unlabeled data can alleviate data sparseness. POS tags < word clusters < words

Accuracy (LAS) of supervised English Parser

Then they conduct experiments on English treebank to study the effect of different feauture types.

They show that word cluster features, combined with POS tag features, can somehow achieve very close accuracy to that of using lexical features and POS tag features.

LAS on test set (supervised English parser)

14Delexicalized transfer with cross-lingual word clusters (Tackstrom+, 2012)Re-lexicalize the delexicalized transfer parser by using cross-lingual word clusters.Cross-lingual word clusters need to be consistent across languages.A cross-lingual clustering algorithm that leverageslarge amounts of monolingual unlabeled data, andparallel data through which the cross-lingual word cluster constrains are enforced.

Based on this observation, they relexicalized the delexicalized parsers by using cross-lingual word cluster features.

One key challenge is that the cross-lingual word clusters need to be consistent across languages.

Then, the authors extend monolingual word clustering algorithms into a cross-lingual clustering algorithm.

Cross-lingual word clusters require thatcan help delexicalized transfer of dependency parsers. (re-lexicalizing)

15

Delexicalized transfer with cross-lingual word clusters (Tackstrom+ 12)

This example illustrates the outputs of the bilingual word clustering algorithm

For example, we can see that under the word cluster id 60, the English side contains was, , ..While the spanish side contains estaba.These words should play similar syntactic functions in both languages. 16

Here is an example that shows how these cross-lingual clusters are used in delexicalized parser.

Besides universal POS tags, the word clusters are also provided in the source-language treebank and target-language test sentences.In this way, word cluster features can be used as extra features in the parsing model.17Selectively multi-source delexicalized transfer (Tackstrom+ 13)Unsupervised; delexicalized (also in Naseem+ 12)Some delexicalized features do not transfer well across typologically different languagesSelectively parameter sharing based on typological and language-family features (Dryer and Haspelmath 11)

Tackstrom+ further extend the multi-source delexiclazied transfer method.This work is motivated by the observation that some delexicalzied features do not transfer well across typologically different languages.Therefore, they propose selectively parameter sharing based on typological and language-family features.

This work considers four features to divide languages into different families. For example, the order of subject, object and verb, etc. 18Selectively multi-source delexicalized transfer (Tackstrom+ 13)Selectively use features for different source and target language pairsIndo-European (Bulgarian, Catalan, Czech, Greek, English, Spanish, Italian, Dutch, Portuguese, Swedish)Altaic languages (Japanese, Turkish)

For each source-target language pair, they fire a subset or full set of the features of the delexicalzied parser.According to the topological features in previous slide, they divide the language into two classes, Indo-european family, and altaic family.19Ambiguity-aware transfer (Tackstrom+ 13)Unsupervised; delexicalized

Relexicalize delexicalized parsers using target-language unlabeled text.First, parse the unlabeled data with the base delexicalized parser, and use the arc-marginals to construct ambiguous labels (forest) for each sentence.Then, train a lexicalized parser on the unlabeled data via ambiguity-aware training.Then, Tackstrom+ propose another method to relexicalize the delexicalized parser based on ambiguity-aware training.

First,Then,

20Ambiguity-aware transfer (Tackstrom+ 13)

Here is an example sentence with ambiguous labels.We can see that after filtering out highly unlikely dependencies, the resulting structure is a forest, where a wordMay have multiple head candidates.

During training, the objective is to maximize the likelihood of the training data with ambiguous labels, which is equivalent to maximizing the probability of the forest, instead of a single parse tree in traditional training.

21Ambiguity-aware transfer (Tackstrom+ 13)Ambiguity-aware self-trainingAmbiguity-aware ensemble trainingAdd the outputs of another parser (Naseem+, 2012) into the ambiguous label set.Can improve accuracy.

The method in previous two slides is named as ambiguity-aware self-training.And Tackstrom+ shows that we can add the outputs of another parser into the ambiguous label set.Then the method is called ambiguity-aware ensemble training.They show this can further improve accuracy.22

Syntax projection as posterior regularization (Ganchev+ 09)Unsupervised; project high-probability edges (used in training objective)Key pointsFilter unlikely projected dependencies according to source-side dependency probabilities and alignment probabilities (only project high-confidence dependencies) Give the model more flexibility via posterior regularization

Ganchev+ propose a projection method based on posterior regularization.

This work has two key points.

23Syntax projection as posterior regularization (Ganchev+ 09)Posterior regularizationAfter training, the model should satisfy that the expectation of the number of conserved edges (among the projected edges) is at least 90%

The posterior regularization term enforce that the expectation of the number of conserved edges .

For example, from the lower English tree, after filtering bad projection, we have got four dependencies.

Intuitively, posterior regularization allows that the model does not necessarily adopt all the four dependencies, instead, it can abandon 10% dependencies if needed.

This give the model flexibility to handle potential projection errors.24Syntax projection as posterior regularization (Ganchev+ 09)

log-likelihood of the target-side bitextModel priorThe set of distributions satisfying the desired posterior constrainsModel posteriorSyntax projection as posterior regularization (Ganchev+ 09)Create several simple rules to handle different annotation decisions (annotation guideline).E.g., Main verbs should dominate auxiliary verbs.

Can largely improve parsing accuracy.Ganchev+ also create .., and show that this can largely improve parsing accuracy.26Combining unsupervised and projection objectives (Liu+ 13)Unsupervised; project edges Derive local dependency/non-dependency instances from projection (Jiang+, 10)

P13-1105 [bib]: Kai Liu; Yajuan L; Wenbin Jiang; Qun LiuBilingually-Guided Monolingual Dependency Grammar Induction

Surprisingly, Chinese is not the worst

27Combining unsupervised and projection objectives (Liu+ 13)Workflow

P13-1105 [bib]: Kai Liu; Yajuan L; Wenbin Jiang; Qun LiuBilingually-Guided Monolingual Dependency Grammar Induction28Combining unsupervised and projection objectives (Liu+ 13)Joint objective: linear interpolation of

Unsupervised objective

Projection objective

P13-1105 [bib]: Kai Liu; Yajuan L; Wenbin Jiang; Qun LiuBilingually-Guided Monolingual Dependency Grammar Induction29Transfer distribution as bilingual guidance (Ma & Xia 14)Unsupervised; hybrid of projection and delexicalized; transfer edge weights (distribution)

Projected weights (lexicalized)Delexicalized weightsMa and xia propose a method to transfer distribution as bilingual guidance.Their method can be understood as the hybrid of projection and delexicalized methods.They try to transfer edge weights from source language to target language.

Here is an example, English is source, and Chinese is target.Three dependencies is projected from English to Chinese.For these three projected dependencies, their weights equal to the weights of the English dependencies according to a lexicalzied parser.While for un-projected dependencies in the Chinese sentence, they use a delexicalzied English parser to get their weights. By now, every possible dependency in the Chinese sentences gets a weight, either from the projection or the delexcalized method.

project the weights of source-language edges to the target-language

Train a lexicalized source parser and an delexicalized source parser

30Transfer distribution as bilingual guidance (Ma & Xia 14)Use the target-language sentences to train a parser, by minimizing the KL divergence from the transferred distribution to the distribution of the learned model.

Then, they use the target-language sentences to train a parser, and try to learn a model that produce closest distribution to the transferred distribution.

Only project edge weight; not explicitly project any tree structures.

31Transfer distribution as bilingual guidance (Ma & Xia 14)Use unlabeled data by minimizing entropy.Not very helpful

32Summary of Part DHelp target-language parsing with source-language treebanks and bitextSemi-supervised vs. unsupervisedProjection vs. delexicalized Project hard edges, transfer edge weights, or use bilingual constraints

OK, let me summarize part D.In the part, we focus on the idea of using source-language treebank and/or bitext to help target-language parsing.We distinguish the methods from different perspectives.Semi-supervised or unsupervisedProjection vs delexicalized methodProject hard edges, transfer edge weights, or use bilingual constraints

33Summary of Part DFuture thoughtsWord alignment errors (probabilities)Source-language parsing errors (probabilities)How to capture cross-language non-isomorphismJoint word alignment and bilingual parsing?

In my personal opinion, I think a few possible directions may be further explored in this research line.First, how to alleviate the effect of word alignment errors?A straightforward method is to use word alignment probabilities, instead of using hard alignment directly.

Second, how to alleviate the effect of source-language parsing errors?Also, we can use the dependency probabilities.

The third direction is the most challenging, how to design a systematic way to capture cross-language non-isomorphism structures?Actually, previous work have done little in this, and by now only use some hand-crafted and language-dependent rules

The final thought is that maybe we should work on joint word alignment and bilingual dependency parsing, and try to handle all above problems in a unified joint framework.

34Matthew S. Dryer and Martin Haspelmath, editors. 2011. The World Atlas of Language Structures Online. Munich: Max Planck Digital Library. http://wals.info/.Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2009. Dependency grammar induction via bitext projection constraints. In Proceedings of the 47th ACL.Liang Huang, Wenbin Jiang, and Qun Liu. 2009. Bilingually-constrained (monolingual) shift-reduce parsing. In Proceedings of EMNLP, pages 12221231.Wenbin Jiang and Qun Liu. 2010. Dependency parsing and projection based on word-pair classification. In ACL, pages 897904.Zhenghua Li, Min Zhang and Wenliang Chen. 2014. Soft Cross-lingual Syntax Projection for Dependency Parsing. In Proceedings of COLING, pages 783-793Kai Liu, Yajuan L, Wenbin Jiang, and Qun Liu. 2013. Bilingually-guided monolingual dependency grammar induction. In Proceedings of ACL.Xuezhe Ma and Fei Xia. 2014. Unsupervised Dependency Parsing with Transferring Distribution via Parallel Guidance and Entropy Regularization. In ACLRyan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of EMNLP.References35S. Petrov, D. Das, and R. McDonald. 2011. A universa part-of-speech tagset. In ArXiv:1104.2086.Anders Sgaard. 2011. Data point selection for cross-language adaptation of dependency parsers. In Proceedings of ACL 2011, pages 682686.Kathrin Spreyer and Jonas Kuhn. 2009. Data-driven dependency parsing of new languages using incomplete and noisy training data. In CoNLL, pages 1220.Oscar Tckstrm, Ryan McDonald, and Joakim Nivre. 2013. Target Language Adaptation of Discriminative Transfer Parsers. In Proc. of NAACL.Oscar Tckstrm, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of NAACL-HLT.

References36

part d: multilingual dependency parsing. motivation a difficult syntactic ambiguity in one language...

Documents

target language

targetlanguage parser

language unsupervised

sourcelanguage resources

targetlanguage dependency

targetlanguage sentences

sourcelanguage sentence

whenthe source language