university of albertanov 30, 2006 inversion transduction grammar with linguistic constraints colin...

University of AlbertaNov 30, 2006

Inversion Transduction Grammar with Linguistic Constraints

Colin Cherry

University of Alberta

2 University of AlbertaNov 30, 2006

Edmonton Weather (Tuesday)


Outline

• Bitext and Bitext Parsing

• Inversion Transduction Grammar (ITG)

• ITG with Linguistic Constraints

• Discriminative ITG with Linguistic Features

• Other Projects


Statistical Machine Translation• Input:

– Source language sentence E

• Goal: – Produce a well-formed target language sentence F

with same meaning as E

• Process:– Decoding: search for an operation sequence O that

transforms E into F

– Weights on individual operations are determined empirically from examples of translation


Bitext

• Valuable resource for training and testing statistical machine translation systems

• Large-scale examples of translation

• Needs analysis to determine small-scale operations that generalize to unseen sentences

Text inEnglish

Same text,in French


Word Alignment

• Given a sentence and its translation, find the word-to-word connections

the minister in charge of the Canadian Wheat Board

le ministre chargé de la Commission Canadienne du blé


Word Alignment

• Given a sentence and its translation, find the word-to-word connections

• Link: a single word-to-word connection


le ministre chargé de la Commission Canadienne du blé


Given a Word Alignment

• Extract bilingual phrase pairs for phrasal SMT (Koehn et al. 2003)

• Add in a parse tree and:– Extract treelet pairs for dependency translation (Quirk et

al. 2005)

– Extract rules for a tree transducer (Galley et al. 2004)

• Other fun things:– Train monolingual paraphrasers (Quirk et al. 2004, Callison-

Burch et al. 2005)


Bitext Parsing

• Assume a context-free grammar generates two languages at once

• Like joint models, but position of words in both languages is controlled by grammar


Monolingual Parsing

always verbs the adjective nounhe

Adv V Det

Adj N

NP

V NP

NP VP

S Non-terminals

Terminals

ProductionNPAdj N


Another viewS

VPNP

VNP NP

…

always verbshe the adjective noun

VP V NP

S NP VP


Bitext Parsing is in 2D

S

Eng

lish

French



VP

NP

Eng

lish

French



NP

NP

VEng

lish

French



NP

NP

Adv

V

Eng

lish

French



NP

NP

Adv

V

Det

Eng

lish

French



Adj

NP

Adv

V

Det

N

Eng

lish

French



Adj

NP

Adv

V

Det

N

il verbe toujours le nom adjectif

he

always

verbs

the

adjective

noun


Why Bitext Parsing?• Established polynomial algorithms

• Flexible framework, easy to add info:– Parse given an alignment– Align given a parse (this work)

• Discoveries can be ported to parser-based decoders (Zens et al. 2004, Melamed 2004)

• Advances in parsing can be ported to word alignment


Outline





• Other Projects


Inversion Transduction Grammar

• Introduced in by Wu (1997)

– Transduction: • N noun / nom

– Inversion:• NP [Det NP]

• NP <Adj N>

N

nom

noun

NP

Det

AdjN

Straight

Inverted


Binary Bracketing

A[AA]

A<AA>

Ae/f

• No linguistic meaning to “A”


Tree visualization


Pros and Cons of Bracketing

• Pros:– Language independent– Straight-forward and fast– Symbols are minimally restrictive

• Cons:– Grammar is meaningless– ITG Constraint


ITG Constraint

12 are acceptable to the commission Mr Burton fully or in part

12 are acceptable to the commissionMr Burton fully or in part


Outline





• Other Projects


Some questions

• Those ITG constraints are kind of scary. How bad are they? Do they ever help?

• Can we inject some linguistics into this otherwise purely syntactic process?

– Linguistic grammar would limit trees that can be built - and therefore limit alignments


Alignment Spaces

• Set of feasible alignments for a sentence pair

• Described by how links interact– If links don’t interact, problem loses its structure

• Should encourage competition between links (Guidance)

• Should not eliminate correct alignments (Expressiveness)


ITG Space

• Rules out “inside-out” alignments

• Limits how concepts can be re-ordered during translation


Permutation Space

• One-to-one: each word in at most one link

• Allows any permutation of concepts

• Reduces to weighted maximum matching if each link can be scored independently

the tax causes unrest

l’ impôt cause le malaise


Linguistic source: Dependencies

• Tree structure defines dependencies between words

• Subtrees define contiguous phrases



Phrasal Cohesion

• Syntactic phrases in tree tend to stay together after translation (Fox 2002)

• We can use this idea to constrain an alignment given an English dependency tree

• Shown to improve alignment quality (Lin and Cherry 2003)


Example




Example



We can rule out the link, even with no one-to-one violation


ITG & Dependency• Both limit movement with phrasal cohesion

– ITG: Cohesive in some binary tree

– Dep: Cohesive in provided dependency tree

• Not subspaces of each other

the big red dog the dog ate it

Dep & ITG x Dep x & ITG


D-ITG Space

• Force ITG to maintain phrasal cohesion with a provided dependency tree

• Intersects ITG and Dependency spaces

• Adds linguistic dependency tree to ITG parsing


Chart Modification Solution

• Eliminate structures that allow tax to invert away from the

the tax causes unrest the tax causes unrest



Effect on Parser

the

tax

causes

unrest


A x

A

A


Effect on Parser

the

tax

causes

unrest


A

A


Continuum of constraints

Unconstrained

Permutation ITG D-ITG


Experimental Setup

• English-French Parliamentary debates

• 500 sentence labeled test set – (Och and Ney, 2003)

• Dependency parses from Minipar


Guidance Test

• Does the space stop incorrect alignments?

• Use a weighted link score built from:– Bilingual correlations between words– Relative position of tokens

• Maximize summed link scores in all spaces, check alignment error rate– AER: Combined precision and recall, lower is better


Guidance Results

02468

101214161820

Alignment Error Rate

PermutationITGD-ITG


Expressiveness Test

• Given a strong model, does the space hold us back?

• Use a cooked link score from the gold standard:– Only correct links are given positive scores– Best space is unconstrained space

• Maximize summed link scores in all spaces, check recall


Expressiveness Results


Contributions

• Algorithmic:– Method to inject ITG with linguistic constraints

• Experimental:– ITG constraints provide guidance, with virtually no

loss in expressiveness (French-English)

– Dependency cohesion constraints provide greater guidance, at the cost of some expressiveness


Outline





• Other Projects


Remaining Problems

• Dependency cohesion stops correct links:– Parse errors, Paraphrase, Exceptions– Would like a soft constraint

• I’m not doing much learning 2 competitive linking with an ITG search


Soft Constraint• Invalid spans need not be disallowed

– Instead parser could incur a penalty

• Easy to incorporate penalty into DP

the

tax

causes

unrest


A -5

A

A


ITG Learning

• Zhang and Gildea 2004, 2005, 2006…• Expectation Maximization to parameterize a

stochastic grammar unsupervised– Driven by expensive 2D inside-outside– Not doing much better than I am with 2

• Meanwhile, EMNLP’05 is happening– Moore 2005, Taskar et al. 2005– Suddenly it’s okay to use some training data


Discriminative matching (Taskar et al. 05)

causes

cause

?2 0.767DIST 0.050LCSR 0.833HMM 0.0

= 47.2

Link Score

Max matching finds alignment that maximizes the sum of link scores

Entire alignment y can be given feature vector (y) according to features of links in y

Features Learned Weights


Learning objective

• Find weights w, such that for each example i:

• Can formulate as constrained optimization problem, do max margin training

• Problem: Exponential number of wrong answers

FeaturesLearned Weights Structured Distance

QuickTime™ and aPhoto - JPEG decompressor

are needed to see this picture.


SVM Struct (Tsochantaridis et al. 2004)

Constrainedoptimization

wSearch for

most violated

Emptyconstraints

Accumulated constraints

Theory of constraint generation in constrained optimization guarantees convergence


Similarities to Averaged Perceptron

• Online method driven by comparisons of current output to correct answer

• But– Allows a notion of structural distance– Returns a max margin solution (with slacks) at

each step– Remembers all of its past mistakes


SVM-ITG• Can learn ITG parameters discriminatively

• Link productions Ae/f are scored as in discriminative matching

• Non-terminal productions A[AA] | <AA> are scored with two features:– Is it inverted?– Does it cover a span that would usually be illegal?

causes

cause

2 0.767DIST 0.050LCSR 0.833HMM 0.0

= 47.2Acauses / cause :


Experimental Setup

• Identical to Taskar et al.– 100 training– 37 development– 347 test

• Same unsupervised text as before to derive features – 50k Hansards data


Results: Bipartite matching SVM (Permutation): SVM weights with hard constraint (D-ITG)

02468

10121416

AER 1-Prec 1-Rec


Results: Bipartite matching SVM : SVM weights with hard constraint : ITG SVM with soft cohesion feature

02468

10121416

AER 1-Prec 1-Rec


Contributions

• Algorithmic:– Discriminative learning method for ITGs

• Experimental:– Value of hard constraints is reduced in the presence

of a strong link score– Integrating constraint as a feature during training can

recover value of constraints, improve AER & recall


Other Projects

• Applying techniques from SMT to new domains:– Unsupervised pronoun resolution

• Discriminative Structured Learning:– Discriminative parsing


Unsupervised Pronoun ResolutionCherry and Bergsma, CoNLL’05

• The president entered the arena with his family.

• Input: – A pronoun in context, and a list of candidates

“his family”, {arena, president}

• Output: The correct candidate - president

• Big Idea:– Formulate a generative model, where a candidate

generates the pronoun and context, run EM– Similar to IBM-1: Align pronouns to candidates


Pronoun Resolution: Innovations

• Used linguistics to limit candidate list:– Binding theory, known noun genders

• Used unambiguous cases to initialize EM

• Re-weighted component models discriminatively with maximum entropy

• End result: – Within 5% of a supervised system, with re-weighted

model matching supervised performance


Discriminative ParsingWang, Cherry, Lizotte and Schuurmans, CoNLL’06

• Input: Segmented Chinese string

• Output: Dependency parse tree

• Big Idea: – Score each link independently, with SVM weighting

features on links (MacDonald 2005), but generalize without Part of Speech tags

– Learn a weight for every word-pair seen in training



Parsing Innovations

• To promote generalization:– Altered “large margin” portion of SVM objective so

semantically similar word pairs have similar weights

• Tried two constraint types:– Local: Link scores constrained so links present in gold

standard score higher than those absent– Global: SVM Struct-style constraint generation


Others in brief

• Dependency treelet decoder (here)

• Sequence tagging:– Biomedical Term recognition

• Highlight gene names, proteins in medical texts

– Character-based Syllabification• Find syllable breaks in written words


Outline





• Other Projects


Connecting E and F

• One language generates the other – IBM models (Brown et al. 1993), HMM (Vogel et al. 1996),

Tree-to-string model (Yamada and Knight 2001)

• Both languages generated simultaneously– Joint model (Melamed 2000), Phrasal joint model (Marcu and

Wong 2002)

• S and T generate an alignment– Conditional model (Cherry and Lin 2003), Discriminative

models (Taskar et al. 2005, Moore 2005)


Phrases agree, not trees

he ran here quickly

Dependencies state that ran is modified here and quickly separately

We allow ITG to state that ran is modified by “here quickly”

Also tested these additional head constraints


A

Effect on Parser

the

tax

causes

unrest


A x


Custom Grammar Solution

• What trees force the and tax to stay together?– Custom recursive grammar– Same alignment space, canonical tree

the tax causes unresttax causes unrest

the tax

ITG

ITG


Guidance Results

02468

101214161820

Alignment Error Rate

PermutationITGDep BeamD-ITGHD-ITG



90919293949596979899

100

Recall

UnconstrainedPermutationITGDep BeamD-ITGHD-ITG


Expressiveness Analysis

• HD-ITG has systematic violations– Discontinuous Constituents (Melamed, 2003)

– Maintains distance to head - not always maintained in translation

Canadian Wheat Board

Commission Canadienne du blé

Canadian Wheat Board


Discriminative Alignment

• Alignment can be viewed as multi-class classification





Input:

Correct Answer:

Wrong Answers:






l’ impôt cause le malaise…


Problem

• Exponential number of incorrect alignments• One solution:

– Take advantage of properties of matching algorithm– Factor constraints

• Doing the same factorization on ITG could be a lot of work - need something more modular– Averaged perceptron?– Structured SVM


Final Challenge

• Need gold standard trees to train on, only have gold standard alignments

• Versatility of ITG makes this easy:– Search for best parse given an alignment– Select the parse with fewest cohesion violations and

fewest inversions


Redundancy

• Using A[AA] | <AA> | e/f– Several parses produce the same alignment– Wu provides a canonical-form grammar– Creates only one parse per alignment

• Useful for:– Counting methods like EM– Detecting arbitrary bracketing decisions


Results Table

Method Prec Rec AER

Match 79.3 82.7 19.24

ITG 81.8 83.7 17.36

Cohesion 88.8 84.0 13.40

D-ITG 88.8 84.2 13.32

HD-ITG 89.2 84.0 13.15


Guidance Results

0

5

10

15

20

25

Prec Err Rec Err AER




0

1

2

3

4

5

6

Recall Error AER



SVM Objective

€

minw,ξ12w

2+ Cn

ξ ii=1

n

∑ s.t. ∀ i :ξ i ≥ 0

∀ i,∀y :ξ i ≥ Δ(y i,y) + w • Ψi(y) −w • Ψi(y i)

Slack Structured loss Feature rep

university of albertanov 30, 2006 inversion transduction grammar with linguistic constraints colin...

Documents

university of albertanov

grammar slide

bitext parsing

d np v english french

examples of translation

d vp np english french

d s english french slide

adjective noun slide