integrated supertagging and parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing...

Integrated Supertagging and Parsing

Michael Auli !University of Edinburgh

Marcel proved completeness

Parsing


Parsing

NP NPVBD

VP

S


CCG Parsing

(S\NP)/NPNP NP

S\NP

S

Combinatory Categorial Grammar (CCG; Steedman 2000)


CCG Parsing

NP NP(S\NP)/NP

S\NP

S

<proved, (S\NP)/NP, completeness>


CCG Parsing

NP NP(S\NP)/NP

S\NP

S

<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>

Why CCG Parsing?

• MT: Can analyse nearly any span in a sentence (Auli ’09; Mehay ‘10; Zhang & Clark 2011; Weese et. al. ’12)

e.g. “conjectured and proved completeness” ⊢S\NP!

• Composition of regular and context-free languages -- mirrors situation in syntactic MT (Auli & Lopez, ACL 2011)!

• Transparent interface to semantics (Bos et al. 2004) e.g. proved ⊢ (S\NP)/NP : λx.λy.proved’ xy


CCG Parsing is hard!

Over 22 tags per word! (Clark & Curran 2004)

NP NP(S\NP)/NP>

S\NP<

S


Supertagging


Supertagging

NP NP(S\NP)/NP


Supertagging

NP NP(S\NP)/NP>

S\NP<

S

Supertagging

time flies like an arrowNP NPS\NP NP/NP(S\NP)/NP

Supertagging

time flies like an arrowNP NPS\NP NP/NP

✗(S\NP)/NP

The Problem• Supertagger has no sense of overall grammaticality.!

• But parser restricted by its decisions.!

• Supertagger probabilities not used in parser.




supertag sequences




supertag sequences

supertagger

parser




supertag sequences

supertagger

parser




supertag sequences

parsersupertagger

This talk

• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)

• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)

• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)

Methods achieve most accurate CCG parsing results.

This talk


• Integrated supertagging and parsingwith Loopy Belief Propagation and Dual Decomposition (ACL 2011b)


Methods achieve

Adaptive Supertagging

time flies like an arrowNP NPS\NP (S\NP)/NP NP/NP


time flies like an arrowNP NPS\NP (S\NP)/NP NP/NP

((S\NP)\(S\NP))/NP....

....

NPNP/NP

... ...

... ...

Clark & Curran (2004)


• Algorithm:!

• Run supertagger.!

• Return tags with posterior higher than some alpha.!

• Parse by combining tags (CKY).!

• If parsing succeeds, stop.!

• If parsing fails, lower alpha and repeat.



• Algorithm:!

• Run supertagger.!

• Return tags with posterior higher than some alpha.!

• Parse by combining tags (CKY).!

• If parsing succeeds, stop.!

• If parsing fails, lower alpha and repeat.

• Q: are parses returned in early rounds suboptimal?



Answer... L

abel

led

F-sc

ore

92

95

97

100

Tight beam Loose beam

Oracle parsing (Huang 2008)

Standard parsing (Clark and Curran 2007)

Answer... L

abel

led

F-sc

ore

92

95

97

100


Oracle parsing (Huang 2008)

Lab

elle

d F-

scor

e

85

87

88

90

Standard parsing (Clark and Curran 2007)

88.2$

88.4$

88.6$

88.8$

89.0$

89.2$

89.4$

89.6$

89.8$

85600$

85800$

86000$

86200$

86400$

86600$

86800$

87000$

87200$

87400$

0.075$

0.03$

0.01$

0.005$

0.001$

0.0005$

0.0001$

0.00005$

0.00001$

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'

Model$score$ F6measure$

Parsing

Note: only sentences parsable at all beam settings.

least aggressive

most aggressive

Parsing


88.2$

88.4$

88.6$

88.8$

89.0$

89.2$

89.4$

89.6$

89.8$

85600$

85800$

86000$

86200$

86400$

86600$

86800$

87000$

87200$

87400$

0.075$

0.03$

0.01$

0.005$

0.001$

0.0005$

0.0001$

0.00005$

0.00001$

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'

Model$score$ F6measure$

least aggressive

most aggressive

Oracle Parsing


93.5%

94.0%

94.5%

95.0%

95.5%

96.0%

96.5%

97.0%

97.5%

98.0%

98.5%

82500%

83000%

83500%

84000%

84500%

85000%

0.075%

0.03%

0.01%

0.005%

0.001%

0.0005%

0.0001%

0.00005%

0.00001%

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'

Model%score% F6measure%

least aggressive

most aggressive

What’s happening here?


• Supertagger keeps parser from making serious errors.



• But it also occasionally prunes away useful parses.



• But it also occasionally prunes away useful parses.

• Why not combine supertagger and parser into one?

Overview




Overview

• Analysis of state-of-the-art approachTrade-off between efficiency and accuracy (ACL 2011a)



Integrated Model• Supertagger & parser are log-linear models.!

• Idea: combine their features into one model.!

• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.




B C → A O(Gn3)original parsing problem:





qBs sCr → qAr O(G3n3)new parsing problem:




Intersection of a regular and context-free language!(Bar-Hillel et al. 1964)


qBs sCr → qAr O(G3n3)new parsing problem:

Approximate Algorithms

• Loopy belief propagation: approximate calculation of marginals. (Pearl 1988; Smith & Eisner 2008)!

• Dual decomposition: exact (sometimes) calculation of maximum. (Dantzig & Wolfe 1960; Komodakis et al. 2007; Koo et al. 2010)

Belief Propagation

Belief Propagation

Forward-backward is belief propagation (Smyth et al. 1997)

Belief Propagation


start stop


Belief Propagation


start stop

e1,j e2,j e3,j

ei,jemission message:fi,j =

X

j0

fi�1,j0ei�1,j0tj0,jforward message:

f1,j f2,j f3,j

backward message: bi,j =X

j0

bi+1,j0ei+1,j0tj,j0

b1,j b2,j b3,j

pi,j =1Z

fi,jei,jbi,jbelief (probability) that tag j is at position i:


Belief Propagation


Notational convenience: one factor describes whole!distribution over supertag sequence...

span variables

Belief Propagation


We can also do the same for the distribution over parse trees!(Case-factor diagrams: McAllester et al. 2008)

0S3 0NP3

0NP2 1S\NP3

0NP1 1S\NP/NP2 2NP3

span variables

Belief Propagation


We can also do the same for the distribution over parse trees!(Case-factor diagrams: McAllester et al. 2008)

Inside-outside is belief propagation (Sato 2007)

0S3 0NP3

0NP2 1S\NP3

0NP1 1S\NP/NP2 2NP3

Belief Propagation


parsing factor

supertagging factor

Belief Propagation


parsing factor

supertagging factor

Graph is not a tree!

Loopy Belief Propagation



parsing factor

supertagging factor




forward-backward

parsing factor

supertagging factor




forward-backward

inside-outside

parsing factor

supertagging factor




pi,j =1Z

fi,jei,jbi,joi,j

forward-backward

inside-outside

parsing factor

supertagging factor

!

• Computes approximate marginals, no guarantees.!

• Complexity is additive:!

• Used to compute minimum-risk parse (Goodman 1996).


O(Gn3 +Gn)

Dual Decomposition


parsing factor

supertagging factor

Dual Decomposition


parsing factor

supertagging factor

f(y)

g(z)

Dual Decomposition


parsing factor

supertagging factor

f(y)

g(z)

arg max

y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t

Dual Decompositionarg max




L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)



relaxed!original!problem

L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)



modified!subproblem

L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)



L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)

u(i, t)Dual objective: find assignment of that minimises L(u)

Dual Decomposition

u(i, t) = u(i, t) + ↵ · [y(i, t)� z(i, t)] (Rush et al. 2010)

arg max


L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)

u(i, t)Dual objective: find assignment of that minimises L(u)

Solution provably solves original problem.

Dual Decomposition


parsing factor

supertagging factor

Dual Decomposition


parsing factor

supertagging factor Viterbi tags

Viterbi parse

Dual Decomposition


parsing factor

supertagging factor Viterbi tags

Viterbi parse

“Message passing” (Komodakis et al. 2007)

• Computes exact maximum, if it converges.!

• Otherwise: return best parse seen (approximation).!

• Complexity is additive:!

• Use to compute Viterbi solutions.

Dual Decomposition

O(Gn3 +Gn)

Experiments

• Standard parsing task:!

• C&C Parser and supertagger (Clark & Curran 2007).!

• CCGBank standard train/dev/test splits.!

• Piecewise optimisation (Sutton and McCallum 2005)!

• Approximate algorithms used to decode test set.

Experiments: Accuracy over time

Experiments: Accuracy over time

tight search (AST)

loose search (Rev)

Experiments: Convergence

Experiments: Convergence

Dual decomposition exact in 99.7% of cases What about belief propagation?

Experiments: BP Exactness


90#

92#

94#

96#

98#

100#

1# 10# 100# 1000#

Match&(%

)&

Itera-ons&

match#DD#k=1000#

match#BP#k=1000#


90#

92#

94#

96#

98#

100#

1# 10# 100# 1000#

Match&(%

)&

Itera-ons&

match#DD#k=1000#

match#BP#k=1000#

Instantly, 91% match final DD solutions!

Takes DD 15 iterations to reach same level.

Experiments: Accuracy

87

87.5

88

88.5

89


Baseline Belief Propagation Dual Decomposition

Test set results

Labe

lled

F-m

easu

re


87

87.5

88

88.5

89



88.188.3

87.7

Test set results

Labe

lled

F-m

easu

re


87

87.5

88

88.5

89



88.8

88.1

88.9

88.3

87.787.7

Note: BP accuracy after 1 iteration; DD accuracy after 25 iterations

Test set results

Labe

lled

F-m

easu

re

+1.1


87

87.5

88

88.5

89



88.8

88.1

88.9

88.3

87.787.7

Note: BP accuracy after 1 iteration; DD accuracy after 25 iterations

Test set results

Labe

lled

F-m

easu

re

+1.1

Best published result

Oracle Results Again

89.4%

89.5%

89.6%

89.7%

89.8%

89.9%

90.0%

60000%

80000%

100000%

120000%

140000%

160000%

180000%

200000%

0.075%

0.03%

0.01%

0.005%

0.001%

0.0005%

0.0001%

0.00005%

0.00001%

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'


Belief Propagation

89.2%

89.3%

89.4%

89.5%

89.6%

89.7%

89.8%

89.9%

85200%

85400%

85600%

85800%

86000%

86200%

86400%

0.075%

0.03%

0.01%

0.005%

0.001%

0.0005%

0.0001%

0.00005%

0.00001%

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'


Dual Decomposition

Summary so far

• Supertagging efficiency comes at the cost of accuracy.!

• Interaction between parser and supertagger can be exploited in an integrated model.!

• Practical inference for complex integrated model.!

• First empirical comparison between dual decomposition and belief propagation on NLP task.!

• Loopy belief propagation is fast, accurate and exact.

Overview




Overview

• Analysis of state-of-the-art approachTrade-off between efficiency and accuracy (ACL 2011a)

• Integrated supertagging and parsingwith Loopy Belief Propagation and Dual Decomposition (ACL 2011b)


Training the Integrated Model

• So far optimised Conditional Log-Likelihood (CLL).!

• Optimise towards task-specific metric e.g. F1 such as in SMT (Och, 2003).!

• Past work used approximations to Precision (Taskar et al. 2004).!

• Contribution: Do it exactly and verify approximations.


Parsing Metrics

NP NP(S\NP)/NP

S\NP

S

CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)


Evaluate this}


Parsing Metrics

NP NP(S\NP)/NP

S\NP

S



Not this!


Parsing Metrics

NP NP(S\NP)/NP

S\NP

S



y = dependencies in ground truthy’ = dependencies in proposed output

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

Parsing Metrics

y = dependencies in ground truthy’ = dependencies in proposed output

Precision P (y, y0) =|y \ y0||y0| =

n

d


|y \ y0| = n

|y0| = d

Recall R(y, y0) =|y \ y0||y| =

n

|y|

F-measure F1(y, y0) =

2PR

P +R=

2|y \ y0||y|+ |y0| =

2n

d+ |y|

Parsing Metrics

Softmax-Margin Training

• Discriminative.!

• Probabilistic.!

• Convex objective.!

• Minimises bound on expected risk for a given loss function.!

• Requires little change to existing CLL implementation.

(Sha & Saul, 2006; Povey & Woodland,2008; Gimpel & Smith, 2010)


CLL: min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌅

⌃ (2)

min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌅

⌃ (3)

⌥

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥

y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x

(i), y(i))⇥k

⇥(4)

Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).

Draft, do not circulate without permission.

Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)

GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

⇥

Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.

counts, n+ and d+:

DecP = d+ � n+ (6)

Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:

DecR = y+ � n+ (7)

However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<�>

NP<

(S\NP)

Dependencies:and - pearsand - appleslikes - pears, likes - apples

Figure 5: Example illustrating handling of conjunctionsin CCG: .

Finally, decomposed F-measure is simply the sumof the two decomposed losses:

DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments

Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.

Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran



NP\NP<

NP>

(S\NP)

Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.

same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+

whenever y+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposed approximationto F-measure.

DecF1 = DecP +DecR (10)

4 Experiments

Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,

we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.

Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.

Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-

Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).

The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.

Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-

measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.

tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j

accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:

Ai,i+1 ⇤= w(ai : A)

Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N

Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.

3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.

4Here we use state-splitting to refer to splitting an item Ai,j

into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside

algorithm, or mechanically using the reverse values of Good-man (1999).


CLL: min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌅

⌃ (2)

min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌅

⌃ (3)

⌥

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥


(i), y(i))⇥k

⇥(4)





GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

⇥


counts, n+ and d+:

DecP = d+ � n+ (6)


DecR = y+ � n+ (7)




NP\NP<�>

NP<

(S\NP)




DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments






NP\NP<

NP>

(S\NP)






4 Experiments












Ai,i+1 ⇤= w(ai : A)


GOAL ⇤= S0,N






weights true outputfeatures input proposed outputpossible outputs

training examples


CLL: min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌅

⌃ (2)

min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌅

⌃ (3)

⌥

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥


(i), y(i))⇥k

⇥(4)





GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

⇥


counts, n+ and d+:

DecP = d+ � n+ (6)


DecR = y+ � n+ (7)




NP\NP<�>

NP<

(S\NP)




DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments






NP\NP<

NP>

(S\NP)






4 Experiments












Ai,i+1 ⇤= w(ai : A)


GOAL ⇤= S0,N






SMM:

min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌅

⌃ (2)

min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌅

⌃ (3)

⌥

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥


(i), y(i))⇥k

⇥(4)





GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

⇥


counts, n+ and d+:

DecP = d+ � n+ (6)


DecR = y+ � n+ (7)




NP\NP<�>

NP<

(S\NP)




DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments






NP\NP<

NP>

(S\NP)






4 Experiments












Ai,i+1 ⇤= w(ai : A)


GOAL ⇤= S0,N







training examples


CLL: min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌅

⌃ (2)

min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌅

⌃ (3)

⌥

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥


(i), y(i))⇥k

⇥(4)





GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

⇥


counts, n+ and d+:

DecP = d+ � n+ (6)


DecR = y+ � n+ (7)




NP\NP<�>

NP<

(S\NP)




DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments






NP\NP<

NP>

(S\NP)






4 Experiments












Ai,i+1 ⇤= w(ai : A)


GOAL ⇤= S0,N






SMM:

min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌅

⌃ (2)

min�

m�

i=1

⇤

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌅

⌃ (3)

⌥

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥


(i), y(i))⇥k

⇥(4)





GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

⇥


counts, n+ and d+:

DecP = d+ � n+ (6)


DecR = y+ � n+ (7)




NP\NP<�>

NP<

(S\NP)




DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments






NP\NP<

NP>

(S\NP)






4 Experiments












Ai,i+1 ⇤= w(ai : A)


GOAL ⇤= S0,N






• Penalise high-loss outputs.!

• Re-weight outcomes by loss function.!

• Loss function an unweighted feature -- if decomposable.


training examples

Decomposability

• CKY assumes weights factor over substructures (node + children = substructure).


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

• A decomposable loss function must factor identically.

Decomposability


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,


|y \ y0| = n

|y0| = dn = n1 + n2

Decomposability


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,


|y \ y0| = n

|y0| = d

: n1 : n2

: n

Correct dependency counts

n = n1 + n2


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: f1 : f2

: f

F-measure


|y \ y0| = n

|y0| = d

Decomposability


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: f1 : f2

: f

F-measure


|y \ y0| = n

|y0| = d

Decomposability

f = f1 f2⌦


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: f1 : f2

: f

F-measure


|y \ y0| = n

|y0| = d

Decomposability

f = f1 f2⌦

Approximations!

Approximate Loss Functions


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,



NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,:0,0 :1,1

:1,1

:0,0:0,0

n+ correct dependencies d+ all dependencies

c+ gold dependencies

for each substructure:



NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,:0,0 :1,1

:1,1

:0,0:0,0

Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N,n,d ⌅�1� 2n

d+ |y|

⇥


Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.

3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.

Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).

DecP (y) =⇤

t⇥T (y)

d+(t)� n+(t) (8)

To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.

DecR(y) =⇤

t⇥T (y)

c+(t)� n+(t) (9)

Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies

to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+

whenever c+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposable approximationto F-measure.

DecF1(y) = DecP (y) +DecR(y) (10)

4 Experiments

Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.

First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.



GOAL ⇤= S0,N,n,d ⌅�1� 2n

d+ |y|

⇥





DecP (y) =⇤

t⇥T (y)

d+(t)� n+(t) (8)


DecR(y) =⇤

t⇥T (y)

c+(t)� n+(t) (9)






4 Experiments






GOAL ⇤= S0,N,n,d ⌅�1� 2n

d+ |y|

⇥





DecP (y) =⇤

t⇥T (y)

d+(t)� n+(t) (8)


DecR(y) =⇤

t⇥T (y)

c+(t)� n+(t) (9)






4 Experiments




n+ correct dependencies d+ all dependencies

c+ gold dependencies

for each substructure:

Approximate Losses with CKY

time1 flies2 like3 an4 arrow5

items Ai,j target analysiscorrect dependencies all dependencies



NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j target analysiscorrect dependencies all dependencies



NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

target analysis

DecF1(1,1)

correct dependencies all dependencies



NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

target analysis

DecF1(1,1)

DecF1(1,1)




NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

target analysis

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)




NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5

target analysis

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)




NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5

target analysisGOAL

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)



another analysisitems Ai,j




NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0





NP3,5,1,1

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0


DecF1(1,1)




NP3,5,1,1

S\NP2,5,1,2

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0


DecF1(1,1)

DecF1(0,1)




NP3,5,1,1

NP0,2,0,1

S\NP2,5,1,2

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0


DecF1(1,1)

DecF1(0,1)

DecF1(0,1)




NP3,5,1,1

NP0,2,0,1

S\NP2,5,1,2

S0,5

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0


DecF1(1,1)

DecF1(0,1)

DecF1(0,1)

DecF1(0,1)




NP3,5,1,1

NP0,2,0,1

S\NP2,5,1,2

S0,5

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

another analysisitems Ai,jGOAL

DecF1(1,1)

DecF1(0,1)

DecF1(0,1)

DecF1(0,1)




NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5

NP0,2,0,1

S\NP2,5,1,2

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

both analysisitems Ai,jGOAL

DecF1(1,1)

DecF1(0,1)

DecF1(0,1)

DecF1(1,1)DecF1(0,1)

DecF1(1,1)

DecF1(1,1)




NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

F1(y, y0) =

2n

d+ |y|

F-measure


|y \ y0| = n

|y0| = d

Decomposability Revisited


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: n1, d1 : n2, d2

F1(y, y0) =

2n

d+ |y|

F-measure


|y \ y0| = n

|y0| = d



NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: n1, d1 : n2, d2

F1(y, y0) =

2n

d+ |y|

F-measure


|y \ y0| = n

|y0| = d


f =2n1

d1 + |y| ⌦2n2

d2 + |y|

=2(n1 + n2)

d1 + d2 + 2|y|

Exact Loss Functions


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,


• Treat sentence-level F1 as non-local feature dependent on n, d.


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,


• Treat sentence-level F1 as non-local feature dependent on n, d.

• Result: new dynamic program over items Ai,j,n,d


NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,,0,0

,0,0,0,0

,1,1

,1,1

Exact Losses with State-Split CKYitems Ai,j,n,d





NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0




NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0


GOAL



NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0


GOALF1(4,4)


NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,2,0,1

S\NP3,5,1,2

S0,5,1,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0


items Ai,j,n,dGOAL F1(1,4)F1(4,4)

Exact Losses with State-Split CKY


NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,2,0,1

S\NP3,5,1,2

S0,5,1,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

Speed O(L7) Space O(L4)


Exact Losses with State-Split CKY


NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,2,0,1

S\NP3,5,1,2

S0,5,1,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0


Exact Losses with State-Split CKYin practice!

48 x larger DP!30 x slower

Experiments

• Standard parsing task:!

• C&C Parser and supertagger (Clark & Curran 2007).!

• CCGBank standard train/dev/test splits.!

• Piecewise optimisation (Sutton and McCallum 2005)

Exact versus Approximate

86.9

87.1

87.3

87.6

87.8

88.0

Precision Recall F-measure

Approximate Exact

Exact versus Approximate

Approximate loss functions work, and much faster!

86.9

87.1

87.3

87.6

87.8

88.0

Precision Recall F-measure

Approximate Exact

Softmax-Margin beats CLLTest set results

87.5

87.9

88.2

88.6


C&C ‘07 DecF1

Labe

lled

F-m

easu

re


87.5

87.9

88.2

88.6


C&C ‘07 DecF1

88.1

87.7Labe

lled

F-m

easu

re


87.5

87.9

88.2

88.6


C&C ‘07 DecF1

88.6

88.1

87.787.7

+0.9

Labe

lled

F-m

easu

re

Does task-specific optimisation degrade accuracy on other metrics?

Softmax-Margin beats CLL

Does task-specific optimisation degrade accuracy on other metrics?

37.0

38.0

39.0

40.0


39.1

38.0 38.0

37.7

C&C ‘07 DecF1

Labe

lled

Exac

t Mat

ch

Softmax-Margin beats CLL

Integrated Model + SMM




Hamming!augmented!expectations

forward- backward



parsing factor

supertagging factor


forward- backward



parsing factor

supertagging factor


forward- backward

F-measure!augmented!expectations

inside-outside

Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

87.0

87.8

88.5

89.3

90.0

C&C ’07 Integrated +DecF1 +Tagger

Labe

lled

F-m

easu

re




87.0

87.8

88.5

89.3

90.0


87.7

Labe

lled

F-m

easu

re




87.0

87.8

88.5

89.3

90.0


88.9

87.7

Labe

lled

F-m

easu

re




87.0

87.8

88.5

89.3

90.0


89.288.9

87.7

Labe

lled

F-m

easu

re




87.0

87.8

88.5

89.3

90.0


89.389.288.9

87.7

Labe

lled

F-m

easu

re

+1.5

Results: Automatic POS

85.0

85.8

86.5

87.3

88.0

C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger

Labe

lled

F-m

easu

re

Fowler & Penn (2010)

• F-measure loss for parsing sub-model (+DecF1).!




85.0

85.8

86.5

87.3

88.0


85.7

Labe

lled

F-m

easu

re






85.0

85.8

86.5

87.3

88.0


86.085.7

Labe

lled

F-m

easu

re






85.0

85.8

86.5

87.3

88.0


86.8

86.085.7

Labe

lled

F-m

easu

re






85.0

85.8

86.5

87.3

88.0


87.186.8

86.085.7

Labe

lled

F-m

easu

re






85.0

85.8

86.5

87.3

88.0


87.287.186.8

86.085.7

Labe

lled

F-m

easu

re


+1.5




Results: Efficiency vs. AccuracySe

nten

ces/

seco

nd

AccuracyBetter

Faster

0

11

22

33

44

55

66

77

88

99

110

87 88 89 90


nten

ces/

seco

nd

AccuracyBetter

Faster

0

11

22

33

44

55

66

77

88

99

110

87 88 89 90


nten

ces/

seco

nd

AccuracyBetter

Faster

C&C


nten

ces/

seco

nd

AccuracyBetter

Faster

C&C

0

11

22

33

44

55

66

77

88

99

110

87 88 89 90


nten

ces/

seco

nd

AccuracyBetter

Faster

C&C Integrated Model

0

11

22

33

44

55

66

77

88

99

110

87 88 89 90


nten

ces/

seco

nd

AccuracyBetter

Faster

C&C Integrated Model

Softmax- Margin Training

Summary

• Softmax-Margin training is easy and improves our model.!

• Approximate loss functions are fast, accurate and easy to use.!

• Best ever CCG parsing results (87.7 → 89.3).

Future Directions

• What can we do with the presented methods?!

• BP for other complex problems e.g. SMT!

• Semantics for SMT.!

• Simultaneous parsing of multiple sentences.

BP for other NLP pipelines

• Pipelines necessary for practical NLP systems!

• More accurate integrated models often too complex!

• This talk: Approximate inference can make these models practical!

• Use it for other pipelines e.g. POS, NER tagging & Parsing!

• Hard: BP for syntactic MT, another weighted intersection problem between LM & TM

Semantics for SMT• Compositional & distributional meaning

representation to compute vectors of sentence-meaning (Greffenstette & Sadrzadeh, 2011; Clark, to appear) !

• Syntax (e.g. CCG) drives compositional process!

• Directions: Model optimisation, evaluation, LM

Translation

Reference

Parsing beyond sentence-level• Many NLP tasks (e.g. IE) rely on uniform analysis of constituents!

• Skip-Chain CRFs successful to predict consistent NER tags across sentences (Sutton & McCallum, 2004)!

• Parse multiple sentences at once and enforce uniformity of parses

The securities and exchange commission issued ...

NNP/N

NP

... responded to the statement of the securities and exchange commission

NPconjNP

NP\NP

NP

1.

2.

Related Publications• A Comparison of Loopy Belief Propagation and Dual

Decomposition for Integrated CCG Supertagging and Parsing. with Adam Lopez. In Proc. of ACL, June 2011. !

• Efficient CCG Parsing: A* versus Adaptive Supertagging. with Adam Lopez. In Proc. of ACL, June 2011.!

• Training a Log-Linear Parser with Softmax-Margin. with Adam Lopez. In Proc. of EMNLP, July 2011.!

• A Systematic Comparison of Translation Model Search Spaces. with Adam Lopez, Hieu Hoang, Philipp Koehn. In Proc. of WMT, March 2009.

Thank you

integrated supertagging and parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing...

Documents