integrated supertagging and parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing...
TRANSCRIPT
![Page 1: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/1.jpg)
Integrated Supertagging and Parsing
Michael Auli !University of Edinburgh
![Page 2: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/2.jpg)
Marcel proved completeness
Parsing
![Page 3: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/3.jpg)
Marcel proved completeness
Parsing
NP NPVBD
VP
S
![Page 4: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/4.jpg)
Marcel proved completeness
CCG Parsing
(S\NP)/NPNP NP
S\NP
S
Combinatory Categorial Grammar (CCG; Steedman 2000)
![Page 5: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/5.jpg)
Marcel proved completeness
CCG Parsing
NP NP(S\NP)/NP
S\NP
S
<proved, (S\NP)/NP, completeness>
![Page 6: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/6.jpg)
Marcel proved completeness
CCG Parsing
NP NP(S\NP)/NP
S\NP
S
<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>
![Page 7: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/7.jpg)
Why CCG Parsing?
• MT: Can analyse nearly any span in a sentence (Auli ’09; Mehay ‘10; Zhang & Clark 2011; Weese et. al. ’12)
e.g. “conjectured and proved completeness” ⊢S\NP!
• Composition of regular and context-free languages -- mirrors situation in syntactic MT (Auli & Lopez, ACL 2011)!
• Transparent interface to semantics (Bos et al. 2004) e.g. proved ⊢ (S\NP)/NP : λx.λy.proved’ xy
![Page 8: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/8.jpg)
Marcel proved completeness
CCG Parsing is hard!
Over 22 tags per word! (Clark & Curran 2004)
NP NP(S\NP)/NP>
S\NP<
S
![Page 9: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/9.jpg)
Marcel proved completeness
Supertagging
![Page 10: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/10.jpg)
Marcel proved completeness
Supertagging
NP NP(S\NP)/NP
![Page 11: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/11.jpg)
Marcel proved completeness
Supertagging
NP NP(S\NP)/NP>
S\NP<
S
![Page 12: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/12.jpg)
Supertagging
time flies like an arrowNP NPS\NP NP/NP(S\NP)/NP
![Page 13: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/13.jpg)
Supertagging
time flies like an arrowNP NPS\NP NP/NP
✗(S\NP)/NP
![Page 14: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/14.jpg)
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
![Page 15: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/15.jpg)
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
![Page 16: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/16.jpg)
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
supertagger
![Page 17: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/17.jpg)
parser
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
supertagger
![Page 18: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/18.jpg)
parser
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
supertagger
![Page 19: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/19.jpg)
parser
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
parsersupertagger
![Page 20: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/20.jpg)
This talk
• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
Methods achieve most accurate CCG parsing results.
![Page 21: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/21.jpg)
This talk
• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsingwith Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
Methods achieve
![Page 22: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/22.jpg)
Adaptive Supertagging
time flies like an arrowNP NPS\NP (S\NP)/NP NP/NP
![Page 23: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/23.jpg)
Adaptive Supertagging
time flies like an arrowNP NPS\NP (S\NP)/NP NP/NP
((S\NP)\(S\NP))/NP....
....
NPNP/NP
... ...
... ...
![Page 24: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/24.jpg)
Clark & Curran (2004)
Adaptive Supertagging
![Page 25: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/25.jpg)
• Algorithm:!
• Run supertagger.!
• Return tags with posterior higher than some alpha.!
• Parse by combining tags (CKY).!
• If parsing succeeds, stop.!
• If parsing fails, lower alpha and repeat.
Clark & Curran (2004)
Adaptive Supertagging
![Page 26: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/26.jpg)
• Algorithm:!
• Run supertagger.!
• Return tags with posterior higher than some alpha.!
• Parse by combining tags (CKY).!
• If parsing succeeds, stop.!
• If parsing fails, lower alpha and repeat.
• Q: are parses returned in early rounds suboptimal?
Clark & Curran (2004)
Adaptive Supertagging
![Page 27: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/27.jpg)
Answer... L
abel
led
F-sc
ore
92
95
97
100
Tight beam Loose beam
Oracle parsing (Huang 2008)
Standard parsing (Clark and Curran 2007)
![Page 28: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/28.jpg)
Answer... L
abel
led
F-sc
ore
92
95
97
100
Tight beam Loose beam
Oracle parsing (Huang 2008)
Standard parsing (Clark and Curran 2007)
![Page 29: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/29.jpg)
Answer... L
abel
led
F-sc
ore
92
95
97
100
Tight beam Loose beam
Oracle parsing (Huang 2008)
Standard parsing (Clark and Curran 2007)
![Page 30: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/30.jpg)
Answer... L
abel
led
F-sc
ore
92
95
97
100
Tight beam Loose beam
Oracle parsing (Huang 2008)
Lab
elle
d F-
scor
e
85
87
88
90
Standard parsing (Clark and Curran 2007)
![Page 31: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/31.jpg)
88.2$
88.4$
88.6$
88.8$
89.0$
89.2$
89.4$
89.6$
89.8$
85600$
85800$
86000$
86200$
86400$
86600$
86800$
87000$
87200$
87400$
0.075$
0.03$
0.01$
0.005$
0.001$
0.0005$
0.0001$
0.00005$
0.00001$
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model$score$ F6measure$
Parsing
Note: only sentences parsable at all beam settings.
least aggressive
most aggressive
![Page 32: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/32.jpg)
Parsing
Note: only sentences parsable at all beam settings.
88.2$
88.4$
88.6$
88.8$
89.0$
89.2$
89.4$
89.6$
89.8$
85600$
85800$
86000$
86200$
86400$
86600$
86800$
87000$
87200$
87400$
0.075$
0.03$
0.01$
0.005$
0.001$
0.0005$
0.0001$
0.00005$
0.00001$
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model$score$ F6measure$
least aggressive
most aggressive
![Page 33: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/33.jpg)
Oracle Parsing
Note: only sentences parsable at all beam settings.
93.5%
94.0%
94.5%
95.0%
95.5%
96.0%
96.5%
97.0%
97.5%
98.0%
98.5%
82500%
83000%
83500%
84000%
84500%
85000%
0.075%
0.03%
0.01%
0.005%
0.001%
0.0005%
0.0001%
0.00005%
0.00001%
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model%score% F6measure%
least aggressive
most aggressive
![Page 34: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/34.jpg)
What’s happening here?
![Page 35: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/35.jpg)
What’s happening here?
• Supertagger keeps parser from making serious errors.
![Page 36: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/36.jpg)
What’s happening here?
• Supertagger keeps parser from making serious errors.
• But it also occasionally prunes away useful parses.
![Page 37: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/37.jpg)
What’s happening here?
• Supertagger keeps parser from making serious errors.
• But it also occasionally prunes away useful parses.
• Why not combine supertagger and parser into one?
![Page 38: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/38.jpg)
Overview
• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
![Page 39: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/39.jpg)
Overview
• Analysis of state-of-the-art approachTrade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
![Page 40: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/40.jpg)
Integrated Model• Supertagger & parser are log-linear models.!
• Idea: combine their features into one model.!
• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.
![Page 41: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/41.jpg)
Integrated Model• Supertagger & parser are log-linear models.!
• Idea: combine their features into one model.!
• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.
B C → A O(Gn3)original parsing problem:
![Page 42: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/42.jpg)
Integrated Model• Supertagger & parser are log-linear models.!
• Idea: combine their features into one model.!
• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.
B C → A O(Gn3)original parsing problem:
qBs sCr → qAr O(G3n3)new parsing problem:
![Page 43: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/43.jpg)
Integrated Model• Supertagger & parser are log-linear models.!
• Idea: combine their features into one model.!
• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.
Intersection of a regular and context-free language!(Bar-Hillel et al. 1964)
B C → A O(Gn3)original parsing problem:
qBs sCr → qAr O(G3n3)new parsing problem:
![Page 44: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/44.jpg)
Approximate Algorithms
• Loopy belief propagation: approximate calculation of marginals. (Pearl 1988; Smith & Eisner 2008)!
• Dual decomposition: exact (sometimes) calculation of maximum. (Dantzig & Wolfe 1960; Komodakis et al. 2007; Koo et al. 2010)
![Page 45: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/45.jpg)
Belief Propagation
![Page 46: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/46.jpg)
Belief Propagation
Forward-backward is belief propagation (Smyth et al. 1997)
![Page 47: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/47.jpg)
Belief Propagation
Marcel proved completeness
start stop
Forward-backward is belief propagation (Smyth et al. 1997)
![Page 48: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/48.jpg)
Belief Propagation
Marcel proved completeness
start stop
e1,j e2,j e3,j
ei,jemission message:fi,j =
X
j0
fi�1,j0ei�1,j0tj0,jforward message:
f1,j f2,j f3,j
backward message: bi,j =X
j0
bi+1,j0ei+1,j0tj,j0
b1,j b2,j b3,j
pi,j =1Z
fi,jei,jbi,jbelief (probability) that tag j is at position i:
Forward-backward is belief propagation (Smyth et al. 1997)
![Page 49: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/49.jpg)
Belief Propagation
Marcel proved completeness
Notational convenience: one factor describes whole!distribution over supertag sequence...
![Page 50: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/50.jpg)
span variables
Belief Propagation
Marcel proved completeness
We can also do the same for the distribution over parse trees!(Case-factor diagrams: McAllester et al. 2008)
0S3 0NP3
0NP2 1S\NP3
0NP1 1S\NP/NP2 2NP3
![Page 51: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/51.jpg)
span variables
Belief Propagation
Marcel proved completeness
We can also do the same for the distribution over parse trees!(Case-factor diagrams: McAllester et al. 2008)
Inside-outside is belief propagation (Sato 2007)
0S3 0NP3
0NP2 1S\NP3
0NP1 1S\NP/NP2 2NP3
![Page 52: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/52.jpg)
Belief Propagation
Marcel proved completeness
parsing factor
supertagging factor
![Page 53: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/53.jpg)
Belief Propagation
Marcel proved completeness
parsing factor
supertagging factor
Graph is not a tree!
![Page 54: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/54.jpg)
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
parsing factor
supertagging factor
![Page 55: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/55.jpg)
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
forward-backward
parsing factor
supertagging factor
![Page 56: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/56.jpg)
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
forward-backward
inside-outside
parsing factor
supertagging factor
![Page 57: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/57.jpg)
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
forward-backward
inside-outside
parsing factor
supertagging factor
![Page 58: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/58.jpg)
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
pi,j =1Z
fi,jei,jbi,joi,j
forward-backward
inside-outside
parsing factor
supertagging factor
![Page 59: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/59.jpg)
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
pi,j =1Z
fi,jei,jbi,joi,j
forward-backward
inside-outside
parsing factor
supertagging factor
![Page 60: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/60.jpg)
!
• Computes approximate marginals, no guarantees.!
• Complexity is additive:!
• Used to compute minimum-risk parse (Goodman 1996).
Loopy Belief Propagation
O(Gn3 +Gn)
![Page 61: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/61.jpg)
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor
![Page 62: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/62.jpg)
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor
f(y)
g(z)
![Page 63: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/63.jpg)
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor
f(y)
g(z)
arg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
![Page 64: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/64.jpg)
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
![Page 65: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/65.jpg)
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
![Page 66: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/66.jpg)
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
relaxed!original!problem
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
![Page 67: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/67.jpg)
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
modified!subproblem
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
![Page 68: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/68.jpg)
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
u(i, t)Dual objective: find assignment of that minimises L(u)
![Page 69: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/69.jpg)
Dual Decomposition
u(i, t) = u(i, t) + ↵ · [y(i, t)� z(i, t)] (Rush et al. 2010)
arg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
u(i, t)Dual objective: find assignment of that minimises L(u)
Solution provably solves original problem.
![Page 70: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/70.jpg)
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor
![Page 71: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/71.jpg)
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor Viterbi tags
Viterbi parse
![Page 72: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/72.jpg)
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor Viterbi tags
Viterbi parse
“Message passing” (Komodakis et al. 2007)
![Page 73: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/73.jpg)
• Computes exact maximum, if it converges.!
• Otherwise: return best parse seen (approximation).!
• Complexity is additive:!
• Use to compute Viterbi solutions.
Dual Decomposition
O(Gn3 +Gn)
![Page 74: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/74.jpg)
Experiments
• Standard parsing task:!
• C&C Parser and supertagger (Clark & Curran 2007).!
• CCGBank standard train/dev/test splits.!
• Piecewise optimisation (Sutton and McCallum 2005)!
• Approximate algorithms used to decode test set.
![Page 75: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/75.jpg)
Experiments: Accuracy over time
![Page 76: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/76.jpg)
Experiments: Accuracy over time
tight search (AST)
loose search (Rev)
![Page 77: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/77.jpg)
Experiments: Convergence
![Page 78: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/78.jpg)
Experiments: Convergence
Dual decomposition exact in 99.7% of cases What about belief propagation?
![Page 79: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/79.jpg)
Experiments: BP Exactness
![Page 80: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/80.jpg)
Experiments: BP Exactness
90#
92#
94#
96#
98#
100#
1# 10# 100# 1000#
Match&(%
)&
Itera-ons&
match#DD#k=1000#
match#BP#k=1000#
![Page 81: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/81.jpg)
Experiments: BP Exactness
90#
92#
94#
96#
98#
100#
1# 10# 100# 1000#
Match&(%
)&
Itera-ons&
match#DD#k=1000#
match#BP#k=1000#
Instantly, 91% match final DD solutions!
Takes DD 15 iterations to reach same level.
![Page 82: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/82.jpg)
Experiments: Accuracy
87
87.5
88
88.5
89
Tight beam Loose beam
Baseline Belief Propagation Dual Decomposition
Test set results
Labe
lled
F-m
easu
re
![Page 83: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/83.jpg)
Experiments: Accuracy
87
87.5
88
88.5
89
Tight beam Loose beam
Baseline Belief Propagation Dual Decomposition
88.188.3
87.7
Test set results
Labe
lled
F-m
easu
re
![Page 84: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/84.jpg)
Experiments: Accuracy
87
87.5
88
88.5
89
Tight beam Loose beam
Baseline Belief Propagation Dual Decomposition
88.8
88.1
88.9
88.3
87.787.7
Note: BP accuracy after 1 iteration; DD accuracy after 25 iterations
Test set results
Labe
lled
F-m
easu
re
+1.1
![Page 85: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/85.jpg)
Experiments: Accuracy
87
87.5
88
88.5
89
Tight beam Loose beam
Baseline Belief Propagation Dual Decomposition
88.8
88.1
88.9
88.3
87.787.7
Note: BP accuracy after 1 iteration; DD accuracy after 25 iterations
Test set results
Labe
lled
F-m
easu
re
+1.1
Best published result
![Page 86: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/86.jpg)
Oracle Results Again
89.4%
89.5%
89.6%
89.7%
89.8%
89.9%
90.0%
60000%
80000%
100000%
120000%
140000%
160000%
180000%
200000%
0.075%
0.03%
0.01%
0.005%
0.001%
0.0005%
0.0001%
0.00005%
0.00001%
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model%score% F6measure%
Belief Propagation
89.2%
89.3%
89.4%
89.5%
89.6%
89.7%
89.8%
89.9%
85200%
85400%
85600%
85800%
86000%
86200%
86400%
0.075%
0.03%
0.01%
0.005%
0.001%
0.0005%
0.0001%
0.00005%
0.00001%
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model%score% F6measure%
Dual Decomposition
![Page 87: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/87.jpg)
Summary so far
• Supertagging efficiency comes at the cost of accuracy.!
• Interaction between parser and supertagger can be exploited in an integrated model.!
• Practical inference for complex integrated model.!
• First empirical comparison between dual decomposition and belief propagation on NLP task.!
• Loopy belief propagation is fast, accurate and exact.
![Page 88: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/88.jpg)
Overview
• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
![Page 89: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/89.jpg)
Overview
• Analysis of state-of-the-art approachTrade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsingwith Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
![Page 90: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/90.jpg)
Training the Integrated Model
• So far optimised Conditional Log-Likelihood (CLL).!
• Optimise towards task-specific metric e.g. F1 such as in SMT (Och, 2003).!
• Past work used approximations to Precision (Taskar et al. 2004).!
• Contribution: Do it exactly and verify approximations.
![Page 91: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/91.jpg)
Marcel proved completeness
Parsing Metrics
NP NP(S\NP)/NP
S\NP
S
CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)
<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>
![Page 92: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/92.jpg)
Evaluate this}
Marcel proved completeness
Parsing Metrics
NP NP(S\NP)/NP
S\NP
S
CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)
<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>
![Page 93: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/93.jpg)
Not this!
Marcel proved completeness
Parsing Metrics
NP NP(S\NP)/NP
S\NP
S
CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)
<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>
![Page 94: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/94.jpg)
y = dependencies in ground truthy’ = dependencies in proposed output
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Parsing Metrics
![Page 95: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/95.jpg)
y = dependencies in ground truthy’ = dependencies in proposed output
Precision P (y, y0) =|y \ y0||y0| =
n
d
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Recall R(y, y0) =|y \ y0||y| =
n
|y|
F-measure F1(y, y0) =
2PR
P +R=
2|y \ y0||y|+ |y0| =
2n
d+ |y|
Parsing Metrics
![Page 96: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/96.jpg)
Softmax-Margin Training
• Discriminative.!
• Probabilistic.!
• Convex objective.!
• Minimises bound on expected risk for a given loss function.!
• Requires little change to existing CLL implementation.
(Sha & Saul, 2006; Povey & Woodland,2008; Gimpel & Smith, 2010)
![Page 97: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/97.jpg)
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
![Page 98: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/98.jpg)
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
weights true outputfeatures input proposed outputpossible outputs
training examples
![Page 99: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/99.jpg)
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
SMM:
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
weights true outputfeatures input proposed outputpossible outputs
training examples
![Page 100: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/100.jpg)
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
SMM:
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
weights true outputfeatures input proposed outputpossible outputs
training examples
![Page 101: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/101.jpg)
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
SMM:
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
• Penalise high-loss outputs.!
• Re-weight outcomes by loss function.!
• Loss function an unweighted feature -- if decomposable.
weights true outputfeatures input proposed outputpossible outputs
training examples
![Page 102: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/102.jpg)
Decomposability
• CKY assumes weights factor over substructures (node + children = substructure).
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
• A decomposable loss function must factor identically.
![Page 103: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/103.jpg)
Decomposability
• CKY assumes weights factor over substructures (node + children = substructure).
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
• A decomposable loss function must factor identically.
![Page 104: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/104.jpg)
Decomposability
• CKY assumes weights factor over substructures (node + children = substructure).
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
• A decomposable loss function must factor identically.
![Page 105: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/105.jpg)
Decomposability
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = dn = n1 + n2
![Page 106: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/106.jpg)
Decomposability
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
: n1 : n2
: n
Correct dependency counts
n = n1 + n2
![Page 107: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/107.jpg)
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: f1 : f2
: f
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability
![Page 108: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/108.jpg)
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: f1 : f2
: f
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability
f = f1 f2⌦
![Page 109: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/109.jpg)
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: f1 : f2
: f
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability
f = f1 f2⌦
Approximations!
![Page 110: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/110.jpg)
Approximate Loss Functions
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
![Page 111: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/111.jpg)
Approximate Loss Functions
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,:0,0 :1,1
:1,1
:0,0:0,0
n+ correct dependencies d+ all dependencies
c+ gold dependencies
for each substructure:
![Page 112: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/112.jpg)
Approximate Loss Functions
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,:0,0 :1,1
:1,1
:0,0:0,0
Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N,n,d ⌅�1� 2n
d+ |y|
⇥
Figure 3: State-split inside algorithm for computing softmax-margin with F-measure.
Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.
3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.
Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).
DecP (y) =⇤
t⇥T (y)
d+(t)� n+(t) (8)
To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.
DecR(y) =⇤
t⇥T (y)
c+(t)� n+(t) (9)
Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies
to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+
whenever c+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposable approximationto F-measure.
DecF1(y) = DecP (y) +DecR(y) (10)
4 Experiments
Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.
First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.
Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N,n,d ⌅�1� 2n
d+ |y|
⇥
Figure 3: State-split inside algorithm for computing softmax-margin with F-measure.
Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.
3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.
Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).
DecP (y) =⇤
t⇥T (y)
d+(t)� n+(t) (8)
To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.
DecR(y) =⇤
t⇥T (y)
c+(t)� n+(t) (9)
Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies
to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+
whenever c+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposable approximationto F-measure.
DecF1(y) = DecP (y) +DecR(y) (10)
4 Experiments
Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.
First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.
Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N,n,d ⌅�1� 2n
d+ |y|
⇥
Figure 3: State-split inside algorithm for computing softmax-margin with F-measure.
Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.
3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.
Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).
DecP (y) =⇤
t⇥T (y)
d+(t)� n+(t) (8)
To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.
DecR(y) =⇤
t⇥T (y)
c+(t)� n+(t) (9)
Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies
to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+
whenever c+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposable approximationto F-measure.
DecF1(y) = DecP (y) +DecR(y) (10)
4 Experiments
Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.
First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.
n+ correct dependencies d+ all dependencies
c+ gold dependencies
for each substructure:
![Page 113: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/113.jpg)
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
items Ai,j target analysiscorrect dependencies all dependencies
![Page 114: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/114.jpg)
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j target analysiscorrect dependencies all dependencies
![Page 115: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/115.jpg)
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
target analysis
DecF1(1,1)
correct dependencies all dependencies
![Page 116: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/116.jpg)
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
target analysis
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
![Page 117: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/117.jpg)
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
target analysis
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
![Page 118: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/118.jpg)
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5
target analysis
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
![Page 119: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/119.jpg)
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5
target analysisGOAL
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
![Page 120: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/120.jpg)
time1 flies2 like3 an4 arrow5
another analysisitems Ai,j
correct dependencies all dependencies
Approximate Losses with CKY
![Page 121: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/121.jpg)
time1 flies2 like3 an4 arrow5
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
correct dependencies all dependencies
Approximate Losses with CKY
![Page 122: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/122.jpg)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
![Page 123: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/123.jpg)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
S\NP2,5,1,2
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
DecF1(1,1)
DecF1(0,1)
correct dependencies all dependencies
Approximate Losses with CKY
![Page 124: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/124.jpg)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
NP0,2,0,1
S\NP2,5,1,2
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
correct dependencies all dependencies
Approximate Losses with CKY
![Page 125: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/125.jpg)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
NP0,2,0,1
S\NP2,5,1,2
S0,5
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
DecF1(0,1)
correct dependencies all dependencies
Approximate Losses with CKY
![Page 126: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/126.jpg)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
NP0,2,0,1
S\NP2,5,1,2
S0,5
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,jGOAL
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
DecF1(0,1)
correct dependencies all dependencies
Approximate Losses with CKY
![Page 127: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/127.jpg)
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5
NP0,2,0,1
S\NP2,5,1,2
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
both analysisitems Ai,jGOAL
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
DecF1(1,1)DecF1(0,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
![Page 128: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/128.jpg)
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5
NP0,2,0,1
S\NP2,5,1,2
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
both analysisitems Ai,jGOAL
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
DecF1(1,1)DecF1(0,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
![Page 129: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/129.jpg)
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
F1(y, y0) =
2n
d+ |y|
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability Revisited
![Page 130: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/130.jpg)
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: n1, d1 : n2, d2
F1(y, y0) =
2n
d+ |y|
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability Revisited
![Page 131: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/131.jpg)
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: n1, d1 : n2, d2
F1(y, y0) =
2n
d+ |y|
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability Revisited
f =2n1
d1 + |y| ⌦2n2
d2 + |y|
=2(n1 + n2)
d1 + d2 + 2|y|
![Page 132: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/132.jpg)
Exact Loss Functions
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
![Page 133: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/133.jpg)
Exact Loss Functions
• Treat sentence-level F1 as non-local feature dependent on n, d.
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
![Page 134: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/134.jpg)
Exact Loss Functions
• Treat sentence-level F1 as non-local feature dependent on n, d.
• Result: new dynamic program over items Ai,j,n,d
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,,0,0
,0,0,0,0
,1,1
,1,1
![Page 135: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/135.jpg)
Exact Losses with State-Split CKYitems Ai,j,n,d
time1 flies2 like3 an4 arrow5
correct dependencies all dependencies
![Page 136: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/136.jpg)
Exact Losses with State-Split CKYitems Ai,j,n,d
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
correct dependencies all dependencies
![Page 137: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/137.jpg)
Exact Losses with State-Split CKYitems Ai,j,n,d
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
correct dependencies all dependencies
GOAL
![Page 138: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/138.jpg)
Exact Losses with State-Split CKYitems Ai,j,n,d
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
correct dependencies all dependencies
GOALF1(4,4)
![Page 139: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/139.jpg)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,2,0,1
S\NP3,5,1,2
S0,5,1,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
correct dependencies all dependencies
items Ai,j,n,dGOAL F1(1,4)F1(4,4)
Exact Losses with State-Split CKY
![Page 140: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/140.jpg)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,2,0,1
S\NP3,5,1,2
S0,5,1,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
correct dependencies all dependencies
items Ai,j,n,dGOAL F1(1,4)F1(4,4)
Exact Losses with State-Split CKY
![Page 141: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/141.jpg)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,2,0,1
S\NP3,5,1,2
S0,5,1,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
Speed O(L7) Space O(L4)
items Ai,j,n,dGOAL F1(1,4)F1(4,4)
Exact Losses with State-Split CKY
![Page 142: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/142.jpg)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,2,0,1
S\NP3,5,1,2
S0,5,1,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
items Ai,j,n,dGOAL F1(1,4)F1(4,4)
Exact Losses with State-Split CKYin practice!
48 x larger DP!30 x slower
![Page 143: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/143.jpg)
Experiments
• Standard parsing task:!
• C&C Parser and supertagger (Clark & Curran 2007).!
• CCGBank standard train/dev/test splits.!
• Piecewise optimisation (Sutton and McCallum 2005)
![Page 144: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/144.jpg)
Exact versus Approximate
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
![Page 145: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/145.jpg)
Exact versus Approximate
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
![Page 146: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/146.jpg)
Exact versus Approximate
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
![Page 147: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/147.jpg)
Exact versus Approximate
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
![Page 148: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/148.jpg)
Exact versus Approximate
Approximate loss functions work, and much faster!
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
![Page 149: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/149.jpg)
Softmax-Margin beats CLLTest set results
87.5
87.9
88.2
88.6
Tight beam Loose beam
C&C ‘07 DecF1
Labe
lled
F-m
easu
re
![Page 150: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/150.jpg)
Softmax-Margin beats CLLTest set results
87.5
87.9
88.2
88.6
Tight beam Loose beam
C&C ‘07 DecF1
88.1
87.7Labe
lled
F-m
easu
re
![Page 151: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/151.jpg)
Softmax-Margin beats CLLTest set results
87.5
87.9
88.2
88.6
Tight beam Loose beam
C&C ‘07 DecF1
88.6
88.1
87.787.7
+0.9
Labe
lled
F-m
easu
re
![Page 152: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/152.jpg)
Does task-specific optimisation degrade accuracy on other metrics?
Softmax-Margin beats CLL
![Page 153: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/153.jpg)
Does task-specific optimisation degrade accuracy on other metrics?
37.0
38.0
39.0
40.0
Tight beam Loose beam
39.1
38.0 38.0
37.7
C&C ‘07 DecF1
Labe
lled
Exac
t Mat
ch
Softmax-Margin beats CLL
![Page 154: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/154.jpg)
Integrated Model + SMM
Marcel proved completeness
![Page 155: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/155.jpg)
Integrated Model + SMM
Marcel proved completeness
![Page 156: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/156.jpg)
Integrated Model + SMM
Marcel proved completeness
Hamming!augmented!expectations
forward- backward
![Page 157: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/157.jpg)
Integrated Model + SMM
Marcel proved completeness
Hamming!augmented!expectations
forward- backward
![Page 158: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/158.jpg)
Integrated Model + SMM
Marcel proved completeness
parsing factor
supertagging factor
Hamming!augmented!expectations
forward- backward
![Page 159: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/159.jpg)
Integrated Model + SMM
Marcel proved completeness
parsing factor
supertagging factor
Hamming!augmented!expectations
forward- backward
F-measure!augmented!expectations
inside-outside
![Page 160: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/160.jpg)
Integrated Model + SMM
Marcel proved completeness
parsing factor
supertagging factor
Hamming!augmented!expectations
forward- backward
F-measure!augmented!expectations
inside-outside
![Page 161: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/161.jpg)
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
Labe
lled
F-m
easu
re
![Page 162: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/162.jpg)
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
87.7
Labe
lled
F-m
easu
re
![Page 163: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/163.jpg)
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
88.9
87.7
Labe
lled
F-m
easu
re
![Page 164: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/164.jpg)
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
89.288.9
87.7
Labe
lled
F-m
easu
re
![Page 165: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/165.jpg)
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
89.389.288.9
87.7
Labe
lled
F-m
easu
re
+1.5
![Page 166: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/166.jpg)
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
![Page 167: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/167.jpg)
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
85.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
![Page 168: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/168.jpg)
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
86.085.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
![Page 169: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/169.jpg)
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
86.8
86.085.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
![Page 170: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/170.jpg)
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
87.186.8
86.085.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
![Page 171: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/171.jpg)
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
87.287.186.8
86.085.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
+1.5
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
![Page 172: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/172.jpg)
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
![Page 173: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/173.jpg)
0
11
22
33
44
55
66
77
88
99
110
87 88 89 90
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
![Page 174: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/174.jpg)
0
11
22
33
44
55
66
77
88
99
110
87 88 89 90
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
C&C
![Page 175: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/175.jpg)
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
C&C
![Page 176: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/176.jpg)
0
11
22
33
44
55
66
77
88
99
110
87 88 89 90
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
C&C Integrated Model
![Page 177: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/177.jpg)
0
11
22
33
44
55
66
77
88
99
110
87 88 89 90
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
C&C Integrated Model
Softmax- Margin Training
![Page 178: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/178.jpg)
Summary
• Softmax-Margin training is easy and improves our model.!
• Approximate loss functions are fast, accurate and easy to use.!
• Best ever CCG parsing results (87.7 → 89.3).
![Page 179: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/179.jpg)
Future Directions
• What can we do with the presented methods?!
• BP for other complex problems e.g. SMT!
• Semantics for SMT.!
• Simultaneous parsing of multiple sentences.
![Page 180: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/180.jpg)
BP for other NLP pipelines
• Pipelines necessary for practical NLP systems!
• More accurate integrated models often too complex!
• This talk: Approximate inference can make these models practical!
• Use it for other pipelines e.g. POS, NER tagging & Parsing!
• Hard: BP for syntactic MT, another weighted intersection problem between LM & TM
![Page 181: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/181.jpg)
Semantics for SMT• Compositional & distributional meaning
representation to compute vectors of sentence-meaning (Greffenstette & Sadrzadeh, 2011; Clark, to appear) !
• Syntax (e.g. CCG) drives compositional process!
• Directions: Model optimisation, evaluation, LM
Translation
Reference
![Page 182: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/182.jpg)
Parsing beyond sentence-level• Many NLP tasks (e.g. IE) rely on uniform analysis of constituents!
• Skip-Chain CRFs successful to predict consistent NER tags across sentences (Sutton & McCallum, 2004)!
• Parse multiple sentences at once and enforce uniformity of parses
The securities and exchange commission issued ...
NNP/N
NP
... responded to the statement of the securities and exchange commission
NPconjNP
NP\NP
NP
1.
2.
![Page 183: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/183.jpg)
Related Publications• A Comparison of Loopy Belief Propagation and Dual
Decomposition for Integrated CCG Supertagging and Parsing. with Adam Lopez. In Proc. of ACL, June 2011. !
• Efficient CCG Parsing: A* versus Adaptive Supertagging. with Adam Lopez. In Proc. of ACL, June 2011.!
• Training a Log-Linear Parser with Softmax-Margin. with Adam Lopez. In Proc. of EMNLP, July 2011.!
• A Systematic Comparison of Translation Model Search Spaces. with Adam Lopez, Hieu Hoang, Philipp Koehn. In Proc. of WMT, March 2009.
![Page 184: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:](https://reader033.vdocuments.mx/reader033/viewer/2022060723/60833560e4eeeb437463bead/html5/thumbnails/184.jpg)
Thank you