partial prebracketing to improve parser performance john judge nclt seminar series 7 th december...
Post on 19-Dec-2015
220 views
TRANSCRIPT
Partial Prebracketing to Improve Parser Performance
John Judge
NCLT Seminar Series
7th December 2005
Overview
• Background and Motivation• Prebracketing• NE/MWE Markup• Labelled Bracketing Constituent Markup• NE/MWE + LBCM Combined• Grammars compared• Conclusions
Background and Motivation
• Parse annotated corpora are crucial for developing machine learning and statistics based parsing resources
• Large treebanks are available for major languages
• For other languages there is a lack of such resources for grammar induction
Background and Motivation
• Treebank construction is usually semi-automatic (Penn Treebank, NEGRA)– Raw text is parsed– Annotator corrects parser output
• Propose a new method– Text is pre-processed to help parser– Pre-processed text is parsed– Annotator corrects parser output
• Hopefully the output will be better quality and correction will be quicker and easier for the annotator
Relevance to my work
• Previous work on question analysis has identified a need for a suitable training corpus
• Plan to be able to use this technique for developing a question treebank
• Method is general so it can be applied to other areas/languages
Prebracketing
• Marking up the input text with information that will help the parser parse the sentence properly– Named Entities– Multi-Word Expressions– Constituents VP, PP
• Prebracketing can be done automatically (NE, MWE) or manually (constituents)
• Parser (LoPar, Schmidt (2000)) will respect this markup when parsing
Prebracketing
Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’
Prebracketing: Named Entities
Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’
Prebracketing: Multi-Word Expressions
Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’
Prebracketing: Constituents
Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’
Named Entities
• Names of people, places, companies, products, etc.
• Similar to Proper Nouns
• Internal structure isn’t really important
• Can be treated as a single lexical unit
Parsing NE’s
• Parser doesn’t normally output NEs
• Retrain on NE annotated treebank
• Add NE annotation to sections 2-21 of Penn-II Treebank to use as training
• Add NE annotations to section 23 as test set
Transforming the Corpus
• Use named entity recogniser, Lingpipe, to pick out the entities
(http://alias-i.com/lingpipe/)
• Use (slightly hacked) NE transformation routine in the LFG Annotation Algortithm to transform the trees
Example Tree Before Transformation
NP-SBJ
NP , NP ,
NNP NNP
Robert Erwin, ,NP
NP
PP
IN
NNP
NN
president of
Biosource
Example Tree After Transformation
NP-SBJ
NE , NP ,
Robert Erwin , ,NP
NE
PP
INNN
president ofBiosource
Example Prebracketed Input
• Parser input is generated by systematically stripping markup from the gold standard trees, leaving in only NE markup
((S (NP-SBJ (NE Robert Erwin)…(JJ competitive))))))(. .)('' '')))
(NE Robert Erwin), president of (NE Biosource), called (NE Plant Genetic)’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive . ’’
Results for NE Parsing Section 23
Baseline Marked up
Coverage 99.42 99.63
F-Score 62.95 66.09
Multi-Word Expressions
• Short expressions that are interpreted as a single unit eg. with respect to, vice versa, all of a sudden, according to, as well as …
• Can correspond to a number of syntactic categories
• Treat as a single lexical unit and ignore internal structure
Parsing MWE’s
• Parser doesn’t normally output MWE’s
• As with NE’s Penn-II is transformed to contain MWE’s
• Parser is retrained on MWE annotated treebank
• Tested on MWE annotated version of Section 23
Transforming the corpus
• Use a list of multi-word expressions from the Stanford University Multi-Word Expression Project (http://mwe.stanford.edu/)
• Transform the corpus using (an even more hacked) NE transformation routine in the LFG Annotation Algorithm
Example Tree Before Transformation
PP
RB IN ADJP
rather thanJJ
competitive
Example Tree After Transformation
PP
MWE ADJP
rather thanJJ
competitive
Example Prebracketed Input
• Parser input is generated by systematically stripping markup from the gold standard trees, leaving in only MWE markup
((S (NP-SBJ (NP (NNP Robert) (NNP Erwin)…(JJ competitive))))))(. .)('' '')))
Robert Erwin, president of Biosource, called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary (MWE rather than) competitive . ’’
Results for MWE Parsing Section 23
Baseline Marked up
Coverage 99.42 99.55
F-score 63.88 64.26
NE and MWE Markup Combined
• Marking up NEs and MWEs individually has yielded an improvement on the baseline
• Try doing both together
• Treebank is transformed as before but marking up both NEs and MWEs
• Prebracketed input can also contain both NEs and MWEs
Results for NE/MWE Combined Parsing
Baseline MWE NE NE&MWE
Coverage 99.67 99.67 99.67 99.67
F-Score 65.11 65.13 65.40 65.53
• MWE score is almost 1% better than when using MWEs alone
• NE score is over half a percent worse than when using NEs alone
• Coverage is up slightly on both runs
Labelled Bracketing Constituent Markup
• Instead of marking up lexicalised units, sentence constituents (VP, PP) are marked up
• No transformations are necessary as original treebank trees produce constituents in output
Generating LBCM input
• Systematically strip markup from the gold standard trees, leaving in only selected markup
((S (NP-SBJ (NP (NNP Robert) (NNP Erwin)…(JJ competitive))))))(. .)('' '')))
Robert Erwin, president of Biosource, (VP called Plant Genetic’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary rather than competitive) . ’’
• Simulating “ideal” human generated input
Results for LBCM
Baseline Top VP
Bottom VP
Top PP
Bottom PP
Coverage 99.87 99.11 98.98 99.55 98.55
F-Score 67.19 71.16 69.74 70.48 68.44
Combining NE, MWE and LBCM Prebracketing
• Taking the corpora from NE and MWE combination and preprocessing input as for LBCM
• (NE Robert Erwin), president of (NE Biosource), (VP called (NE Plant Genetic)’s approach `` Interesting ’’ and ``novel, ’’ and `` complementary (MWE rather than) competitive) . ’’
Results for all 3
Prebracketed Coverage F-Score
None (Baseline) 99.67 65.11
MWE 99.67 65.13
NE 99.67 65.40
MWE & NE 99.67 65.53
NE & Top VP 99.13 70.06
MWE & Top VP 99.21 69.63
NE, MWE & Top VP 99.21 69.91
Looking at the Grammar/Lexicon
• Expect to reduce grammar/lexicon size by conflating NEs and MWEs
• Compare 4 grammars and lexicons– Plain PCFG– NE PCFG– MWE PCFG– NE/MWE PCFG
Comparison
Vanilla NE MWE Combi
Rules 15427 16253 16791 17661
NP Rules 6456 6754 6544 6853
NP …NE…
- 1515 - 1520
NP …MWE…
- - 198 197
… NE - 2123 - 2549
… MWE - - 1821 1793
Lexical Entries
45357 51257 45622 51521
NEs in Lex - 14773 - 14773
MWEs in Lex - - 323 318
Why are the Grammars Growing?
• Expectation was that the grammar/lexicon would shrink
• Instead they’re growing in size• Caused by the lexicon
– Many MWEs are appearing as a single MWE entry and also their individual words
– Likewise for NEs– Introducing new categories– Correspondingly more rules in the
grammar
Multi-Word Expression Example
• all of a sudden– all of a sudden MWE 2– All of a sudden MWE 2– all DT 800 PDT 182 RB 36– of IN 22741 RP 2– a DT 19895 FW 6 LS 1 NNP 2 SYM 10
– sudden JJ 32
Named Entity Example
• Winston Churchill– Winston Churchill NE 1– Churchill NE 2
• George Bush– George Bush NE 26– George NE 5– Bush NE 344
Knock-on effects
• Grammar/lexicon size is growing– Parse time is increasing
• A likely cause of the gains in terms of precision, recall and f-score being small– NE/MWE analysis of a phrase is not the
most likely according to the grammar– Consequently a less likely parse is output
Summary
• Generating NE/MWE PCFG grammars in this way is possible
• Unexpectedly, these grammars are larger than plain PCFG grammars
• Results show something can be gained from prebracketing input
• However, even the best result 70.16 (prebracketing topmost VP) is considerabley less than history-based parsers (Charniak, Collins, Bikel)
Further Work
• Better NE recognition– More sophisticated transformation method
• Different/larger MWE lists
• Experiment with history-based parsers for better results*
Thanks
• Any questions or comments?