open-source implementation of document structuring algorithm for nltk
DESCRIPTION
Open-Source Implementation of Document Structuring Algorithm for NLTK. Nicholas FitzGerald. Natural Language Generation. Generate coherent text outputs to express information Express the right information Express information in the right order. NLG Tasks. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/1.jpg)
Open-Source Implementation of Document Structuring Algorithm
for NLTK
Nicholas FitzGerald
![Page 2: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/2.jpg)
Natural Language Generation
Generate coherent text outputs to express information
Express the right information Express information in the right order
![Page 3: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/3.jpg)
NLG Tasks
1. Document Structuring - most important and relevant information selected from knowledge base (Content Determination), then ordered and structured in such a way as to maximize coherence and informativeness (Text Planning)
2. Micro-Planning – specifics of word selection, referring expressions, and the finalization of ordering are determined
3. Realization – internal representations of the above decisions are realized in actual text output
![Page 4: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/4.jpg)
Document Structuring
Given a set of information to be expressed, determine the order and grouping of this information
Texts cannot be simply a random bag of sentences Order of message presentation has significant effect on
meaning [Hovy 1993]: One way:
1 - “Maria was diagnosed with cancer some months ago.” 2 - “Maria and Zurab had a fight last night.” 3 - “She was found dead this morning.”
Vs. 1 - “Maria was diagnosed with cancer some months ago.” 2 - “She was found dead this morning.” 3 - “Maria and Zurab had a fight last night.”
![Page 5: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/5.jpg)
Document Structuring
Ordering also effects coherence: “John was hungry. John went to the store. He bought
some bread to make a sandwich.” “John bought some bread to make a sandwich. He went
to the store. John was hungry.”
![Page 6: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/6.jpg)
Discourse relations
Relationship between a message or group of messages
Elaboration(m1,m2) I love jazz music(m1). My favourite album is Oscar
Peterson's “Night Train” (m2). Contrast(m1, m3)
I love jazz music (m1). However, my favourite album is The Beatles' “White Album” (m3).
Cue word - However
![Page 7: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/7.jpg)
Rhetorical Structure Theory
Mann and Thompson 1988 A text is coherent by virtue of relationships that
hold between messages in the text A small number of relations (~25) can explain
relationships between messages in a wide range of text
![Page 8: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/8.jpg)
Project Proposal
Implement these general algorithms for inclusion in NLTK
Provide a sample Data Set and DR schema for testing and illustration
based on hypothetical WeatherExplainer from [Reiter and Dale 2000]
Experiment utilizing these new tools as part of Abstractive Summarization System for Evaluative Statement Summarization (ASSESS)
![Page 9: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/9.jpg)
Implementation 1: Schemas
Top-Down Approach Output document structure is predictable and
stereotyped Schemas are patterns of expansion, similar to CFG Ie:
CompareAndContrast → DescribeRelationship CompareProperties.
CompareProperties → CompareProperty CompareProperties.
CompareProperties → . “John is much bigger than Kate (DR). He is five inches
taller (CP) and weighs almost twice as much (CP).” Specify rules for choosing if multiple expansions exist
![Page 10: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/10.jpg)
Top-Down Problems
Hypothesis-Driven Content selection done “on-line” Not easily pipelined
Therefore, Bottom-Up used
![Page 11: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/11.jpg)
Implementation 2: Bottom-Up
Output document structure is not predictable
POOL = messages to be expressedwhile( size(pool) > 1)):
find all pairs of elements in pool which can be joined by a DRassign a desirability score to each potential DRfind pair E
i and E
j with highest score and combine with E
k
remove Ei and E
j from POOL, replace with E
k
end while
![Page 12: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/12.jpg)
Implementation
Used nltk.featstruct for Messages and DocPlans A mapping from feature identifiers to feature values, where each
feature value is either a basic value (such as a string or an integer), or a nested feature structure.
TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 direction '+'
[ *msgType* = 'TotalRainfallMsg' ][ ][ [ direction = '+' ] ][ [ ] ][ attribute = [ magnitude = [ number = 4 ] ] ][ [ [ unit = 'inches' ] ] ][ [ ] ][ [ type = 'RelativeVariation' ] ][ ][ period = [ month = 6 ] ][ [ year = 1996 ] ]
![Page 13: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/13.jpg)
Implementation
nltk.featstruct.FeatStruct unify(other):
Unify fstruct1 with fstruct2, and return the resulting feature structure. This unified feature structure is the minimal feature structure that: contains all feature value assignments from both fstruct1 and
fstruct2. preserves all reentrance properties of fstruct1 and fstruct2.
If no such feature structure exists (because fstruct1 and fstruct2 specify incompatible values for some feature), then unification fails, and unify returns None.
![Page 14: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/14.jpg)
Unification
TotalRainfallMsg period year 1996 Month 06 attribute type 'RelativeVariation' direction '+'
+
TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4
=
TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 direction '+'
![Page 15: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/15.jpg)
Implementation
nltk.featstruct.FeatStruct subsumes(other):
True if self subsumes other. I.e., return true if unifying self with other would result in a feature structure equal to other.
![Page 16: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/16.jpg)
Subsumes
TotalRainfallMsg period year 1996 Month 06
subsumes
TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 direction '+'
TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4
Does notsubsume
TotalRainfallMsg period year 1996 month 06
![Page 17: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/17.jpg)
Using Subsumes
”Select from messages all DocPlans whose with a relType of Contrast and a nucleus which is a message of msgType ('TotalRainfallMsg')”
d = DocPlan(relType = 'Contrast', nucleus = Message('TotalRainfallMsg'))return = filter(lambda msg: d.subsumes(msg), messages)
![Page 18: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/18.jpg)
Implementation: Input Formats
Messages:
TotalRainfallMsg period year 1996 month 06 attribute type 'RelativeVariation' magnitude unit 'inches' number 4 direction '+'
![Page 19: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/19.jpg)
Input Formats
Rules:
Elaboration(Message('MonthlyRainfallMsg') M1, Message('TotalRainfallMsg') M2)(M1.attribute.direction == M2.attribute.direction) : ConstituentSet('Elaboration', M1, M2) : 3
inputs
conditions return heuristic
![Page 20: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/20.jpg)
Example Usage
with open('msg_file', 'r') as f:msg_string = f.read()
with open('rule_file', 'r') as f:rule_string = f.read()
messages = read_messages(msg_string)rules = read_rules(rule_string)
plan = bottom_up_plan(messages, rules)
![Page 21: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/21.jpg)
Data Set - WeatherExplainer
Simple example provided in [Reiter and Dale 2000]
Created 3 messages and 3 rules in the input format
![Page 22: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/22.jpg)
WeatherExplainer MessagesTotalRainfallMsg
periodyear 1996month 06
attributetype
'RelativeVariation'magnitude
unit 'inches'
number 4direction '+'
MonthlyRainfallMsgperiod
year 1996month 06
attributetype
'RelativeVariation'magnitude
unit 'inches'
number 2direction '+'
MonthlyTemperatureMsgperiod
year 1996month 06
temperaturecategory 'hot'
![Page 23: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/23.jpg)
WeatherExplainer Messages
Elaboration(Message('MonthlyRainfallMsg') M1, Message('TotalRainfallMsg') M2)(M1.attribute.direction == M2.attribute.direction) : ConstituentSet('Elaboration', M1, M2) : 3
Contrast(Message('MonthlyRainfallMsg') M1, Message('TotalRainfallMsg') M2)(M1.attribute.direction != M2.attribute.direction) : ConstituentSet('Contrast', M1, M2) : 2
Sequence(Message('MonthlyTemperatureMsg')|ConstituentSet(nucleus=Message('MonthlyTemperatureMsg')) M1, Message('MonthlyRainfallMsg')|ConstituentSet(nucleus=Message('MonthlyRainfallMsg')) M2)
() : ConstituentSet(Sequence, M1, M2) : 1
![Page 24: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/24.jpg)
WeatherExplainer Result[ *type* = 'DPDocument' ][ ][ [ [ [ *msgType* = 'TotalRainfallMsg' ] ] ] ][ [ [ [ ] ] ] ][ [ [ [ [ direction = '+' ] ] ] ] ][ [ [ [ [ ] ] ] ] ][ [ [ [ attribute = [ magnitude = [ number = 4 ] ] ] ] ] ][ [ [ *aux* = [ [ [ unit = 'inches' ] ] ] ] ] ][ [ [ [ [ ] ] ] ] ][ [ [ [ [ type = 'RelativeVariation' ] ] ] ] ][ [ [ [ ] ] ] ][ [ [ [ period = [ month = 6 ] ] ] ] ][ [ [ [ [ year = 1996 ] ] ] ] ][ [ [ ] ] ][ [ *aux* = [ [ *msgType* = 'MonthlyRainfallMsg' ] ] ] ][ [ [ [ ] ] ] ][ [ [ [ [ direction = '+' ] ] ] ] ][ [ [ [ [ ] ] ] ] ][ children = [ [ [ attribute = [ magnitude = [ number = 2 ] ] ] ] ] ][ [ [ *nucleus* = [ [ [ unit = 'inches' ] ] ] ] ] ][ [ [ [ [ ] ] ] ] ][ [ [ [ [ type = 'RelativeVariation' ] ] ] ] ][ [ [ [ ] ] ] ][ [ [ [ period = [ month = 6 ] ] ] ] ][ [ [ [ [ year = 1996 ] ] ] ] ][ [ [ ] ] ][ [ [ *relType* = "'Elaboration'" ] ] ][ [ ] ][ [ [ *msgType* = 'MonthlyTemperatureMsg' ] ] ][ [ [ ] ] ][ [ *nucleus* = [ period = [ month = 6 ] ] ] ][ [ [ [ year = 1996 ] ] ] ][ [ [ ] ] ][ [ [ temperature = [ category = 'hot' ] ] ] ][ [ ] ][ [ *relType* = 'Sequence' ] ][ ][ title = [ text = None ] ][ [ type = None ] ]
![Page 25: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/25.jpg)
WeatherExplainer Result
Roughly:
”This has been a hot month. Average rainfall this month is greater than usual. So far, rainfall is four inches above average.”
![Page 26: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/26.jpg)
ASSESS
Summarization of Evaluative Opinions
![Page 27: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/27.jpg)
An Abstractive Summarization Pipeline
Determine most relevant information
and generate summary
Extract all Information from
input corpus
InputReviews Data Summary
![Page 28: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/28.jpg)
ASSESS Testing
Input: Review sentences tagged with crude-feature
evaluations Crude-Feature to User-Defined-Feature mapping
Simple content selection Group evaluations by UDF Calculate average evaluation Also include info on UDF-parent in hierarchy, number
of evaluations
![Page 29: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/29.jpg)
Example Message
[ *msgType* = 'AverageOpinionMessage' ][ numOpinions = 17 ][ polarity = '-' ][ udf = 'Universal Remote Control' ][ udf_parent = 'Extra Features' ][ valence = 1.1764705882352942 ]
12 messages generated
![Page 30: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/30.jpg)
RulesConjunction(Message('AverageOpinionMessage') M1, Message('AverageOpinionMessage') M2) (M1.udf_parent == M2.udf_parent and M1.polarity == M2.polarity):ConstituentSet(Conjunction,M1,M2):(2,M1.numOpinions+M2.numOpinions)
Contrast(Message('AverageOpinionMessage') M1, Message('AverageOpinionMessage') M2) (M1.udf_parent == M2.udf_parent and M1.polarity != M2.polarity):ConstituentSet(Contrast,M1,M2):(3,M1.numOpinions+M2.numOpinions)
Explanation(Message('AverageOpinionMessage') M1, Message('AverageOpinionMessage') M2) (M1.udf == M2.udf_parent and M1.polarity == M2.polarity):ConstituentSet(Explanation,M1,M2):(5,0)
Explanation(Message('AverageOpinionMessage') M1, ConstituentSet(relType = 'Conjunction', nucleus=Message('AverageOpinionMessage')) M2) (M1.udf == M2.nucleus.udf_parent and M1.polarity == M2.nucleus.polarity):ConstituentSet(DExplanation,M1,M2):(10:0)
Sequence(Message('AverageOpinionMessage')|ConstituentSet() M1, Message('AverageOpinionMessage')|ConstituentSet() M2) ():ConstituentSet(Sequence,M1,M2):(1,0)
![Page 31: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/31.jpg)
ASSESS Result
It works! Evaluation of resulting DocPlan would say more about
Rules and Content Selection than Document Structuring Algorithm
Was able to handle larger number of messages and rules
4 of 5 rules used Still, only one message type used
![Page 32: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/32.jpg)
Future Improvements
Investigate whether this simple framework can be used to develop more “intelligent” rules for more sophisticated domain models
[Carenini 2008] – SEA May require changes to implementation
Complete comprehensive documentation and user-manual
Submit to NLTK
![Page 33: Open-Source Implementation of Document Structuring Algorithm for NLTK](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56813651550346895d9dd52c/html5/thumbnails/33.jpg)
References
Bird, Steven; Ewan Klein; Edward Loper (2009). Natural Language Processing with Python. O'Reilly Media Inc. Print and online.
Carenini, G., Moore, J.D., (2006) Generating and evaluating evaluative arguments. Artificial Intelligence, 170(11): 925-952
Carenini, G., Ng, R., and Pauls, A. (2006) Multi-Document Summarization of Evaluative Text. Proc. of the Conf. of the European Chapter of the Association for Computational Linguistics.
FitzGerald, N. (2009) A Complete Pipeline for Semantic Evaluation Summarization. Unpublished Project ReportLester, J. And Porter, B., (1997). Developing and empirically testing robust explanation generators: the KNIGHT
experiments. Computational Linguistics, 23(1):65-101Mann, W. and Thompson, S. (1988) Rhetorical structure theory: toward a functional theory of text organization. Text 3:
243-281.Marcu, D. (1997) From local to global coherence: A bottom-up approach to test planning. Proceedings of Fourteenth
National Conference on Artificial Intelligence (AAAI-1997), 629- 635.Pitler, Emily et al (2008). Easily Identifiable Discourse Relations. University of Pennsylvania Department of Computer
and Information Science Technical Report No. MS-CIS-08-24.Reiter, E. and Dale, R. (1997) Building applied natural language generation systems. Natural Language Engineering 3
(1): 57-87.Reiter, E., and Robert Dale. Building Natural Language Generation Systems (Studies in Natural Language Processing) .
New York: Cambridge UP, 2000. Print.Young, R.M., Moore, J.D. DPOCL: A principled approach to discourse planning, in: Proceedings of the 7th International
Workshop on Natural Language Generation, Kennebunkport, ME, June 17–21, 1994, pp. 13–20.