w-rst: towards a weighted rst-style discourse framework

11
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 3908–3918 August 1–6, 2021. ©2021 Association for Computational Linguistics 3908 W-RST: Towards a Weighted RST-style Discourse Framework Patrick Huber * , Wen Xiao * , Giuseppe Carenini Department of Computer Science University of British Columbia Vancouver, BC, Canada, V6T 1Z4 {huberpat, xiaowen3, carenini}@cs.ubc.ca Abstract Aiming for a better integration of data-driven and linguistically-inspired approaches, we ex- plore whether RST Nuclearity, assigning a bi- nary assessment of importance between text segments, can be replaced by automatically generated, real-valued scores, in what we call a Weighted-RST framework. In particular, we find that weighted discourse trees from aux- iliary tasks can benefit key NLP downstream applications compared to nuclearity-centered approaches. We further show that real-valued importance distributions partially and interest- ingly align with the assessment and uncer- tainty of human annotators. 1 Introduction Ideally, research in Natural Language Processing (NLP) should balance and integrate findings from machine learning approaches with insights and the- ories from linguistics. With the enormous success of data-driven approaches over the last decades, this balance has arguably and excessively shifted, with linguistic theories playing a less and less crit- ical role. Even more importantly, there are only little attempts made to improve such theories in light of recent empirical results. In the context of discourse, two main theories have emerged in the past: The Rhetorical Structure Theory (RST) (Carlson et al., 2002) and PDTB (Prasad et al., 2008). In this paper, we focus on RST, exploring whether the underlying theory can be refined in a data-driven manner. In general, RST postulates a complete discourse tree for a given document. To obtain this formal representation as a projective consituency tree, a given document is first separated into so called El- ementary Discourse Units (or short EDUs), repre- senting clause-like sentence fragments of the input * Equal contribution. document. Afterwards, the discourse tree is built by hierarchically aggregating EDUs into larger con- stituents annotated with an importance indicator (in RST called nuclearity) and a relation holding between siblings in the aggregation. The nuclear- ity attribute in RST thereby assigns each sub-tree either a nucleus-attribute, indicating central impor- tance of the sub-tree in the context of the document, or a satellite-attribute, categorizing the sub-tree as of peripheral importance. The relation attribute fur- ther characterizes the connection between sub-trees (e.g. Elaboration, Cause, Contradiction). One central requirement of the RST discourse theory, as for all linguistic theories, is that a trained human should be able to specify and interpret the discourse representations. While this is a clear advantage when trying to generate explainable outcomes, it also introduces problematic, human- centered simplifications; the most radical of which is arguably the nuclearity attribute, indicating the importance among siblings. Intuitively, such a coarse (binary) importance assessment does not allow to represent nuanced dif- ferences regarding sub-tree importance, which can potentially be critical for downstream tasks. For instance, the importance of two nuclei siblings is rather intuitive to interpret. However, having sib- lings annotated as “nucleus-satellite” or “satellite- nucleus” leaves the question on how much more important the nucleus sub-tree is compared to the satellite, as shown in Figure 1. In general, it is unclear (and unlikely) that the actual importance distributions between siblings with the same nucle- arity attribution are consistent. Based on this observation, we investigate the potential of replacing the binary nuclearity assess- ment postulated by RST with automatically gen- erated, real-valued importance scores in a new, Weighted-RST framework. In contrast with pre- vious work that has assumed RST and developed

Upload: others

Post on 06-Nov-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: W-RST: Towards a Weighted RST-style Discourse Framework

Proceedings of the 59th Annual Meeting of the Association for Computational Linguisticsand the 11th International Joint Conference on Natural Language Processing, pages 3908–3918

August 1–6, 2021. ©2021 Association for Computational Linguistics

3908

W-RST: Towards a Weighted RST-style Discourse Framework

Patrick Huber∗, Wen Xiao∗, Giuseppe CareniniDepartment of Computer ScienceUniversity of British Columbia

Vancouver, BC, Canada, V6T 1Z4{huberpat, xiaowen3, carenini}@cs.ubc.ca

Abstract

Aiming for a better integration of data-drivenand linguistically-inspired approaches, we ex-plore whether RST Nuclearity, assigning a bi-nary assessment of importance between textsegments, can be replaced by automaticallygenerated, real-valued scores, in what we calla Weighted-RST framework. In particular, wefind that weighted discourse trees from aux-iliary tasks can benefit key NLP downstreamapplications compared to nuclearity-centeredapproaches. We further show that real-valuedimportance distributions partially and interest-ingly align with the assessment and uncer-tainty of human annotators.

1 Introduction

Ideally, research in Natural Language Processing(NLP) should balance and integrate findings frommachine learning approaches with insights and the-ories from linguistics. With the enormous successof data-driven approaches over the last decades,this balance has arguably and excessively shifted,with linguistic theories playing a less and less crit-ical role. Even more importantly, there are onlylittle attempts made to improve such theories inlight of recent empirical results.

In the context of discourse, two main theorieshave emerged in the past: The Rhetorical StructureTheory (RST) (Carlson et al., 2002) and PDTB(Prasad et al., 2008). In this paper, we focus onRST, exploring whether the underlying theory canbe refined in a data-driven manner.

In general, RST postulates a complete discoursetree for a given document. To obtain this formalrepresentation as a projective consituency tree, agiven document is first separated into so called El-ementary Discourse Units (or short EDUs), repre-senting clause-like sentence fragments of the input

∗Equal contribution.

document. Afterwards, the discourse tree is built byhierarchically aggregating EDUs into larger con-stituents annotated with an importance indicator(in RST called nuclearity) and a relation holdingbetween siblings in the aggregation. The nuclear-ity attribute in RST thereby assigns each sub-treeeither a nucleus-attribute, indicating central impor-tance of the sub-tree in the context of the document,or a satellite-attribute, categorizing the sub-tree asof peripheral importance. The relation attribute fur-ther characterizes the connection between sub-trees(e.g. Elaboration, Cause, Contradiction).

One central requirement of the RST discoursetheory, as for all linguistic theories, is that a trainedhuman should be able to specify and interpret thediscourse representations. While this is a clearadvantage when trying to generate explainableoutcomes, it also introduces problematic, human-centered simplifications; the most radical of whichis arguably the nuclearity attribute, indicating theimportance among siblings.

Intuitively, such a coarse (binary) importanceassessment does not allow to represent nuanced dif-ferences regarding sub-tree importance, which canpotentially be critical for downstream tasks. Forinstance, the importance of two nuclei siblings israther intuitive to interpret. However, having sib-lings annotated as “nucleus-satellite” or “satellite-nucleus” leaves the question on how much moreimportant the nucleus sub-tree is compared to thesatellite, as shown in Figure 1. In general, it isunclear (and unlikely) that the actual importancedistributions between siblings with the same nucle-arity attribution are consistent.

Based on this observation, we investigate thepotential of replacing the binary nuclearity assess-ment postulated by RST with automatically gen-erated, real-valued importance scores in a new,Weighted-RST framework. In contrast with pre-vious work that has assumed RST and developed

Page 2: W-RST: Towards a Weighted RST-style Discourse Framework

3909

Figure 1: Document wsj 0639 from the RST-DT cor-pus with inconsistent importance differences betweenN-S attributions. (The top-level satellite is clearly morecentral to the overall context than the lower-level satel-lite. However, both are similarly assigned the satelliteattribution by at least one annotator). Top relation: An-notator 1: N-S, Annotator 2: N-N.

computational models of discourse by simply ap-plying machine learning methods to RST annotatedtreebanks (Ji and Eisenstein, 2014; Feng and Hirst,2014; Joty et al., 2015; Li et al., 2016; Wang et al.,2017; Yu et al., 2018), we rely on very recent empir-ical studies showing that weighted “silver-standard”discourse trees can be inferred from auxiliary taskssuch as sentiment analysis (Huber and Carenini,2020b) and summarization (Xiao et al., 2021).

In our evaluation, we assess both, computationalbenefits and linguistic insights. In particular, wefind that automatically generated, weighted dis-course trees can benefit key NLP downstream tasks.We further show that real-valued importance scores(at least partially) align with human annotationsand can interestingly also capture uncertainty inhuman annotators, implying some alignment of theimportance distributions with linguistic ambiguity.

2 Related Work

First introduced by Mann and Thompson (1988),the Rhetorical Structure Theory (RST) has beenone of the primary guiding theories for discourseanalysis (Carlson et al., 2002; Subba and Di Eu-genio, 2009; Zeldes, 2017; Gessler et al., 2019;Liu and Zeldes, 2019), discourse parsing (Ji andEisenstein, 2014; Feng and Hirst, 2014; Joty et al.,2015; Li et al., 2016; Wang et al., 2017; Yu et al.,2018), and text planning (Torrance, 2015; Gatt andKrahmer, 2018; Guz and Carenini, 2020). The RSTframework thereby comprehensively describes theorganization of a document, guided by the author’scommunicative goals, encompassing three compo-nents: (1) A projective constituency tree structure,often referred to as the tree span. (2) A nuclearity

attribute, assigned to every internal node of the dis-course tree, encoding relative importance betweenthe nodes’ sub-trees, with the nucleus expressingprimary importance and a satellite signifying sup-plementary sub-trees. (3) A relation attribute forevery internal node describing the relationship be-tween the sub-trees of a node (e.g., Contrast, Evi-dence, Contradiction).

Arguably, the weakest aspect of an RST repre-sentation is the nuclearity assessment, which makesa too coarse differentiation between primary andsecondary importance of sub-trees. However, de-spite its binary assignment of importance and eventhough the nuclearity attribute is only one of threecomponents of an RST tree, it has major implica-tions for many downstream tasks, as already shownearly on by Marcu (1999), using the nuclearityattribute as the key signal in extractive summariza-tion. Further work in sentiment analysis (Bhatiaet al., 2015) also showed the importance of nu-clearity for the task by first converting the con-stituency tree into a dependency tree (more alignedwith the nuclearity attribute) and then using thattree to predict sentiment more accurately. Bothof these results indicate that nuclearity, even inthe coarse RST version, already contains valuableinformation. Hence, we believe that this coarse-grained classification is reasonable when manuallyannotating discourse, but see it as a major pointof improvement, if a more fine-grained assessmentcould be correctly assigned. We therefore explorethe potential of assigning a weighted nuclearityattribute in this paper.

While plenty of studies have highlighted theimportant role of discourse for real-world down-stream tasks, including summarization, (Geraniet al., 2014; Xu et al., 2020; Xiao et al., 2020), sen-timent analysis (Bhatia et al., 2015; Hogenboomet al., 2015; Nejat et al., 2017) and text classifi-cation (Ji and Smith, 2017), more critical to ourapproach is very recent work exploring such con-nection in the opposite direction. In Huber andCarenini (2020b), we exploit sentiment related in-formation to generate “silver-standard” nuclearityannotated discourse trees, showing their potentialon the domain-transfer discourse parsing task. Cru-cially for our purposes, this approach internallygenerates real-valued importance-weights for trees.

For the task of extractive summarization, we fol-low our intuition given in Xiao et al. (2020) andXiao et al. (2021), exploiting the connection be-

Page 3: W-RST: Towards a Weighted RST-style Discourse Framework

3910

Figure 2: Three phases of our approach to generate weighted RST-style discourse trees. Left and center steps aredescribed in section 3, right component is described in section 4. † = As in Huber and Carenini (2020b), ‡ = As inMarcu (1999), ∗ = Sentiment prediction component is a linear combination, mapping the aggregated embeddingto the sentiment output. The linear combination has been previously learned on the training portion of the dataset.

tween summarization and discourse. In particular,in Xiao et al. (2021), we demonstrate that the self-attention matrix learned during the training of atransformer-based summarizer captures valid as-pects of constituency and dependency discoursetrees.

To summarize, building on our previous work oncreating discourse trees through distant supervision,we take a first step towards generating weighteddiscourse trees from the sentiment analysis andsummarization tasks.

3 W-RST Treebank Generation

Given the intuition from above, we combine infor-mation from machine learning approaches withinsights from linguistics, replacing the human-centered nuclearity assignment with real-valuedweights obtained from the sentiment analysis andsummarization tasks1. An overview of the processto generate weighted RST-style discourse trees isshown in Figure 2, containing the training phase(left) and the W-RST discourse inference phase(center) described here. The W-RST discourse eval-uation (right), is covered in section 4.

3.1 Weighted Trees from SentimentTo generate weighted discourse trees from senti-ment, we slightly modify the publicly availablecode2 presented in Huber and Carenini (2020b) byremoving the nuclearity discretization component.

An overview of our method is shown in Fig-ure 2 (top), while a detailed view is presented inthe left and center parts of Figure 3. First (on theleft), we train the Multiple Instance Learning (MIL)

1Please note that both tasks use binarized discourse trees,as commonly used in computational models of RST.

2Code available at https://github.com/nlpat/MEGA-DT

model proposed by Angelidis and Lapata (2018)on a corpus with document-level sentiment gold-labels, internally annotating each input-unit (in ourcase EDUs) with a sentiment- and attention-score.After the MIL model is trained (center), a tuple(si, ai) containing a sentiment score si and an at-tention ai is extracted for each EDU i. Based onthese tuples representing leaf nodes, the CKY algo-rithm (Jurafsky and Martin, 2014) is applied to findthe tree structure to best align with the overall doc-ument sentiment, through a bottom-up aggregationapproach defined as3:

sp =sl ∗ al + sr ∗ ar

al + arap =

al + ar2

with nodes l and r as the left and right child-nodes of p respectively. The attention scores(al, ar) are here interpreted as the importanceweights for the respective sub-trees (wl = al/(al +ar) and wr = ar/(al + ar)), resulting in a com-plete, normalized and weighted discourse structureas required for W-RST. We call the discourse tree-bank generated with this approach W-RST-Sent.

3.2 Weighted Trees from SummarizationIn order to derive weighted discourse trees from asummarization model we follow Xiao et al. (2021)4,generating weighted discourse trees from the self-attention matrices of a transformer-based summa-rization model. An overview of our method isshown in Figure 2 (bottom), while a detailed viewis presented in the left and center parts of Figure 4.

We start by training a transformer-based extrac-tive summarization model (left), containing three

3Equations taken from Huber and Carenini (2020b)4Code available at https://github.com/

Wendy-Xiao/summ_guided_disco_parser

Page 4: W-RST: Towards a Weighted RST-style Discourse Framework

3911

Figure 3: Three phases of our approach. Left/Center: Detailed view into the generation of weighted RST-stylediscourse trees using the sentiment analysis downstream task. Right: Sentiment discourse application evaluation

Figure 4: Three phases of our approach. Left/Center: Detailed view into the generation of weighted RST-stylediscourse trees using the summarization downstream task. Right: Summarization discourse application evaluation

components: (1) A pre-trained BERT EDU En-coder generating EDU embeddings, (2) a standardtransformer architecture as proposed in Vaswaniet al. (2017) and (3) a final classifier, mapping theoutputs of the transformer to a probability score foreach EDU, indicating whether the EDU should bepart of the extractive summary.

With the trained transformer model, we thenextract the self-attention matrix A and build a dis-course tree in bottom-up fashion (as shown in thecenter of Figure 4). Specifically, the self-attentionmatrix A reflects the relationships between unitsin the document, where entry Aij measures howmuch the i-th EDU relies on the j-th EDU. Giventhis information, we generate an unlabeled con-stituency tree using the CKY algorithm (Jurafskyand Martin, 2014), optimizing the overall tree score,as previously done in Xiao et al. (2021).

In terms of weight-assignment, given a sub-treespanning EDUs i to j, split into child-constituentsat EDU k, then max(Ai:k,(k+1):j), representing themaximal attention value that any EDU in the leftconstituent is paying to an EDU in the right child-constituent, reflects how much the left sub-tree re-lies on the right sub-tree, while max(A(k+1):j,i:k)defines how much the right sub-tree depends on theleft. We define the importance-weights of the left(wl) and right (wr) sub-trees as:

wl = max(A(k+1):j,i:k)/(wl + wr)

wr = max(Ai:k,(k+1):j)/(wl + wr)

In this way, the importance scores of the two sub-trees represent a real-valued distribution. In com-bination with the unlabeled structure computation,we generate a weighted discourse tree for each doc-ument. We call the discourse treebank generatedwith the summarization downstream informationW-RST-Summ.

4 W-RST Discourse Evaluation

To assess the potential of W-RST, we consider twoevaluation scenarios (Figure 2, right): (1) Applyweighted discourse trees to the tasks of sentimentanalysis and summarization and (2) analyze theweight alignment with human annotations.

4.1 Weight-based Discourse Applications

In this evaluation scenario, we address the questionof whether W-RST trees can support downstreamtasks better than traditional RST trees with nucle-arity. Specifically, we leverage the discourse treeslearned from sentiment for the sentiment analysistask itself and, similarly, rely on the discourse treeslearned from summarization to benefit the summa-rization task.

Page 5: W-RST: Towards a Weighted RST-style Discourse Framework

3912

4.1.1 Sentiment AnalysisIn order to predict the sentiment of a documentin W-RST-Sent based on its weighted discoursetree, we need to introduce an additional sourceof information to be aggregated according to suchtree. Here, we choose word embeddings, as com-monly used as an initial transformation in manymodels tackling the sentiment prediction task (Kim,2014; Tai et al., 2015; Yang et al., 2016; Adhikariet al., 2019; Huber and Carenini, 2020a). Toavoid introducing additional confounding factorsthrough sophisticated tree aggregation approaches(e.g. TreeLSTMs (Tai et al., 2015)), we selecta simple method, aiming to directly compare theinferred tree-structures and allowing us to better as-sess the performance differences originating fromthe weight/nuclearity attribution (see right step inFigure 3). More specifically, we start by comput-ing the average word-embedding for each leaf nodeleaf i (here containing a single EDU) in the dis-course tree.

leaf i =

j<|leaf i|∑j=0

Emb(wordji )/|leaf i|

With |leaf i| as the number of words in leaf i,Emb(·) being the embedding lookup and wordjirepresenting word j within leaf i. Subsequently,we aggregate constituents, starting from the leafnodes (with leaf i as embedding constituent ci), ac-cording to the weights of the discourse tree. Forany two sibling constituents cl and cr of the parentsub-tree cp in the binary tree, we compute

cp = cl ∗ wl + cr ∗ wr

with wl and wr as the real-valued weight-distribution extracted from the inferred discoursetree and cp, cl and cr as dense encodings. We aggre-gate the complete document in bottom-up fashion,eventually reaching a root node embedding con-taining a tree-weighted average of the leaf-nodes.Given the root-node embedding representing a com-plete document, a simple Multilayer Perceptron(MLP) trained on the original training portion ofthe MIL model is used to predict the sentiment ofthe document.

4.1.2 SummarizationIn the evaluation step of the summarization model(right of Figure 4), we use the weighted discoursetree of a document in W-RST-Summ to predict itsextractive summary by applying an adaptation of

the unsupervised summarization method by Marcu(1999).

We choose this straightforward algorithm overmore elaborate and hyper-parameter heavy ap-proaches to avoid confounding factors, since ouraim is to evaluate solely the potential of theweighted discourse trees compared to standardRST-style annotations. In the original algorithm,a summary is computed based on the nuclearityattribute by recursively computing the importancescores for all units as:

Sn(u,N) =

dN , u ∈ Prom(N)

S(u,C(N)) s.t.

u ∈ C(N) otherwise

where C(N) represents the child of N , andProm(N) is the promotion set of nodeN , which isdefined in bottom-up fashion as follows: (1) Promof a leaf node is the leaf node itself. (2) Prom ofan internal node is the union of the promotion setsof its nucleus children. Furthermore, dN representsthe level of a node N , computed as the distancefrom the level of the lowest leaf-node. This way,units in the promotion set originating from nodesthat are higher up in the discourse tree are ampli-fied in their importance compared to those fromlower levels.

As for the W-RST-Summ discourse trees withreal-valued importance-weights, we adapt Marcu’salgorithm by replacing the promotion set with real-valued importance scores as shown here:

Sw(u,N) =

d+ wN , N is leaf

Sw(u,C(N)) + wN ,

u ∈ C(N) otherwise

Once Sn or Sw are computed, the top-k unitsof the highest promotion set or with the highestimportance scores respectively are selected into thefinal summary.

4.1.3 Nuclearity-attributed BaselinesTo test whether the W-RST trees are effectivelypredicting the downstream tasks, we need to gen-erate traditional RST trees with nuclearity to com-pare against. However, moving from weighted dis-course trees to coarse nuclearity requires the intro-duction of a threshold. More specifically, while“nucleus-satellite” and “satellite-nucleus” assign-ments can be naturally generated depending onthe distinct weights, in order to assign the third“nucleus-nucleus” class, frequently appearing in

Page 6: W-RST: Towards a Weighted RST-style Discourse Framework

3913

Figure 5: Three phases of our approach. Left: Genera-tion of W-RST-Sent/Summ discourse trees. Right: Lin-guistic evaluation

RST-style treebanks, we need to specify how closetwo weights have to be for such configuration toapply. Formally, we set a threshold t as follows:

If: |wl − wr| < t → nucleus-nucleusElse: If: wl > wr → nucleus-satelliteElse: If: wl ≤ wr → satellite-nucleus

This way, RST-style treebanks with nuclearityattributions can be generated from W-RST-Sent andW-RST-Summ and used for the sentiment analy-sis and summarization downstream tasks. For thenuclearity-attributed baseline of the sentiment task,we use a similar approach as for the W-RST eval-uation procedure, but assign two distinct weightswn and ws to the nucleus and satellite child re-spectively. Since it is not clear how much moreimportant a nucleus node is compared to a satelliteusing the traditional RST notation, we define thetwo weights based on the threshold t as:

wn = 1− (1− 2t)/4 ws = (1− 2t)/4

The intuition behind this formulation is thatfor a high threshold t (e.g. 0.8), the nuclearityneeds to be very prominent (the difference be-tween the normalized weights needs to exceed0.8), making the nucleus clearly more importantthan the satellite, while for a small threshold (e.g.0.1), even relatively balanced weights (for exam-ple wl = 0.56, wr = 0.44) will be assigned as“nucleus-satellite”, resulting in the potential dif-ference in importance of the siblings to be lesseminent.

For the nuclearity-attributed baseline for summa-rization, we directly apply the original algorithm byMarcu (1999) as described in section 4.1.2. How-ever, when using the promotion set to determinewhich EDUs are added to the summarization, po-tential ties can occur. Since the discourse tree doesnot provide any information on how to prioritizethose, we randomly select units from the candi-dates, whenever there is a tie. This avoids exploit-

ing any positional bias in the data (e.g. the leadbias), which would confound the results.

4.2 Weight Alignment with HumanAnnotation

As for our second W-RST discourse evaluationtask, we investigate if the real-valued importance-weights align with human annotations. To be ableto explore this scenario, we generate weightedtree annotations for an existing discourse treebank(RST-DT (Carlson et al., 2002)). In this eval-uation task we verify if: (1) The nucleus in agold-annotation generally receives more weightthan a satellite (i.e. if importance-weights gener-ally favour nuclei over satellites) and, similarly, ifnucleus-nucleus relations receive more balancedweights. (2) In accordance with Figure 1, we fur-ther explore how well the weights capture the ex-tend to which a relation is dominated by the nu-cleus. Here, our intuition is that for inconsistenthuman nuclearity annotations the spread shouldgenerally be lower than for consistent annotations,assuming that human misalignment in the discourseannotation indicates ambivalence on the impor-tance of sub-trees.

To test for these two properties, we use discoursedocuments individually annotated by two humanannotators and analyze each sub-tree within thedoubly-annotated documents with consistent inter-annotator structure assessment for their nuclear-ity assignment. For each of the 6 possible inter-annotator nuclearity assessments, consisting of 3consistent annotation classes (namely N-N/N-N, N-S/N-S and S-N/S-N) and 3 inconsistent annotationclasses (namely N-N/N-S, N-N/S-N and N-S/S-N)5, we explore the respective weight distributionof the document annotated with the two W-RSTtasks – sentiment analysis and summarization (seeFigure 5). We compute an average spread sc foreach of the 6 inter-annotator nuclearity assessmentsclasses c as:

sc = (

j<|c|∑j=0

wjl − w

jr)/|c|

With wjl and wj

r as the weights of the left and rightchild node of sub-tree j in class c, respectively.

5We don’t take the order of annotators into consideration,mapping N-N/N-S and N-S/N-N both onto N-N/N-S.

Page 7: W-RST: Towards a Weighted RST-style Discourse Framework

3914

5 Experiments

5.1 Experimental SetupSentiment Analysis: We follow our previousapproach in Huber and Carenini (2020b) for themodel training and W-RST discourse inferencesteps (left and center in Figure 3) using the adaptedMILNet model from Angelidis and Lapata (2018)trained with a batch-size of 200 and 100 neuronsin a single layer bi-directional GRU with 20%dropout for 25 epochs. Next, discourse trees aregenerated using the best-performing heuristic CKYmethod with the stochastic exploration-exploitationtrade-off from Huber and Carenini (2020b) (beamsize 10, linear decreasing τ ). As word-embeddingsin the W-RST discourse evaluation (right in Figure3), we use GloVe embeddings (Pennington et al.,2014), which previous work (Tai et al., 2015;Huber and Carenini, 2020a) indicates to be suitablefor aggregation in discourse processing. For train-ing and evaluation of the sentiment analysis task,we use the 5-class Yelp’13 review dataset (Tanget al., 2015). To compare our approach againstthe traditional RST approach with nuclearity, weexplore the impact of 11 distinct thresholds for thebaseline described in §4.1.3, ranging from 0 to 1in 0.1 intervals.Summarization: To be consistent with RST, oursummarizer extracts EDUs instead of sentencesfrom a given document. The model is trained onthe EDU-segmented CNNDM dataset containingEDU-level Oracle labels published by Xu et al.(2020). We further use a pre-trained BERT-base(“uncased”) model to generate the embeddings ofEDUs. The transformer used is the standard modelwith 6 layers and 8 heads in each layer (d = 512).We train the extractive summarizer on the trainingset of the CNNDM corpus (Nallapati et al., 2016)and pick the best attention head using the RST-DTdataset (Carlson et al., 2002) as the developmentset. We test the trees by running the summarizationalgorithm in Marcu (1999) on the test set of theCNNDM dataset, and select the top-6 EDUs basedon the importance score to form a summary innatural order. Regarding the baseline model usingthresholds, we apply the same 11 thresholds as forthe sentiment analysis task.

Weight Alignment with Human Annotation:As discussed in §4.2, this evaluation requires twoparallel human generated discourse trees for everydocument. Luckily, in the RST-DT corpus pub-

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.6

0.8

1

Nuc

leus

Rat

io

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

53

54

55

56

Acc

urac

y

nuclearity

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.6

0.8

1

threshold

Nuc

leus

Rat

io

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

20

21

22

23

Avg

RO

UG

E

Figure 6: Top: Sentiment Analysis accuracy of the W-RST model compared to the standard RST frameworkwith different thresholds. Bottom: Average ROUGEscore (ROUGE-1, -2 and -L) of the W-RST summa-rization model compared to different thresholds. Fullnumerical results are shown in Appendix A.

N-N N-S S-NN-N 273 99 41N-S - 694 75S-N - - 172

Table 1: Statistics on consistently and inconsistentlyannotated samples of the 1, 354 structure-aligned sub-trees generated by two distinct human annotators.

lished by Carlson et al. (2002), 53 of the 385 doc-uments annotated with full RST-style discoursetrees are doubly tagged by a second linguist. Weuse the 53 documents containing 1, 354 consistentstructure annotations between the two analysts toevaluate the linguistic alignment of our generatedW-RST documents with human discourse interpre-tations. Out of the 1, 354 structure-aligned sub-trees, in 1, 139 cases both annotators agreed on thenuclearity attribute, while 215 times a nuclearitymismatch appeared, as shown in detail in Table 1.

5.2 Results and Analysis

The results of the experiments on the discourseapplications for sentiment analysis and summa-rization are shown in Figure 6. The results for

Page 8: W-RST: Towards a Weighted RST-style Discourse Framework

3915

Sent N-N N-S S-N

N-N -0.228(106)

-0.238(33)

-0.240(19)

N-S - -0.038(325)

-0.044(22)

S-N - - -0.278(115)

Summ N-N N-S S-N

N-N 0.572(136)

0.604(42)

0.506(25)

N-S - 0.713(418)

0.518(36)

S-N - - 0.616(134)

Table 2: Confusion Matrices based on human annota-tion showing the absolute weight-spread using the Sen-timent (top) and Summarization (bottom) tasks on 620and 791 sub-trees aligned with the human structureprediction, respectively. Cell upper value: Absoluteweight spread for the respective combination of human-annotated nuclearities. Lower value (in brackets): Sup-port for this configuration.

sentiment analysis (top) and summarization (bot-tom) thereby show a similar trend: With an in-creasing threshold and therefore a larger number ofN-N relations (shown as grey bars in the Figure),the standard RST baseline (blue line) consistentlyimproves for the respective performance measureof both tasks. However, reaching the best perfor-mance at a threshold of 0.8 for sentiment analysisand 0.6 for summarization, the performance startsto deteriorate. This general trend seems reason-able, given that N-N relations represent a ratherfrequent nuclearity connection, however classify-ing every connection as N-N leads to a severe lossof information. Furthermore, the performance sug-gests that while the N-N class is important in bothcases, the optimal threshold varies depending onthe task and potentially also the corpus used, mak-ing further task-specific fine-tuning steps manda-tory. The weighted discourse trees following ourW-RST approach, on the other hand, do not requirethe definition of a threshold, resulting in a single,promising performance (red line) for both tasksin Figure 6. For comparison, we apply the gener-ated trees of a standard RST-style discourse parser(here the Two-Stage parser by Wang et al. (2017))trained on the RST-DT dataset (Carlson et al., 2002)on both downstream tasks. The fully-supervisedparser reaches an accuracy of 44.77% for sentimentanalysis and an average ROUGE score of 26.28 forsummarization. While the average ROUGE score

Sent N-N N-S S-N

N-N ∅-0.36 ∅-0.43 ∅-0.45

N-S - ∅+1.00 ∅+0.96

S-N - - ∅-0.72

Summ N-N N-S S-N

N-N ∅-0.13 ∅+0.13 ∅-0.66

N-S - ∅+1.00 ∅-0.56

S-N - - ∅+0.22

Table 3: Confusion Matrices based on human anno-tation showing the weight-spread relative to the task-average for Sentiment (top) and Summarization (bot-tom), aligned with the human structure prediction, re-spectively. Cell value: Relative weight spread as thedivergence from the average spread across all cells inTable 2. Color: Positive/Negative divergence, ∅ = Av-erage value of absolute scores.

of the fully-supervised parser is above the perfor-mance of our W-RST results for the summarizationtask, the accuracy on the sentiment analysis taskis well below our approach. We believe that theseresults are a direct indication of the problematicdomain adaptation of fully supervised discourseparsers, where the application on a similar domain(Wall Street Journal articles vs. CNN-Daily Mailarticles) leads to superior performances comparedto our distantly supervised method, however, withlarger domain shifts (Wall Street Journal articles vs.Yelp customer reviews), the performance drops sig-nificantly, allowing our distantly supervised modelto outperform the supervised discourse trees forthe downstream task. Arguably, this indicates thatalthough our weighted approach is still not com-petitive with fully-supervised models in the samedomain, it is the most promising solution availablefor cross-domain discourse parsing.

With respect to exploring the weight alignmentwith human annotations, we show a set of confu-sion matrices based on human annotation for eachW-RST discourse generation task on the absoluteand relative weight-spread in Tables 2 and 3 re-spectively. The results for the sentiment analysistask are shown on the top of both tables, while theperformance for the summarization task is shownat the bottom. For instance, the top right cell ofthe upper confusion matrix in Table 2 shows thatfor 19 sub-trees in the doubly annotated subset ofRST-DT one of the annotators labelled the sub-tree with a nucleus-nucleus nuclearity attribution,while the second annotator identified it as satellite-

Page 9: W-RST: Towards a Weighted RST-style Discourse Framework

3916

nucleus. The average weight spread (see §4.2) forthose 19 sub-trees is−0.24. Regarding Table 3, wesubtract the average spread across Table 2 definedas ∅ =

∑ci∈C (ci)/|C| (with C = {c1, c2, ...c6}

containing the cell values in the upper trianglematrix) from each cell value ci and normalize bymax = maxci∈C(|ci−∅|), with ∅ = −0.177 andmax = 0.1396 across the top table. Accordingly,we transform the −0.24 in the top right cell into(−0.24− avg)/max = −0.45.

Moving to the analysis of the results, we findthe following trends in this experiment: (1) As pre-sented in Table 2, the sentiment analysis task tendsto strongly over-predict S-N (i.e., wl << wr), lead-ing to negative spreads in all cells. In contrast,the summarization task is heavily skewed towardsN-S assignments (i.e., wl >> wr), leading to ex-clusively positive spreads. We believe both trendsare consistent with the intrinsic properties of thetasks, given that the general structure of reviewstends to become more important towards the endof a review (leading to increased S-N assignments),while for summarization, the lead bias potentiallyproduces the overall strong nucleus-satellite trend.(2) To investigate the relative weight spreads fordifferent human annotations (i.e., between cells)beyond the trends shown in Table 2, we normalizevalues within a table by subtracting the averageand scaling between [−1, 1]. As a result, Table 3shows the relative weight spread for different hu-man annotations. Apart from the general trendsdescribed in Table 2, the consistently annotatedsamples of the two linguists (along the diagonalof the confusion matrices) align reasonably. Themost positive weight spread is consistently foundin the agreed-upon nucleus-satellite case, while thenucleus-nucleus annotation has, as expected, thelowest divergence (i.e., closest to zero) along the di-agonal in Table 3. (3) Regarding the inconsistentlyannotated samples (shown in the triangle matrixabove the diagonal) it becomes clear that in the sen-timent analysis model the values for the N-N/N-Sand N-N/S-N annotated samples (top row in Ta-ble 3) are relatively close to the average value. Thisindicates that, similar to the nucleus-nucleus case,the weights are also ambivalent, with the N-N/N-S value (top center) slightly larger than the valuefor N-N/S-N (top right). The N-S/S-N case forthe sentiment analysis model is less aligned withour intuition, showing a strongly negative weight-spread (i.e. wl << wr) where we would have

expected a more ambivalent result with wl ≈ wr

(however, aligned with the overall trend shown inTable 2). For summarization, we see a very similartrend with the values for N-N/N-S and N-N/S-Nannotated samples. Again, both values are closeto the average, with the N-N/N-S cell showing amore positive spread than N-N/S-N. However forsummarization, the consistent satellite-nucleus an-notation (bottom right cell) seems misaligned withthe rest of the table, following instead the generaltrend for summarization described in Table 2. Allin all, the results suggest that the values in mostcells are well aligned with what we would expectregarding the relative spread. Interestingly, humanuncertainty appears to be reasonably captured in theweights, which seem to contain more fine grainedinformation about the relative importance of siblingsub-trees.

6 Conclusion and Future Work

We propose W-RST as a new discourse framework,where the binary nuclearity assessment postulatedby RST is replaced with more expressive weights,that can be automatically generated from auxiliarytasks. A series of experiments indicate that W-RSTis beneficial to the two key NLP downstream tasksof sentiment analysis and summarization. Further,we show that W-RST trees interestingly align withthe uncertainty of human annotations.

For the future, we plan to develop a neural dis-course parser that learns to predict importanceweights instead of nuclearity attributions whentrained on large W-RST treebanks. More longerterm, we want to explore other aspects of RSTthat can be refined in light of empirical results,plan to integrate our results into state-of-the-artsentiment analysis and summarization approaches(e.g. Xu et al. (2020)) and generate parallel W-RSTstructures in a multi-task manner to improve thegenerality of the discourse trees.

Acknowledgments

We thank the anonymous reviewers for their in-sightful comments. This research was supported bythe Language & Speech Innovation Lab of CloudBU, Huawei Technologies Co., Ltd and the Nat-ural Sciences and Engineering Research Councilof Canada (NSERC). Nous remercions le Conseilde recherches en sciences naturelles et en genie duCanada (CRSNG) de son soutien.

Page 10: W-RST: Towards a Weighted RST-style Discourse Framework

3917

ReferencesAshutosh Adhikari, Achyudh Ram, Raphael Tang, and

Jimmy Lin. 2019. Rethinking complex neural net-work architectures for document classification. InProceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 4046–4051.

Stefanos Angelidis and Mirella Lapata. 2018. Multi-ple instance learning networks for fine-grained sen-timent analysis. Transactions of the Association forComputational Linguistics, 6:17–31.

Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein.2015. Better document-level sentiment analysisfrom RST discourse parsing. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 2212–2218.

Lynn Carlson, Mary Ellen Okurowski, and DanielMarcu. 2002. RST discourse treebank. LinguisticData Consortium, University of Pennsylvania.

Vanessa Wei Feng and Graeme Hirst. 2014. A linear-time bottom-up discourse parser with constraintsand post-editing. In Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 511–521.

Albert Gatt and Emiel Krahmer. 2018. Survey of thestate of the art in natural language generation: Coretasks, applications and evaluation. Journal of Artifi-cial Intelligence Research, 61:65–170.

Shima Gerani, Yashar Mehdad, Giuseppe Carenini,Raymond T Ng, and Bita Nejat. 2014. Abstractivesummarization of product reviews using discoursestructure. In Proceedings of the 2014 conference onempirical methods in natural language processing(EMNLP), pages 1602–1613.

Luke Gessler, Yang Janet Liu, and Amir Zeldes. 2019.A discourse signal annotation system for rst trees. InProceedings of the Workshop on Discourse RelationParsing and Treebanking 2019, pages 56–61.

Grigorii Guz and Giuseppe Carenini. 2020. Towardsdomain-independent text structuring trainable onlarge discourse treebanks. In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing: Findings, pages 3141–3152.

Alexander Hogenboom, Flavius Frasincar, FranciskaDe Jong, and Uzay Kaymak. 2015. Using rhetori-cal structure in sentiment analysis. Commun. ACM,58(7):69–77.

Patrick Huber and Giuseppe Carenini. 2020a. Fromsentiment annotations to sentiment predictionthrough discourse augmentation. In Proceedings ofthe 28th International Conference on ComputationalLinguistics, pages 185–197.

Patrick Huber and Giuseppe Carenini. 2020b. MEGARST discourse treebanks with structure and nuclear-ity from scalable distant sentiment supervision. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 7442–7457.

Yangfeng Ji and Jacob Eisenstein. 2014. Representa-tion learning for text-level discourse parsing. In Pro-ceedings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), volume 1, pages 13–24.

Yangfeng Ji and Noah A Smith. 2017. Neural dis-course structure for text categorization. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 996–1005.

Shafiq Joty, Giuseppe Carenini, and Raymond T Ng.2015. CODRA: A novel discriminative frameworkfor rhetorical analysis. Computational Linguistics,41(3).

Dan Jurafsky and James H Martin. 2014. Speech andlanguage processing, volume 3. Pearson London.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1746–1751.

Qi Li, Tianshi Li, and Baobao Chang. 2016. Discourseparsing with attention-based hierarchical neural net-works. In Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing,pages 362–371.

Yang Liu and Amir Zeldes. 2019. Discourse relationsand signaling information: Anchoring discourse sig-nals in rst-dt. Proceedings of the Society for Compu-tation in Linguistics, 2(1):314–317.

William C Mann and Sandra A Thompson. 1988.Rhetorical structure theory: Toward a functional the-ory of text organization. Text, 8(3):243–281.

Daniel Marcu. 1999. Discourse trees are good indica-tors of importance in text. Advances in automatictext summarization, 293:123–136.

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Caglar GuI‡lcehre, and Bing Xiang. 2016. Abstrac-tive text summarization using sequence-to-sequenceRNNs and beyond. In Proceedings of The 20thSIGNLL Conference on Computational Natural Lan-guage Learning, pages 280–290. Association forComputational Linguistics.

Bita Nejat, Giuseppe Carenini, and Raymond Ng. 2017.Exploring joint neural model for sentence level dis-course parsing and sentiment analysis. In Proceed-ings of the 18th Annual SIGdial Meeting on Dis-course and Dialogue, pages 289–298.

Page 11: W-RST: Towards a Weighted RST-style Discourse Framework

3918

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1532–1543.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-sakaki, Livio Robaldo, Aravind Joshi, and BonnieWebber. 2008. The penn discourse treebank 2.0.LREC.

Rajen Subba and Barbara Di Eugenio. 2009. An effec-tive discourse parser that uses rich linguistic infor-mation. In Proceedings of Human Language Tech-nologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 566–574. Association forComputational Linguistics.

Kai Sheng Tai, Richard Socher, and Christopher DManning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),pages 1556–1566.

Duyu Tang, Bing Qin, and Ting Liu. 2015. Docu-ment modeling with gated recurrent neural networkfor sentiment classification. In Proceedings of the2015 conference on empirical methods in naturallanguage processing, pages 1422–1432.

Mark Torrance. 2015. Understanding planning in textproduction. Handbook of writing research, pages1682–1690.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of the 31st InternationalConference on Neural Information Processing Sys-tems, pages 6000–6010.

Yizhong Wang, Sujian Li, and Houfeng Wang. 2017.A two-stage parsing method for text-level discourseanalysis. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), pages 184–188.

Wen Xiao, Patrick Huber, and Giuseppe Carenini. 2020.Do we really need that many parameters in trans-former for extractive summarization? discourse canhelp! In Proceedings of the First Workshop on Com-putational Approaches to Discourse, pages 124–134.

Wen Xiao, Patrick Huber, and Giuseppe Carenini. 2021.Predicting discourse trees from transformer-basedneural summarizers. In Proceedings of the 2021Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 4139–4152, Online.Association for Computational Linguistics.

Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu.2020. Discourse-aware neural extractive text sum-marization. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguis-tics, pages 5021–5031. Association for Computa-tional Linguistics.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In Proceedings of the 2016 conference of the NorthAmerican chapter of the association for computa-tional linguistics: human language technologies,pages 1480–1489.

Nan Yu, Meishan Zhang, and Guohong Fu. 2018.Transition-based neural rst parsing with implicit syn-tax features. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 559–570.

Amir Zeldes. 2017. The GUM corpus: Creating mul-tilayer resources in the classroom. Language Re-sources and Evaluation, 51(3):581–612.

A Numeric Results

The numeric results of our W-RST approach for thesentiment analysis and summarization downstreamtasks presented in Figure 6 are shown in Table 4below, along with the threshold-based approach, aswell as the supervised parser.

ApproachSentiment SummarizationAccuracy R-1 R-2 R-L

Nuclearity with Thresholdt = 0.0 53.76 28.22 8.58 26.45t = 0.1 53.93 28.41 8.69 26.61t = 0.2 54.13 28.64 8.85 26.83t = 0.3 54.33 28.96 9.08 27.14t = 0.4 54.44 29.36 9.34 27.51t = 0.5 54.79 29.55 9.50 27.68t = 0.6 54.99 29.78 9.65 27.90t = 0.7 55.07 29.57 9.45 27.74t = 0.8 55.32 29.18 9.08 27.32t = 0.9 54.90 28.11 8.29 26.35t = 1.0 54.15 26.94 7.60 25.27

Our Weighted RST Frameworkweighted 54.76 29.70 9.58 27.85

Supervised Training on RST-DTsupervised 44.77 34.20 12.77 32.09

Table 4: Results of the W-RST approach compared tothreshold-based nuclearity assignments and supervisedtraining on RST-DT.