modeling the relationship between user comments and edits ... · made by 300.8 million users....

Modeling the Relationship between User Commentsand Edits in Document Revision

Xuchao Zhang†∗, Dheeraj Rajagopal‡∗, Michael Gamon§,Sujay Kumar Jauhar§ and Chang-Tien Lu†

†Discovery Analytics Center, Virginia Tech, Falls Church, VA, USA‡Carnegie Mellon University, Pittsburgh, PA, USA

§Microsoft Research, Redmond, WA, USA†{xuczhang, ctlu}@vt.edu, ‡[email protected], §{mgamon, sjauhar}@microsoft.com

Abstract

Management of collaborative documents canbe difficult, given the profusion of edits andcomments that multiple authors make duringa document’s evolution. Reliably modelingthe relationship between edits and commentsis a crucial step towards helping the user keeptrack of a document in flux. A number of au-thoring tasks, such as categorizing and sum-marizing edits, detecting completed to-dos,and visually rearranging comments could ben-efit from such a contribution. Thus, in this pa-per we explore the relationship between com-ments and edits by defining two novel, relatedtasks: Comment Ranking and Edit Anchoring.We begin by collecting a dataset with morethan half a million comment-edit pairs basedon Wikipedia revision histories. We then pro-pose a hierarchical multi-layer deep neural-network to model the relationship between ed-its and comments. Our architecture tacklesboth Comment Ranking and Edit Anchoringtasks by encoding specific edit actions such asadditions and deletions, while also accountingfor document context. In a number of evalu-ation settings, our experimental results showthat our approach outperforms several strongbaselines significantly. We are able to achievea precision@1 of 71.0% and a precision@3of 94.4% for Comment Ranking, while weachieve 74.4% accuracy on Edit Anchoring.

1 Introduction

Comments are widely used in collaborative docu-ment writing as a natural way to suggest and trackchanges, annotate content and explain the intent ofedits. Table 1 shows an example of a comment leftby an editor in a Wikipedia revision. The com-ment explains the intent of adding a missing re-searcher’s name (Sutskever) to a citation.

∗ The work was done while these authors interned inMicrosoft Research.

This example demonstrates that user commentsclosely relate to the intent of an edit, the actualedit operation, and the content and location in thedocument that underwent change. However, dur-ing collaborative document authoring, the trackingand maintenance of comments becomes increas-ingly more challenging due to the large numberof edits and comments that authors make. Forexample, structurally refactoring a document cansignificantly change the order of paragraphs andsentences, stranding comments in confusing andcontextually inappropriate locations. Or, com-ments may have already been addressed but con-tinue to linger in the document without havingbeen marked as completed. These issues, amongothers, are exacerbated when multiple authors col-laborate on a document, often asynchronously. Itbecomes difficult to know which tasks have beencompleted and which haven’t, especially if authorsare not proactive about marking comments as ad-dressed. This situation stands to benefit from anintelligent system capable of marking changes ascompleted, by understanding the relationship be-tween edits and comments.

Many tasks in document management, such assummarizing and categorizing edits, detecting to-do item completion, prioritizing writing tasks andvisually re-arranging document structure to reflectstate in multi-author scenarios, require users to un-derstand edit-comment interactions. Consider thescenario where multiple authors make edits to adocument using the track-change feature availablein popular document authoring apps. The resultcan be extremely confusing, and it is difficult todisentangle who edited what, over multiple ver-sions. A feature that could summarize these ed-its, so that the visual burden of tracking changes isnot on the UI, would certainly alleviate this prob-lem. Such a system would necessarily first needto learn mappings between edits and natural lan-

guage comments, before it could learn to gener-ate them. Therefore, automatic solutions to thesechallenges stand to benefit from an ability to fun-damentally model the relationship between editsand comments.

Yet, most existing studies on document re-visions focus on modeling the document edits(Bronner and Monz, 2012) or comments (Shaver,2016) separately; or using comments as a supple-mentary source of information to study the edits(Yatskar et al., 2010). To the best of our knowl-edge, no prior work focuses on jointly modelingthe relationship between comments and edits.

Thus, in this paper we tackle two novel tasks.Comment Ranking considers a document editoperation and seeks to rank a set of candidatecomments in order of their relevance to the edit.Edit Anchoring seeks to identify the locationsin a document that are most likely to have un-dergone change as the result of a specific com-ment. Crucially, both tasks require jointly mod-eling comment-edit relationship.

We start by collecting a dataset of 780KWikipedia revisions, each with their associatedcomment and edits. We then build a hierarchi-cal multi-layer deep neural network, a model wecall CmntEdit, which is capable of learning therelationship between comments and edits fromthis data. Our approach addresses both CommentRanking and Edit Anchoring by sharing many ofthe model’s components and parameters acrosstasks. Since edits can apply to discontiguous se-quences of text, which pose a challenge for se-quence modeling approaches, we explore novelways to represent a document both before and af-ter an edit, while also accounting for contextual in-formation surrounding an edit. To differentiate thecontext from edit words, we also explore a novelmechanism to encode edit operations such as ad-ditions and deletions explicitly in the model. 1

Finally, we evaluate our model on both tasksand in a number of experimental settings, demon-strating that our solution is significantly better thanseveral strong baselines on jointly capturing therelationship between edits and comments. Ourmodel outperforms the best baseline by 34.6% onNDCG for Comment Ranking and achieves a bestscore of 0.687 F1 for Edit Anchoring. Addition-

1We are making the code for the CmntEdit model, andgeneration of our dataset publicly available to the researchcommunity at https://github.com/microsoft/WikiCommentEdit.

Comment # Sutskever missing

Pre-editVersion

In October 2012, a similar system byKrizhevsky and Hinton won the large-scaleImageNet competition by a significant marginover shallow...

Post-editVersion

In October 2012, a similar system byKrizhevsky and Sutskever and Hinton wonthe large-scale ImageNet competition by asignificant margin over shallow...

Table 1: Example of an edit and its associated comment. Theadded words “Sutskever and” in post-edit version is markedin red.

ally, in an ablation study we demonstrate that ourvarious modeling choices, which tackle the in-herent challenges of comment-edit understanding,each contribute positively to empirical results.

2 Related Work

Document revisions have been the subject of sev-eral studies in recent years (Nunes et al., 2011; Fis-cher, 2013). Most prior work focuses on modelingdocument edits only. For instance, Bronner andMonz (2012) build a classifier to distinguish flu-ency edits from factual edits. Zhu et al. (2017)study the semantic distance between the contentin different versions of documents to detect doc-ument revisions. Grossman et al. (2013) proposea hierarchical navigation method to display docu-ment revision histories.

Some work utilizes comments associated withdocument edits as supplementary information tostudy the document revisions. For example,Yatskar et al. (2010) consider both commentand document revision for lexical simplification.However, they use comments as meta-data to iden-tify trusted revisions, rather than directly model-ing the relationship between comments and edits.Yang et al. (2017) featurize both comments and re-visions to classify edit intent, but without explic-itly modeling edit-comment relationship.

Wikipedia revision history data (Nunes et al.,2008) has been used in many NLP tasks (Zesch,2012; Max and Wisniewski, 2010; Ganter andStrube, 2009). For instance, Yamangil and Nelken(2008) model Wikipedia revision histories for im-proving sentence compression, Aji et al. (2010)propose a new term weighting model leveragingWikipedia revision histories, and Zanzotto andPennacchiotti (2010) expand textual entailmentcorpora from Wikipedia revision histories usingco-training. Again, however, none of these meth-

https://github.com/microsoft/WikiCommentEdit

https://github.com/microsoft/WikiCommentEdit

ods directly consider or model the relationship be-tween comments and edits.

At a basic level, modeling the connection be-tween comments and edits can be seen as a textmatching problem, with superficial similarity toother common NLP tasks, such as Question An-swering (Seo et al., 2016; Yu et al., 2018), docu-ment search (Burges et al., 2005; Nalisnick et al.,2016), and textual entailment (Androutsopoulosand Malakasiotis, 2010), among others. Notehowever, that edits are a (possibly discontinuousand non-trivial) delta between two versions of atext, making their representation and understand-ing more challenging than that of a simple string.We demonstrate this in our evaluation in Sec-tion 5.2, where we compare against several com-peting models that were designed for other textmatching challenges.

3 Dataset

Our dataset – which we call WikiCmnt – is gen-erated from Wikipedia revision histories. Inthe absence of publicly available documentdata, Wikipedia is a particularly rich resource:(i) It maintains full revision histories of everyWikipedia page, along with associated editor com-ments as meta-data. (ii) It is a large-scale instanceof multi-author collaboration, with many editorscontributing to and maintaining pages.

The specific historical dump we use is fromMay 1, 2018. It contains approximately 52.7 mil-lion pages, and 755.5 million unique revisionsmade by 300.8 million users. WikiCmnt is a sub-sample of 786,866 Wikipedia page revisions alongwith their associated metadata. Revisions are fil-tered out before sampling, if they violate any oneof the following criteria: (i) The length of the com-ment is longer than 8 words. (ii) The edits made tothe Wikipedia page span more than one section2.(iii) The Wikipedia page has an edit history con-taining fewer than 10 unique revisions.

We extract and store a number of data fieldsfrom the Wikipedia revision history as summa-rized in Table 2. For each specific revision of apage, we not only retrieve the text of the commentand edit but also sample 10 non-related comments(Neg-Cmnts) and 10 non-related edits (Neg-Edits)from the same page’s history. Finally we also en-code the individual edit operations in both pre-edit

2https://en.wikipedia.org/wiki/Help:Section

l cookedEdit:

0 0 -1 -1 0 0 0 1 0

chicken wing salmon

Actions

before-editing version after-editing version

Tokens

yesterday

l cooked salmon yesterdayl cooked chicken wing yesterday

Figure 1: An example of the action encoding associated withchanging the phrase “chicken wing” to “salmon”. Specifi-cally, the values -1, 1 and 0 are used to represent deletions,additions and unchanged tokens, respectively.

Field Description

Revision ID Wiki revision IDParent ID Parent revision IDTimestamp Timestamp of revisionPage Title Title of Wikipedia page

Comment Revision commentNeg-Cmnts Negative sampled comments

Src-Tokens Tokens of pre-editing documentSrc-Actions Action encoding of pre-editing documentTgt-Tokens Tokens of post-editing documentTgt-Actions Action encoding of post-editing docu-

mentPos-Edits Edited sentences in post-editing docu-

mentNeg-Edits Negative sampled sentences in post-

editing document

Table 2: Data Fields in WikiCmnt Dataset

and post-edit versions of a text. For example, con-sider Figure 1, which shows the edit action en-coding associated with an change from “chickenwing” to “salmon”.

4 Proposed Model

To model the relationship between edits and com-ments, we first formulate the problem with respectto the two tasks of Comment Ranking and EditAnchoring; we then provide details of the neuralarchitecture used to solve these problems; finallywe provide some implementation details. We be-gin with preliminary notation.

Let us define c as a comment consisting of wordtokens {w1, ..., wq}. Minimally, let us also definea pre-edit es, as the contiguous sequence of wordsspanning the first to the last edited token (withpossibly intervening tokens that are not edited) ina document’s pre-edit version. The edit es mayoptionally also contain some surrounding context.Formally this can be defined as:

{ wi−k, ...wi−1︸︷︷︸context before edit

,wi, ...wi+p︸︷︷︸edit words

,wi+p+1, ...wi+p+k︸︷︷︸context after edit

}

where i and i + p are the indices of the first andthe last edited tokens in a revision, and k is the

https://en.wikipedia.org/wiki/Help:Section

https://en.wikipedia.org/wiki/Help:Section

context window size. If there is more than onecontiguous block of edited tokens, these blocks areconcatenated to form the set of edit words withtheir context words.

We define a document edit as the pair e ={es, et}, where et is similarly defined over editedtokens and their context words in a document’spost-edit version.

4.1 Problem Formulation

Comment Ranking is the task of finding themost relevant comment among a list of potentialcandidates, given a document edit. The inputs ofthe comment ranking task are some set of usercomments C = {c1, c2, . . . cm} pertaining to adocument, and an edit e = {es, et} in the samedocument.The goal of the task is to produce aranking on C, such that the true comment ci withrespect to the edit e = {es, et} is ranked aboveall the other comments (i.e. distractors). We usestandard ranking metrics to evaluate model per-formance: Precision@K (P@K), Mean ReciprocalRank (MRR) and Normalized Discounted Cumu-lative Gain (NDCG).

Edit Anchoring is the task of finding the sen-tences in a document that most likely underwentchange, given a specific user comment. The inputsto the task are a user comment c and a list of can-didate sentences S = {s1, s2, . . . sn} in the post-edit document et. Unlike with Comment Ranking,we operate under the assumption that an edit hasalready been completed, and therefore discard theinformation from the pre-edit version es. In theground truth, at least one (but possibly more) ofthe sentences is an edit location for the commentc. The expected output is a list of binary classifi-cations R = {ri}ni=1, where ri = 1 indicates thatthe sentence si is a likely edit location, given com-ment c. We use Accuracy and F1 score to evaluateperformance on this task.

4.2 Model Overview

For both tasks, our models are hierarchical deepneural networks with four layers: an input em-bedding layer, a contextual embedding layer, acomment-edit attention layer, and an output layer.The overall architecture is shown in Figure 2. Wedescribe each of the four layers in what follows.Since both models share many components wewill describe the more general case that covers allinputs for Comment Ranking; for Edit Anchoring

Edits (Pre-Edit Version)

Comment RankingLoss

Edit AnchoringLoss

Edits (Post-Edit Version)Comment

Input of Edit Anchoring Task

Input of Comment Ranking

Inpu

t Em

bed

Laye

rC

onte

xtua

lEm

bed

Laye

rC

omm

ent-E

dit

Atte

ntio

n La

yer

Out

put L

ayer

Bi-GRU Bi-GRU Bi-GRU. . .

. . .

Similarity Matrices

ActionEncoding

ActionEncodingHadamard

ProductHadamard

Product

Relevance Vector

Edit-

base

d A

ttent

ion

Edit-based A

ttention

comment-wisemaximum

comment-wisemaximum

Pre-Edit Contextual Matrix

CommentContextual Matrix

Post-EditContextual Matrix

. . .

. . .

. . .

. . .

Figure 2: Overall Architecture of Proposed Model

those components that correspond to the pre-editinput are suitably omitted.

Input Embedding Layer. The input embeddinglayer maps each word in user comments c andedits e = {es, et} to a high-dimensional vectorspace. The output of the input embedding layerare matrices: U ∈ Rd×M representing the pre-edit document, V ∈ Rd×M representing the post-edit document, and Q ∈ Rd×J representing thecomment. Here M is the length of edits and J isthe length of the comment, while d is the fixed-length dimension of word vectors.

Contextual Embedding Layer. We use a bi-directional Gated Recurrent Unit (GRU) (Chunget al., 2014) to model the sequential interactionbetween words. Operating over the output of theprevious layer we obtain the contextual embed-ding matrices U c ∈ R2d×M and V c ∈ R2d×M

for both pre- and post-edit versions. Also, we ob-tainQc ∈ R2d×J from the comment word vectors.Note that the row dimension of contextual matri-cesU c, V c andQc are 2d because of the concate-nation of the GRU’s output in both forward andbackward directions.

Comment-Edit Attention Layer. Inspired bythe attention mechanisms utilized in machine

comprehension (Seo et al., 2016; Yu et al., 2018),the comment-edit attention layer is designed tocapture relationships between document edit andcomment words. The attention layer maintainsand processes both pre- and post-edit documentsseparately. This is to reduce the information lossthat would have occurred if their representationswere fused before the attention layer. Addition-ally, this layer incorporates an action encodingvector, which is designed to reflect the three kindsof edit operations: adding a word, deleting a word,or leaving it unchanged.

The inputs to the layer are the contextual matri-cesU c andV c of the pre- and post-edit documentsrespectively, the matrix Qc representing the com-ment, and the supplemental action encoding vec-tors a†,a‡ ∈ ZM which encode the edit operationeach token undergoes in the pre- and post-edit doc-uments, respectively. The output is the comment-aware concatenated vector representations of theedit words in both pre- and post-edit documents,h ∈ R2J .

Internally, we first calculate the shared similar-ity matrix S† ∈ RM×J between the comment Qc

and contextual matrix U c of pre-edit documents,while also accounting for the action encoding vec-tor a†. The elements of this shared similarity ma-trix are defined as follows:

S†ij = G(U c:i,Q

c:j ,a

†i ) (1)

where G is a trainable function that generates thesimilarity between the word-level representationsof comments and edits with respect to an edit op-eration.

Here U c:i ∈ R

2d×1 is the vector representa-tion of the i-th word in the pre-edit document andQc

:j ∈ R2d×1 is the vector representation of j-th

word in the comment. a†i ∈ {−1, 0, 1} is the ac-tion encoding for the edit operation performed onthe i-th word in the pre-edit document. We choosethe trainable function G(u, q, a) = wT [u⊗ q; a],where w ∈ R(2d+1)×1 is a trainable weight vec-tor, ⊗ is the element-wise multiplication operatorand [; ] represents the vector concatenation acrossthe row dimension. Similarly, we can calculate theshared similarity matrix S‡ ∈ RM×J between thecomment and contextual matrix of the post-editdocument as S‡ij = G(V c

:i ,Qc:j ,a

‡i ). Note that the

weight vectors in function G are different for pre-and post-edit versions, and both are trained simul-taneously in the model.

After the similarity matrices are computed, weuse them to generate the Comment-to-Edit Atten-tion (C2E) weights. C2E is used to represent therelevance of words in the edit to those that ap-pear in the comment. This is critical for model-ing the relationship between comments and edits.We obtain the C2E attention weights c† ∈ RM

for edit words in the pre-edit document by takingc† = softmax(maxcol(S

†)), where the maxcol(·)operator finds the column-wise maximum valuefrom a matrix. Similarly, for the post-edit docu-ment, we have c‡ = softmax(maxcol(S

‡)).Finally we multiply the attention vectors to the

previously computed similarity matrices for bothpre- and post-edit documents, and concatenate theresults to obtain the relevance vector h ∈ R2J :

h =

[(c†)TS† ;

(c‡)TS‡]T

(2)

The vector h intuitively captures the weightedsum of the relevance of the comment with respectto the edits in both pre- and post-edit documents.

Output Layer and Loss Function. The outputlayer and loss function of the network is task-specific. Comment Ranking requires ranking therelevance of candidate comments given a docu-ment edit. Broadly we obtain a ranked list by com-puting the relevance score between comments andedits by the output r = βTh, where β is a train-able weight vector.

A data sample i in Comment Ranking consistsof one true comment-edit pair and ni negativecomment-edit distractors. We denote r+i as therelevance score of the true pair, and r−ij as therelevance score of the j-th distractor pair (with1 ≤ j ≤ ni). The goal of our task is to maker+i > r−ij for all j. We therefore set the loss func-tion to be a pairwise hinge loss between true anddistractor relevance scores.

Lc(Θ) =

N∑i=1

ni∑j=1

max(0, 1− r+i + r−ij) (3)

where Θ is the set of all trainable weights in themodel and N is the number of training samples inthe dataset.

For Edit Anchoring, the goal is to predictwhether a sentence in the document is likely to bethe location of an edit, given a comment. This is abinary classification problem, and we can suitablyset the output layer as p = softmax(γTh) – the

probability of predicting the positive class. Hereγ is a trainable weight vector.

Given the binary nature of the classificationproblem we can use the cross entropy loss:

Le(Φ) = − 1

N

N∑i=1

mi∑j=1

[yij log pij+

(1− yij

)log(1− pij

)],

(4)

where Φ is the set of all trainable weights in themodel, N is the number of data samples, mi is thenumber of sentences in the i-th data sample, pij isthe predicted label of the j-th sentence in the i-thdata sample, and yij is the corresponding groundtruth label.

4.3 Implementation DetailsThe CmntEdit model described in this section isimplemented using the Pytorch3 framework andtrained on a single Tesla P100 GPU with 16GBmemory. The word vectors are initialized withpre-trained Glove embeddings (Pennington et al.,2014) using the default dimensionality of 100. Weset the number of training epochs to 5, the max-imum length of a comment to 30 tokens and themaximum length of an edit to 300 tokens. Forthe Comment Ranking task, we set the batch sizeto 10 and consider 5 candidate comments in eachdata sample: one true comment and 4 distractors.We thus have 5 comment-edit pairs for each datasample and 50 pair-wise samples for each train-ing batch. For the Edit Anchoring task, we alsoset the batch size to 10 and consider 5 candidatesentences, which yields an identical 50 training in-stances per batch.

It should be noted that while we train our modelwith only 5 candidate comments or sentences(for Comment Ranking or Edit Anchoring respec-tively), the models – once trained – can be appliedto any number of candidates at test time for eithertask.

5 Experiment

In this section, we show evaluation results ofour model on the previously collected Wikipediadataset 3. We begin by introducing the experi-mental settings in Section 5.1. We then comparethe performance achieved by the proposed methodagainst several baseline models in Section 5.2. We

3https://pytorch.org/

also conduct an ablation study to evaluate the var-ious components of our model, as well as providesome qualitative results to demonstrate it’s effec-tiveness in practise.

5.1 Experimental Setup

We begin by introducing the evaluation setting,metrics and baselines we use in our experiments.

5.1.1 Datasets and LabelsWe use the WikiCmnt dataset described in Section3 for training and evaluation. The dataset contains786,886 data samples in total. We reserve 70%of the data for training and split the remainder be-tween 10% for validation and 20% for testing.

For the Comment Ranking task, we always havea single true comment, but in our test we experi-ment with both 4 and 9 distractors, yielding a totalof 5 and 10 candidate comments. Similarly for theEdit Anchoring task, we also test with both 5 and10 candidate sentences.

5.1.2 Evaluation MetricsWe use common ranking evaluation metrics fromthe literature to evaluate models on the task ofComment Ranking. These include: (i) Preci-sion@K. The proportion of predicted instanceswhere the true comment appears in the rankedtop-K result, with K = 1, 3, 5. (ii) Mean Recip-rocal Rank (MRR). The average multiplicativeinverse of the rank of the correct answer, repre-sented mathematically as MRR = 1

N

∑Ni=1

1ranki

,where N is the number of samples and ranki isthe rank assigned to the true comment by a model.(iii) Normalized Discounted Cumulative Gain(NDCG). the normalized gain of each commentbased on its ranking position in the results. We setthe relevance score of the true comment to one andthose of the distractors to zero.

For the Edit Anchoring task, we use Accuracyand F1 Score as evaluation metrics. These arecomputed based on a model’s classification of can-didate sentences in the post-edit version of thedocument.

5.1.3 Baseline MethodsFor the Comment Ranking task, we compare ourapproach to the following baselines:(i) Cosine-TfIdf uses the cosine similarity be-tween the TfIdf-weighted vectors of the commentand edit as a measure of relevance. (ii) Cosine-In-ferSent computes the cosine similarity between

https://pytorch.org/

Candidates=5 Candidates=10

P@1 P@3 MRR NDCG P@1 P@3 P@5 MRR NDCG

Cosine-TfIdf 0.266 0.519 0.470 0.597 0.137 0.291 0.444 0.296 0.453Cosine-InferSent 0.228 0.597 0.471 0.600 0.115 0.305 0.497 0.302 0.461

RankNet 0.257 0.658 0.503 0.620 0.034 0.055 0.061 0.139 0.320LambdaMART 0.310 0.726 0.549 0.661 0.152 0.384 0.593 0.352 0.502

Gated RNN 0.283 0.716 0.531 0.647 0.158 0.411 0.628 0.363 0.511

CmntEdit-MT 0.674 0.930 0.804 0.853 0.529 0.780 0.896 0.680 0.758CmntEdit-CR 0.710 0.944 0.828 0.871 0.593 0.830 0.924 0.730 0.796

Table 3: Performance on Comment Ranking

comment and edit vectors generated by thestate-of-the-art sentence embedding method In-ferSent (Conneau et al., 2017). (iii) RankNet(Burges et al., 2005) minimizes the numberof inversions in ranking by optimizing a pair–wise loss function. We use a previous neu-ral network implementation4 with default set-tings. (iv) LambdaMART (Burges, 2010) lever-ages gradient boosted decision trees with a costfunction derived from LambdaRank (Burges et al.,2007) for solving a ranking task. We use an exist-ing python implementation5, with a learning rateof 0.02 and 10 max leaf nodes. (v) Gated Recur-rent Neural Network (Chung et al., 2014) modelsthe sequence of words in comments and edits us-ing pre-trained GloVe vectors as embedding units.Three fully-connected layers compute a final rele-vance score between comments and edits.

For the Edit Anchoring task, we chose the fol-lowing popular classifiers as baselines: (i) Ran-dom Forest (Liaw et al., 2002) (ii) Adaboost(Ratsch et al., 2001) (iii) Passive Aggressive clas-sifiers (Crammer et al., 2006) (iv) Gated Recur-rent Neural Network. The features used in thesemodels are based on both TfIdf scores, as wellas sentence embedding features generated by In-ferSent. The Gated RNN model is trained with atask-specific fully connect layer for Edit Anchor-ing to compute the classification probability.

5.1.4 Model Variants

We train and evaluate several different variants ofour neural architecture. They include: (i) Cm-ntEdit-CR, a variant of our model only trainedwith data samples for the Comment Ranking task.(ii) CmntEdit-EA, a variant of our model onlytrained with data samples for the Edit Anchoring

4https://github.com/shiba24/learning2rank

5https://github.com/jma127/pyltr


Acc F1 Acc F1

Passive-Aggr 0.581 0.533 0.716 0.262RandForest 0.639 0.290 0.743 0.112Adaboost 0.657 0.398 0.751 0.207

Gated RNN 0.696 0.651 0.665 0.539

CmntEdit-MT 0.635 0.587 0.619 0.468CmntEdit-EA 0.744 0.687 0.726 0.583

Table 4: Performance on Edit Anchoring

task. (iii) CmntEdit-MT, a variant of our modeljointy trained on data samples of both CommentRanking and Edit Anchoring tasks. Unless other-wise specified, all models use a standard contextwindow size of 5 tokens6.

5.2 Experimental Results

We now present and discuss the empirical resultsof our evaluation on both Comment Ranking andEdit Anchoring.

5.2.1 Comment RankingTable 3 summarizes results of the Comment Rank-ing task. Our models significantly outperform allthe baselines in every setting and on all metrics.The results are statistically significant at p < 0.01using the Wilcoxon signed-rank test (Smuckeret al., 2007). Since the Comment Ranking taskonly has one true comment, the other commentsbeing distractors, the P@1 result becomes espe-cially important for practical applications. Our ap-proach achieves 71% precision, which is signif-icantly better than the 31% precision of the bestbaseline method (LambdaMART) with 5 candi-date comments. Our model similarly outperformsthe best baseline with 10 candidate comments, ob-

6We performed an extensive evaluation of context sizewith window sizes ranging from 0 to 50. From the result, wefound the our model tends to perform best when the contextwindow size is close to 5 tokens.

https://github.com/shiba24/learning2rank

https://github.com/shiba24/learning2rank

https://github.com/jma127/pyltr

CommentRanking


P@1 MRR P@1 MRR

w/o Action 0.630 0.778 0.495 0.659w/o Attention 0.625 0.775 0.488 0.655w/o Hadamard 0.624 0.774 0.483 0.652CmntEdit-CR 0.710 0.828 0.593 0.730

EditAnchoring


Acc F1 Acc F1

w/o Action 0.713 0.668 0.684 0.556w/o Attention 0.723 0.670 0.701 0.562w/o Hadamard 0.722 0.660 0.706 0.558CmntEdit-EA 0.744 0.687 0.726 0.583

Table 5: Ablation study

taining a P@1 score of 59.3%. Additionally, thehigher scores on MRR and NDCG indicate thatour approach consistently ranks the true commenthigher than the baseline methods.

The performance of CmntEdit-MT is 2% and5.3% worse than CmntEdit-CR on NDCG andP@1, respectively. Surprisingly, this suggests thattraining our model in a multi-task fashion leadsto slightly lower scores, and a model specificallytrained for the individual task of Comment Rank-ing is to be preferred.

5.2.2 Edit Anchoring

Table 4 shows the results for Edit Anchoring. Ourmethod, CmntEdit-EA, outperforms the best base-line method, Gated-RNN, by 5.5% on F1 and6.9% on accuracy. The improvements over all thebaselines are statistically significant at a p-value of0.01. The baseline classifiers including Passive-Aggressive, Random Forest and Adaboost havehigh accuracies, but low F1 scores. This is be-cause of the imbalance between positive and neg-ative samples in our data. Specifically, the num-ber of negative samples is 4 times greater than thenumber of positive samples when the size of thecandidate set is 5 – and even greater when it is 10.Therefore, the baseline classifiers tend to naivelypredict a negative label, which artificially boostsprecision to the detriment of recall. In fact, Ad-aboost actually outperforms our models on accu-racy when the candidate set size is 10, but yields amuch lower F1 score.

As with Comment Ranking, the performanceof CmntEdit-MT is slightly worse than CmntEdit-EA. Within the scope of our problem space, thisreinforces the finding that targeted training seemsto work better than joint training.

5.2.3 Ablation StudyTo verify the effectiveness of our modelingchoices, we evaluate performance in the absenceof each of the following model components:1. w/o Action: we remove the action encodingfrom the trainable function G and instead simplyuse G(u, q) = wT [uTq]. 2. w/o Attention: weremove the edit-based attention from Equation (2).Instead, we generate the relevance vector h as fol-lows: h =

[meancol(S

†); meancol(S‡)]T . 3. w/o

Hadamard: we use the inner product instead ofthe Hadamard product in the trainable function Gas follows: G(u, q, a) = wT [uTq; a].

Results in Table 5 show that each compo-nent improves the overall performance on bothComment Ranking and Edit Anchoring tasks,across our evaluation metrics. This indicates thatour modeling choices are particularly suited totackle the inherent challenges involved in model-ing comment-edit relationship.

5.2.4 Qualitative EvaluationTable 6 provides a few output examples from ourmodel on the Comment Ranking task, demonstrat-ing its ability to learn abstract connections be-tween comments and edits. Due to space con-straints, only one illustrative distractor is shown.

The first example shows an edit summarized bythe high-level comment “date and capitalizationcorrections”. This comment is correctly assignedthe highest relevance score by our model, despitethe fact that no words are shared between the com-ment and edit. Meanwhile, one of the distractorshas a lower score even though it shares the lexicalitem “Walgreens” with the context of the edit.

In the second example an entire sentence is re-moved by the editor. Again, although no wordsare shared between the comment and the edit, ourmodel is correctly able to identify the delete oper-ation, possibly by learning the common Wikipediashorthand for deletions “rm”. Meanwhile, one ofthe distractors contains the phrase “St Helens”,which also appears in the edit, but is still assigneda lower score.

6 Conclusion and Future Work

In this paper, we have explored the relationship be-tween comments and edits by defining two noveltasks: Comment Ranking and Editing Anchor-ing. In order to model the problem we collected adataset with over 780K comment-edit pairs. Fur-

Case 1: Grammar fixing

Pre-Edits In December of 2012, A judge ordered Walgreens to pay $16.57 million to settle a lawsuit claiming...Post-Edits In December 2012, a judge ordered Walgreens to pay $16.57 million to settle a lawsuit claiming...

Comment Date and capitalization corrections [Relevance Score: 4.050]

Distractor Dane Cook’s comments about Walgreens do not belong in ”Facts” [Relevance Score: 3.028]

Case 2: Sentence removal

Pre-Edits [http://www.pilkington.co.uk/energikare Pilkington energiKare] - Pilkington product from St Helenshelping the environment...

Post-Edits (whole sentence removed)

Comment rm advertising, other link to major employer noted in text seems ok at first glance. [Relevance Score: 5.192]

Distractor Fixing link to Kevin Brown - You know I never knew he was from St Helens [Relevance Score: 3.479]

Table 6: Example of the edits and comments matched by the proposed model with one more deceptive distractor for eachcase. The scores for each comment/distractor are presented after the comment/distractor. The addition and removal in edit arehighlighted in bold.

ther we proposed a hierarchical multi-layer neu-ral network capable of tackling both our proposedtasks by encoding specific edit actions, such as ad-ditions and deletions, as well as document context.In our experiments we show that our approach out-performs several baselines by significant marginson both tasks, yielding a best score of 71% pre-cision@1 for Comment Ranking and 74.4% accu-racy for Edit Anchoring.

In future work we plan to explore sequences ofrevisions through the lifecycle of a document fromcreation to completion, with the ultimate goal ofmodeling document evolution. We also hope toapply our modeling approach to practical down-stream applications, including: i) detecting com-pleted to-dos based on related edits; ii) localizingthe paragraphs that could be edited to address agiven comments; iii) summarizing document revi-sions.

Acknowledgement

We thank the anonymous reviewers and membersof the Knowledge Technologies and IntelligentExperiences group in Microsoft Research for help-ful feedback. We also thank Ryen W. White forthe thoughtful feedback on the statistical hypothe-sis testing.

ReferencesAblimit Aji, Yu Wang, Eugene Agichtein, and Ev-

geniy Gabrilovich. 2010. Using the past to score thepresent: Extending term weighting models throughrevision history analysis. In Proceedings of the 19thACM international conference on Information andknowledge management, pages 629–638. ACM.

Ion Androutsopoulos and Prodromos Malakasiotis.2010. A survey of paraphrasing and textual entail-ment methods. Journal of Artificial Intelligence Re-search, 38:135–187.

Amit Bronner and Christof Monz. 2012. User editsclassification using document revision histories. InProceedings of the 13th Conference of the EuropeanChapter of the Association for Computational Lin-guistics, EACL ’12, pages 356–366, Stroudsburg,PA, USA. Association for Computational Linguis-tics.

Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier,Matt Deeds, Nicole Hamilton, and Greg Hullender.2005. Learning to rank using gradient descent. InProceedings of the 22nd international conference onMachine learning, pages 89–96. ACM.

Christopher J Burges, Robert Ragno, and Quoc V Le.2007. Learning to rank with nonsmooth cost func-tions. In Advances in neural information processingsystems, pages 193–200.

Christopher JC Burges. 2010. From ranknet to lamb-darank to lambdamart: An overview. Learning,11(23-581):81.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 670–680, Copen-hagen, Denmark. Association for ComputationalLinguistics.

Koby Crammer, Ofer Dekel, Joseph Keshet, ShaiShalev-Shwartz, and Yoram Singer. 2006. Onlinepassive-aggressive algorithms. Journal of MachineLearning Research, 7(Mar):551–585.

http://dl.acm.org/citation.cfm?id=2380816.2380860

http://dl.acm.org/citation.cfm?id=2380816.2380860

https://www.aclweb.org/anthology/D17-1070



Stephen E Fischer. 2013. Automated document re-vision markup and change control. US Patent8,381,095.

Viola Ganter and Michael Strube. 2009. Findinghedges by chasing weasels: Hedge detection usingwikipedia tags and shallow linguistic features. InProceedings of the ACL-IJCNLP 2009 ConferenceShort Papers, pages 173–176. Association for Com-putational Linguistics.

Tovi Grossman, Justin Frank Matejka, and GeorgeFitzmaurice. 2013. Hierarchical display and nav-igation of document revision histories. US Patent8,533,593.

Andy Liaw, Matthew Wiener, et al. 2002. Classifi-cation and regression by randomforest. R news,2(3):18–22.

Aurelien Max and Guillaume Wisniewski. 2010. Min-ing naturally-occurring corrections and paraphrasesfrom wikipedia’s revision history. In LREC.

Eric Nalisnick, Bhaskar Mitra, Nick Craswell, andRich Caruana. 2016. Improving document rankingwith dual word embeddings. In Proceedings of the25th International Conference Companion on WorldWide Web, pages 83–84. International World WideWeb Conferences Steering Committee.

Sergio Nunes, Cristina Ribeiro, and Gabriel David.2008. Wikichanges: exposing wikipedia revisionactivity. In Proceedings of the 4th InternationalSymposium on Wikis, page 25. ACM.

Sergio Nunes, Cristina Ribeiro, and Gabriel David.2011. Term weighting based on document revisionhistory. Journal of the American Society for Infor-mation Science and Technology, 62(12):2471–2478.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Gunnar Ratsch, Takashi Onoda, and K-R Muller. 2001.Soft margins for adaboost. Machine learning,42(3):287–320.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2016. Bidirectional attentionflow for machine comprehension. arXiv preprintarXiv:1611.01603.

Robert Shaver. 2016. Document comment manage-ment. US Patent 9,418,054.

Mark D. Smucker, James Allan, and Ben Carterette.2007. A comparison of statistical significance testsfor information retrieval evaluation. In Proceed-ings of the Sixteenth ACM Conference on Confer-ence on Information and Knowledge Management,CIKM ’07, pages 623–632, New York, NY, USA.ACM.

Elif Yamangil and Rani Nelken. 2008. Miningwikipedia revision histories for improving sentencecompression. In Proceedings of the 46th AnnualMeeting of the Association for Computational Lin-guistics on Human Language Technologies: ShortPapers, pages 137–140. Association for Computa-tional Linguistics.

Diyi Yang, Aaron Halfaker, Robert Kraut, and EduardHovy. 2017. Identifying semantic edit intentionsfrom revisions in wikipedia. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 2000–2010.

Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2010. For the sake of sim-plicity: Unsupervised extraction of lexical simplifi-cations from wikipedia. In Human Language Tech-nologies: The 2010 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 365–368. Association forComputational Linguistics.

Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc VLe. 2018. Qanet: Combining local convolutionwith global self-attention for reading comprehen-sion. arXiv preprint arXiv:1804.09541.

Fabio Massimo Zanzotto and Marco Pennacchiotti.2010. Expanding textual entailment corporafromwikipedia using co-training. In Proceedings ofthe 2nd Workshop on The People’s Web Meets NLP:Collaboratively Constructed Semantic Resources,pages 28–36.

Torsten Zesch. 2012. Measuring contextual fitness us-ing error contexts extracted from the wikipedia revi-sion history. In Proceedings of the 13th Conferenceof the European Chapter of the Association for Com-putational Linguistics, pages 529–538. Associationfor Computational Linguistics.

Xiaofeng Zhu, Diego Klabjan, and Patrick N. Bless.2017. Semantic document distance measures andunsupervised document revision detection. In IJC-NLP.

http://www.aclweb.org/anthology/D14-1162

http://www.aclweb.org/anthology/D14-1162

https://doi.org/10.1145/1321440.1321528

https://doi.org/10.1145/1321440.1321528

modeling the relationship between user comments and edits ... · made by 300.8 million users....

Documents