arxiv:1903.07319v1 [cs.cl] 18 mar 2019

15
What You Say and How You Say it: Joint Modeling of Topics and Discourse in Microblog Conversations Jichuan Zeng 1* , Jing Li 2* , Yulan He 3 , Cuiyun Gao 1 , Michael R. Lyu 1 , Irwin King 1 1 Department of Computer Science and Engineering The Chinese University of Hong Kong, HKSAR, China 2 Tencent AI Lab, Shenzhen, China 3 Department of Computer Science, University of Warwick, UK 1 {jczeng, cygao, lyu, king}@cse.cuhk.edu.hk 2 [email protected], 3 [email protected] Abstract This paper presents an unsupervised frame- work for jointly modeling topic content and discourse behavior in microblog conversa- tions. Concretely, we propose a neural model to discover word clusters indicating what a conversation concerns (i.e., topics) and those reflecting how participants voice their opinions (i.e., discourse). 1 Exten- sive experiments show that our model can yield both coherent topics and meaningful discourse behavior. Further study shows that our topic and discourse representations can benefit the classification of microblog messages, especially when they are jointly trained with the classifier. 1 Introduction The last decade has witnessed the revolution of communication, where the “kitchen table conver- sations" have been expanded to public discussions on online platforms. As a consequence, in our daily life, the exposure to new information and the exchange of personal opinions have been mediated through microblogs, one popular online platform genre (Bakshy et al., 2015). The flourish of mi- croblogs has also led to the sheer quantity of user- created conversations emerging every day, expos- ing individuals to superfluous information. Fac- ing such unprecedented number of conversations relative to limited attention of individuals, how shall we automatically extract the critical points and make sense of these microblog conversations? Towards key focus understanding of a conversa- tion, previous work has shown the benefits of dis- course structure (Li et al., 2016b; Qin et al., 2017; Li et al., 2018), which shapes how messages in- teract with each other forming the discussion flow * This work was partially conducted in Jichuan Zeng’s internship in Tencent AI Lab. Corresponding author: Jing Li. 1 Our datasets and code are available at: http:// github.com/zengjichuan/Topic_Disc ... M 1 [Statement]: Just watched HRC openly endorse a gun-control measure which will fail in front of the Supreme Court. This is a train wreck. M 2 [Comment]: People said the same thing about Obama, and nothing took place. Gun laws just aren’t being enforced like they should be. :/ M 3 [Question]: Okay, hold up. What do you think I’m referencing here? It’s not what you’re talking about. M 4 [Agreement]: Thought it was about gun control. I’m in agreement that gun rights shouldn’t be stripped. ... Figure 1: A Twitter conversation snippet about the gun control issue in U.S. Topic words reflecting the con- versation focus are in boldface. The italic words in [] are our interpretations of the messages’ discourse roles. and can usefully reflect salient topics raised in the discussion process. After all, the topical content of a message naturally occurs in context of the con- versation discourse and hence should not be mod- eled in isolation. On the other way around, the extracted topics can reveal the purpose of partic- ipants and further facilitate the understanding of their discourse behavior (Qin et al., 2017). Fur- ther, the joint effects of topics and discourse have shown useful to better understand microblog con- versations, such as a downstream task to predict user engagements (Zeng et al., 2018b). To illustrate how the topics and discourse inter- play in a conversation, Figure 1 displays a snippet of Twitter conversation. As can be seen, the con- tent words reflecting the discussion topics (such as supreme court” and “gun rights”) appear in con- text of the discourse flow, where participants carry the conversation forward via making a statement, giving a comment, asking a question, and so forth. Motivated by such an observation, we assume that a microblog conversation can be decomposed into two crucially different components: one for topi- cal content and the other for discourse behavior. Here, the topic components indicate what a con- versation is centered around and reflect the impor- arXiv:1903.07319v1 [cs.CL] 18 Mar 2019

Upload: others

Post on 14-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

What You Say and How You Say it: Joint Modeling of Topics andDiscourse in Microblog Conversations

Jichuan Zeng1∗, Jing Li2∗, Yulan He3, Cuiyun Gao1, Michael R. Lyu1, Irwin King1

1Department of Computer Science and EngineeringThe Chinese University of Hong Kong, HKSAR, China

2Tencent AI Lab, Shenzhen, China3Department of Computer Science, University of Warwick, UK

1{jczeng, cygao, lyu, king}@[email protected], [email protected]

Abstract

This paper presents an unsupervised frame-work for jointly modeling topic content anddiscourse behavior in microblog conversa-tions. Concretely, we propose a neuralmodel to discover word clusters indicatingwhat a conversation concerns (i.e., topics)and those reflecting how participants voicetheir opinions (i.e., discourse).1 Exten-sive experiments show that our model canyield both coherent topics and meaningfuldiscourse behavior. Further study showsthat our topic and discourse representationscan benefit the classification of microblogmessages, especially when they are jointlytrained with the classifier.

1 Introduction

The last decade has witnessed the revolution ofcommunication, where the “kitchen table conver-sations" have been expanded to public discussionson online platforms. As a consequence, in ourdaily life, the exposure to new information and theexchange of personal opinions have been mediatedthrough microblogs, one popular online platformgenre (Bakshy et al., 2015). The flourish of mi-croblogs has also led to the sheer quantity of user-created conversations emerging every day, expos-ing individuals to superfluous information. Fac-ing such unprecedented number of conversationsrelative to limited attention of individuals, howshall we automatically extract the critical pointsand make sense of these microblog conversations?

Towards key focus understanding of a conversa-tion, previous work has shown the benefits of dis-course structure (Li et al., 2016b; Qin et al., 2017;Li et al., 2018), which shapes how messages in-teract with each other forming the discussion flow

∗ This work was partially conducted in Jichuan Zeng’sinternship in Tencent AI Lab. Corresponding author: Jing Li.

1Our datasets and code are available at: http://github.com/zengjichuan/Topic_Disc

...M1 [Statement]: Just watched HRC openly endorse agun-control measure which will fail in front of theSupreme Court. This is a train wreck.M2 [Comment]: People said the same thing aboutObama, and nothing took place. Gun laws just aren’tbeing enforced like they should be. :/M3 [Question]: Okay, hold up. What do you think I’mreferencing here? It’s not what you’re talking about.M4 [Agreement]: Thought it was about gun control.I’m in agreement that gun rights shouldn’t be stripped....

Figure 1: A Twitter conversation snippet about the guncontrol issue in U.S. Topic words reflecting the con-versation focus are in boldface. The italic words in []are our interpretations of the messages’ discourse roles.

and can usefully reflect salient topics raised in thediscussion process. After all, the topical content ofa message naturally occurs in context of the con-versation discourse and hence should not be mod-eled in isolation. On the other way around, theextracted topics can reveal the purpose of partic-ipants and further facilitate the understanding oftheir discourse behavior (Qin et al., 2017). Fur-ther, the joint effects of topics and discourse haveshown useful to better understand microblog con-versations, such as a downstream task to predictuser engagements (Zeng et al., 2018b).

To illustrate how the topics and discourse inter-play in a conversation, Figure 1 displays a snippetof Twitter conversation. As can be seen, the con-tent words reflecting the discussion topics (such as“supreme court” and “gun rights”) appear in con-text of the discourse flow, where participants carrythe conversation forward via making a statement,giving a comment, asking a question, and so forth.Motivated by such an observation, we assume thata microblog conversation can be decomposed intotwo crucially different components: one for topi-cal content and the other for discourse behavior.Here, the topic components indicate what a con-versation is centered around and reflect the impor-

arX

iv:1

903.

0731

9v1

[cs

.CL

] 1

8 M

ar 2

019

tant discussion points put forward in the conver-sation process. The discourse components signalthe discourse roles of messages, such as makinga statement, asking a question, and other dialogueacts (Ritter et al., 2010; Joty et al., 2011), whichfurther shape the discourse structure of a conver-sation.2 To distinguish the above two components,we examine the conversation contexts and iden-tify two types of words: topic words, indicat-ing what a conversation focuses on, and discoursewords, reflecting how the opinion is voiced ineach message. For example, in Figure 1, the topicwords “gun” and “control” indicate the conversa-tion topic while the discourse word “what” and“?” signal the question in M3.

Concretely, we propose a neural frameworkbuilt upon topic, enabling the joint explorationof word clusters to represent topic and discoursein microblog conversations. Different from theprior models trained on annotated data (Li et al.,2016b; Qin et al., 2017), our model is fully unsu-pervised, not dependent on annotations for eithertopics or discourse, which ensures its immediateapplicability in any domain or language. More-over, taking advantages of the recent advances inneural topic models (Srivastava and Sutton, 2017;Miao et al., 2017), we are able to approximateBayesian variational inference without requiringmodel-specific derivations, while most existingwork (Ritter et al., 2010; Joty et al., 2011; Alvarez-Melis and Saveski, 2016; Zeng et al., 2018b; Liet al., 2018) require expertise involved to cus-tomize model inference algorithms. In addition,our neural nature enables end-to-end training oftopic and discourse representation learning withother neural models for diverse tasks.

For model evaluation, we conduct an exten-sive empirical study on two large-scale Twitterdatasets. The intrinsic results show that ourmodel can produce latent topics and discourseroles with better interpretability than the state-of-the-art models from previous studies. The extrin-sic evaluations on a tweet classification task ex-hibit the model’s ability to capture useful repre-sentations for microblog messages. Particularly,our model enables an easy combination with ex-isting neural models for end-to-end training, suchas CNN, which is shown to perform better in clas-

2In this paper, the discourse role refers to a certain type ofdialogue act (e.g., statement or question) for each message.And the discourse structure refers to some combination ofdiscourse roles in a conversation.

sification than the pipeline approach without jointtraining.

2 Related Work

Our work is in the line with previous studies thatemploy non-neural models to leverage discoursestructure for extracting topical content from con-versations (Li et al., 2016b; Qin et al., 2017; Liet al., 2018). Zeng et al. (2018b) explores how dis-course and topics jointly affect user engagementsin microblog discussions. Different from them, webuild our model in a neural network framework,where the joint effects of topic and discourse rep-resentations can be exploited for various down-stream deep learning tasks in an end-to-end man-ner. In addition, we are inspired by the prior re-search that only models topics or conversation dis-course. In the following, we discuss them in turn.

Topic Modeling. Our work is closely relatedwith the topic model studies. In this field, de-spite of the huge success achieved by the spring-board topic models (e.g., pLSA (Hofmann, 1999)and LDA (Blei et al., 2001)), and their exten-sions (Blei et al., 2003; Rosen-Zvi et al., 2004),the applications of these models have been lim-ited to formal and well-edited documents, suchas news reports (Blei et al., 2003) and scien-tific articles (Rosen-Zvi et al., 2004), attributedto their reliance on document-level word colloca-tions. When processing short texts, such as themessages on microblogs, it is likely that the per-formance of these models will be inevitably com-promised, due to the severe data sparsity issue.

To deal with such an issue, many previous ef-forts incorporate the external representations, suchas word embeddings (Nguyen et al., 2015; Li et al.,2016a; Shi et al., 2017) and knowledge (Songet al., 2011; Yang et al., 2015; Hu et al., 2016), pre-trained on large-scale high-quality resources. Dif-ferent from them, our model learns topic and dis-course representations only with the internal dataand thus can be widely applied on scenarios wherethe specific external resource is unavailable.

In another line of the research, most priorwork focuses on how to enrich the context ofshort messages. To this end, biterm topic model(BTM) (Yan et al., 2013) extends a message intoa biterm set with all combinations of any two dis-tinct words appearing in the message. On the con-trary, our model allows the richer context in a con-versation to be exploited, where word collocation

patterns can be captured beyond a short message.In addition, there are many methods employ-

ing some heuristic rules to aggregate short mes-sages into long pseudo-documents, such as thosebased on authorship (Hong and Davison, 2010;Zhao et al., 2011) and hashtags (Ramage et al.,2010; Mehrotra et al., 2013). Compared withthese methods, we model messages in context oftheir conversations, which has been demonstratedto be a more natural and effective text aggrega-tion strategy for topic modeling (Alvarez-Melisand Saveski, 2016).

Conversation Discourse. Our work is also inthe area of discourse analysis for conversations,ranging from the prediction of the shallow dis-course roles on utterance level (Stolcke et al.,2000; Ji et al., 2016; Zhao et al., 2018) to thediscourse parsing for a more complex conversa-tion structure (Elsner and Charniak, 2008, 2010;Afantenos et al., 2015). In this area, most exist-ing models heavily rely on the data annotated withdiscourse labels for learning (Zhao et al., 2017).Different from them, our model, in a fully unsu-pervised way, identifies distributional word clus-ters to represent latent discourse factors in con-versations. Although such latent discourse vari-ables have been studied in previous work (Ritteret al., 2010; Joty et al., 2011; Ji et al., 2016; Zhaoet al., 2018), none of them explores the effects oflatent discourse on the identification of conversa-tion topic, which is a gap our work fills in.

3 Our Neural Model for Topics andDiscourse in Conversations

This section introduces our neural model thatjointly explores latent representations for topicsand discourse in conversations. We first present anoverview of our model in Section 3.1, followed bythe model generative process and inference proce-dure in Section 3.2 and 3.3, respectively.

3.1 Model OverviewIn general, our model aims to learn coherent wordclusters that reflect the latent topics and discourseroles embedded in the microblog conversations.To this end, we distinguish two latent componentsin the given collection: topics and discourse, eachrepresented by a certain type of word distribution(distributional word cluster). Specifically, at thecorpus level, we assume there are K topics, repre-sented by φTk (k = 1, 2, . . . ,K), and D discourse

Conversation c

Latent topic Latent discourse

,

Target message

MI

Figure 2: The architecture of our neural frameworkthat jointly models latent topics and latent discourse.

roles, captured with φDd (d = 1, 2, . . . , D). φT andφD are all multinomial word distributions over thevocabulary size V . Inspired by the neural topicmodels in Miao et al. (2017), our model encodestopic and discourse distributions (φT and φD) aslatent variables in a neural network and learns theparameters via back propagation.

Before touching the details of our model, wefirst describe how we formulate the input. Onmicroblogs, as a message might have multiplereplies, messages in an entire conversation canbe organized as a tree with replying relations (Liet al., 2016b, 2018). Though the recent progress inrecursive models allows the representation learn-ing from the tree-structured data, previous studieshave pointed out that, in practice, sequence modelsserve as a more simple yet robust alternative (Liet al., 2015). In this work, we follow the com-mon practice in most conversation modeling re-search (Ritter et al., 2010; Joty et al., 2011; Zhaoet al., 2018) to take a conversation as a sequenceof turns. To this end, each conversation tree is flat-tened into root-to-leaf paths. Each one of suchpaths is hence considered as a conversation in-stance, and a message on the path corresponds to aconversation turn (Zarisheva and Scheffler, 2015;Cerisara et al., 2018; Jiao et al., 2018).

The overall architecture of our model is shownin Figure 2. Formally, we formulate a conversationc as a sequence of messages (x1,x2, . . . ,xMc),where Mc denotes the number of messages in c.In the conversation, each message x, named as thetarget message, is fed into our model sequentially.

Here we process the target message x as the bag-of-words (BoW) term vector xBoW ∈ RV , fol-lowing the bag-of-words assumption in most topicmodels (Blei et al., 2003; Miao et al., 2017). Theconversation, c, where the target message x is in-volved, is considered as the context of x. It is alsoencoded in the BoW form (denoted as cBoW ∈RV ) and fed into our model. In doing so, we en-sure context of the target message is incorporatedwhile learning its latent representations.

Following the previous practice in neural topicmodels (Miao et al., 2017; Srivastava and Sutton,2017), we employ the variational auto-encoder(VAE) (Kingma and Welling, 2013) to resemblethe data generative process via two steps. First,given the target message x and its conversation c,our model converts them into two latent variables:topic variable z and discourse variable d. Then,using the intermediate representations captured byz and d, we reconstruct the target message, x′.

3.2 Generative Process

In this section, we first describe the two latent vari-ables in our model: the topic variable z and thediscourse variable d. Then, we present our datagenerative process from the latent variables.

Latent Topics. For latent topic learning, we ex-amine the main discussion points in the context ofa conversation. Our assumption is that messagesin the same conversation tend to focus on similartopics (Li et al., 2018; Zeng et al., 2018b). Con-cretely, we define the latent topic variable z ∈ RKat the conversation level and generate the topicmixture of c, denoted as a K-dimentional distri-bution θ, via a softmax construction conditionedon z (Miao et al., 2017).

Latent Discourse. For modeling the discoursestructure of conversations, we capture the mes-sage-level discourse roles reflecting the dialogueacts of each message, as is done in Ritter et al.(2010). Concretely, given the target message x,we employ aD-dimensional one-hot vector to rep-resent the latent discourse variable d, where thehigh bit indicates the index of a discourse worddistribution that can best express x’s discourserole. In the generative process, the latent discoursed is drawn from a multinomial distribution withparameters estimated from the input data.

Data Generative Process As mentioned previ-ously, our entire framework is based on VAE,

which consists of an encoder and a decoder. Theencoder maps a given input into latent topic anddiscourse representations and the decoder recon-structs the original input from the latent represen-tations. In the following, we first describe the de-coder followed by the encoder.

In general, our decoder is learned to reconstructthe words in the target message x (in the BoWform) from the latent topic z and latent discoursed. We show the generative story that reflects thereconstruction process below:• Draw the latent topic z ∼ N (µ,σ2)• c’s topic mixture θ = softmax(fθ(z))• Draw the latent discourse d ∼Multi(π)• For the n-th word in x

– βn = softmax(fφT (θ) + fφD(d))– Draw the word wn ∼Multi(βn)

where f∗(·) is a neural perceptron, with a lin-ear transformation of inputs activated by a non-linear transformation. Here we use rectified linearunits (ReLUs) (Nair and Hinton, 2010) as the ac-tivate functions. In particular, the weight matrixof fφT (·) (after the softmax normalization) is con-sidered as the topic-word distributions φT . Thediscourse-word distributions φD are similarly ob-tained from fφD(·).

For the encoder, we learn the parameters µ, σ,and π from the input xBoW and cBoW (the BoWform of the target message and its conversation),following the formula below:

µ = fµ(fe(cBoW )), logσ = fσ(fe(cBoW ))

π = softmax(fπ(xBoW ))(1)

3.3 Model Inference

For the objective function of our entire framework,we take three aspects into account: the learning oflatent topics and discourse, the reconstruction ofthe target messages, and the separation of topic-associated words and discourse-related words.

Learning Latent Topics and Discourse. Forlearning the latent topics/discourse in our model,we employ the variational inference (Blei et al.,2016) to approximate posterior distribution overthe latent topic z and the latent discourse d givenall the training data. To this end, we maximizethe variational lower bound Lz for z and Ld for d,each defined as following:

Lz = Eq(z | c)[p(c | z)]−DKL(q(z | c) || p(z))Ld = Eq(d |x)[p(x |d)]−DKL(q(d |x) || p(d))

(2)

q(z | c) and q(d |x) are approximated posteriorprobabilities describing how the latent topic z andthe latent discourse d are generated from the data.p(c | z) and p(x |d) represent the corpus likeli-hoods conditioned on the latent variables. Hereto facilitate coherent topic production, in p(c | z),we penalize stop words’ likelihood to be gener-ated from latent topics following Li et al. (2018).p(z) follows the standard normal prior N (0, I)and p(d) is the uniform distribution Unif(0, 1).DKL refers to the Kullback-Leibler divergence(KLD) that ensures the approximated posteriors tobe close to the true ones. For more derivation de-tails, we refer the readers to Miao et al. (2017).

Reconstructing target messages. From the la-tent variables z and d, the goal of our model is toreconstruct the target message x. The correspond-ing learning objective is to maximize Lx definedas:

Lx = Eq(z |x)q(d | c)[log p(x | z,d)] (3)

Here we designLx to ensure that the learned latenttopics and discourse can reconstruct x.

Distinguishing Topics and Discourse. Ourmodel aims to distinguish word distributions fortopics (φT ) and discourse (φD), which enablestopics and discourse to capture different informa-tion in conversations. Concretely, we employ themutual information, given below, to measure themutual dependency between the latent topics z andthe latent discourse d. 3

Eq(z)q(d)[logp(z,d)

p(z)p(d)] (4)

Eq. 4 can be further derived as the Kullback-Leibler divergence of the conditional distribution,p(d | z), and marginal distribution, p(d). The de-rived formula, defined as the mutual informationloss (LMI ) and shown in Eq. 5, is used to map zand d into the separated semantic space.

LMI = Eq(z)[DKL(p(d | z)||p(d))] (5)

We can hence minimize LMI for guiding ourmodel to separate word distributions that representtopics and discourse.

3The distributions in Eq. 4 are all conditional probabilitydistributions given the target message x and its conversationc. We omit the conditions for simplicity.

Datasets # of Avg msgs Avg words|Vocab|

convs per conv per msgTREC 116,612 3.95 11.38 9,463TWT16 29,502 8.67 14.70 7,544

Table 1: Statistics of the two datasets containing Twit-ter conversations.

The Final Objective. To capture the joint ef-fects of the learning objectives described above(Lz , Ld, Lx, and LMI ), we design the final ob-jective function for our entire framework as fol-lowing:

L = Lz + Ld + Lx − λLMI (6)

where the hyperparameter λ is the trade-off pa-rameter for balancing between the MI loss (LMI )and the other learning objectives. By maximiz-ing the final objective L via back propagation, theword distributions of topics and discourse can bejointly learned from microblog conversations.4

4 Experimental Setup

Data Collection. For our experiments, we col-lected two microblog conversation datasets fromTwitter. One is released by the TREC 2011 mi-croblog track (henceforth TREC), containing con-versations concerning a wide rage of topics.5 Theother is crawled from January to June 2016 withTwitter streaming API6 (henceforth TWT16, shortfor Twitter 2016), following the way of buildingTREC dataset. During this period, there are a largevolume of discussions centered around U.S. presi-dential election. In addition, for both datasets, weapply Twitter search API7 to retrieve the missingtweets in the conversation history, as the Twitterstreaming API (used to collect both datasets) onlyreturns sampled tweets from the entire pool.

The statistics of the two experiment datasets areshown in Table 1. For model training and evalu-ation, we randomly sampled 80%, 10%, and 10%

4To smooth the gradients in implementation, for z ∼N (µ,σ), we apply the reparameterization on z (Kingma andWelling, 2013; Rezende et al., 2014), and for d ∼Multi(π),we adopt the Gumbel-Softmax trick (Maddison et al., 2016;Jang et al., 2016).

5http://trec.nist.gov/data/tweets/6https://developer.twitter.com/

en/docs/tweets/filter-realtime/api-reference/post-statuses-filter.html

7https://developer.twitter.com/en/docs/tweets/search/api-reference/get-savedsearches-show-id.

of the data to form the training, development, andtest set, respectively.

Data Preprocessing. We preprocessed the datawith the following steps. First, non-Englishtweets were filtered out. Then, hashtags, men-tions (@username), and links were replaced withgeneric tags “HASH”, “MENT”, and “URL”, re-spectively. Next, the natural languge toolkit(NLTK) was applied for tweet tokenization.8 Af-ter that, all letters were normalized to lower cases.Finally, words occurred less than 20 times werefiltered out from the data.

Parameter Setting. To ensure comparable re-sults with Li et al. (2018) (the prior work focusingon the same task as ours), in the topic coherenceevaluation, we follow their setup to report the re-sults under two sets of K (the number of topics):K = 50 and K = 100, and with the number ofdiscourse roles (D) set to 10. The analysis for theeffects of K and D will be further presented inSection 5.5. For all the other hyper-parameters,we tuned them on development set by grid search.The trade-off parameter λ (defined in Eq. 6), bal-ancing the MI loss and the other objective func-tions, is set to 0.01. In model training, we useAdam optimizer (Kingma and Ba, 2014) and run100 epochs with early stop strategy adopted.

Baselines. In topic modeling experiments, weconsider the five topic model baselines treatingeach tweet as a document: LDA (Blei et al.,2003), BTM (Yan et al., 2013), LF-LDA, LF-DMM (Nguyen et al., 2015), and NTM (Miaoet al., 2017). In particular, BTM and LF-DMMare the state-of-the-art topic models for shorttexts. BTM explores the topics of all word pairs(biterms) in each message to alleviate data spar-sity in short texts. LF-DMM incorporates wordembeddings pre-trained on external data to ex-pand semantic meanings of words, so does LF-LDA. In Nguyen et al. (2015), LF-DMM, basedon one-topic-per-document Dirichlet MultinomialMixture (DMM) (Nigam et al., 2000), was re-ported to perform better than LF-LDA, based onLDA. For LF-LDA and LF-DMM, we use GloVeTwitter embeddings (Pennington et al., 2014) asthe pre-trained word embeddings.9

8https://www.nltk.org/9https://nlp.stanford.edu/projects/

glove/

For the discourse modeling experiments, wecompare our results with LAED (Zhao et al.,2018), a VAE-based representation learning modelfor conversation discourse. In addition, for bothtopic and discourse evaluation, we compare withLi et al. (2018), a recently proposed model for mi-croblog conversations, where topics and discourseare jointly explored with a non-neural framework.Besides the existing models from previous studies,we also compare with the variants of our modelthat only models topics (henceforth TOPIC ONLY)or discourse (henceforth DISC ONLY).10 Our jointmodel of topics and discourse is referred to asTOPIC+DISC.

In the preprocessing process for the base-lines, we removed stop words and punctuation fortopic models unable to learn discourse representa-tions following the common practice in previouswork (Yan et al., 2013; Miao et al., 2017). For theother models, stop words and punctuation were re-tained in the vocabulary considering their useful-ness as discourse indicators (Li et al., 2018).

5 Experimental Results

In this section, we first report the topic coherenceresults in Section 5.1, followed by a discussion inSection 5.2 comparing the latent discourse rolesdiscovered by our model with the manually anno-tated dialogue acts. Then, we study whether wecan capture useful representations for microblogmessages in a tweet classification task (in Section5.3). A qualitative analysis, showing some exam-ple topics and discourse roles, is further providedin Section 5.4. Finally, in Section 5.5, we providemore discussions on our model.

5.1 Topic CoherenceFor the topic coherence, we adopt the Cv scoresmeasured via the open-source Palmetto toolkit asour evaluation metric.11 Cv scores assume thatthe top N words in a coherent topics (ranked bylikelihood) tend to co-occur in the same documentand have shown comparable evaluation results tohuman judgments (Röder et al., 2015). Table 2shows the average Cv scores over the producedtopics given N = 5 and N = 10. The values

10 In our ablation without mutual information loss (LMI

defined in Eq. 4), topics and discourse are learned indepen-dently. Thus, its topic representation can be used for the out-put of TOPIC ONLY, so does its discourse one for DISC ONLY.

11https://github.com/dice-group/Palmetto

Models K = 50 K = 100TREC TWT16 TREC TWT16

BaselinesLDA 0.467 0.454 0.467 0.454BTM 0.460 0.461 0.466 0.463LF-DMM 0.456 0.448 0.463 0.466LF-LDA 0.470 0.456 0.467 0.453NTM 0.478 0.479 0.482 0.443Li et al. (2018) 0.463 0.433 0.464 0.435Our modelsTOPIC ONLY 0.478 0.482 0.481 0.471TOPIC+DISC 0.485 0.487 0.496 0.480

Table 2: Cv coherence scores for latent topics producedby different models. The best result in each columnis highlighted in bold. Our joint model TOPIC+DISCachieves significantly better coherence scores than allthe baselines (p < 0.01, paired test).

range from 0.0 to 1.0 and higher scores indicatebetter topic coherence. We can observe that:

• Models assuming a single topic for each mes-sage do not work well. It has long been pointed outthat the one-topic-per-message assumption (eachmessage contains only one topic) helps topic mod-els alleviate the data sparsity issue in short textson microblogs (Zhao et al., 2011; Quan et al.,2015; Nguyen et al., 2015; Li et al., 2018). How-ever, we observe contradictory results since bothLF-DMM and Li et al. (2018), following this as-sumption, achieve generally worse performancethan the other models. This might be attributed tothe large-scale data used in our experiments (eachdataset has over 250K messages as shown in Ta-ble 1), which potentially provide richer word co-occurrence patterns and thus partially alleviate thedata sparsity issue.• Pre-trained word embeddings do not bring ben-efits. Comparing LF-LDA with LDA, we foundthat they give similar coherence scores. Thisshows that with sufficiently large training data,with or without using the pre-trained word embed-dings do not make any difference in the topic co-herence results.• Neural models perform better than non-neuralbaselines. When comparing the results of neu-ral models (NTM and our models) with the otherbaselines, we find the former yield topics with bet-ter coherence scores in most cases.• Modeling topics in conversations is effective.Among neural models, we found our models out-perform NTM (without exploiting conversationcontexts). This shows that the conversations pro-vide useful context and enables more coherent top-

Models Purity Homogeneity VIBaselinesLAED 0.505 0.022 6.418Li et al. (2018) 0.511 0.096 5.540Our modelsDISC ONLY 0.510 0.112 5.532TOPIC+DISC 0.521 0.142 5.097

Table 3: The purity, homogeneity, and variation ofinformation (VI) scores for the latent discourse rolesmeasured against the human-annotated dialogue acts.For purity and homogeneity, higher scores indicate bet-ter performance, while for VI scores, lower is better. Ineach column, the best results are in boldface. Our jointmodel TOPIC+DISC significantly outperforms all thebaselines (p < 0.01, paired t-test).

ics to be extracted from the entire conversationthread instead of a single short message.• Modeling topics together with discourse helpsproduce more coherent topics. We can observebetter results with the joint model TOPIC+DISC

in comparison with the variant considering topicsonly. This shows that TOPIC+DISC, via the jointmodeling of topic- and discourse-word distribu-tions (reflecting non-topic information), can bet-ter separate topical words from non-topical ones,hence resulting in more coherent topics.

5.2 Discourse Interpretability

In this section, we evaluate whether our modelcan discover meaningful discourse representa-tions. To this end, we train the comparison mod-els for discourse modeling on the TREC datasetand test the learned latent discourse on a bench-mark dataset released by Cerisara et al. (2018).The benchmark dataset consists of 2, 217 mi-croblog messages forming 505 conversations col-lected from Mastodon12, a microblog platform ex-hibiting Twitter-like user behavior (Cerisara et al.,2018). For each message, there is a human-assigned discourse label, selected from one of the15 dialogue acts, such as question, answer, dis-agreement, etc.

For discourse evaluation, we measure whetherthe model-produced discourse assignments areconsistent with the human-annotated dialogueacts. Hence following Zhao et al. (2018), weassume that an interpretable latent discourse roleshould cluster messages labeled with the same di-alogue act. Therefore, we adopt purity (Man-

12https://mastodon.social

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

IDSAQOWHTRMVJFE

0.0

0.1

0.2

0.3

0.4

0.5

0.6

clusterI: statement, D: disagreement, S: suggest, A: agreement, Q:yes/no question, O: wh*/open question, W: open+choiceanswer, H: initial greetings, T: thanking, R: request, M:sympathy, V: explicit performance, J: exclamation, F: ac-knowledge, and E: offer.

Figure 3: A heatmap showing the alignments of the la-tent discourse roles and human-annotated dialogue actlabels. Each line visualizes the distribution of messageswith the corresponding dialogue act label over varyingdiscourse roles (indexed from 1 to 15), where darkercolors indicate higher values.

ning et al., 2008), homogeneity (Rosenberg andHirschberg, 2007), and variation of information(VI) (Meila, 2003; Goldwater and Griffiths, 2007)as our automatic evaluation metrics. Here, we setD = 15 to ensure the number of latent discourseroles to be the same as the number of manually-labeled dialogue acts. Table 3 shows the compar-ison results of the average scores over the 15 la-tent discourse roles. Higher values indicate betterperformance for purity and homogeneity, while forVI, lower is better.

It can be observed that our models exhibitgenerally better performance, showing the ef-fectiveness of our framework in inducing inter-pretable discourse roles. Particularly, we ob-serve the best results achieved by our joint modelTOPIC+DISC, which is learned to distinguishtopic- and discourse-words, important in recogniz-ing indicative words to reflect latent discourse.

To further analyze the consistency of vary-ing latent discourse roles (produced by ourTOPIC+DISC model) with the human-labeled di-alogue acts, Figure 3 displays a heatmap, whereeach line visualizes how the messages with a di-alogue act distribute over varying discourse roles.

Models TREC TWT16Acc Avg F1 Acc Avg F1

BaselinesBoW 0.120 0.026 0.132 0.030TF-IDF 0.116 0.024 0.153 0.041LDA 0.128 0.041 0.146 0.046BTM 0.123 0.035 0.167 0.054LF-DMM 0.158 0.072 0.162 0.052NTM 0.138 0.042 0.186 0.068Our model 0.259 0.180 0.341 0.269

Table 4: Evaluation of tweet classification results inaccuracy (Acc) and average F1 (Avg F1). Represen-tations learned by various models serve as the classi-fication features. For our model, both the topic anddiscourse representations are fed into the classifier.

It is seen that among all dialogue acts, our modeldiscovers more interpretable latent discourse for“greetings”, “thanking”, “exclamation”, and “of-fer”, where most messages are clustered into oneor two dominant discourse roles. It may be be-cause these dialogue acts can be relatively easier todetect based on their associated indicative words,such as the word “thanks” for “thanking”, and theword “wow” for “exclamation”.

5.3 Message Representations

To further evaluate our ability to capture effec-tive representations for microblog messages, wetake tweet classification as an example and test theclassification performance with the topic and dis-course representations as features. Here the user-generated hashtags capturing the topics of onlinemessages are used as the proxy class labels (Liet al., 2016b; Zeng et al., 2018a). We constructthe classification dataset from TREC and TWT16with the following steps. First, we removed thetweets without hashtags. Second, we ranked hash-tags by their frequencies. Third, we manuallyremoved the hashtags that are not topic-related(e.g. “#fb” for indicating the source of tweetsfrom Facebook), and combined the hashtags re-ferring to the same topic (e.g. “#DonaldTrump”and “#Trump”). Finally, we selected the top 50frequent hashtags, and all tweets containing thesehashtags as our classification dataset. Here, wesimply use the support vector machines (SVMs) asthe classifier, since our focus is to compare the rep-resentations learned by various models. Li et al.(2018) is unable to produce vector representationon tweet level, hence not considered here.

Table 4 shows the classification results of accu-

LDA :::::people trump police violence gundeath protest guns flag shot

BTMgun guns

:::::people police wrong right

::::think law agree black

LF-DMMgun police black

:::said

:::::people guns

killing ppl amendment laws

Li et al. (2018)wrong don trump gun

::::::::understand laws

agree guns::::doesn

::::make

NTMgun

::::::::understand

::yes guns world dead

:::real discrimination trump silence

TOPIC ONLYshootings gun guns cops charges con-trol

::::mass commit

::::know agreed

TOPIC+DISCguns gun shootings chicago shootingcops firearm criminals commit laws

Table 5: Top 10 representative words of example la-tent topics discovered from the TWT16 dataset. Weinterpret the topics as “gun control” by the displayedwords.

::::::::Non-topic

::::::words are wave-underlined and in

blue, while off-topic words are underlined and in red.

racy and average F1 on the two datasets with therepresentations learned by various models serv-ing as the classification features. We observethat our model outperforms other models with alarge margin. The possible reasons are two folds.First, our model derives topics from conversationthreads and thus potentially yields better messagerepresentations. Second, the discourse represen-tations (only produced by our model) are indica-tive features for hashtags, because users will ex-hibit various discourse behaviors in discussing di-verse topics (hashtags). For instance, we observeprominent “argument” discourse from tweets with“#Trump” and “#Hillary”, attributed to the contro-versial opinions to the two candidates in the 2016U.S. presidential election.

5.4 Example Topics and Discourse RolesWe have shown that joint modeling of topics anddiscourse presents superior performance on quan-titative measure. In this section, we qualitativelyanalyze the interpretability of our outputs via ana-lyzing the word distributions of some example top-ics and discourse roles.

Example Topics. Table 5 lists the top 10 wordsof some example latent topics discovered by var-ious models from the TWT16 dataset. Accord-ing to the words shown, we can interpret the ex-tracted topics as “gun control” — discussion aboutgun law and the failure of gun control in Chicago.We observe that LDA wrongly includes off-topicword “flag”. From the outputs of BTM, LF-DMM,Li et al. (2018), and our TOPIC ONLY variant,

Table 6: Top 10 representative words of example dis-course roles learned from TREC and TWT16. The dis-course roles of the word clusters are manually assignedaccording to their associated words.

though we do not find off-topic words, there aresome non-topic words, such as “said” and “under-stand”.13 The output of our TOPIC+DISC modelappears to be the most coherent, with words suchas “firearm” and “criminals” included, which areclearly relevant to “gun control”. Such results in-dicate the benefit of examining the conversationcontexts and jointly exploring topics and discoursein them.

Example Discourse Roles. To qualitatively an-alyze whether our TOPIC+DISC model can dis-cover interpretable discourse roles, we select thetop 10 words from the distributions of some exam-ple discourse roles and list them in Table 6. It canbe observed that there are some meaningful wordclusters reflecting varying discourse roles foundwithout any supervision. Interestingly, we observethat the latent discourse roles from TREC andTWT16, though learned separately, exhibit somenotable overlap in their associated top 10 words,particularly for “question” and “statement”. Wealso note that “argument” is represented by verydifferent words. The reason is that TWT16 con-tains a large volume of arguments centered aroundcandidate Clinton and Trump, resulting in the fre-quent appearance of words like “he” and “she”.

5.5 Further Discussions

In this section, we further present more discus-sions on our joint model: TOPIC+DISC .

13Non-topic words do not clearly indicate the correspond-ing topic, while off-topic words are more likely to appear inother topics.

Figure 4: (a) The impact of topic numbers. The hori-zontal axis: the number of topics; The vertical axis: theCv topic coherence. (b) The impact of discourse num-bers. The horizontal axis: the number of discourse; Thevertical axis: the homogeneity measure.

Parameter Analysis. Here we study the two im-portant hyper-parameters in our model, the num-ber of topics (K) and the number of discourseroles (D). In Figure 4, we show the Cv topic co-herence given varyingK in (a) and the homogene-ity measure given varying D in (b). As can beseen, the curves corresponding to the performanceon topics and discourse are not monotonic. In par-ticular, better topic coherence scores are achievedgiven relatively larger topic numbers for TRECwith the best result observed at K = 80. On thecontrary, the optimum topic number for TWT16 isK = 20, while increasing the number of topics re-sults in worse Cv scores in general. This may beattributed to the relatively centralized topic con-cerning U.S. election in the TWT16 corpus. Fordiscourse homogeneity, the best result is achievedgiven D = 15, with same the number of manuallyannotated dialogue acts in the benchmark.

Case Study. To further understand why ourmodel learns meaningful representations for top-ics and discourse, we present a case study basedon the example conversation shown in Figure 1.Specifically, we visualize the topic words (withp(w | z) > p(w |d)) in red and the rest wordsin blue to indicate discourse. Darker red indi-cates the higher topic likelihood (p(w | z)) whiledarker blue shows the higher discourse likelihood(p(w |d)). The results are shown in Figure 5.We can observe that topic and discourse wordsare well separated by our model, which explainswhy it can generate high-quality representationsfor both topics and discourse.

Model Extensibility. Recall that in the Intro-duction, we have mentioned that our neural-basedmodel has an advantage to be easily combinedwith other neural network architectures and allowsfor the joint training of both models. Here we take

Figure 5: Visualization of the topic-discourse assign-ment of a twitter conversion from TWT16. The anno-tated blue words are pone to be discourse words, andthe red are topic words. The shade is indicating theconfidence of current assignment.

message classification (with the setup in Section5.3) as an example, and study whether joint train-ing our model with convolutional neural network(CNN) (Kim, 2014), the widely-used model onshort text classification, can bring benefits to theclassification performance. We set the embeddingdimension to 200, with random initialization. Theresults are shown in Table 7, where we observethat joint training our model and the classifier cansuccessfully boost the classification performance.

Error Analysis. We further analyze the errorsin our outputs. For topics, taking a closer look attheir word distributions, we found that our modelsometimes mix sentiment words with topic words.For example, among the top 10 words of a topic“win people illegal americans hate lt racism socialtax wrong”, there are words “hate” and “wrong”,expressing sentiment rather than conveying topic-related information. This is due to the promi-nent co-occurrences of topic words and sentimentwords in our data, which results in the similar dis-tributions for topics and sentiment. Future workcould focus on the further separation of sentimentand topic words.

For discourse, we found that our model can in-duce some discourse roles beyond the 15 manu-ally defined dialogue acts in the Mastodon dataset(Cerisara et al., 2018). For example, as shown inTable 6, our model discover the “quotation” dis-course from both TREC and TWT16, which ishowever not defined in the Mastodon dataset. Thisperhaps should not be considered as an error. Weargue that it is not sensible to pre-define a fixed setof dialogue acts for diverse microblog conversa-

Models TREC TWT16Acc Avg F1 Acc Avg F1

CNN only 0.199 0.167 0.334 0.311Separate-Train 0.284 0.270 0.391 0.390Joint-Train 0.297 0.286 0.428 0.413

Table 7: Accuracy (Acc) and average F1 (Avg F1) ontweet classification (hashtags as labels). CNN only:CNN without using our representations. Seperate-Train: CNN fed with our pre-trained representations.Joint-Train: Joint training CNN and our model.

tions due to the rapid change and a wide variety ofuser behaviors in social media. Therefore, futurework should involve a better alternative to evalu-ate the latent discourse without relying on manu-ally defined dialogue acts. We also notice that ourmodel sometimes fails to identify discourse behav-iors requiring more in-depth semantic understand-ing, such as sarcasm, irony, and humor. This isbecause our model detects latent discourse purelybased on the observed words, while the detectionof sarcasm, irony, or humor requires deeper lan-guage understanding, which is beyond the capac-ity of our model.

6 Conclusion

We have presented a neural framework that jointlyexplores topic and discourse from microblog con-versations. Our model, in an unsupervised man-ner, examines the conversation contexts and dis-covers word distributions that reflect latent topicsand discourse roles. Results from extensive ex-periments show that our model can generate co-herent topics and meaningful discourse roles. Inaddition, our model can be easily combined withother neural network architectures (such as CNN)and allows for joint training, which has presentedbetter message classification results compared tothe pipeline approach without joint training.

Acknowledgements

This work is partially supported by the ResearchGrants Council of the Hong Kong Special Admin-istrative Region, China (No. CUHK 14208815and No. CUHK 14210717 of the General Re-search Fund), Innovate UK (grant No. 103652),and Microsoft Research Asia (2018 Microsoft Re-search Asia Collaborative Research Award). Wethank Shuming Shi, Dong Yu, and TACL review-ers and editors for the insightful suggestions onvarious aspects of this work.

References

Stergos D. Afantenos, Eric Kow, Nicholas Asher,and Jérémy Perret. 2015. Discourse parsing formulti-party chat dialogues. In Proceedings ofthe 2015 Conference on Empirical Methods inNatural Language Processing, EMNLP 2015,September 17-21, 2015, pages 928–937, Lis-bon, Portugal.

David Alvarez-Melis and Martin Saveski. 2016.Topic modeling in twitter: Aggregating tweetsby conversations. In Proceedings of the TenthInternational Conference on Web and SocialMedia, May 17-20, 2016., pages 519–522,Cologne, Germany.

Eytan Bakshy, Solomon Messing, and Lada AAdamic. 2015. Exposure to ideologically di-verse news and opinion on facebook. Science,348(6239):1130–1132.

David M. Blei, Thomas L. Griffiths, Michael I.Jordan, and Joshua B. Tenenbaum. 2003. Hi-erarchical topic models and the nested chineserestaurant process. In Advances in Neural In-formation Processing Systems, NIPS 2003, De-cember 8-13, 2003, pages 17–24, Vancouverand Whistler, British Columbia, Canada.

David M. Blei, Alp Kucukelbir, and Jon D.McAuliffe. 2016. Variational inference: A re-view for statisticians. CoRR, abs/1601.00670.

David M. Blei, Andrew Y. Ng, and Michael I. Jor-dan. 2001. Latent dirichlet allocation. In Ad-vances in Neural Information Processing Sys-tems 14 [Neural Information Processing Sys-tems: Natural and Synthetic, NIPS 2001, De-cember 3-8, 2001], pages 601–608, Vancouver,British Columbia, Canada.

Christophe Cerisara, Somayeh Jafaritazehjani,Adedayo Oluokun, and Hoa T. Le. 2018. Multi-task dialog act and sentiment recognition onmastodon. In Proceedings of the 27th Inter-national Conference on Computational Linguis-tics, COLING 2018, August 20-26, 2018, pages745–754, Santa Fe, New Mexico, USA.

Micha Elsner and Eugene Charniak. 2008. Youtalking to me? A corpus and algorithm for con-versation disentanglement. In Proceedings ofthe 46th Annual Meeting of the Association for

Computational Linguistics, ACL 2008, June 15-20, pages 834–842, Columbus, Ohio, USA.

Micha Elsner and Eugene Charniak. 2010. Dis-entangling chat. Computational Linguistics,36(3):389–409.

Sharon Goldwater and Thomas L. Griffiths. 2007.A fully bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45thAnnual Meeting of the Association for Compu-tational Linguistics, June 23-30, 2007, Prague,Czech Republic.

Thomas Hofmann. 1999. Probabilistic latent se-mantic indexing. In Proceedings of the 22ndAnnual International ACM SIGIR Conferenceon Research and Development in InformationRetrieval, August 15-19, 1999, pages 50–57,Berkeley, CA, USA.

Liangjie Hong and Brian D. Davison. 2010. Em-pirical study of topic modeling in twitter. InProceedings of the 3rd Workshop on SocialNetwork Mining and Analysis, SNAKDD 2009,June 28, pages 80–88, Paris, France.

Zhiting Hu, Gang Luo, Mrinmaya Sachan, Eric P.Xing, and Zaiqing Nie. 2016. Grounding topicmodels with knowledge bases. In Proceedingsof the Twenty-Fifth International Joint Confer-ence on Artificial Intelligence, IJCAI 2016, 9-15July 2016, pages 1578–1584, New York, NY,USA.

Eric Jang, Shixiang Gu, and Ben Poole. 2016.Categorical reparameterization with gumbel-softmax. CoRR, abs/1611.01144.

Yangfeng Ji, Gholamreza Haffari, and JacobEisenstein. 2016. A latent variable recurrentneural network for discourse-driven languagemodels. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associ-ation for Computational Linguistics: HumanLanguage Technologies, NAACL HLT 2016,June 12-17, pages 332–342, San Diego Califor-nia, USA.

Yunhao Jiao, Cheng Li, Fei Wu, and Qiaozhu Mei.2018. Find the conversation killers: A predic-tive study of thread-ending posts. In Proceed-ings of the 2018 World Wide Web Conferenceon World Wide Web, WWW 2018, April 23-27,pages 1145–1154, Lyon, France.

Shafiq R. Joty, Giuseppe Carenini, and Chin-YewLin. 2011. Unsupervised modeling of dialogacts in asynchronous conversations. In Pro-ceedings of the 22nd International Joint Confer-ence on Artificial Intelligence, IJCAI 2011, July16-22, pages 1807–1813, Barcelona, Catalonia,Spain.

Yoon Kim. 2014. Convolutional Neural Networksfor Sentence Classification. In Proceedings ofthe 2014 Conference on Empirical Methods inNatural Language Processing, EMNLP 2014, Ameeting of SIGDAT, a Special Interest Group ofthe ACL, pages 1746–1751, Doha, Qatar.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Diederik P. Kingma and Max Welling. 2013.Auto-encoding variational bayes. CoRR,abs/1312.6114.

Chenliang Li, Haoran Wang, Zhiqian Zhang,Aixin Sun, and Zongyang Ma. 2016a. Topicmodeling for short texts with auxiliary wordembeddings. In Proceedings of the 39th Inter-national ACM SIGIR conference on Researchand Development in Information Retrieval, SI-GIR 2016, July 17-21, 2016, pages 165–174,Pisa, Italy.

Jing Li, Ming Liao, Wei Gao, Yulan He, andKam-Fai Wong. 2016b. Topic extraction frommicroblog posts using conversation structures.In Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics,ACL 2016, August 7-12, 2016, Volume 1: LongPapers, Berlin, Germany.

Jing Li, Yan Song, Zhongyu Wei, and Kam-Fai Wong. 2018. A joint model of conversa-tional discourse and latent topics on microblogs.Computational Linguistics, 44(4):719–754.

Jiwei Li, Thang Luong, Dan Jurafsky, and Ed-uard H. Hovy. 2015. When are tree structuresnecessary for deep learning of representations?In Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Process-ing, EMNLP 2015, September 17-21, 2015,pages 2304–2314, Lisbon, Portugal.

Chris J. Maddison, Andriy Mnih, and Yee WhyeTeh. 2016. The concrete distribution: A con-tinuous relaxation of discrete random variables.CoRR, abs/1611.00712.

Christopher D. Manning, Prabhakar Raghavan,and Hinrich Schütze. 2008. Introduction toinformation retrieval. Cambridge UniversityPress.

Rishabh Mehrotra, Scott Sanner, Wray L. Buntine,and Lexing Xie. 2013. Improving LDA topicmodels for microblogs via tweet pooling andautomatic labeling. In The 36th InternationalACM SIGIR conference on research and de-velopment in Information Retrieval, SIGIR ’13,July 28 - August 1, pages 889–892, Dublin, Ire-land.

Marina Meila. 2003. Comparing clusteringsby the variation of information. In Com-putational Learning Theory and Kernel Ma-chines, 16th Annual Conference on Computa-tional Learning Theory and 7th Kernel Work-shop, COLT/Kernel, August 24-27, 2003, Pro-ceedings, pages 173–187, Washington, DC,USA.

Yishu Miao, Edward Grefenstette, and Phil Blun-som. 2017. Discovering discrete latent topicswith neural variational inference. In Proceed-ings of the 34th International Conference onMachine Learning, ICML 2017, 6-11 August2017, pages 2410–2419, Sydney, NSW, Aus-tralia.

Vinod Nair and Geoffrey E. Hinton. 2010. Rec-tified Linear Units Improve Restricted Boltz-mann Machines. In Proceedings of the 27thInternational Conference on Machine Learning(ICML 2010), pages 807–814, Haifa, Israel.

Dat Quoc Nguyen, Richard Billingsley, Lan Du,and Mark Johnson. 2015. Improving topicmodels with latent feature word representations.Transactions of the Association for Computa-tional Linguistics, TACL, 3:299–313.

Kamal Nigam, Andrew McCallum, SebastianThrun, and Tom M. Mitchell. 2000. Text classi-fication from labeled and unlabeled documentsusing EM. Machine Learning, 39(2/3):103–134.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectorsfor word representation. In Proceedings of the2014 Conference on Empirical Methods in Nat-ural Language Processing, EMNLP 2014, Oc-tober 25-29, A meeting of SIGDAT, a SpecialInterest Group of the ACL, pages 1532–1543,Doha, Qatar.

Kechen Qin, Lu Wang, and Joseph Kim. 2017.Joint modeling of content and discourse rela-tions in dialogues. In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics, ACL 2017, July 30 - August4, Volume 1: Long Papers, pages 974–984, Van-couver, Canada.

Xiaojun Quan, Chunyu Kit, Yong Ge, andSinno Jialin Pan. 2015. Short and sparsetext topic modeling via self-aggregation. InProceedings of the Twenty-Fourth InternationalJoint Conference on Artificial Intelligence, IJ-CAI 2015, July 25-31, pages 2270–2276,Buenos Aires, Argentina.

Daniel Ramage, Susan T. Dumais, and Daniel J.Liebling. 2010. Characterizing microblogs withtopic models. In Proceedings of the Fourth In-ternational Conference on Weblogs and SocialMedia, ICWSM 2010, May 23-26, Washington,DC, USA.

Danilo Jimenez Rezende, Shakir Mohamed, andDaan Wierstra. 2014. Stochastic backpropaga-tion and approximate inference in deep genera-tive models. In Proceedings of the 31th Interna-tional Conference on Machine Learning, ICML2014, 21-26 June 2014, pages 1278–1286, Bei-jing, China.

Alan Ritter, Colin Cherry, and Bill Dolan. 2010.Unsupervised modeling of twitter conversa-tions. In Human Language Technologies: Con-ference of the North American Chapter of theAssociation of Computational Linguistics, Pro-ceedings, June 2-4, 2010,, pages 172–180, LosAngeles, California, USA.

Michael Röder, Andreas Both, and AlexanderHinneburg. 2015. Exploring the Space of TopicCoherence Measures. In Proceedings of theEighth ACM International Conference on WebSearch and Data Mining, WSDM 2015, pages399–408, Shanghai, China.

Michal Rosen-Zvi, Thomas L. Griffiths, MarkSteyvers, and Padhraic Smyth. 2004. Theauthor-topic model for authors and documents.In UAI ’04, Proceedings of the 20th Conferencein Uncertainty in Artificial Intelligence, July 7-11, 2004, pages 487–494, Banff, Canada.

Andrew Rosenberg and Julia Hirschberg. 2007.V-measure: A conditional entropy-based exter-nal cluster evaluation measure. In Proceed-ings of the 2007 Joint Conference on Empiri-cal Methods in Natural Language Processingand Computational Natural Language Learn-ing, EMNLP-CoNLL 2007, June 28-30, pages410–420, Prague, Czech Republic.

Bei Shi, Wai Lam, Shoaib Jameel, Steven Schock-aert, and Kwun Ping Lai. 2017. Jointly learningword embeddings and latent topics. In Proceed-ings of the 40th International ACM SIGIR Con-ference on Research and Development in In-formation Retrieval, August 7-11, 2017, pages375–384, Shinjuku, Tokyo, Japan.

Yangqiu Song, Haixun Wang, Zhongyuan Wang,Hongsong Li, and Weizhu Chen. 2011. Shorttext conceptualization using a probabilisticknowledgebase. In Proceedings of the 22nd In-ternational Joint Conference on Artificial Intel-ligence, IJCAI 2011, July 16-22, pages 2330–2336, Barcelona, Catalonia, Spain.

Akash Srivastava and Charles Sutton. 2017. Au-toencoding variational inference for topic mod-els. In Proceedings of the Fifth InternationalConference on Learning Representations, ICLR2017, 24-26, April 2017, Toulon, France.

Andreas Stolcke, Noah Coccaro, Rebecca Bates,Paul Taylor, Carol Van Ess-Dykema, KlausRies, Elizabeth Shriberg, Daniel Jurafsky,Rachel Martin, and Marie Meteer. 2000. Dia-logue act modeling for automatic tagging andrecognition of conversational speech. Compu-tational Linguistics, 26(3).

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and XueqiCheng. 2013. A biterm topic model for shorttexts. In 22nd International World Wide WebConference, WWW ’13, May 13-17, 2013, pages1445–1456, Rio de Janeiro, Brazil.

Yi Yang, Doug Downey, and Jordan L. Boyd-Graber. 2015. Efficient methods for incorporat-

ing knowledge into topic models. In Proceed-ings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP2015, September 17-21, 2015, pages 308–317,Lisbon, Portugal.

Elina Zarisheva and Tatjana Scheffler. 2015. Dia-log act annotation for twitter conversations. InProceedings of the SIGDIAL 2015 Conference,The 16th Annual Meeting of the Special InterestGroup on Discourse and Dialogue, 2-4 Septem-ber 2015, pages 114–123, Prague, Czech Re-public.

Jichuan Zeng, Jing Li, Yan Song, Cuiyun Gao,Michael R. Lyu, and Irwin King. 2018a. Topicmemory networks for short text classification.In Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Process-ing, EMNLP 2018, November 31-4, 2018, Brus-sels, Belgium.

Xingshan Zeng, Jing Li, Lu Wang, NicholasBeauchamp, Sarah Shugars, and Kam-FaiWong. 2018b. Microblog conversation rec-ommendation via joint modeling of topics anddiscourse. In Proceedings of the 2018 Con-ference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies, NAACL-HLT2018, June 1-6, 2018, Volume 1 (Long Papers),pages 375–385, New Orleans, Louisiana, USA.

Tiancheng Zhao, Kyusong Lee, and Maxine Es-kénazi. 2018. Unsupervised discrete sentencerepresentation learning for interpretable neuraldialog generation. In Proceedings of the 56thAnnual Meeting of the Association for Com-putational Linguistics, ACL 2018, July 15-20,2018, Volume 1: Long Papers, pages 1098–1107, Melbourne, Australia.

Tiancheng Zhao, Ran Zhao, and Maxine Eskénazi.2017. Learning discourse-level diversity forneural dialog models using conditional varia-tional autoencoders. In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics, ACL 2017, July 30 - August4, Volume 1: Long Papers, pages 654–664, Van-couver, Canada.

Wayne Xin Zhao, Jing Jiang, Jianshu Weng, JingHe, Ee-Peng Lim, Hongfei Yan, and Xiaoming

Li. 2011. Comparing twitter and traditional me-dia using topic models. In Advances in Infor-mation Retrieval - 33rd European Conferenceon IR Research, ECIR 2011, April 18-21, 2011.Proceedings, pages 338–349, Dublin, Ireland.