keep meeting summaries on topic: abstractive multi-modal ...script of the utterances generated by...

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2190–2196Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

2190

Keep Meeting Summaries on Topic:Abstractive Multi-Modal Meeting Summarization

Manling Li1, Lingyu Zhang2, Heng Ji1, Richard J. Radke21 Department of Computer Science

2 Department of Electrical, Computer, and Systems EngineeringRensselaer Polytechnic Institute

1{lim22,jih}@rpi.edu, 2{[email protected], [email protected]}

Abstract

Transcripts of natural, multi-person meetingsdiffer significantly from documents like newsarticles, which can make Natural LanguageGeneration models generate unfocused sum-maries. We develop an abstractive meetingsummarizer from both videos and audios ofmeeting recordings. Specifically, we propose amulti-modal hierarchical attention mechanismacross three levels: topic segment, utteranceand word. To narrow down the focus intotopically-relevant segments, we jointly modeltopic segmentation and summarization. In ad-dition to traditional textual features, we intro-duce new multi-modal features derived fromvisual focus of attention, based on the as-sumption that an utterance is more importantif its speaker receives more attention. Exper-iments show that our model significantly out-performs the state-of-the-art with both BLEUand ROUGE measures.

1 Introduction

Automatic meeting summarization is valuable, es-pecially if it takes advantage of multi-modal sens-ing of the meeting environment, such as micro-phones to capture speech and cameras to captureeach participant’s head pose and eye gaze. Tradi-tional extractive summarization methods based onselecting and reordering salient words tend to pro-duce summaries that are not natural and incoher-ent. Although state-of-the-art work (Shang et al.,2018) employs WordNet (Miller, 1995) to makesummaries more abstractive, the quality is still farfrom those produced by humans, as shown in Ta-ble 1. Moreover, these methods tend to have lim-ited content coverage by selecting salient words.

On the other hand, recent years have wit-nessed the success of Natural Language Genera-tion (NLG) models to generate abstractive sum-maries. Since human-written summaries tend to

mention the exact given keywords without para-phrasing, the copy mechanism proposed by aPointer Generator Network (PGN) (See et al.,2017) naturally fits this task. Apart from gener-ating words from a fixed vocabulary, it also copiesthe words from the input. However, transcriptsof multi-person meetings widely differ from tra-ditional documents. Instead of grammatical, well-segmented sentences, the input is often composedof ill-formed utterances. Therefore, NLG modelscan easily lose focus. For example, in Table 1,PGN fails to capture the keywords remote control,trendy and user-friendly.

Therefore, we propose a multi-modal hierarchi-cal attention mechanism across topic segments, ut-terances, and words. We learn topic segmentationas an auxiliary task and limit the attention withineach segment. Our approach mimics human sum-marization methods by segmenting first and thensummarizing each segment. To locate key utter-ances, we propose that the rich multi-modal datafrom recording the meeting environment, espe-cially cameras facing each participant, can pro-vide speaker interaction and participant feedbackto discover salient utterances. One typical interac-tion is Visual Focus Of Attention (VFOA), i.e., thetarget that each participant looks at in every times-tamp. Possible VFOA targets include other partic-ipants, the table, etc. We estimate VFOA based oneach participant’s head orientation and eye gaze.The longer the speaker is paid attention by others,the higher possibility that the utterance is impor-tant. For example, in Table 1, the high VFOA re-ceived by the speaker for the last two sentencesassists in maintaining the bold keywords.

2 Method

As shown in Figure 1, our meeting data consistsof synchronized videos of each participant in a

2191

Um I'm Sarah, the Project Managerand this is our first meeting, surprisingly enough. Okay, this is our agenda, um we will do some stuff , get to know each other a bitbetter to feel more comfortable with each other . Um then we'll go do tool training, talk about the project plan, discuss our own ideasand everything um and we've got twenty five minutes to do that, as far as I canunderstand. Now, we're developing a remote control which you probably already know. Um, wewant it to be original, something that's uh people haven't thought of, that's not out inthe shops, um, trendy, appealing to a wide market, but you know, not a hunk of metal,and userfriendly, grannies to kids, maybe even pooches should be able to use it.

Transcript

Manual summary The project manager gave an introduction to the goal of the project , to create a trendy yet userfriendly remote.Extractive summary(Shang et al., 2018)Abstractive summary(See et al., 2017)

Our Approach

hunk of metal and userfriendly granny's to kids.

The project manager opened the meeting and introduced the upcoming project to the team members.

The project manager opens the meeting. The project manager states the goal of the project, which is to develop aremote control. It should be original, trendy, and userfriendly.

UIID

MEPM

UIID

MEPM

UIID

MEPM

Received VFOA

Table 1: Comparison of Human and System Generated Summaries. The color indicates the attention received bythe speaker PM (Project Manager). Others: ME (Marketing Expert), ID (Industrial Designer), UI (User Interface).

Figure 1: Multi-modal Meeting Summarization Frame-work

group meeting, as well as a time-stamped tran-script of the utterances generated by AutomaticSpeech Recognition (ASR) tools 1. We formu-late a meeting transcript as a list of triples X ={(pi, fi, ui)}. pi ∈ P is the the speaker of ut-terance ui, where P denotes the set of partic-ipants. fi contains the VFOA target sequenceover the course of utterance ui for each partici-pant. Each utterance ui is a sequence of words

1For example, IBM Watson’s Speech to Text System(https://www.ibm.com/watson/services/ speech-to-text/)

ui = {wi0, w

i1, . . . }. The output of our model is a

summary Y and the segment ending boundaries S.The training instances for the generator are pro-vided in the form of Ttrain = {(X,Y, S)}, andthe testing instances only contain the transcriptsTtest = {X}.

2.1 Visual Focus of Attention Estimation

Given the recording video of each individual, weestimate VFOA based on each participant’s headorientation and eye gaze for every frame. TheVFOA targets include F = {p0, . . . , p|P |, table,whiteboard, projection screen and unknown}. As

OpenFace Feature Extractor

Output

Eye GazeDirection

Head Pose Angle

VFOA Target Detector

Figure 2: VFOA Detector Framework

illustrated in Figure 2, we feed each input colorimage into the OpenFace tool (Baltrusaitis et al.,2018) to estimate the head pose angle (roll, pitchand yaw) and the eye gaze direction vector (az-

2192

imuth and elevation), and concatenate them into a5-dimensional feature vector. To obtain the actualvisual targets from the head pose and eye gaze es-timation, we build a seven-layer network to outputa one-hot vector, which indicates the most possi-ble visual target at the current frame, and each di-mension stands for a VFOA target. The networkis trained on the VFOA annotation, including theVFOA target for each frame of each participant.

Then the output of all participants are concate-nated. For utterance ui, the VFOA vector f i ∈R|P |∗|F | is the sum of each frame’s VFOA out-puts over the course of ui, where each dimensionstands for the total duration of the attention paid tothe corresponding VFOA target.

2.2 Meeting Transcript EncoderFor an utterance ui = {wi

0, wi1, . . . }, we embed

each word wij using the pretrained GloVe (Pen-

nington et al., 2014), and apply a bidirectionalgated recurrent unit (GRU) (Cho et al., 2014) toobtain the encoded word representation hi

j . Theutterance representations are the average of words.Additionally, the speaker pi is encoded into a one-hot vector pi ∈ R|P|.

2.3 Topic Segmentation DecoderWe divide the input sequence into contiguous seg-ments based on SegBot (Li et al., 2018). Its de-coder takes a starting utterance of a segment as in-put at each decoding step, and outputs the endingutterance of the segment. Taking Figure 3 as anexample, there are 5 utterances in the transcript.The initial starting utterance is u0 with the possi-ble positions from u0 to u4; if u2 is detected as theending utterance, then u3 is the next starting ut-terance and is input to the decoder, with possiblepositions from u3 to u4.

We extend SegBot to obtain the distributionover possible positions j ∈ {i, i+1, . . . } by usinga multi-modal segmentation attention:

αsegij =v>s tanh(Wud

i+W hhj+W ppj+W ff j)

where di is the decoded utterance of starting ut-terance ui. Let si denote the ending utterance ofthe segment that starts with the utterance ui, theprobability for uj to be the ending utterance si is:

P (si = uj |(pi,f i,ui)) =expαseg

ij∑k∈{i,i+1,...} expα

segik

,

Figure 3: Topic Segmentation Decoder

2.4 Meeting Summarization DecoderWe build our decoder based on Pointer-GeneratorNetwork (PGN) (See et al., 2017) to copy wordsfrom the input transcript in terms of attention dis-tribution. Different from PGN, we introduce a hi-erarchical attention mechanism based on the topicsegmentation results, as shown in Figure 4.

Figure 4: Hierarchical Attention in Summary Decoder

As VFOA has close ties to salient utterances, weuse the VFOA received by speaker f>k p

′k to cap-

ture the importance of utterance uk, where p′k is

the a vector indicating which dimension’s VFOAtarget is the speaker pk. Formally, we use a GRUto obtain the decoded hidden states di for the ith

input word. The Utterance2Word attention on theword wj of the utterance uk is:

eij=v>1 tanh(W d1di+Wwwj+W ppj+W ff j)

The context representation for the utterance uk isuik = Softmax(eij)wj , wj ∈ uk. The Seg-ment2Utterance attention on the utterance uk inthe input transcript is:

e′ik =f>k p

′k

(v>2 tanh (W d2di +W uuik)

).

2193

Model ROUGE BLEUROUGE 1 ROUGE 2 ROUGE L BLEU 1 BLEU 2 BLEU 3 BLEU 4

CoreRank (Shang et al., 2018) 37.86 7.84 13.72 17.17 6.78 1.77 0.00PGN (See et al., 2017) 36.75 10.48 23.81 37.89 23.41 12.84 6.92

Our Approach (TopicSeg+VFOA) 53.29 13.51 26.90 40.98 26.19 13.76 8.03Our Approach (TopicSeg) 51.53 12.23 25.47 39.67 24.91 12.37 7.86

Table 2: Comparison on AMI datasets

The context representation for segment sq isciq = Softmax(e

′ik)uk, uk ∈ sq. The Meet-

ing2Segment attention is:

e′′iq =v

>3 tanh (W d3di +W sciq).

The hierarchical attention of wj is calculatedwithin the utterance uk and then segment sq:

αsumij =

exp(eije

′ike

′′iq

)∑

j∈sq exp(eije

′ike

′′iq

) ,The probability of generating yi follows the de-

coder in PGN (See et al., 2017), and αsumij is the

attention in the decoder for copying words fromthe input sequence.

2.5 Joint End-to-End TrainingThe summarization task and the topic segmenta-tion task are trained jointly with the loss function:

L = − logP (Y, S|X)

=∑yi∈Y−logP (yi|X)+

∑sj∈S−logP (sj |(pj ,f j ,uj))

where P (Y, S|X) is the conditional probability ofthe summary Y and the segments S given the inputmeeting transcript X = {(pi, fi, ui)}. Here, yiis one token in the ground truth summary, and sjdenotes the ending boundary of the segment thatstarts with uj .

3 Experiments

Our experiments are conducted on the widely usedAMI Meeting Corpus (Carletta et al., 2005). Thiscorpus is about a remote control design projectfrom kick-off to completion. Each meeting lasts30 minutes and contains four participants: aproject manager, a marketing expert, an industrialdesigner, and a user interface designer. We followthe conventional approach (Shang et al., 2018) inthe meeting analysis literature to preprocess anddivide the dataset into training (97 meetings), de-velopment (20 meetings) and test sets (20 meet-ings). One meeting in the test set does not provide

videos and thus it is ignored. The ASR transcriptsare provided in the dataset (Garner et al., 2009),which are manually revised based on the automat-ically generated ASR output. Each meeting has asummary containing about 300 words and 10 sen-tences. Each meeting is also divided into multiplesegments focusing on various topics. The ASRtranscripts and the videos recorded for all partic-ipants are the input of the model. We use manualannotation of summaries and topic segments fortraining, while they are generated automaticallyduring testing. The VFOA estimation model istrained separately on the VFOA annotation of 14meetings in the dataset, and achieve 64.5% predic-tion accuracy.

The baselines include: (1) state-of-the-art ex-tractive summarization method CoreRank (Shanget al., 2018), and (2) neural network based gen-eration model PGN (See et al., 2017). We adopttwo standard metrics ROUGE (Lin, 2004) andBLEU (Papineni et al., 2002) for evaluation. Addi-tionally, to show the impact of VFOA, we removethe VFOA features as an additional baseline, andconduct significance testing. By T-test, the differ-ences on ROUGE and BLEU are considered to bestatistically significant (P value ≤ 0.09), exceptBLEU 4 (P value = 0.27).

Compared to the abstractive method PGN in Ta-ble 2, the multimodal summarizer achieves largerimprovement on ROUGE than BLEU. It demon-strates our approach’s ability to focus on topicallyrelated words. For example, ‘The marketing ex-pert discussed his findings from trend watchingreports, stressing the need for a product that hasa fancy look and feel, is technologically inno-vative...’ is generated by our model, while thePGN generates ‘the marketing expert discussedhis findings from trend watching reports’. Thespeaker receives higher VFOA from participantswhile mentioning the utterances containing thesekeywords. To demonstrate the effectiveness ofVFOA attention, we rank the utterances in termsof VFOA, and achieve 45.8% accuracy of select-ing salient utterances based on the annotation of

2194

(Shang et al., 2018)2. Therefore, the model learnsthat when the speaker receives higher VFOA, theutterances of that speaker is more important.

Moreover, topic segmentation also contributesto the better coverage of salient words, whichis demonstrated by the improvement on ROUGEmetrics of the model without VFOA features.Each meeting is divided to six to ten segments,with special focuses on topics such as ‘openings’,‘trend watching’, ‘project budget’ and ‘user tar-get group’. With the topic segmentation results,the utterances within the same segment are morecorrelated, and topically related words tend to befrequently mentioned. For example, ‘fancy look’is more important within the ‘trend watching’ seg-ment than the whole transcript.

The VFOA distribution is highly correlated totopic segmentation. For example, the project man-ager pays more attention to the user interface de-signer in ‘trend watching’ segment, while focusesmore on the marketing expert in another segmentabout ‘project budget’. Therefore, the VFOA fea-ture not only benefits the summarization decoder,but also improves the performance of topic seg-mentation. The topic segmentation accuracy is57.74% without VFOA feature, and 60.11% withVFOA feature in segmentation attention.

Compared to the extractive method CoreRankin Table 2, our BLEU scores are doubled, whichdemonstrate that the abstractive summaries aremore coherent and natural. For example, theextractive summaries are often incomplete sen-tences, such as ‘prefer a design where the remotecontrol and the docking station’. But the abstrac-tive summaries are well-organized sentences, suchas ‘The remote will use a conventional batteryand a docking station which recharges the bat-tery’. Also, the improvement on ROUGE 2 andROUGE L is larger than ROUGE 1, which showsthe superiority of abstractive methods to maintainlonger terms, such as corporate website, etc.

4 Related Work

Extractive summarization methods rank and se-lect words by constructing word co-occurrencegraphs (Mihalcea and Tarau, 2004; Erkan andRadev, 2004; Lin and Bilmes, 2010; Tixier et al.,2016b), and they are applied to meeting sum-marization (Liu et al., 2009, 2011; Tixier et al.,

2https://bitbucket.org/dascim/acl2018_abssumm/src/master/data/meeting/ami

2016a; Shang et al., 2018). However, extractivesummaries are often not natural and coherent withlimited content coverage. Recently the neural nat-ural language generation models boost the perfor-mance of abstractive summarization (Luong et al.,2015; Rush et al., 2015; See et al., 2017), butthey are often unable to focus on topic words. In-spired by utterance clustering in extractive meth-ods (Shang et al., 2018), we propose a hierar-chical attention based on topic segmentation (Liet al., 2018). Moreover, our hierarchical attentionis multi-modal to narrow down the focus by cap-turing participant interactions. Multi-modal fea-tures from human annotations have been proveneffective at improving summarization, such as di-alogue act (Goo and Chen, 2018). Instead of usinghuman annotations, our approach utilizes a simplydetectable multi-modal feature VFOA.

5 Conclusions and Future Work

We develop a multi-modal summarizer to gener-ate natural language summaries for multi-personmeetings. We present a multi-modal hierarchi-cal attention mechanism based on VFOA estima-tion and topic segmentation, and the experimentsdemonstrate its effectiveness. In the future, weplan to further integrate higher level participantinteractions, such as gestures, face expressions,etc. We also plan to construct a larger multime-dia meeting summarization corpus to cover morediverse scenarios, building on our previous work(Bhattacharya et al., 2019).

Acknowledgments

This material is based upon work supportedby the U.S. National Science Foundation un-der Grant No. IIP-1631674, DARPA AIDA Pro-gram No. FA8750-18-2-0014, ARL NS-CTANo. W911NF-09-2-0053, and Tencent AI LabRhino-Bird Gift Fund. The views and conclu-sions contained herein are those of the authorsand should not be interpreted as necessarily rep-resenting the official policies, either expressed orimplied of the U.S. Government. The U.S. Gov-ernment is authorized to reproduce and distributereprints for governmental purposes notwithstand-ing any copyright annotation therein.

https://bitbucket.org/dascim/acl2018_abssumm/src/master/data/meeting/ami

https://bitbucket.org/dascim/acl2018_abssumm/src/master/data/meeting/ami

2195

ReferencesTadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and

Louis-Philippe Morency. 2018. Openface 2.0: Fa-cial behavior analysis toolkit. 2018 13th IEEE Inter-national Conference on Automatic Face and GestureRecognition (FG 2018), pages 59–66.

Indrani Bhattacharya, Michael Foley, Christine Ku,Ni Zhang, Tongtao Zhang, Cameron Mine, Man-ling Li, Heng Ji, Christoph Riedl, Brooke FoucaultWelles, and Richard J. Radke. 2019. The unobtru-sive group interaction (UGI) corpus. In 10th ACMMultimedia Systems Conference (MMSys 2019).

Jean Carletta, Simone Ashby, Sebastien Bourban, MikeFlynn, Mael Guillemot, Thomas Hain, JaroslavKadlec, Vasilis Karaiskos, Wessel Kraaij, MelissaKronenthal, et al. 2005. The AMI meeting cor-pus: A pre-announcement. In International Work-shop on Machine Learning for Multimodal Interac-tion, pages 28–39. Springer.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1724–1734.

Gunes Erkan and Dragomir R Radev. 2004. Lexrank:Graph-based lexical centrality as salience in textsummarization. Journal of Artificial IntelligenceResearch, 22:457–479.

Philip N Garner, John Dines, Thomas Hain, Asmaa ElHannani, Martin Karafiat, Danil Korchagin, MikeLincoln, Vincent Wan, and Le Zhang. 2009. Real-time ASR from meetings. In Tenth Annual Confer-ence of the International Speech Communication As-sociation.

Chih-Wen Goo and Yun-Nung Chen. 2018. Abstrac-tive dialogue summarization with sentence-gatedmodeling optimized by dialogue acts. arXiv preprintarXiv:1809.05715.

Jing Li, Aixin Sun, and Shafiq Joty. 2018. Segbot: Ageneric neural text segmentation model with pointernetwork. In Proceedings of the 27th InternationalJoint Conference on Artificial Intelligence (IJCAI).

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

Hui Lin and Jeff Bilmes. 2010. Multi-document sum-marization via budgeted maximization of submod-ular functions. In Human Language Technologies:The 2010 Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics, pages 912–920.

Fei Liu, Feifan Liu, and Yang Liu. 2011. A supervisedframework for keyword extraction from meetingtranscripts. IEEE Transactions on Audio, Speech,and Language Processing, 19(3):538–548.

Feifan Liu, Deana Pennell, Fei Liu, and Yang Liu.2009. Unsupervised approaches for automatic key-word extraction using meeting transcripts. In Pro-ceedings of human language technologies: The 2009Annual Conference of the North American Chap-ter of the Association for Computational Linguistics,pages 620–628. Association for Computational Lin-guistics.

Minh-Thang Luong, Quoc V Le, Ilya Sutskever, OriolVinyals, and Lukasz Kaiser. 2015. Multi-tasksequence to sequence learning. arXiv preprintarXiv:1511.06114.

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In Proceedings of the 2004 Con-ference on Empirical Methods in Natural LanguageProcessing.

George A Miller. 1995. Wordnet: a lexical database forEnglish. Communications of the ACM, 38(11):39–41.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of the40th Annual Meeting on Association for Computa-tional Linguistics, pages 311–318. Association forComputational Linguistics.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543.

Alexander M Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 379–389.

Abigail See, Peter J Liu, and Christopher D Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), volume 1,pages 1073–1083.

Guokan Shang, Wensi Ding, Zekun Zhang, An-toine Jean-Pierre Tixier, Polykarpos Meladianos,Michalis Vazirgiannis, and Jean-Pierre Lorre. 2018.Unsupervised abstractive meeting summarizationwith multi-sentence compression and budgeted sub-modular maximization. In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), vol-ume 1.

2196

Antoine Tixier, Fragkiskos Malliaros, and MichalisVazirgiannis. 2016a. A graph degeneracy-based ap-proach to keyword extraction. In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing, pages 1860–1870.

Antoine Tixier, Konstantinos Skianis, and MichalisVazirgiannis. 2016b. Gowvis: a web applica-tion for graph-of-words-based text visualization andsummarization. Proceedings of ACL-2016 SystemDemonstrations, pages 151–156.

keep meeting summaries on topic: abstractive multi-modal ...script of the utterances generated by...

Documents