proceedings of acl 2018, student research …shibhansh dohare, vivek gupta and harish karnick ix...

ACL 2018

The 56th Annual Meeting of theAssociation for Computational Linguistics

Proceedings of the Student Research Workshop

July 15 - July 20Melbourne, Australia

c©2018 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-948087-36-0 (Student Research Workshop)

ii

Introduction

Welcome to the ACL 2018 Student Research Workshop!

The ACL 2018 Student Research Workshop (SRW) is a forum for student researchers in computationallinguistics and natural language processing. The workshop provides a unique opportunity for studentparticipants to present their work and receive valuable feedback from the international researchcommunity as well as from selected mentors.

Following the tradition of the previous years’ student research workshops, we have two tracks: researchpapers and research proposals. The research paper track is a venue for Ph.D. students, Masters students,and advanced undergraduates to describe completed work or work-in-progress along with preliminaryresults. The research proposal track is offered for advanced Masters and Ph.D. students who have decidedon a thesis topic and are interested in feedback on their proposal and ideas about future directions fortheir work.

We received 66 submissions in total: 12 research proposals and 54 research papers. We accepted 26papers, for an acceptance rate of 39%. After withdrawals, 22 papers are appearing in these proceedings,including 6 research proposals and 16 research papers. All of the accepted papers will be presented asposters in lunchtime sessions as a part of the main conference, split across two days (July 17th and 18th).

Mentoring is at the heart of the SRW. In keeping with previous years, students had the opportunityto participate in a pre-submission mentoring program prior to the submission deadline. This programoffered students a chance to receive comments from an experienced researcher, in order to improve thequality of the writing and presentation before making their submission. Eleven authors participated in thepre-submission mentoring. In addition, authors of accepted SRW papers are matched with mentors whowill meet with the students in person during the workshop. Each mentor prepares in-depth comments andquestions prior to the student’s presentation, and provides discussion and feedback during the workshop.

We are deeply grateful to our sponsors whose support will enable a number of students to attend theconference, including the Don and Betty Walker Scholarship Fund, Roam Analytics, Google, and theNational Science Foundation under award No. 1827830. We would also like to thank our programcommittee members for their careful reviews of each paper, and all of our mentors for donating theirtime to provide feedback to our student authors.

Thank you to our faculty advisors Marie-Catherine de Marneffe, Wanxiang Che, and Malvina Nissimfor their essential advice and guidance, and to the members of the ACL 2018 organizing committee, inparticular Claire Cardie, Yusuke Miyao, and Priscilla Rasmussen for their helpful support. Finally, thankyou to our student participants!

iii

Organizers:

Vered Shwartz, Bar-Ilan UniversityJeniya Tabassum, Ohio State UniversityRob Voigt, Stanford University

Faculty Advisors:

Wanxiang Che, Harbin Institute of TechnologyMarie-Catherine de Marneffe, Ohio State UniversityMalvina Nissim, University of Groningen

Program Committee:

Roee Aharoni, Bar Ilan UniversityFernando Alva-Manchego, University of SheffieldBharat Ram Ambati, Apple Inc.Jacob Andreas, University of California, BerkeleyIsabelle Augenstein, Department of Computer Science, University of CopenhagenPetr Babkin, Rensselaer Polytechnic InstituteFan Bai, The Ohio State UniversityKalika Bali, Microsoft Research LabsMiguel Ballesteros, IBM Research AIDavid Bamman, University of California, BerkeleyYonatan Belinkov, MIT CSAILYonatan Bisk, University of WashingtonFacundo Carrillo, Buenos Aires UniversityArun Chaganty, Stanford UniversityKuan-Yu Chen, NTUSTTanmay Chinchore, M S Ramaiah Institute of TechnologyGrzegorz Chrupała, Tilburg UniversityDaria Dzendzik, ADAPT Centre Dublin City UniversityKurt Junshean Espinosa, University of the ManchesterTina Fang, University of WaterlooSilvia García Méndez, AtlantTIC Center - University of VigoMatt Gardner, Allen Institute for Artificial IntelligenceMarjan Ghazvininejad, University of Southern CaliforniaHila Gonen, Bar-Ilan UniversityAlvin Grissom II, Ursinus CollegeVivek Gupta, Microsoft ResearchHe He, Stanford UniversityMohit Iyyer, Allen Institute for Artificial IntelligenceGanesh Jawahar, IIIT HyderabadVasu Jindal, University of Texas at DallasSudipta Kar, University of HoustonAlina Karakanta, Saarland UnivesityEliyahu Kiperwasser, Bar-Ilan UniversityThomas Kober, University of EdinburghAlexandar Konovalov, The Ohio State University

v

Alice Lai, University of Illinois at Urbana-ChampaignWuwei Lan, The Ohio State UniversityEric Laporte, Université Paris-Est Marne-la-ValléeHaixia Liu, The University of NottinghamHaijing Liu, Institute of Software, Chinese Academy of Sciences & University of Chinese Academyof SciencesEdison Marrese-Taylor, The University of TokyoPrashant Mathur, eBayBrian McMahan, Rutgers UniversityPaul Michel, Carnegie Mellon UniversitySaif Mohammad, National Research Council CanadaTaesun Moon, IBM ResearchMaria Moritz, University of GoettingenNihal V. Nayak, Stride.AIDenis Newman-Griffis, The Ohio State UniversityDat Quoc Nguyen, The University of MelbourneVlad Niculae, Cornell UniversitySuraj Pandey, The Open UniversityMohammad Taher Pilehvar, University of CambridgeDaniel Preotiuc-Pietro, University of PennsylvaniaEmily Prud’hommeaux, Boston CollegeAvinesh PVS, UKP Lab, Technische Universität DarmstadtSudha Rao, University Of Maryland, College ParkWill Roberts, Humboldt-Universität zu BerlinStephen Roller, FacebookJulia Romberg, Heinrich Heine University DüsseldorfFarig Sadeque, University of ArizonaEnrico Santus, MITShoetsu Sato, Graduate School of Information Science and Technology, The University of TokyoSabine Schulte im Walde, University of StuttgartSebastian Schuster, Stanford UniversityVasu Sharma, Carnegie Mellon UniversitySunayana Sitaram, Microsoft Research IndiaKevin Small, AmazonKatira Soleymanzadeh, Ege UniversityR.Sudhesh Solomon, Indian Institute of Information TechnologyRichard Sproat, GoogleGabriel Stanovsky, University of Washington, Allen Institute for Artificial IntelligenceShang-Yu Su, National Taiwan UniversitySandesh Swamy, The Ohio State UniversityIvan Vulic, University of CambridgeOlivia Winn, Columbia UniversityZhou Yu, University of California, DavisOmnia Zayed, Insight Centre for Data Analytics, National University of Ireland GalwayMeishan Zhang, Heilongjiang University, ChinaYuan Zhang, Google

vi

Table of Contents

Towards Opinion Summarization of Customer ReviewsSamuel Pecar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Sampling Informative Training Data for RNN Language ModelsJared Fernandez and Doug Downey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Learning-based Composite Metrics for Improved Caption EvaluationNaeha Sharif, Lyndon White, Mohammed Bennamoun and Syed Afaq Ali Shah . . . . . . . . . . . . . . .14

Recursive Neural Network Based Preordering for English-to-Japanese Machine TranslationYuki Kawara, Chenhui Chu and Yuki Arase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Pushing the Limits of Radiology with Joint Modeling of Visual and Textual InformationSonit Singh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Recognizing Complex Entity Mentions: A Review and Future DirectionsXiang Dai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Automatic Detection of Cross-Disciplinary Knowledge AssociationsMenasha Thilakaratne, Katrina Falkner and Thushari Atapattu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Language Identification and Named Entity Recognition in Hinglish Code Mixed TweetsKushagra Singh, Indira Sen and Ponnurangam Kumaraguru . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

German and French Neural Supertagging Experiments for LTAG ParsingTatiana Bladier, Andreas van Cranenburgh, Younes Samih and Laura Kallmeyer . . . . . . . . . . . . . . 59

SuperNMT: Neural Machine Translation with Semantic Supersenses and Syntactic SupertagsEva Vanmassenhove and Andy Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Unsupervised Semantic Abstractive SummarizationShibhansh Dohare, Vivek Gupta and Harish Karnick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Biomedical Document Retrieval for Clinical Decision Support SystemJainisha Sankhavara . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

A Computational Approach to Feature Extraction for Identification of Suicidal Ideation in TweetsRamit Sawhney, Prachi Manchanda, Raj Singh and Swati Aggarwal . . . . . . . . . . . . . . . . . . . . . . . . . 91

BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu Using Word-level AnnotationsSreekavitha Parupalli, Vijjini Anvesh Rao and Radhika Mamidi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Reinforced Extractive Summarization with Question-Focused RewardsKristjan Arumae and Fei Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder ModelsSatoru Katsumata, Yukio Matsumura, Hayahide Yamagishi and Mamoru Komachi . . . . . . . . . . . 112

Exploring Chunk Based Templates for Generating a subset of English TextNikhilesh Bhatnagar, Manish Shrivastava and Radhika Mamidi . . . . . . . . . . . . . . . . . . . . . . . . . . . . .120

Trick Me If You Can: Adversarial Writing of Trivia Challenge QuestionsEric Wallace and Jordan Boyd-Graber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

vii

Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate DetectionPranav A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Mixed Feelings: Natural Text Generation with Variable, Coexistent Affective CategoriesLee Kezar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141

Automatic Spelling Correction for Resource-Scarce Languages using Deep LearningPravallika Etoori, Manoj Chinnakotla and Radhika Mamidi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Automatic Question Generation using Relative Pronouns and AdverbsPayal Khullar, Konigari Rachna, Mukul Hase and Manish Shrivastava . . . . . . . . . . . . . . . . . . . . . . 153

viii

Conference Program

Tuesday July 17, 2018

Towards Opinion Summarization of Customer ReviewsSamuel Pecar

Sampling Informative Training Data for RNN Language ModelsJared Fernandez and Doug Downey

Learning-based Composite Metrics for Improved Caption EvaluationNaeha Sharif, Lyndon White, Mohammed Bennamoun and Syed Afaq Ali Shah

Recursive Neural Network Based Preordering for English-to-Japanese MachineTranslationYuki Kawara, Chenhui Chu and Yuki Arase

Pushing the Limits of Radiology with Joint Modeling of Visual and Textual Informa-tionSonit Singh

Recognizing Complex Entity Mentions: A Review and Future DirectionsXiang Dai

Automatic Detection of Cross-Disciplinary Knowledge AssociationsMenasha Thilakaratne, Katrina Falkner and Thushari Atapattu

Language Identification and Named Entity Recognition in Hinglish Code MixedTweetsKushagra Singh, Indira Sen and Ponnurangam Kumaraguru

German and French Neural Supertagging Experiments for LTAG ParsingTatiana Bladier, Andreas van Cranenburgh, Younes Samih and Laura Kallmeyer

SuperNMT: Neural Machine Translation with Semantic Supersenses and SyntacticSupertagsEva Vanmassenhove and Andy Way

Unsupervised Semantic Abstractive SummarizationShibhansh Dohare, Vivek Gupta and Harish Karnick

ix

Wednesday July 18, 2018

Biomedical Document Retrieval for Clinical Decision Support SystemJainisha Sankhavara

A Computational Approach to Feature Extraction for Identification of SuicidalIdeation in TweetsRamit Sawhney, Prachi Manchanda, Raj Singh and Swati Aggarwal

BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu Using Word-levelAnnotationsSreekavitha Parupalli, Vijjini Anvesh Rao and Radhika Mamidi

Reinforced Extractive Summarization with Question-Focused RewardsKristjan Arumae and Fei Liu

Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder ModelsSatoru Katsumata, Yukio Matsumura, Hayahide Yamagishi and Mamoru Komachi

Exploring Chunk Based Templates for Generating a subset of English TextNikhilesh Bhatnagar, Manish Shrivastava and Radhika Mamidi

Trick Me If You Can: Adversarial Writing of Trivia Challenge QuestionsEric Wallace and Jordan Boyd-Graber

Alignment Analysis of Sequential Segmentation of Lexicons to Improve AutomaticCognate DetectionPranav A

Mixed Feelings: Natural Text Generation with Variable, Coexistent Affective Cate-goriesLee Kezar

Automatic Spelling Correction for Resource-Scarce Languages using Deep Learn-ingPravallika Etoori, Manoj Chinnakotla and Radhika Mamidi

Automatic Question Generation using Relative Pronouns and AdverbsPayal Khullar, Konigari Rachna, Mukul Hase and Manish Shrivastava

x

Proceedings of ACL 2018, Student Research Workshop, pages 1–8Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

Towards Opinion Summarization of Customer Reviews

Samuel PecarSlovak University of Technology in Bratislava

Faculty of Informatics and Information Technologieslkovicova 2, 842 16 Bratislava, Slovakia

[email protected]

Abstract

In recent years, the number of texts hasgrown rapidly. For example, most review-based portals, like Yelp or Amazon, con-tain thousands of user-generated reviews.It is impossible for any human reader toprocess even the most relevant of thesedocuments. The most promising tool tosolve this task is a text summarization.Most existing approaches, however, workon small, homogeneous, English datasets,and do not account to multi-linguality,opinion shift, and domain effects. Inthis paper, we introduce our research planto use neural networks on user-generatedtravel reviews to generate summaries thattake into account shifting opinions overtime. We outline future directions in sum-marization to address all of these issues.By resolving the existing problems, wewill make it easier for users of review-sitesto make more informed decisions.

1 Introduction

In recent years, amount of available text corporahas been growing rapidly with increasing popular-ity of web. Users produce a huge amount of textevery day. With a larger amount of text and infor-mation included within it, it becomes impossiblefor people to read all the texts and it leads to in-formation overload. For a common human, it isnot possible to read all the available text even if hereads only all the most relevant ones. The taskof text summarization is known for a very longperiod. In late 50s Luhn (Luhn, 1958) tried tocreate abstract of documents automatically. Overdecades there have been many summarization sys-tems dealing with different forms of summariza-tion. This task belongs to one of the most chal-

lenging tasks in natural language processing. Thetask of text summarization can be particularly im-portant for decision making or relevance judg-ments (Nenkova and McKeown, 2011).

Automatic text summarization became veryuseful and also important tool to help the user ob-tain as much information as possible without thenecessity to read all the original documents. Manydefinitions of text summarization exist. Text sum-mary can be defined as a text produced from oneor more texts that contains the same information asthe original text and is no longer than half of theoriginal text (Hovy and Lin, 1998). Mani (Mani,2001) defined the goal of summarization as a pro-cess of finding the source of information, extract-ing content from it and presenting the most impor-tant content to the user in a concise form and in amanner sensitive to needs of user’s application.

We can divide techniques of summarization intotwo categories: abstractive and extractive summa-rization (Gambhir and Gupta, 2017). Extractivesummarization aims to choose parts of the originaldocument such as sentence part, whole sentenceor paragraph. Abstractive summarization wantsto get paraphrase content of the original docu-ment with respect to cohesion and concise of out-put summary. Selection of the text section in ex-tractive summarization leads to a partial loss of anoutput cohesion, which abstractive summarizationtries to accomplish.

In last few years, approaches based on neuralnetworks became very popular for summarizationtask (Rush et al., 2015). A specific branch oftext summarization is a summarization of opinionsfrom the human-generated text. We can summa-rize opinions from customer reviews or commentson social networks. This problem differs from astandard summarization task due to a number ofrepetitive and redundant information. There canbe also a problem with polarity of opinions be-

1

tween different users. This types of summary canbe very useful for both a customer of products anda product owner. Opinion summarization can beparticularly important for decision making (Yuanet al., 2015). This summarization type can alsoshow trends in opinions collected from commentson social networks, especially when a number oftext entries grows very fast.

In this work, we also discuss analysis of spe-cific aspect of opinion summarization: sentimentanalysis of customer reviews. In summarizationtask, sentiment information can be viewed as oneof the inputs along with text corpora itself. A dif-ference between sentiment of text fragments andsentiment of whole summarization is a very inter-esting aspect to consider.

The expected contributions of our research are:(1) overview of a recent development in opin-ion summarization, (2) assembly of a reason-ably big dataset for opinion summarization (fromtravel based portals), (3) a novel method for opin-ion summarization based on state-of-the-art neuralnetwork architectures.

The rest of this paper is structured as follows.Section 2 explains recent advances in text summa-rization. In Section 3, we focus on possibilities inthe area of opinion mining and summarization ofopinions. Future directions in summarization aredrawn in Section 4. The research proposal includ-ing opinion summarization along with a possibledataset and a planned experiment are described inSection 5. Final observations and conclusions arementioned in Section 6.

2 Text Summarization

In recent years, text summarization has been fo-cusing on the abstractive summarization with a useof neural models. In (Rush et al., 2015), the au-thors showed a way to use a neural network basedon encoder-decoder architecture for creating ab-stractive summarization on the sentence level. Us-ing this type of model originates from a task ofmachine translation where these models were usedbefore. The approach presented in (Chopra et al.,2016) can be considered as a follower of the pre-vious work. Instead of a feed-forward neural net-work a recurrent neural network (RNN) was used.RNN emphasizes the order of input words. Theauthors presented conditional RNN with convolu-tional attention-based encoder.

Ferreira et al. presented a sentence clustering

algorithm to deal with the redundancy and infor-mation diversity problems (Ferreira et al., 2014).The algorithm uses the text representation to con-vert input text into graph model along with fourtypes of relations between sentences.

A specific yet not very widely used technique isAbstract Meaning Representation (AMR). For textsummarization, a framework for abstractive sum-marization based on the recent development of atreebank for AMR (Liu et al., 2015) can be em-ployed. This framework parses source text into aset of AMR graphs, then the graph is transformedinto a summary graph from which the output sum-mary is generated.

The use of neural architecture from machinetranslation became widely popular and many au-thors made research in this area. Nallapati et al.(Nallapati et al., 2016) presented a neural modelfor abstractive summarization along with the in-troduction of a whole new dataset for evaluationof summarization.

The use of attention mechanism in neural net-works became widely spread and very popular.Many works showed usefulness of this mechanismin other tasks. In summarization task, a workproposed by See et al. (See et al., 2017) intro-duced a method based on the principle of encoder-decoder along with attention distribution of inputtext. They used hybrid pointer-generator architec-ture with a use of the coverage. The pointer mech-anism tries to solve problem of choosing words ei-ther to use original word or generate a new one.The coverage part ensures minimizing repetitionduring the text generation in the later parts of theoutput. Interesting modification was introducedby Paulus et al. (Paulus et al., 2017). Theirmechanism modifies standard attention mecha-nism and also objective function with a combi-nation of maximum likelihood and cross-entropyloss. This mechanism is used in a phase of rein-forcement learning. Tan et al. (Tan et al., 2017)proposed another modification of attention mech-anism and their graph-based attention mechanismwas used in a sequence-to-sequence framework.The goal of the encoder is mapping the input doc-uments to the vector representation. Then decoderis used to generate the output sentences. Noveltyof their method lies in using graph-based atten-tion mechanism in a hierarchical encoder-decoderframework.

Neural models are widely used for both abstrac-

2

tive and extractive summarization. Nallapati et al.(Nallapati et al., 2017) presented a neural sequen-tial model for the extractive summarization of doc-uments. Visualizing impact of a particular parts ofthe input text to output summarization we can con-sider as other contribution of this paper.

Similarly to other tasks of natural languageprocessing, convolutional neural networks can beused for summarization. Yasunaga et al. (Ya-sunaga et al., 2017) incorporates sentence rela-tions using Graph Convolutional Network on re-lation graphs along with the sentence embeddingsobtained from RNN, which were taken as inputnode features. This system tries to exploit a repre-sentational power of neural networks and sentencerelation information which can be encoded in thegraph representation of document clusters.

Much research has been conducted in this fieldin recent years. Other interesting modificationcan employ latent structure modeling presented ina framework based on sequence-to-sequence ori-ented encoder-decoder model which incorporatesa latent structure modeling component (Li et al.,2017). This model generates abstractive summaryof the latent variables but also of the the discrimi-native deterministic states.

All aforementioned summarization works wereprimarily aimed at summarization of news arti-cles. There can be also other summarization typeslike a summarization of emails (Carenini et al.,2008; Yousefi-Azar and Hamey, 2017), event-based summarization (Glavaš and Šnajder, 2014;Kedzie et al., 2015), personalized summarization(Díaz and Gervás, 2007; Moro and Bielikova,2012) and also sentiment-based or opinion sum-marization described in the next section.

3 Opinion Mining and Summarization

Summarization of opinions is a special type ofsummarization. Product and services along withcomments on social networks could consist ofhundreds of entries and could lead to informationoverload. Repetition of opinions is one of the ma-jor differences that contrasts with the summariza-tion of news. User-generated text often remark-ably differentiate from news text which is com-monly widely revised.

3.1 Summarization of Customer Opinions

Summarization of opinions from product reviewsis the most common example of opinion summa-

rization. These reviews often come from stores ofelectronics like Amazon. Yuan et al. (Yuan et al.,2015) presented user study how opinion summa-rization can help in decision making before con-sumer purchase.

One of the first works in opinion summarizationcould be considered the work of Hu and Liu (Huand Liu, 2004). They proposed a set of techniquesfor mining and summarizing product reviews. Themain goal of their opinion summarization systemis to provide a feature-based summary.

Tadano et al. proposed method based on evalu-ative sentence extraction where aspects are judgedby their ratings, tf-idf value and number of men-tions with similar topic (Tadano et al., 2010).

Summarization approach based on the topicalstructure was introduced by Zhan et al. (Zhanet al., 2009). They presented a topical structureas a list of significant topics related from a docu-ment set. To reduce redundancy of sentences theyimplemented a method of maximal marginal rele-vance.

The Opinosis project presented a graph-basedsummarization framework (Ganesan et al., 2010).This framework tries to generate abstractive sum-marization of highly redundant opinions. Authorsshowed that their summaries have better agree-ment with human summaries compared to thebaseline extractive methods.

Dalal and Zaveri presented application of amulti-step approach for automatic opinion min-ing consisting of various phases (Dalal and Za-veri, 2013). Authors showed that this multi-stepfeature-based semi-supervised opinion mining ap-proach can be successful in identification of opin-ionated sentences from user reviews.

Aspect-based sentiment analysis can help toproduce a structured summary based on positiveand negative opinion about features of the prod-uct (Kansal and Toshniwal, 2014). The systemtakes into consideration not only sentence infor-mation, but also pieces of information from othersentences or reviews called contextual informa-tion. The authors also showed that polarity ofwords can be different even within one domain.

Kurian and Asokan presented a method withthe cross-domain sentiment classification alongwith the distributional similarity of opinion words(Kurian and Asokan, 2015). This method helpsto classify and summarize product reviews and, incontrast with other methods, it does not require la-

3

beled data from the target domain or other lexicalresources.

Unlike other opinion summarization systemsdealing with sentiment polarity, another study for-mulated opinion summarization as a communityleader detection problem (Zhu et al., 2015). Au-thors proposed a graph-based method to identifyinformative sentences and evaluated method onproduct reviews. The study proposed algorithmsfor leaders detection in the sentence graph.

A system named Gist (Lovinger et al., 2017) in-tends to deal with a large amount of text and au-tomatically summarizes it into informative and ac-tionable key sentences. Gist tries to summarizeoriginal reviews into the short text consisting ofa few key sentences that will capture the overallsentiment about the product.

A kind of opinion summarization could be asummarization of travel reviews to give feedbackfor quality of hotels, restaurants or other services.A clustering based method for summarization ofhotel reviews was proposed by Hu et al. (Hu et al.,2017). They also showed additional informationas author’s reputation or creation date could havea huge impact on relevant summary creation. Rautet al. (Raut and Londhe, 2014) presented machinelearning and Senti-WordNet method for miningopinions from hotel reviews and also a method forsentence relevant scoring.

3.2 Summarization of Community Answers

Along with the summarization of customer re-views, a very important summarization type con-siders comments on social networks or answers inquestion answering (QA) systems as input entries.Investigation in this forms of text entries can leadto easier decision making.

A sub-modular function-based framework waspresented by Wang et al. (Wang et al., 2014). Thisframework can be used for query-focused opinionsummarization. Authors evaluated this frameworkin QA and blog dataset. Statistically learned sen-tence relevance along with information coveragewith respect to diverse topics are encoded as sub-modular functions.

Work of Lloret et al. (Lloret et al., 2015) dealsboth with the summarization of opinions in socialnetworks and opinions in product reviews. Theirmethod can be characterized with an integrationof sentence simplification, but also regeneration ofsentence and also internal concept representation

in the summarization task. The method tries to beable to generate abstractive summaries.

There are many topics in this area which canlead to very interesting observations. Guo et al.(Guo et al., 2015) proposed a model for opin-ion summarization of highly contrastive opinionsparticularly for controversial issues. They inte-grated expert opinions with ordinary opinions tocreate an output of contrastive sentence pairs. Thestudy also presented this method as a unified wayfor users to better summarize opinions concerningcontroversial issues.

Another study explores opinion summarizationof the spontaneous conversation (Wang and Liu,2015). Phone conversation corpus was annotatedin this study and authors investigated two methodsof extractive summarization, graph-based with in-corporating topic and sentiment information andsupervised method which cast this problem as aclassification problem.

A study from Li et al. deals also with opinionsummarization in blogging (Li et al., 2016). Theyproposed a convolutional neural network for opin-ion summarization based on recent deep-learningresearch. Maximal marginal relevance is used forextraction of representative opinion sentences.

A very important problem of the volume andvolatility of opinionated data published in socialmedia was presented by Tsirakis et al. (Tsirakiset al., 2016). They discussed that most of methodsdeal only with a small volume of data, where theyare quite effective, but usually do not scale up.

4 Future Directions

As presented in the previous sections, many chal-lenges are still present in summarization. Stan-dard text summarization usually applied to newsarticles deals still with the problem of abstractivesummarization. Text summarization techniquescan be on a different level of abstraction and usu-ally are not fully abstractive (See et al., 2017).

Another big challenge lies in summarization oftext in languages other than English. Most ofmethods were evaluated only on English and inter-esting could be an evaluation in other ones alongwith their specifics. Another aspect could be sum-marization over multiple languages where inputtext does not need to be in only one language.

Summarization of user-generated content has todeal with a problem of the noisy and ungram-matical documents but also with very diverse and

4

conflicting opinions included within these docu-ments (Murray et al., 2017). This problem is evenmore protuberant in minor languages where opin-ion mining and sentiment analysis is not well de-veloped (Krchnavy and Simko, 2017).

In opinion summarization with thousands of in-put entries a researcher should deal with a changeof opinions during the time. When summarizingcustomer reviews for services like hotels or restau-rants, change of quality should be considered. Itcan lead to a specific time-based summary whichconsiders progress of opinions over the time. Thisproblem is also relevant in summarization of textin social networks, especially with controversialtopics reports to the specific mood in society orcan be effected by community leaders. Lack ofavailable large datasets for this task is other cru-cial subject of research in next years, since mostof research was evaluated only on small ones.

A significant problem is present in the evalu-ation phase. Automatic evaluation can be quitecontroversial as there exist not only one correctsummarization. Automatic evaluation measureslike ROUGE and its modifications (Lin, 2004) canpartially deal with these problems using n-grams,but still do not handle a use of synonyms. Sameproblems based on multiple formulation of groundtruth can cause problems with human evaluationas well. Experiments evaluated with more humanparticipants have to deal with an agreement be-tween users which can be quite low.

In next few years, we expect the opinion sum-marization to deal with the domain specifics andalso with the user satisfaction. The growth of user-generated content in the future can lead to focus onreduction of information overload and also to textsummarization itself.

5 Research Proposal

As we described earlier, we would like to focuson a specific type of summarization: creation ofopinion summaries. Nowadays, travel sites in-clude thousands of reviews from users which vis-ited one of reviewed places. These reviews arevery important in decision making of future pos-sible customers but also for owners of services.With tens of new reviews every day it is impossibleto read all the reviews and it is often very difficultto choose only the relevant ones. For owners, it isnot possible to manually read all the reviews thatcould be very helpful in service improvement.

Recent advances in neural networks and alsoin the text summarization showed that employingencoder-decoder architecture can be very useful.The problem of summarization of customer re-views differs from standard single document sum-marization where the models were applied before.In this task, we should consider multi-modularframework.

The main idea of this proposal lies in gettingbetter user satisfaction with review summary andalso in examining of time aspect on opinion sum-marization. Opinion summarization should pro-cess in several phases:

1. aspect detection,

2. clustering opinionated features,

3. sentence generation.

The first step of our proposal lies in the detec-tion of aspects. Before creation of any summary,we need to identify aspects discussed in these re-views. Another mechanism would be needed todistinguish similarity of aspects. A taxonomy ofaspects could be a very useful tool to avoid a sepa-ration of similar aspects but other approaches uti-lizing a distributional and vector space should bealso examined. We plan to use a bidirectionalLSTM neural network with convolutional atten-tion mechanism to identify aspects within text andalso to determine polarity of each aspect.

In the second step, we have to cluster opinion-ated sentences by aspects they talk about. In eachcluster, a sentiment and polarity of opinions needto be determined. Whereas in the task of senti-ment analysis only overall or average sentiment istypically provided, in the summarization all polar-ity opinions should be included in the output sum-mary. To reach this goal, all the opinionated sen-tences and opinions should be identified.

In the final phase, we would like to employ neu-ral network architecture to generate output sen-tences from collected aspect-oriented information.In this phase, we need to generate sentence to out-put summary based on clustered aspects and alsouse the polarity of aspects. We plan to use LSTMnetwork for this stage and generate sentence fromextracted aspects and their polarity.

There is also another important point of inter-est in task of summarization that has not been dis-cussed yet. The time horizon is often a neglectedfeature which should be considered in opinion

5

summarization of customer reviews, as opinionsof customers can develop over the time in a posi-tive, but also a negative way. We plan to includeinformation about created time to process of clus-tering opinions along with other information aboutthe reviewer, what can lead to better accuracy ofsummarization as well as resulting user satisfac-tion with an output summary.

Another significant point to discuss is employ-ing end-to-end deep learning in the task of opinionsummarization. The major problem is lack of largedataset which is necessary for such learning. Cre-ation of this kind of dataset is expensive and couldtake hundreds of hours, if performed manually.

5.1 Dataset

To create an appropriate dataset, we plan to gathercustomer reviews from large travel portals (e.g.,TripAdvisor, Booking.com). All reviews comealong with other useful information such as scoreranking, which should be included too. Howeverany public information about reviewers could bevery useful too as it shows reviewer relevance andalso importance.

5.2 Experiments

To evaluate the quality of generated summaries afew experiments are required. We will have to cre-ate our ground truth or reference summaries to au-tomatically evaluate quality of summary. As wementioned before, it is not a sufficient way forevaluation and other experiments including humanevaluation would be needed. As this type of eval-uation is very time consuming and difficult forresources, a posteriori evaluation is more feasi-ble way to assess the quality of generated sum-maries. Very interesting view for opinion sum-maries is comparison of the sentiment of gener-ated summaries and the sentiment of original inputreviews. In human evaluation, we would like toprovide users a list of original reviews, generatedsummary and ask about their satisfaction. We alsoplan to use some other automatic measures likeROUGE (Lin, 2004) and compare generated sum-mary with summary created by humans. Anotherimportant measure is aspect coverage and ratio ofincluded aspect in generated summaries from orig-inal reviews.

6 Conclusion

In this paper, we described the background forsummarization task. More importantly, we de-scribed recent contributions and development inthis area with many problems the research dealswith. We emphasized the main problems and fu-ture research directions in process of summariza-tion and also particularly for opinion summariza-tion. We also introduced our future research inten-tions along with a design of the first experimentsand possible model and dataset. We demonstratedthat summarization task and especially opinionsummarization still have big open issues will beresearched in the next few years.

Acknowledgments

I would like to thank my supervisors MarianSimko and Maria Bielikova. This work has beenpartially supported by the STU Grant scheme forSupport of Young Researchers and grants No. VG1/0646/15 and No. KEGA 028STU-4/2017.

ReferencesGiuseppe Carenini, Raymond T. Ng, and Xiaodong

Zhou. 2008. Summarizing emails with con-versational cohesion and subjectivity. In Pro-ceedings of ACL-08: HLT . Association forComputational Linguistics, pages 353–361.http://www.aclweb.org/anthology/P08-1041.

Sumit Chopra, Michael Auli, and Alexander M Rush.2016. Abstractive Sentence Summarization with At-tentive Recurrent Neural Networks. In Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies. Asso-ciation for Computational Linguistics, pages 93–98.https://doi.org/10.18653/v1/N16-1012.

Mita K Dalal and Mukesh A Zaveri. 2013. Semisuper-vised Learning Based Opinion Summarization andClassification for Online Product Reviews. AppliedComputational Intelligence and Soft Computing2013:1–8. https://doi.org/10.1155/2013/910706.

Alberto Díaz and Pablo Gervás. 2007. User-model based personalized summarization. Informa-tion Processing & Management 43(6):1715–1734.https://doi.org/10.1016/j.ipm.2007.01.009.

Rafael Ferreira, Luciano de Souza Cabral, FredericoFreitas, Rafael Dueire Lins, Gabriel de FrançaSilva, Steven J Simske, and Luciano Favaro.2014. A multi-document summarization systembased on statistics and linguistic treatment. Ex-pert Systems with Applications 41(13):5780–5787.https://doi.org/10.1016/j.eswa.2014.03.023.

6

Mahak Gambhir and Vishal Gupta. 2017. Recentautomatic text summarization techniques: a sur-vey. Artificial Intelligence Review 47(1):1–66.https://doi.org/10.1007/s10462-016-9475-9.

Kavita Ganesan, ChengXiang Zhai, and Jiawei Han.2010. Opinosis: A graph based approach to abstrac-tive summarization of highly redundant opinions. InProceedings of the 23rd International Conference onComputational Linguistics (Coling 2010). Associa-tion for Computational Linguistics, pages 340–348.

Goran Glavaš and Jan Šnajder. 2014. Eventgraphs for information retrieval and multi-document summarization. Expert Sys-tems with Applications 41(15):6904–6916.https://doi.org/10.1016/j.eswa.2014.04.004.

Jinlong Guo, Yujie Lu, Tatsunori Mori, and Cather-ine Blake. 2015. Expert-Guided ContrastiveOpinion Summarization for Controversial Is-sues. In Proceedings of the 24th InternationalConference on World Wide Web - WWW ’15Companion. ACM Press, pages 1105–1110.https://doi.org/10.1145/2740908.2743038.

Eduard Hovy and Chin-Yew Lin. 1998. Automated textsummarization and the summarist system. In Pro-ceedings of a Workshop on Held at Baltimore, Mary-land: October 13-15, 1998. Association for Compu-tational Linguistics, TIPSTER ’98, pages 197–214.https://doi.org/10.3115/1119089.1119121.

Minqing Hu and Bing Liu. 2004. Mining andsummarizing customer reviews. In Proceed-ings of the Tenth ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining. ACM, KDD ’04, pages 168–177.https://doi.org/10.1145/1014052.1014073.

Ya-Han Hu, Yen-Liang Chen, and Hui-Ling Chou.2017. Opinion mining from online hotel re-views – A text summarization approach. Infor-mation Processing & Management 53(2):436–449.https://doi.org/10.1016/j.ipm.2016.12.002.

Hitesh Kansal and Durga Toshniwal. 2014. Aspectbased Summarization of Context Dependent Opin-ion Words. Procedia Computer Science 35:166–175. https://doi.org/10.1016/J.PROCS.2014.08.096.

Chris Kedzie, Kathleen McKeown, and Fernando Diaz.2015. Predicting salient updates for disaster summa-rization. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers).Association for Computational Linguistics, pages1608–1617. https://doi.org/10.3115/v1/P15-1155.

R. Krchnavy and M. Simko. 2017. Sentimentanalysis of social network posts in slovak lan-guage. In 2017 12th International Work-shop on Semantic and Social Media Adapta-tion and Personalization (SMAP). pages 20–25.https://doi.org/10.1109/SMAP.2017.8022661.

Neethu Kurian and Shimmi Asokan. 2015.Summarizing User Opinions: A Methodfor Labeled-data Scarce Product Domains.Procedia Computer Science 46:93–100.https://doi.org/10.1016/j.procs.2015.01.062.

Piji Li, Wai Lam, Lidong Bing, and Zihao Wang.2017. Deep recurrent generative decoder forabstractive text summarization. In Proceed-ings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing. Associationfor Computational Linguistics, pages 2091–2100.http://aclweb.org/anthology/D17-1222.

Qiudan Li, Zhipeng Jin, Can Wang, and Daniel Da-jun Zeng. 2016. Knowle dge-Base d SystemsMining opinion summarizations using convolutionalneural networks in Chinese microblogging sys-tems. Knowledge-Based Systems 107:289–300.https://doi.org/10.1016/j.knosys.2016.06.017.

Chin-Yew Lin. 2004. ROUGE: A Packagefor Automatic Evaluation of Summaries 8.http://anthology.aclweb.org/W/W04/W04-1013.pdf.

Fei Liu, Jeffrey Flanigan, Sam Thomson, Nor-man Sadeh, and Noah A. Smith. 2015. To-ward Abstractive Summarization Using Seman-tic Representations. In Proceedings of the2015 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies. Associationfor Computational Linguistics, pages 1077–1086.https://doi.org/10.3115/v1/N15-1114.

Elena Lloret, Ester Boldrini, Tatiana Vodolazova, Patri-cio Martínez-Barco, Rafael Muñoz, and ManuelPalomar. 2015. A novel concept-level approachfor ultra-concise opinion summarization. Ex-pert Systems with Applications 42(20):7148–7156.https://doi.org/10.1016/j.eswa.2015.05.026.

Justin Lovinger, Iren Valova, and Chad Clough.2017. Gist: general integrated summariza-tion of text and reviews. Soft Computinghttps://doi.org/10.1007/s00500-017-2882-2.

H. P. Luhn. 1958. The Automatic Creationof Literature Abstracts. IBM Journal ofResearch and Development 2(2):159–165.https://doi.org/10.1147/rd.22.0159.

Inderjeet Mani. 2001. Automatic Summariza-tion, volume 3 of Natural Language Process-ing. John Benjamins Publishing Company.https://doi.org/10.1075/nlp.3.

Robert Moro and Maria Bielikova. 2012. Per-sonalized Text Summarization Based on Im-portant Terms Identification. In 2012 23rdInternational Workshop on Database and Ex-pert Systems Applications. IEEE, pages 131–135.https://doi.org/10.1109/DEXA.2012.47.

7

Gabriel Murray, Enamul Hoque, and GiuseppeCarenini. 2017. Chapter 11 - opinion sum-marization and visualization. In Federico Al-berto Pozzi, Elisabetta Fersini, Enza Messina,and Bing Liu, editors, Sentiment Analysis in So-cial Networks, Morgan Kaufmann, pages 171 –187. https://doi.org/10.1016/B978-0-12-804412-4.00011-5.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.Summarunner: A recurrent neural network based se-quence model for extractive summarization of docu-ments. In Proceedings of the Thirty-First AAAI Con-ference on Artificial Intelligence (AAAI-17). pages3075–3081.

Ramesh Nallapati, Bowen Zhou, Cicero dos San-tos, Caglar Gulcehre, and Bing Xiang. 2016.Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceed-ings of The 20th SIGNLL Conference on Com-putational Natural Language Learning. Associa-tion for Computational Linguistics, pages 280–290.https://doi.org/10.18653/v1/K16-1028.

Ani Nenkova and Kathleen McKeown. 2011.Automatic Summarization. Foundations andTrends R© in Information Retrieval 5(2):103–233.https://doi.org/10.1561/1500000015.

Romain Paulus, Caiming Xiong, and Richard Socher.2017. A Deep Reinforced Model for Ab-stractive Summarization. CoRR abs/1705.04304.http://arxiv.org/abs/1705.04304.

Vijay B Raut and DD Londhe. 2014. Opinion min-ing and summarization of hotel reviews. In Compu-tational Intelligence and Communication Networks(CICN), 2014 International Conference on. IEEE,pages 556–559.

Alexander M Rush, Sumit Chopra, and Jason We-ston. 2015. A Neural Attention Model for Ab-stractive Sentence Summarization. In Proceed-ings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing. Associa-tion for Computational Linguistics, pages 379–389.https://doi.org/10.18653/v1/D15-1044.

Abigail See, Peter J. Liu, and Christopher D. Man-ning. 2017. Get To The Point: Summarization withPointer-Generator Networks. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers). Asso-ciation for Computational Linguistics, pages 1073–1083. https://doi.org/10.18653/v1/P17-1099.

Ryosuke Tadano, Kazutaka Shimada, and Tsu-tomu Endo. 2010. Multi-aspects review sum-marization based on identification of impor-tant opinions and their similarity. In Pro-ceedings of the 24th Pacific Asia Conferenceon Language, Information and Computation.http://www.aclweb.org/anthology/Y10-1079.

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. Ab-stractive Document Summarization with a Graph-Based Attentional Neural Model. In Proceedings ofthe 55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers).Association for Computational Linguistics, pages1171–1181. https://doi.org/10.18653/v1/P17-1108.

Nikos Tsirakis, Vasilis Poulopoulos, Panagiotis Tsan-tilas, and Iraklis Varlamis. 2016. Large scaleopinion mining for social, news and blog data.The Journal of Systems and Software 127:237–248.https://doi.org/10.1016/j.jss.2016.06.012.

Dong Wang and Yang Liu. 2015. Opinion sum-marization on spontaneous conversations.Computer Speech & Language 34:61–82.https://doi.org/10.1016/j.csl.2015.04.004.

Lu Wang, Hema Raghavan, Claire Cardie, and Vitto-rio Castelli. 2014. Query-focused opinion summa-rization for user-generated content. In Proceedingsof COLING 2014, the 25th International Confer-ence on Computational Linguistics: Technical Pa-pers. pages 1660–1669.

Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,Ayush Pareek, Krishnan Srinivasan, and DragomirRadev. 2017. Graph-based Neural Multi-DocumentSummarization. In Proceedings of the 21stConference on Computational Natural Lan-guage Learning (CoNLL 2017). Associationfor Computational Linguistics, pages 452–462.https://doi.org/10.18653/v1/K17-1045.

Mahmood Yousefi-Azar and Len Hamey. 2017. Textsummarization using unsupervised deep learn-ing. Expert Systems with Applications 68:93–105.https://doi.org/10.1016/j.eswa.2016.10.017.

Xiaojun Yuan, Ning Sa, Grace Begany, and HuahaiYang. 2015. What users prefer and why: A userstudy on effective presentation styles of opinionsummarization. In Human-Computer Interaction –INTERACT 2015. Springer International Publishing,pages 249–264.

Jiaming Zhan, Han Tong Loh, and Ying Liu. 2009.Gather customer concerns from online product re-views – A text summarization approach. ExpertSystems with Applications 36(2, Part 1):2107–2115.https://doi.org/10.1016/j.eswa.2007.12.039.

Linhong Zhu, Sheng Gao, Sinno Jialin Pan, HaizhouLi, Dingxiong Deng, and Cyrus Shahabi. 2015. ThePareto Principle Is Everywhere: Finding Informa-tive Sentences for Opinion Summarization ThroughLeader Detection. In Lecture Notes in Social Net-works, pages 165–187. https://doi.org/10.1007/978-3-319-14379-89.

8


Sampling Informative Training Data for RNN Language Models

Jared Fernandez and Doug DowneyDepartment of Electrical Engineering and Computer Science

Northwestern UniversityEvanston, IL 60208

[email protected], [email protected]

Abstract

We propose an unsupervised importancesampling approach to selecting trainingdata for recurrent neural network (RNN)language models. To increase the infor-mation content of the training set, ourapproach preferentially samples high per-plexity sentences, as determined by an eas-ily queryable n-gram language model. Weexperimentally evaluate the heldout per-plexity of models trained with our var-ious importance sampling distributions.We show that language models trained ondata sampled using our proposed approachoutperform models trained over randomlysampled subsets of both the Billion Word(Chelba et al., 2014) and Wikitext-103benchmark corpora (Merity et al., 2016).

1 Introduction

The task of statistical language modeling seeksto learn a joint probability distribution over se-quences of natural language words. In recentwork, recurrent neural network (RNN) languagemodels (Mikolov et al., 2010) have produced state-of-the-art perplexities in sentence-level languagemodeling, far below those of traditional n-grammodels (Melis et al., 2017). Models trained onlarge, diverse benchmark corpora such as the Bil-lion Word Corpus and Wikitext-103 have seen re-ported perplexities as low as 23.7 and 37.2, respec-tively (Kuchaiev and Ginsburg, 2017; Dauphinet al., 2017).

However, building models on large corpora islimited by prohibitive computational costs, as thenumber of training steps scales linearly with thenumber of tokens in the training corpus. Sentence-level language models for these large corpora canbe learned by training on a set of sentences sub-sampled from the original corpus. We seek to de-termine whether it is possible to select a set of

training sentences that is significantly more infor-mative than a randomly drawn training set. Wehypothesize that by training on higher informa-tion and more difficult training sentences, RNNlanguage models can learn the language distribu-tion more accurately and produce lower perplexi-ties than models trained on similar-sized randomlysampled training sets.

We propose an unsupervised importance sam-pling technique for selecting training data forsentence-level RNN language models. We lever-age n-gram language models’ rapid training andquery time, which often requires just a single passover the training data. We determine a prelimi-nary heuristic for each sentence’s importance andinformation content by calculating its average per-word perplexity. Our technique uses an offline n-gram model to score sentences and then sampleshigher perplexity sentences with increased proba-bility. Selected sentences are then used for trainingwith corrective weights to remove the samplingbias. As entropy and perplexity have a monotonicrelationship, selecting sentences with higher aver-age n-gram perplexity also increases the averageentropy and information content.

We experimentally evaluate the effectiveness ofmultiple importance sampling distributions at se-lecting training data for RNN language models.We compare the heldout perplexities of modelstrained with randomly sampled and importancesampled training data on both the One BillionWord and Wikitext-103 corpora. We show thatour importance sampling techniques yield lowerperplexities than models trained on similarly sizedrandom samples. By using an n-gram model to de-termine the sampling distribution, we limit addedcomputational costs of our importance samplingapproach. We also find that applying perplexity-based importance sampling requires maintaininga relatively high weight on low perplexity sen-tences. We hypothesize that this is because low

9

perplexity sentences frequently contain commonsubsequences that are useful in modeling othersentences.

2 Related Work

Standard stochastic gradient descent (SGD) iter-atively selects random examples from the train-ing set to perform gradient updates. In con-trast, weighted SGD has been proven to acceler-ate the convergence rates of SGD by leveragingimportance sampling as a means of variance re-duction (Needell et al., 2014; Zhao and Zhang,2015). Weighted SGD selects examples from animportance sampling distribution and then trainson the selected examples with corrective weights.Weights of each training example i are set to be

1Pr(i) , where Pr(i) is the probability of selectingexample i. The weighting provides an unbiasedestimator of overall loss by removing the bias ofthe importance sampling distribution. In expecta-tion, each example’s contribution to the total lossfunction is the same as if the example had beendrawn uniformly at random.

Alain et al. (2015) developed an importancesampling technique for training deep neural net-works by sampling examples directly accord-ing to their gradient norm. To avoid the highcomputational costs of gradient computations,Katharopoulos and Fleuret (2018) sample accord-ing to losses as approximated by a lightweightRNN model trained along side their larger pri-mary RNN model. Both techniques observed in-creased convergence rates and reduced errors inimage classification tasks. In comparison, weuse a fixed offline n-gram model to compute oursampling distribution, which can be trained andqueried much more efficiently than a neural net-work model.

In natural language processing, subsampling oflarge corpora has been used to speed up trainingfor both language modeling and machine trans-lation. For domain specific language modeling,Moore and Lewis (2010) used an n-gram modeltrained on in-domain data to score sentences andthen selected the sentences with low perplexitiesfor training. Both Cho et al. (2014) and Koehnand Haddow (2012) used similar perplexity-basedsampling to select training data for domain spe-cific machine translation systems. Importancesampling has also been used to increase rate ofconvergence for a class of neural network lan-

guage models which use a set of binary classifiersto determine sequence likelihood, rather than cal-culating the probabilities jointly (Xu et al., 2011).

Because these subsampling techniques are usedto learn domain specific distributions differentfrom the distribution of the original corpus, theytarget lower perplexity sentences and have no needfor corrective weighting. In contrast, we studyhow training sets generated using weighted im-portance sampling can be selected to maximizeknowledge of the entire corpus for the standardlanguage modeling task.

3 Methodology

First, we train an offline n-gram model over sen-tences randomly sampled from the training corpus.Using the n-gram model, we score perplexities forthe remaining sentences in the training corpus.

We propose multiple importance sampling andlikelihood weighting schemes for selecting train-ing sequences for an RNN language model. Ourproposed sampling distributions (discussed in de-tail below) bias the training set to select higherperplexity sentences in order to increase the train-ing set’s information content. We then train anRNN language model on the sampled sentenceswith weights set to the reciprocal of the sentence’sselection probability.

3.1 Z-Score Sampling (Zfull)This sampling distribution naively selects sen-tences according to their z-score, as calculated interms of their n-gram perplexities. The selectionprobability of sequence s is set as:

PKeep(s) = kpr

(ppl(s)− µppl

σppl+ 1

),

where ppl(s) is the n-gram perplexity of sentences, µppl is the average n-gram perplexity, σppl isthe standard deviation of n-gram perplexities, andkpr is a normalizing constant to ensure a properprobability distribution.

For sentences with z-scores less than −1.00 orsequences where ppl(s) was in the 99th percentileof n-gram perplexities, sequences are assignedPkeep(s) = kpr. This ensured all sequences hadpositive selection probability and limited bias to-wards the selection of high perplexity sequencesin the tail of the distribution. Upon examination,sequences with perplexities in the 99th percentilewere generally esoteric or nonsensical. Selection

10

of these high perplexity sentences provided min-imal accuracy gain in exchange for their boostedselection probability.

3.2 Limited Z-Score Sampling (Zα)Training on low perplexity sentences can be help-ful in learning to model higher perplexity sen-tences which share common sub-sequences. How-ever, naive z-score sampling results in the selec-tion of few low perplexity sentences. Additionally,the low perplexity sentences that are selected tendto dominate the training weight space due to theirlow selection probability.

To smooth the distribution in the weight space,selection probability is only determined using z-scores for sentences where their perplexities aregreater than the mean. Thus, the selection proba-bility of sentence s is:PKeep(s) =

{kpr

(αppl(s)−µppl

σppl+ 1), if ppl(s) > µppl.

kpr, else.

where α is a constant parameter that determinesthe weight of the z-score in calculating the se-quence’s selection probability.

3.3 Squared Z-Score Sampling (Z2)To investigate the effects of sampling from morecomplex distributions, we also evaluate an impor-tance sampling scheme where sentences are sam-pled according to their squared z-score.PKeep(s) =

kpr

(α(ppl(s)−µppl

σppl

)2+ 1

), if ppl(s) > µppl.

kpr, else.

4 Experiments

We experimentally evaluate the effectiveness ofthe Zfull and Z2 sampling methods, as well as theZα method for various values of parameter α.

4.1 Dataset DetailsSentence-level models were trained and evaluatedon samples from Wikitext-103 and the One Bil-lion Word Benchmark corpus. To create a datasetof independent sentences, the Wikitext-103 corpuswas parsed to remove headers and to create indi-vidual sentences. The training and heldout setswere combined, shuffled, and then split to cre-ate new 250k token test and validation sets. The

remaining sequences were set as a new trainingset of approximately 99 million tokens. In Bil-lion Word experiments, training sequences weresampled from a 500 million subset of the releasedtraining split. Billion Word models were evaluatedon 250k token test and validation sets randomlysampled from the released heldout split.

Models were trained on 500 thousand, 1 mil-lion, and 2 million token training sets sampledfrom each training split. Rare words were replacedwith <unk> tokens, resulting in vocabularies of267K and 250K for the Wikitext and Billion Wordcorpora, respectively.

4.2 Model Details

To calculate the sampling distribution, an n-grammodel was trained on a heldout set with the samenumber tokens used to train each RNN model(Hochreiter and Schmidhuber, 1997). For exam-ple, the sampling distribution used to build a 1million token RNN training set was determinedusing perplexities calculated by an n-gram modelalso trained on 1 million tokens. N-gram mod-els were trained as 5-gram models with Kneser-Ney discounting (Kneser and Ney, 1995) usingSRILM (Stolcke, 2002). For efficient calculationof sentence perplexities, we query our models us-ing KenLM (Heafield, 2011).

RNN models were built using a two-layer longshort-term memory (LSTM) neural network, with200-dimensional hidden and embedding layers.Each training set was trained on for 10 epochs us-ing the Adam gradient optimizer (Kingma and Ba,2014) with a mini-batch size of 12.

5 Results

In Tables 1 and 2, we summarize the performancesof models trained on samples from Wikitext-103and the Billion Word Corpus, respectively. We re-port the Random and n-gram baseline perplexitiesfor RNN and n-gram language models trained onrandomly sampled data. We also report µngramand σngram for each training set, which are themean and standard deviation in sentence perplex-ity as evaluated by the offline n-gram model.

In all experiments, RNN language modelstrained using our sampling approaches yield a de-crease in model perplexity as compared to RNNmodels trained on similar sized randomly sam-pled sets. As size of the training set increases,the RNNs trained on importance sampling datasets

11

Model Tokens µngram σngram ppln-gram 500k — — 492.3Random 500k 449.0 346.4 749.1Z0.5 500k 497.1 398.8 643.9Z1.0 500k 544.1 440.1 645.2Z2.0 500k 615.7 481.3 593.2Z4.0 500k 729.0 523.6 571.4Z2 500k 576.5 499.7 720.0Zfull 500k 627.1 451.9 663.7n-gram 1M — — 502.7Random 1M 448.9 380.2 550.6Z0.5 1M 495.7 431.8 545.73Z1.0 1M 540.4 475.4 435.4Z2.0 1M 615.6 528.4 426.9Z4.0 1M 732.9 584.4 420.1Z2 1M 571.5 535.7 435.7ZFull 1M 608.6 489.9 416.3n-gram 2M — — 502.6Random 2M 430.45 392.1 341.3Z0.5 2M 471.8 445.2 292.7Z1.0 2M 514.6 493.9 289.8Z2.0 2M 582.8 544.6 346.9Z4.0 2M 684.6 604.7 294.6Z2 2M 518.4 522.9 287.9ZFull 2M 568.4 506.5 312.5

Table 1: Perplexities for Wikitext models. All pro-posed models outperform the random and n-grambaselines as number of training tokens increases.

also yield significantly lower perplexities thanthe n-gram models trained on randomly sampledtraining sets. As expected, µngram and σngram in-crease substantially for training sets generated us-ing our proposed sampling methods.

Overall, the Z4.0 sampling method producedthe most consistent reductions in average perplex-ity of 102.9 and 54.2 compared to the Randomand n-gram baselines, respectively. ZFull and Z2

exhibit higher variance in their heldout perplex-ity as compared to the Zα and baseline methods.We expect that this is because these methods se-lect higher perplexity sequences with significantlyhigher probability than Zα methods. As a result,low perplexity sentences, which may contain com-mon subsequences helpful in modeling other sen-tences, are ignored in training.

6 Conclusions and Future Work

We introduce a weighted importance samplingscheme for selecting RNN language model train-ing data from large corpora. We demonstrate thatmodels trained with data generated using this ap-proach yield perplexity reductions of up to 24%when compared to models trained over randomlysampled training sets of similar size. This tech-nique leverages higher perplexity training sen-

Model Tokens µngram σngram ppln-gram 1M — — 432.5Random 1M 433.2 515.4 484.0Z0.5 1M 476.8 410.9 436.6Z1.0 1M 543.8 529.0 421.5Z4.0 1M 726.4 517.3 427.3Zfull 1M 635.19 458.69 495.75Z2 1M 639.2 593.7 435.3

Table 2: Perplexities for Billion Word models. Zαand Z2 both outperform the random baseline andare comparable to the n-gram baseline.

tences to learn more accurate language models,while limiting added computational cost of impor-tance calculations.

In future work, we will examine the perfor-mance of our proposed selection techniques in ad-ditional parameter settings, with various values ofα and thresholds in the limited z-score methodsZα. We will evaluate the performance of sam-pling distributions based on perplexities calculatedusing small, lightweight RNN language modelsrather than n-gram language models. Addition-ally, we will also be evaluating the performance ofsampling distributions calculated based on a sen-tence’s subsequences and unique n-gram content.Furthermore, we plan on adapting this importancesampling approach to use online n-gram modelstrained alongside the RNN language models fordetermining the importance sampling distribution.

Acknowledgements

This work was supported in part by NSF GrantIIS-1351029. Support for travel provided inpart by the ACL Student Travel Grant (NSFGrant IIS-1827830) and the Conference TravelGrant from Northwestern University’s Office ofthe Provost. We thank Dave Demeter, ThanaponNoraset, Yiben Yang, and Zheng Yuan for helpfulcomments.

ReferencesGuillaume Alain, Alex Lamb, Chinnadhurai Sankar,

Aaron Courville, and Yoshua Bengio. 2015. Vari-ance reduction in sgd by distributed importance sam-pling. arXiv preprint arXiv:1511.06481.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2014. One billion word benchmark for mea-suring progress in statistical language modeling. InFifteenth Annual Conference of the InternationalSpeech Communication Association.

12

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1724–1734.

Yann N Dauphin, Angela Fan, Michael Auli, and DavidGrangier. 2017. Language modeling with gated con-volutional networks. In International Conference onMachine Learning, pages 933–941.

Kenneth Heafield. 2011. Kenlm: Faster and smallerlanguage model queries. In Proceedings of the SixthWorkshop on Statistical Machine Translation, pages187–197. Association for Computational Linguis-tics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Angelos Katharopoulos and Francois Fleuret. 2018.Not all samples are created equal: Deep learn-ing with importance sampling. arXiv preprintarXiv:1803.00942.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Reinhard Kneser and Hermann Ney. 1995. Improvedbacking-off for m-gram language modeling. InAcoustics, Speech, and Signal Processing, 1995.ICASSP-95., 1995 International Conference on, vol-ume 1, pages 181–184. IEEE.

Philipp Koehn and Barry Haddow. 2012. Towards ef-fective use of training data in statistical machinetranslation. In Proceedings of the Seventh Workshopon Statistical Machine Translation, pages 317–321.Association for Computational Linguistics.

Oleksii Kuchaiev and Boris Ginsburg. 2017. Factor-ization tricks for lstm networks. arXiv preprintarXiv:1703.10722.

Gabor Melis, Chris Dyer, and Phil Blunsom. 2017. Onthe state of the art of evaluation in neural languagemodels. arXiv preprint arXiv:1707.05589.

Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2016. Pointer sentinel mixturemodels. arXiv preprint arXiv:1609.07843.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. InEleventh Annual Conference of the InternationalSpeech Communication Association.

Robert C Moore and William Lewis. 2010. Intelligentselection of language model training data. In Pro-ceedings of the ACL 2010 conference short papers,

pages 220–224. Association for Computational Lin-guistics.

Deanna Needell, Rachel Ward, and Nati Srebro. 2014.Stochastic gradient descent, weighted sampling, andthe randomized kaczmarz algorithm. In Advancesin Neural Information Processing Systems, pages1017–1025.

Andreas Stolcke. 2002. Srilm-an extensible languagemodeling toolkit. In Seventh International Confer-ence on Spoken Language Processing.

Puyang Xu, Asela Gunawardana, and Sanjeev Khudan-pur. 2011. Efficient subsampling for training com-plex language models. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1128–1136. Association for Com-putational Linguistics.

Peilin Zhao and Tong Zhang. 2015. Stochastic opti-mization with importance sampling for regularizedloss minimization. In International Conference onMachine Learning, pages 1–9.

13


Learning-based Composite Metrics for Improved Caption Evaluation

Naeha Sharif, Lyndon White, Mohammed Bennamoun and Syed Afaq Ali Shah,

{naeha.sharif, lyndon.white}@research.uwa.edu.au,and {mohammed.bennamoun, afaq.shah}@uwa.edu.au

The University of Western Australia.35 Stirling Highway, Crawley, Western Australia

Abstract

The evaluation of image caption quality isa challenging task, which requires the as-sessment of two main aspects in a caption:adequacy and fluency. These quality as-pects can be judged using a combination ofseveral linguistic features. However, mostof the current image captioning metrics fo-cus only on specific linguistic facets, suchas the lexical or semantic, and fail to meeta satisfactory level of correlation with hu-man judgements at the sentence-level. Wepropose a learning-based framework to in-corporate the scores of a set of lexicaland semantic metrics as features, to cap-ture the adequacy and fluency of captionsat different linguistic levels. Our experi-mental results demonstrate that compositemetrics draw upon the strengths of stand-alone measures to yield improved correla-tion and accuracy.

1 Introduction

Automatic image captioning requires the under-standing of the visual aspects of images to gen-erate human-like descriptions (Bernardi et al.,2016). The evaluation of the generated captionsis crucial for the development and fine-grainedanalysis of image captioning systems (Vedantamet al., 2015). Automatic evaluation metrics aimat providing efficient, cost-effective and objectiveassessments of the caption quality. Since theseautomatic measures serve as an alternative to themanual evaluation, the major concern is that suchmeasures should correlate well with human as-sessments. In other words, automatic metrics areexpected to mimic the human judgement processby taking into account various aspects that humansconsider when they assess the captions.

The evaluation of image captions can be charac-terized as having two major aspects: adequacy and

fluency. Adequacy is how well the caption reflectsthe source image, and fluency is how well the cap-tion conforms to the norms and conventions of hu-man language (Toury, 2012). In the case of man-ual evaluation, both adequacy and fluency tend toshape the human perception of the overall captionquality. Most of the automatic evaluation metricstend to capture these aspects of quality based onthe idea that “the closer the candidate descriptionto the professional human caption, the better it isin quality” (Papineni et al., 2002). The output insuch case is a score (the higher the better) reflect-ing the similarity.

The majority of the commonly used metrics forimage captioning such as BLEU (Papineni et al.,2002) and METEOR (Banerjee and Lavie, 2005)are based on the lexical similarity. Lexical mea-sures (n-gram based) work by rewarding the n-gram overlaps between the candidate and the ref-erence captions. Thus, measuring the adequacyby counting the n-gram matches and assessing thefluency by implicitly using the reference n-gramsas a language model (Mutton et al., 2007). How-ever, a high number of n-gram matches cannot al-ways be indicative of a high caption quality, nora low number of n-gram matches can always bereflective of a low caption quality (Gimenez andMarquez, 2010). A recently proposed semanticmetric SPICE (Anderson et al., 2016), overcomesthis deficiency of lexical measures by measuringthe semantic similarity of candidate and referencecaptions using Scene Graphs. However, the majordrawback of SPICE is that it ignores the fluency ofthe output caption.

Integrating assessment scores of different mea-sures is an intuitive and reasonable way to improvethe current image captioning evaluation methods.Through this methodology, each metric plays therole of a judge, assessing the quality of captionsin terms of lexical, grammatical or semantic accu-racy. For this research, we use the scores conferred

14

by a set of measures that are commonly used forcaptioning and combine them through a learning-based framework. In this work:1. We evaluate various combinations of a chosenset of metrics and show that the proposed com-posite metrics correlate better with human judge-ments.2. We analyse the accuracy of composite metricsin terms of differentiating between pairs of cap-tions in reference to the ground truth captions.

2 Literature Review

The success of any captioning system dependson how well it transforms the visual informa-tion to natural language. Therefore, the signifi-cance of reliable automatic evaluation metrics isundeniable for the fine-grained analysis and ad-vancement of image captioning systems. Whileimage captioning has drawn inspiration from theMachine Translation (MT) domain for encoder-decoder based captioning networks (Vinyals et al.,2015), (Xu et al., 2015), (Yao et al., 2016), (Youet al., 2016), (Lu et al., 2017), it has also benefitedfrom automatic metrics which were initially pro-posed to evaluate machine translations/text sum-maries, such as BLEU (Papineni et al., 2002), ME-TEOR (Denkowski and Lavie, 2014) and ROUGE(Lin, 2004).

In the past few years, two metrics CIDEr(Vedantam et al., 2015) and SPICE (Andersonet al., 2016) were developed specifically for im-age captioning. Compared to the previously usedmetrics, these two show a better correlation withhuman judgements. The authors in (Liu et al.,2016) proposed a linear combination of SPICEand CIDEr called SPIDEr and showed that op-timizing image captioning models for SPIDEr’sscore can lead to better quality captions. How-ever, SPIDEr was not evaluated for its correlationwith human judgements. Recently, (Kusner et al.,2015) proposed the use of a document similar-ity metric Word Mover’s Distance (WMD), whichuses the word2vec (Mikolov et al., 2013) embed-ding space to determine the distance between twotexts.

The metrics used for caption evaluation can bebroadly categorized as lexical and semantic mea-sures. Lexical metrics reward the n-gram matchesbetween candidate captions and human generatedreference texts (Gimenez and Marquez, 2010), andcan be further categorized as unigram and n-gram

based measures. Unigram based methods such asBLEU-1 (Papineni et al., 2002), assess only thelexical correctness of the candidate. However, inthe case of METEOR or WMD, where some sortof synonym-matching/stemming is also involved,unigram-overlaps help to evaluate both the lexi-cal and to some degree the semantic aptness ofthe output caption. N-gram based metrics suchas ROUGE and CIDEr primarily assess the lex-ical correctness of the caption, but also measuresome amount of syntactic accuracy by capturingthe word order.

The lexical measures have received criticismbased on the argument that the n-gram overlapis neither an adequate nor a necessary indicativemeasure of the caption quality (Anderson et al.,2016). To overcome this limitation, semantic met-rics such as SPICE, capture the sentence meaningto evaluate the candidate captions. Their perfor-mance however is highly dependent on a success-ful semantic parsing. Purely syntactic measures,which capture the grammatical correctness, exist,and have been used in MT (Mutton et al., 2007),but not in the captioning domain.

While fluency (well-formedness) of a candidatecaption can be attributed to the syntactic and lexi-cal correctness (Fomicheva et al., 2016), adequacy(informativeness) depends on the lexical and se-mantic correctness (Rios et al., 2011). We hypoth-esize that by combining scores from different met-rics, which have different strengths in measuringadequacy and fluency, a composite metric that isof overall higher quality is created (Sec. 5).

Machine learning offers a systematic approachto integrate the scores of stand-alone metrics.In the MT evaluation, various successful learn-ing paradigms have been proposed (Bojar et al.,2016), (Bojar et al., 2017) and the existinglearning-based metrics can be categorized asbinary functions–“which classify the candidatetranslation as good or bad” (Kulesza and Shieber,2004), (Guzman et al., 2015) or continuous func-tions–“which score the quality of translation on anabsolute scale” (Song and Cohn, 2011), (Albrechtand Hwa, 2008). Our research is conceptuallysimilar to the work in (Kulesza and Shieber, 2004),which induces a “human-likeness” criteria. How-ever, our approach differs in terms of the learningalgorithm as well as the features used. Moreover,the focus of this work is to assess various combina-tions of metrics (that capture the caption quality at

15

Figure 1: Overall framework of the proposed Composite Metrics

different linguistic levels) in terms of their correla-tion with human judgements at the sentence level.

3 Proposed Approach

In our approach, we use scores conferred by a setof existing metrics as an input to a multi-layerfeed-forward neural network. We adopt a trainingcriteria based on a simple question: is the captionmachine or human generated? Our trained classi-fier sets a boundary between good and bad qualitycaptions, thus classifying them as human or ma-chine produced. Furthermore, we obtain a contin-uous output score by using the class-probability,which can be considered as some “measure of be-lievability” that the candidate caption is humangenerated. Framing our learning problem as aclassification task allows us to create binary train-ing data using the human generated captions andmachine generated captions as positive and nega-tive training examples respectively.

Our proposed framework shown in Figure 1 firstextracts a set of numeric features using the can-didate “C” and the reference sentences “S”. Theextracted feature vector is then fed as an input toour multi-layer neural network. Each entity of thefeature vector corresponds to the score generatedby one of the four measures: METEOR, CIDEr,WMD1 and SPICE respectively. We chose thesemeasures because they show a relatively bettercorrelation with human judgements compared to

1We convert the WMD distance score to similarity by us-ing a negative exponential, to use it as a feature.

the other commonly used ones for captioning (Kil-ickaya et al., 2016). Our composite metrics arenamed EvalMS , EvalCS , EvalMCS , EvalWCS ,EvalMWS and EvalMWCS . The subscript let-ters in each name corresponds to the first letterof each individual metric. For example, EvalMS

corresponds to the combination of METEOR andSPICE. Figure 2 shows the linguistic aspects cap-tured by the stand-alone2 and the composite met-rics. SPICE is based on sentence meanings, thus itevaluates the semantics. CIDEr covers the syntac-tic and lexical aspects, whereas Meteor and WMDassess the lexical and semantic components. Thelearning-based metrics mostly fall in the regionformed by the overlap of all three major linguis-tics facets, leading to better a evaluation.

We train our metrics to maximise the classifica-tion accuracy on the training dataset. Since we areprimarily interested in maximizing the correlationwith human judgements, we perform early stop-ping based on Kendalls τ (rank correlation) withthe validation set.

4 Experimental Setup

To train our composite metrics, we source datafrom Flicker30k dataset (Plummer et al., 2015)and three image captioning models namely: (1)show and tell (Vinyals et al., 2015), (2) show, at-tend and tell (soft-attention) (Xu et al., 2015), and(3) adaptive attention (Lu et al., 2017). Flicker30k

2The stand-alone metrics marked with an * in the Figure 2are used as features for this work.

16

Figure 2: Various automatic measures (stand-alone and combined) and their respective linguis-tic levels. See Sec. 3 for more details.

dataset consists of 31,783 photos acquired fromFlicker3, each paired with 5 captions obtainedthrough the Amazon Mechanical Turk (AMT)(Turk, 2012). For each image in Flicker30k, werandomly select three of the human generated cap-tions as positive training examples, and three ma-chine generated (one from each image captioningmodel) captions as negative training examples. Wecombined the Microsoft COCO (Chen et al., 2015)training and validation set (containing 123,287 im-ages in total, each paired with 5 or more captions),to train the image captioning models using theirofficial codes. These image captioning modelsachieved state-of-the-art performance when theywere published.

In order to obtain reference captions for eachtraining example, we again use the human writ-ten descriptions of Flicker30k. For each neg-ative training example (machine-generated cap-tion), we randomly choose 4 out of 5 human writ-ten captions originally associated with each im-age. Whereas, for each positive training example(human-generated caption), we use the 5 humanwritten captions associated with each image, se-lecting one of these as a human candidate caption(positive example) and the remaining 4 as refer-ences. In Figure 3, a possible pairing scenario isshown for further clarification.

For our validation set, we source data fromFlicker8k (Young et al., 2014). This datasetcontains 5,822 captions assessed by three expertjudges on a scale of 1 (the caption is unrelatedto the image) to 4 (the caption describes the im-

3https://www.flickr.com/

Table 1: Kendall’s correlation co-efficient of au-tomatic evaluation metrics and proposed compos-ite metrics against human quality judgements. Allcorrelations are significant at p<0.001

IndividualMetrics

Kendallτ

CompositeMetrics

Kendallτ

BLEU 0.202 EvalMS 0.386ROUGE-L 0.216 EvalCS 0.384METEOR 0.352 EvalMCS 0.386CIDEr 0.356 EvalWCS 0.379SPICE 0.366 EvalMWS 0.367WMD 0.336 EvalMWCS 0.378

age without any errors). From our training set,we remove the captions of images which overlapwith the captions in the validation and test sets(discussed in Sec. 5), leaving us with a total of132,984 non-overlapping captions for the trainingof the composite metrics.

5 Results and Discussion

5.1 Correlation

The most desirable characteristic of an automaticevaluation metric is its strong correlation with hu-man scores (Zhang and Vogel, 2010). A strongercorrelation with human judgements indicates thata metric captures the information that humans useto assess a candidate caption. To evaluate thesentence-level correlation of our composite met-rics with human judgements, we source data froma dataset collected by the authors in (Aditya et al.,2017). We use 6993 manually evaluated humanand machine generated captions from this set,which were scored by AMT workers for correct-ness on the scale of 1 (low relevance to image) to5 (high relevance to image). Each caption in thedataset is accompanied by a single judgement. InTable 1, we report the Kendalls τ correlation co-efficient for the proposed composite metrics andother commonly used caption evaluation metrics.

It can be observed from Table 1 that compositemetrics outperform stand-alone metrics in termsof sentence-level correlation. The combinationof Meteor and SPICE (EvalMS) and METEOR,CIDEr and SPICE (EvalMCS) showed the mostpromising results. The success of these com-posite metrics can be attributed to the individ-ual strengths of Meteor, CIDEr and SPICE. ME-TEOR is a strong lexical measure based on un-igram matching, which uses additional linguisticknowledge for word matching, such as the mor-phological variation in words via stemming and

17

Figure 3: Shows an example of a candidate and reference pairing that is used in the training set. (a)Image, (b) human and machine generated captions for the image, and (c) candidate and reference pairingsfor the image.

dictionary based look-up for synonyms and para-phrases (Banerjee and Lavie, 2005). CIDEr useshigher order n-grams to account for fluency anddown-weighs the commonly occurring (less infor-mative) n-grams by performing Term FrequencyInverse Document Frequency (TF-IDF) weightingfor each n-gram in the dataset (Vedantam et al.,2015). SPICE on the other hand is a strong in-dicator of the semantic correctness of a caption.Together these metrics assess the lexical, seman-tic and syntactic information. The composite met-rics which included WMD in the combinationachieved a lower performance, compared to theones in which WMD was not included. One possi-ble reason is that WMD heavily penalizes shortercandidate captions when the number of words be-tween the output and the reference captions are notequal (Kusner et al., 2015). This penalty might notbe consistently useful as it is possible for a shortercandidate caption to be both fluent and adequate.Therefore, WMD is a better suited metric for mea-suring document distance.

5.2 Accuracy

We follow the framework introduced in (Vedan-tam et al., 2015) to analyse the ability of a met-ric to discriminate between pairs of captions withreference to the ground truth caption. A met-ric is considered accurate if it assigns a higherscore to the caption preferred by humans. Forthis experiment, we use PASCAL-50s (Vedan-tam et al., 2015), which contains human judge-ments for 4000 triplets of descriptions (one refer-ence caption with two candidate captions). Basedon the pairing, the triplets are grouped into fourcategories (comprising of 1000 triplets each) i.e.,

Table 2: Comparative accuracy results (in percent-age) on four kinds of pairs tested on PASCAL-50s

Metrics HC HI HM MM AVGBLEU 53.7 93.2 85.6 61.0 73.4ROUGE-L 56.5 95.3 93.4 58.5 75.9METEOR 61.1 97.6 94.6 62.0 78.8CIDEr 57.8 98.0 88.8 68.2 78.2SPICE 58.0 96.7 88.4 71.6 78.7WMD 56.2 98.4 91.7 71.5 79.5EvalMS 62.8 97.9 93.5 69.6 80.9EvalCS 59.5 98.3 90.7 71.3 79.9EvalMCS 60.2 98.3 91.8 71.8 80.5EvalWCS 58.2 98.7 91.7 70.6 79.8EvalMWS 56.9 98.4 91.3 71.2 79.4EvalMWCS 59.0 98.5 90.7 70.2 79.6

Human-Human Correct (HC), Human-Human In-correct (HI), Human-Machine (HM), Machine-Machine (MM). We follow the original approachof (Vedantam et al., 2015) and use 5 referencecaptions per candidate to assess the accuracy ofthe metrics and report them in Table 2. Table 2shows that on average composite measures pro-duce better accuracy compared to the individualmetrics. Amongst the four categories, HC is thehardest, in which all metrics show the worst per-formance. Differentiating between two good qual-ity (human generated) correct captions is challeng-ing as it involves a fine-grained analysis of the twocandidates. EvalMS achieves the highest accu-racy in HC category which shows that as caption-ing systems continue to improve, this combinationof lexical and semantic metrics will continue toperform well. Moreover, human generated cap-tions are usually fluent. Therefore, a combinationof strong indicators of adequacy such as SPICEand METEOR is the most suitable for this task.EvalMCS shows the highest accuracy in differ-

18

entiating between machine captions, which is an-other important category as one of the main goalsof automatic evaluation is to distinguish betweentwo machine algorithms. Amongst the compositemetrics, EvalMS is again the best in distinguish-ing human captions (good quality) from machinecaptions (bad quality) which was our basic train-ing criteria.

6 Conclusion and Future Works

In this paper we propose a learning-based ap-proach to combine various metrics to improve cap-tion evaluation. Our experimental results showthat metrics operating along different linguistic di-mensions can be successfully combined througha learning-based framework, and they outperformthe existing metrics for caption evaluation in termof correlation and accuracy, with EvalMS andEvalMCS giving the best overall performance.

Our study reveals that the proposed approach ispromising and has a lot of potential to be used forevaluation in the captioning domain. In the fu-ture, we plan to integrate features (components)of metrics instead of their scores for a better per-formance. We also intend to use syntactic mea-sures, which to the best of our knowledge havenot yet been used for caption evaluation (exceptin an indirect way by the n-gram measures whichcapture the word order) and study how they canimprove the correlation at the sentence level. Ma-jority of the metrics for captioning focus more onadequacy as compared to fluency. This aspect alsoneeds further attention and a combination of met-rics/features that can specifically assess the flu-ency of captions needs to be devised.

Acknowledgements

The authors are grateful to Nvidia for providingTitan-Xp GPU, which was used for the experi-ments. This research is supported by AustralianResearch Council, ARC DP150100294.

ReferencesSomak Aditya, Yezhou Yang, Chitta Baral, Yiannis

Aloimonos, and Cornelia Fermuller. 2017. Imageunderstanding using vision and reasoning throughscene description graph. Computer Vision and Im-age Understanding.

Joshua S Albrecht and Rebecca Hwa. 2008. Regres-sion for machine translation evaluation at the sen-tence level. Machine Translation, 22(1-2):1.

Peter Anderson, Basura Fernando, Mark Johnson, andStephen Gould. 2016. Spice: Semantic propo-sitional image caption evaluation. In EuropeanConference on Computer Vision, pages 382–398.Springer.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation and/or sum-marization, pages 65–72.

Raffaella Bernardi, Ruket Cakici, Desmond Elliott,Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis,Frank Keller, Adrian Muscat, Barbara Plank, et al.2016. Automatic description generation from im-ages: A survey of models, datasets, and evaluationmeasures. J. Artif. Intell. Res.(JAIR), 55:409–442.

Ondrej Bojar, Yvette Graham, Amir Kamran, andMilos Stanojevic. 2016. Results of the wmt16 met-rics shared task. In Proceedings of the First Con-ference on Machine Translation: Volume 2, SharedTask Papers, volume 2, pages 199–231.

Ondrej Bojar, Jindrich Helcl, Tom Kocmi, Jindrich Li-bovicky, and Tomas Musil. 2017. Results of thewmt17 neural mt training task. In Proceedings of theSecond Conference on Machine Translation, pages525–533.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollar, andC Lawrence Zitnick. 2015. Microsoft coco captions:Data collection and evaluation server. arXiv preprintarXiv:1504.00325.

Michael Denkowski and Alon Lavie. 2014. Meteoruniversal: Language specific translation evaluationfor any target language. In Proceedings of the ninthworkshop on statistical machine translation, pages376–380.

Marina Fomicheva, Nuria Bel, Lucia Specia, Iriada Cunha, and Anton Malinovskiy. 2016. Cobaltf:a fluent metric for mt evaluation. In Proceedings ofthe First Conference on Machine Translation: Vol-ume 2, Shared Task Papers, volume 2, pages 483–490.

Jesus Gimenez and Lluıs Marquez. 2010. Linguisticmeasures for automatic machine translation evalua-tion. Machine Translation, 24(3-4):209–240.

Francisco Guzman, Shafiq Joty, Lluıs Marquez, andPreslav Nakov. 2015. Pairwise neural machinetranslation evaluation. In Proceedings of the 53rdAnnual Meeting of the Association for Computa-tional Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 1: Long Papers), volume 1, pages 805–814.

Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis,and Erkut Erdem. 2016. Re-evaluating automaticmetrics for image captioning. arXiv preprintarXiv:1612.07600.

19

Alex Kulesza and Stuart M Shieber. 2004. A learn-ing approach to improving sentence-level mt evalu-ation. In Proceedings of the 10th International Con-ference on Theoretical and Methodological Issues inMachine Translation, pages 75–84.

Matt Kusner, Yu Sun, Nicholas Kolkin, and KilianWeinberger. 2015. From word embeddings to docu-ment distances. In International Conference on Ma-chine Learning, pages 957–966.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama,and Kevin Murphy. 2016. Improved image caption-ing via policy gradient optimization of spider. arXivpreprint arXiv:1612.00370.

Jiasen Lu, Caiming Xiong, Devi Parikh, and RichardSocher. 2017. Knowing when to look: Adaptive at-tention via a visual sentinel for image captioning. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), volume 6.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Andrew Mutton, Mark Dras, Stephen Wan, and RobertDale. 2007. Gleu: Automatic evaluation ofsentence-level fluency. In Proceedings of the 45thAnnual Meeting of the Association of ComputationalLinguistics, pages 344–351.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and Svet-lana Lazebnik. 2015. Flickr30k entities: Col-lecting region-to-phrase correspondences for richerimage-to-sentence models. In Computer Vision(ICCV), 2015 IEEE International Conference on,pages 2641–2649. IEEE.

Miguel Rios, Wilker Aziz, and Lucia Specia. 2011.Tine: A metric to assess mt adequacy. In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation, pages 116–122. Association for Com-putational Linguistics.

Xingyi Song and Trevor Cohn. 2011. Regression andranking based optimisation for sentence level ma-chine translation evaluation. In Proceedings of theSixth Workshop on Statistical Machine Translation,pages 123–129. Association for Computational Lin-guistics.

Gideon Toury. 2012. Descriptive Translation Studiesand beyond: revised edition, volume 100. John Ben-jamins Publishing.

Amazon Mechanical Turk. 2012. Amazon mechanicalturk. Retrieved August, 17:2012.

Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. 2015. Cider: Consensus-based image de-scription evaluation. In Proceedings of the IEEEconference on computer vision and pattern recog-nition, pages 4566–4575.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Computer Vision and Pat-tern Recognition (CVPR), 2015 IEEE Conferenceon, pages 3156–3164. IEEE.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual at-tention. In International Conference on MachineLearning, pages 2048–2057.

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, andTao Mei. 2016. Boosting image captioning with at-tributes. OpenReview, 2(5):8.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,and Jiebo Luo. 2016. Image captioning with seman-tic attention. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages4651–4659.

Peter Young, Alice Lai, Micah Hodosh, and JuliaHockenmaier. 2014. From image descriptions tovisual denotations: New similarity metrics for se-mantic inference over event descriptions. Transac-tions of the Association for Computational Linguis-tics, 2:67–78.

Ying Zhang and Stephan Vogel. 2010. Significancetests of automatic machine translation evaluationmetrics. Machine Translation, 24(1):51–65.

20


Recursive Neural Network Based Preordering for English-to-JapaneseMachine Translation

Yuki Kawara† Chenhui Chu‡†Graduate School of Information Science and Technology, Osaka University

‡Institute for Datability Science, Osaka University{kawara.yuki,arase}@ist.osaka-u.ac.jp,[email protected]

Yuki Arase†

Abstract

The word order between source and tar-get languages significantly influences thetranslation quality in machine translation.Preordering can effectively address thisproblem.　 Previous preordering methodsrequire a manual feature design, makinglanguage dependent design costly. In thispaper, we propose a preordering methodwith a recursive neural network that learnsfeatures from raw inputs. Experimentsshow that the proposed method achievescomparable gain in translation quality tothe state-of-the-art method but without amanual feature design.

1 Introduction

The word order between source and targetlanguages significantly influences the transla-tion quality in statistical machine translation(SMT) (Tillmann, 2004; Hayashi et al., 2013;Nakagawa, 2015). Models that adjust ordersof translated phrases in decoding have beenproposed to solve this problem (Tillmann, 2004;Koehn et al., 2005; Nagata et al., 2006). However,such reordering models do not perform well forlong-distance reordering. In addition, their com-putational costs are expensive. To address theseproblems, preordering (Xia and McCord, 2004;Wang et al., 2007; Xu et al., 2009; Isozaki et al.,2010b; Gojun and Fraser, 2012; Nakagawa,2015) and post-ordering (Goto et al., 2012,2013; Hayashi et al., 2013) models have beenproposed. Preordering reorders source sentencesbefore translation, while post-ordering reorderssentences translated without considering the wordorder after translation. In particular, preorder-ing effectively improves the translation qualitybecause it solves long-distance reordering and

computational complexity issues (Jehl et al.,2014; Nakagawa, 2015).

Rule-based preordering methods either man-ually create reordering rules (Wang et al.,2007; Xu et al., 2009; Isozaki et al., 2010b;Gojun and Fraser, 2012) or extract reorderingrules from a corpus (Xia and McCord, 2004;Genzel, 2010). On the other hand, studies in(Neubig et al., 2012; Lerner and Petrov, 2013;Hoshino et al., 2015; Nakagawa, 2015) applymachine learning to the preordering problem.Hoshino et al. (2015) proposed a method thatlearns whether child nodes should be swapped ateach node of a syntax tree. Neubig et al. (2012)and Nakagawa (2015) proposed methods that con-struct a binary tree and reordering simultaneouslyfrom a source sentence. These methods requirea manual feature design for every languagepair, which makes language dependent designcostly. To overcome this challenge, methodsbased on feed forward neural networks that donot require a manual feature design have beenproposed (de Gispert et al., 2015; Botha et al.,2017). However, these methods decide whetherto reorder child nodes without considering thesub-trees, which contains important informationfor reordering.

As a preordering method that is free of man-ual feature design and makes use of information insub-trees, we propose a preordering method witha recursive neural network (RvNN). RvNN cal-culates reordering in a bottom-up manner (fromthe leaf nodes to the root) on a source syntaxtree. Thus, preordering is performed consid-ering the entire sub-trees. Specifically, RvNNlearns whether to reorder nodes of a syntax tree1

with a vector representation of sub-trees andsyntactic categories. We evaluate the proposed

1In this paper, we used binary syntax trees.

21

method for English-to-Japanese translations us-ing both phrase-based SMT (PBSMT) and neu-ral MT (NMT). The results confirm that the pro-posed method achieves comparable translationquality to the state-of-the-art preordering method(Nakagawa, 2015) that requires a manual featuredesign.

2 Preordering with a Recursive NeuralNetwork

We explain our design of the RvNN to conductpreordering after describing how to obtain gold-standard labels for preordering.

2.1 Gold-Standard Labels for PreorderingWe created training data for preordering by label-ing whether each node of the source-side syntaxtree has reordered child nodes against a target-side sentence. The label is determined basedon Kendall’s τ (Kendall, 1938) as in (Nakagawa,2015), which is calculated by Equation (1).

τ =4

∑|y|−1i=1

∑|y|j=i+1 δ(yi < yj)

|y|(|y| − 1)− 1,(1)

δ(x) =

{1 (x is true),

0 (otherwise),

where y is a vector of target word indexes that arealigned with source words. The value of Kendall’sτ is in [−1, 1]. When it is 1, it means the se-quence of y is in a complete ascending order,i.e., target sentence has the same word order withthe source in terms of word alignment. At eachnode, if Kendall’s τ increases by reordering childnodes, an “Inverted” label is assigned; otherwise,a “Straight” label, which means the child nodesdo not need to be reordered, is assigned. Whena source word of a child node does not have analignment, a “Straight” label is assigned.

2.2 Preordering ModelRvNN is constructed given a binary syntax tree. Itpredicts the label determined in Section 2.1 at eachnode. RvNN decides whether to reorder the childnodes by considering the sub-tree. The vector ofthe sub-tree is calculated in a bottom-up mannerfrom the leaf nodes. Figure 1 shows an exampleof preordering of an English sentence “My parentslive in London.” At the VP node corresponding to“live in London,” the vector of the node is calcu-lated by Equation (2), considering its child nodes

VPPP

My parents live in London

My parents liveinLondon

PRP$ NNS VBP IN NNP

NP

S

Figure 1: Preordering an English sentence “Myparents live in London” with RvNN (Nodes with ahorizontal line mean “Inverted”).

correspond to “live” and “in London.”

p = f([pl;pr]W + b), (2)

s = pWs + bs, (3)

where f is a rectifier, W ∈ R2λ×λ is a weight ma-trix, pl and pr are vector representations of the leftand right child nodes, respectively. [·; ·] denotesthe concatenation of two vectors. Ws ∈ Rλ×2 is aweight matrix for the output layer, and b,bs ∈ Rλ

are the biases. s ∈ R2 calculated by Equation (3)is a weight vector for each label, which is fed intoa softmax function to calculate the probabilities ofthe “Straight” and “Inverted” labels.

At a leaf node, a word embedding calculated byEquations (4) and (5) is fed into Equation (2).

e = xWE , (4)

pe = f(eWl + bl), (5)

where x ∈ RN is a one-hot vector of an inputword with a vocabulary size of N , WE ∈ RN×λ

is an embedding matrix, and bl ∈ Rλ is the bias.The loss function is the cross entropy defined byEquation (6).

L(θ) = − 1

K

K∑

k=1

∑

n∈Tlnk log p(lnk ; θ), (6)

where θ is the parameters of the model, n is thenode of a syntax tree T , K is a mini batch size, andlnk is the label of the n-th node in the k-th syntaxtree in the mini batch.

With the model using POS tags and syntacticcategories, we use Equation (7) instead of Equa-tion (2).

p = f([pl;pr; et]Wt + bt), (7)

22

where et represents a vector of POS tags or syn-tactic categories, Wt ∈ R3λ×λ is a weight matrix,and bt ∈ Rλ is the bias. et is calculated in thesame manner as Equations (4) and (5), but the in-put is a one-hot vector of the POS tags or syntacticcategories at each node. λ is tuned on a develop-ment set, whose effects are investigated in Section3.2.

3 Experiments

3.1 Settings

We conducted English-to-Japanese transla-tion experiments using the ASPEC corpus(Nakazawa et al., 2016). This corpus provides 3Msentence pairs as training data, 1, 790 sentencepairs as development data, and 1, 812 sentencepairs as test data. We used Stanford CoreNLP2

for tokenization and POS tagging, Enju3 forparsing of English, and MeCab4 for tokenizationof Japanese. For word alignment, we usedMGIZA.5 Source-to-target and target-to-sourceword alignments were calculated using IBMmodel 1 and hidden Markov model, and they werecombined with the intersection heuristic following(Nakagawa, 2015).

We implemented our RvNN preordering modelwith Chainer.6 The ASPEC corpus was createdusing the sentence alignment method proposed in(Utiyama and Isahara, 2007) and was sorted basedon the alignment confidence scores. In this pa-per, we used 100k sentences sampled from the top500k sentences as training data for preordering.The vocabulary size N was set to 50k. We usedAdam (Kingma and Ba, 2015) with a weight de-cay and gradient clipping for optimization. Themini batch size K was set to 500.

We compared our model with the state-of-the-art preordering method proposed in (Nakagawa,2015), which is hereafter referred to as BTG.We used its publicly available implementation,7

and trained it on the same 100k sentences as ourmodel.

We used the 1.8M source and target sentencesas training data for MT. We excluded part of thesentence pairs whose lengths were longer than

2http://stanfordnlp.github.io/CoreNLP/3http://www.nactem.ac.uk/enju/4http://taku910.github.io/mecab/5http://github.com/moses-smt/giza-pp6http://chainer.org/7http://github.com/google/topdown-btg-preordering

1 2 3 4 5

Epoch

6.8

6.9

7.0

7.1

7.2

7.3

7.4

7.5

7.6

Tra

inlo

ss

13.46

13.48

13.50

13.52

13.54

13.56

13.58

Dev

loss

train loss

dev loss

Figure 2: Learning curve of our preorderingmodel.

Node dimensions 100 200 500

w/o preordering 22.73

w/o tags and categories 24.63 24.95 25.02

w/ tags and categories 25.22 25.41 25.38

Table 1: BLEU scores with preordering by ourmodel and without preordering under different λsettings (trained on a 500k subset of the trainingdata).

50 words or the source to target length ratio ex-ceeded 9. For SMT, we used Moses.8 We trainedthe 5-gram language model on the target side ofthe training corpus with KenLM.9 Tuning wasperformed by minimum error rate training (Och,2003). We repeated tuning and testing of eachmodel 3 times and reported the average of scores.For NMT, we used the attention-based encoder-decoder model of (Luong et al., 2015) with 2-layerLSTM implemented in OpenNMT.10 The sizesof the vocabulary, word embedding, and hiddenlayer were set to 50k, 500, and 500, respectively.The batch size was set to 64, and the number ofepochs was set to 13. The translation quality wasevaluated using BLEU (Papineni et al., 2002) andRIBES (Isozaki et al., 2010a) using the bootstrapresampling method (Koehn, 2004) for the signifi-cance test.

3.2 ResultsFigure 2 shows the learning curve of our preorder-ing model with λ = 200.11 Both the training and

8http://www.statmt.org/moses/9http://github.com/kpu/kenlm

10http://opennmt.net/11The learning curve behaves similarly for different λ val-

ues.

23

Avogadro ’s hypothesis ( 1811 ) contributed to the development in since then

Figure 4: Example of a syntax tree with a parse-error (the phrase “(1811)” was divided in two phrases bymistake). Our preordering result was affected by such parse-errors. (Nodes with a horizontal line means“Inverted”.)

PBSMT NMTBLEU RIBES BLEU RIBES

w/o preordering 22.88 64.07 32.68 81.68w/ BTG 29.51 77.20 28.91 79.58w/ RvNN 29.16 76.39 29.01 79.63

Table 2: BLEU and RIBES scores on the test set.(All models are trained on the entire training cor-pus of 1.8M sentence pairs.) Numbers in bold in-dicate the best systems and the systems that arestatistically insignificant at p < 0.05 from the bestsystems.

−1 0 1

Kendall’s τ

0.00

0.05

0.10

0.15

0.20

prop

orti

on

tau of w/o preordering

tau of preordering with BTG

tau of preordering with RvNN

Figure 3: Distribution of Kendall’s τ in the train-ing data without preordering, preordering by BTG,and preordering by our RvNN.

the development losses decreased until 2 epochs.However, the development loss started to increaseafter 3 epochs. Therefore, the number of epochswas set up to 5, and we chose the model with thelowest development loss. The source sentencesin the translation evaluation were preordered withthis model.

Next, we investigated the effect of λ. Table1 shows the BLEU scores with different λ val-ues, as well as the BLEU score without preorder-

ing. In this experiment, PBSMT was trained witha 500k subset of training data, and the distortionlimit was set to 6. Our RvNNs consistently out-performed the plain PBSMT without preordering.The BLEU score improved as λ increased whenonly word embedding was considered. In addi-tion, RvNNs involving POS tags and syntactic cat-egories achieved even higher BLEU scores. Thisresult shows the effectiveness of POS tags andsyntactic categories in reordering. For these mod-els, setting λ larger than 200 did not contribute tothe translation quality. Based on these, we furtherevaluated the RvNN with POS tags and syntacticcategories where λ = 200.

Table 2 shows BLEU and RIBES scores of thetest set on PBSMT and NMT trained on the en-tire training data of 1.8M sentence pairs. The dis-tortion limit of SMT systems trained using pre-ordered sentences by RvNN and BTG was set to 0,while that without preordering was set to 6. Com-pared to the plain PBSMT without preordering,both BLEU and RIBES increased significantlywith preordering by RvNN and BTG. These scoreswere comparable (statistically insignificant at p <0.05) between RvNN and BTG,12 indicating thatthe proposed method achieves a translation qualitycomparable to BTG. In contrast to the case of PB-SMT, NMT without preordering achieved a signif-icantly higher BLEU score than NMT models withpreordering by RvNN and BTG. This is the samephenomenon in the Chinese-to-Japanese transla-tion experiment reported in (Sudoh and Nagata,2016). We assume that one reason is the isola-tion between preordering and NMT models, whereboth models are trained using independent opti-mization functions. In the future, we will investi-gate this problem and consider a model that unifies

12The p-value for BLEU and RIBES were 0.068 and0.226, respectively.

24

Preordered examplesSource sentence because of the embedding heterostructure, current leakage around the threshold was minimal.BTG of the embedding heterostructure because, the threshold around current leakage minimal was.RvNN embedding heterostructure the of because, around threshold the current leakage minimal was.

Translation examples by PBSMTReference 埋込みヘテロ構造のため、しきい値近くでの漏れ電流は非常に小さかった。

(embedding heterostructure of because, threshold around leakage very minimal.)w/o preordering 埋込みヘテロ構造のため、漏れ電流のしきい値付近では最低であった。

(embedding heterostructure of because, leakage threshold around minimal.)BTG　の埋込みヘテロ構造のため、このしきい値付近での漏れ電流の最小であった。

(of embedding heterostructure of because, the threshold around leakage minimal.)RvNN 埋込みヘテロ構造のため、周辺のしきい値の電流漏れは認められなかった。

(embedding heterostructure of because, around threshold leakage recognized not.)

Table 3: Example where preordering improves translation. (Literal translations are given in the paren-thesis under the Japanese sentences.)

Preordered examplesSource sentence avogadro’s hypothesis (1811) contributed to the development in since then.BTG avogadro’s hypothesis (1811) the then since in development to contributed .RvNN avogadro’s hypothesis (1811 then since in to development the contributed).

Translation examples by PBSMTReference Avogadroの仮説 (1811)は，以後の発展に貢献した。

(Avogadro’s hypothesis (1811), since then development to contributed.)w/o preordering Avogadroの仮説 (1811)の開発に貢献し以後である。

(Avogadro’s hypothesis (1811) development to contributed since then.)BTG Avogadroの仮説 (1811)以後の発展に貢献した。

(Avogadro’s hypothesis (1811) since then development to contributed.)RvNN Avogadroの仮説 (1811以降のこれらの開発に貢献した。　

(Avogadro’s hypothesis (1811 since then these development to contributed.)

Table 4: Example of a parse-error disturbed preordering in our method. (Literal translations are given inthe parenthesis under the Japanese sentences.)

preordering and translation in a single model.

Figure 3 shows the distribution of Kendall’s τin the original training data as well as the dis-tributions after preordering by RvNN and BTG.The ratio of high Kendall’s τ largely increased inthe case of RvNN, suggesting that the proposedmethod learns preordering properly. Furthermore,the ratio of high Kendall’s τ by RvNN is more thanthat of BTG, implying that preordering by RvNNis better than that by BTG.

We also manually investigated the preorderingand translation results. We found that our modelimproved both. Table 3 shows a successful pre-ordering and translation example on PBSMT. Theword order is notably different between source andreference sentences. After preordering, the wordorder between the source and reference sentencesbecame the same. Because RvNN depends onparsing, sentences with a parse-error tended to failin preordering. For example, the phrase “(1811)”in Figure 4 was divided in two phrases by mistake.Consequently, preordering failed. Table 4 showspreordering and translation examples for the sen-tence in Figure 4. Compared to the translation

without preordering, the translation quality afterpreordering was improved to deliver correct mean-ing.

4 Conclusion

In this paper, we proposed a preordering methodwithout a manual feature design for MT. The ex-periments confirmed that the proposed methodachieved a translation quality comparable to thestate-of-the-art preordering method that requiresa manual feature design. As a future work, weplan to develop a model that jointly parses and pre-orders a source sentence. In addition, we plan tointegrate preordering into the NMT model.

Acknowledgement

This work was supported by NTT communica-tion science laboratories and Grant-in-Aid for Re-search Activity Start-up #17H06822, JSPS.

ReferencesJan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov,

Alex Salcianu, David Weiss, Ryan McDonald, and

25

Slav Petrov. 2017. Natural language processingwith small feed-forward networks. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 2879–2885,Copenhagen, Denmark.

Dmitriy Genzel. 2010. Automatically learning source-side reordering rules for large scale machine trans-lation. In Proceedings of the International Con-ference on Computational Linguistics (COLING),pages 376–384, Beijing, China.

Adria de Gispert, Gonzalo Iglesias, and Bill Byrne.2015. Fast and accurate preordering for SMT us-ing neural networks. In Proceedings of the Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies (NAACL-HLT), pages 1012–1017, Denver, Colorado.

Anita Gojun and Alexander Fraser. 2012. Determin-ing the placement of German verbs in English–to–German SMT. In Proceedings of the Conference ofthe European Chapter of the Association for Com-putational Linguistics (EACL), pages 726–735, Avi-gnon, France.

Isao Goto, Masao Utiyama, and Eiichiro Sumita. 2012.Post-ordering by parsing for Japanese-English sta-tistical machine translation. In Proceedings of theAnnual Meeting of the Association for Computa-tional Linguistics (ACL), pages 311–316, Jeju Is-land, Korea.

Isao Goto, Masao Utiyama, and Eiichiro Sumita. 2013.Post-ordering by parsing with ITG for Japanese-English statistical machine translation. ACM Trans-actions on Asian Language Information Processing,12(4):17:1–17:22.

Katsuhiko Hayashi, Katsuhito Sudoh, Hajime Tsukada,Jun Suzuki, and Masaaki Nagata. 2013. Shift-reduce word reordering for machine translation. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing (EMNLP),pages 1382–1386, Seattle, Washington, USA.

Sho Hoshino, Yusuke Miyao, Katsuhito Sudoh, Kat-suhiko Hayashi, and Masaaki Nagata. 2015. Dis-criminative preordering meets Kendall’s τ maxi-mization. In Proceedings of the Annual Meetingof the Association for Computational Linguisticsand International Joint Conference on Natural Lan-guage Processing (ACL-IJCNLP), pages 139–144,Beijing, China.

Hideki Isozaki, Tsutomu Hirao, Kevin Duh, KatsuhitoSudoh, and Hajime Tsukada. 2010a. Automaticevaluation of translation quality for distant languagepairs. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 944–952, Cambridge, USA.

Hideki Isozaki, Katsuhito Sudoh, Hajime Tsukada, andKevin Duh. 2010b. Head finalization: A simple re-ordering rule for SOV languages. In Proceedings

of the Workshop on Statistical Machine Translationand MetricsMATR, pages 244–251, Uppsala, Swe-den.

Laura Jehl, Adria de Gispert, Mark Hopkins, and BillByrne. 2014. Source-side preordering for transla-tion using logistic regression and depth-first branch-and-bound search. In Proceedings of the Confer-ence of the European Chapter of the Association forComputational Linguistics (EACL), pages 239–248,Gothenburg, Sweden.

M. G. Kendall. 1938. A new measure of rank correla-tion. Biometrika, 30(1/2):81–93.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof the International Conference for Learning Repre-sentations (ICLR), San Diego, USA.

Philipp Koehn. 2004. Statistical significance testsfor machine translation evaluation. In Proceedingsof the Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 388–395,Barcelona, Spain.

Philipp Koehn, Amittai Axelrod, Alexandra Birch-Mayne, Chris Callison-Burch, Miles Osborne, andDavid Talbot. 2005. Edinburgh system descriptionfor the 2005 IWSLT speech translation evaluation.In Proceedings of the International Workshop onSpoken Language Translation (IWSLT), pages 68–75, Pittsburgh, USA.

Uri Lerner and Slav Petrov. 2013. Source-side clas-sifier preordering for machine translation. In Pro-ceedings of the Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages513–523, Seattle, USA.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1412–1421, Lis-bon, Portugal.

Masaaki Nagata, Kuniko Saito, Kazuhide Yamamoto,and Kazuteru Ohashi. 2006. A clustered globalphrase reordering model for statistical machinetranslation. In Proceedings of the InternationalConference on Computational Linguistics and An-nual Meeting of the Association for ComputationalLinguistics (COLING-ACL), pages 713–720, Syd-ney, Australia.

Tetsuji Nakagawa. 2015. Efficient top-down BTGparsing for machine translation preordering. InProceedings of the Annual Meeting of the Associ-ation for Computational Linguistics and Interna-tional Joint Conference on Natural Language Pro-cessing (ACL-IJCNLP), pages 208–218, Beijing,China.

26

Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi-moto, Masao Utiyama, Eiichiro Sumita, SadaoKurohashi, and Hitoshi Isahara. 2016. ASPEC:Asian scientific paper excerpt corpus. In Proceed-ings of the International Conference on LanguageResources and Evaluation (LREC), pages 2204–2208, Portoroz, Slovenia.

Graham Neubig, Taro Watanabe, and Shinsuke Mori.2012. Inducing a discriminative parser to optimizemachine translation reordering. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL), pages 843–853, Jeju Island, Korea.

Franz Josef Och. 2003. Minimum error rate trainingin statistical machine translation. In Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics (ACL), pages 160–167, Sapporo,Japan.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics (ACL), pages 311–318, Philadel-phia, USA.

Katsuhito Sudoh and Masaaki Nagata. 2016. Chinese-to-Japanese patent machine translation based onsyntactic pre-ordering for WAT 2016. In Proceed-ings of the Workshop on Asian Translation (WAT),pages 211–215, Osaka, Japan.

Christoph Tillmann. 2004. A unigram orientationmodel for statistical machine translation. In Pro-ceedings of the Human Language Technology Con-ference of the North American Chapter of theAssociation for Computational Linguistics (HLT-NAACL), pages 101–104, Boston, Massachusetts,USA.

Masao Utiyama and Hitoshi Isahara. 2007. AJapanese-English patent parallel corpus. In Pro-ceedings of the Machine Translation Summit XI,pages 475–482, Copenhagen, Denmark.

Chao Wang, Michael Collins, and Philipp Koehn. 2007.Chinese syntactic reordering for statistical machinetranslation. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL), pages 737–745, Prague, CzechRepublic.

Fei Xia and Michael McCord. 2004. Improving astatistical MT system with automatically learnedrewrite patterns. In Proceedings of the InternationalConference on Computational Linguistics (COL-ING), pages 508–514, Geneva, Switzerland.

Peng Xu, Jaeho Kang, Michael Ringgaard, and FranzOch. 2009. Using a dependency parser to improve

SMT for subject-object-verb languages. In Pro-ceedings of the Human Language Technologies: An-nual Conference of the North American Chapter ofthe Association for Computational Linguistics (HLT-NAACL), pages 245–253, Boulder, USA.

27


Pushing the Limits of Radiology with Joint Modeling of Visual andTextual Information

Sonit Singh1,2

Department of Computing, Macquarie University1

DATA61, CSIRO2

Sydney, [email protected]

Abstract

Recently, there has been increasing in-terest in the intersection of computer vi-sion and natural language processing. Re-searchers have studied several interestingtasks, including generating text descrip-tions from images and videos and lan-guage embedding of images. More re-cent work has further extended the scopeof this area to combine videos and lan-guage, learning to solve non-visual tasksusing visual cues, visual question answer-ing, and visual dialog. Despite a largebody of research on the intersection ofvision-language technology, its adaptionto the medical domain is not fully ex-plored. To address this research gap, weaim to develop machine learning modelsthat can reason jointly on medical imagesand clinical text for advanced search, re-trieval, annotation and description of med-ical images.

1 Introduction

Integrating information from various modalitiesis deeply rooted in human lives. Humans com-bine vision, language, speech and touch to ac-quire knowledge about the world and comprehendthe world (Hall and McKevitt, 1995). Vision andLanguage are the most common ways of express-ing our knowledge about the world. Both Com-puter Vision (CV) and Natural Language Process-ing (NLP) demonstrated successful results on var-ious general purpose tasks such as image classi-fication, object detection, semantic segmentation,and machine translation. Although research at theintersection of CV and NLP is gaining pace, itsapplications to healthcare are still under-explored.The success of Artificial Intelligence (AI) tech-

nologies in general purpose tasks is mainly at-tributed to publicly available large-scale datasets,enhanced compute power due to rise of Graph-ics Processing Units (GPUs), and due to advance-ments in Machine Learning (ML) algorithms andits various architectures. One of the biggest hur-dles in deploying ML (especially Deep Learning)models in healthcare is a lack of annotated data.Although it is easy to get annotated data for gen-eral purpose tasks by crowdsourcing, it is almostimpossible for medical data because of limited ex-pertise, privacy and ethical issues. On the pos-itive side, a lot of medical data in the form ofmedical images and accompanying text reports isstored in hospitals’ Picture Archival and Commu-nication Systems (PACS). For instance, Beth Is-rael Deaconnes Medical Center (Harvard) gener-ates approximately 20 terabytes of image data andone terabyte of text data per year (Mastanduno,2017). Also, the drive toward structured reportingin radiology definitely enhance NLP accuracy (Caiet al., 2016). Interpreting medical images andsummarising them in natural text is a challeng-ing, complex and tedious task. Various researchstudies show that the general rate of missed radi-ological findings can be as much as 30% (Berlin,2001; Berlin and Hendrix, 1998). These errors aremainly due to limited expertise, increasing patientvolumes, the subjectivity of human perception, fa-tigue, and inability to locate critical and subtlefindings (Sohani, 2013). Based on a recent esti-mate one billion radiology examinations are per-formed worldwide annually. This equates to about40 million radiologist errors per annum (Brady,2017). In order to reduce these errors, there is aneed to develop automated clinical decision sup-port systems (CDSS) (Eickhoff et al., 2017) thatcan interpret medical images and generate writtenreports to augment radiologist’s work.

Our research aims to develop machine learning

28

models that reason jointly on medical images andclinical text for advanced search, retrieval, anno-tation and description of medical images. Specif-ically, we aim to automatically generate descrip-tion of medical images, to develop medical visualquestion answering system and to develop medicaldialog agents that interact with patients to answertheir queries based on their medical data.

2 Background

Deep Neural Networks (DNNs) are a special classof machine learning algorithms that learn in multi-ple levels, corresponding to different levels of ab-straction. In this section, we provide an overviewof two of the most common DNNs namely Con-volutional Neural Networks (CNNs) and Recur-rent Neural Networks (RNNs). Also, we provide aprofile of architectures and various successful ap-plications in CV and NLP.

2.1 Convolutional Neural Networks

In the past, problems such as image classifica-tion and object detection were approached usinga traditional CV pipeline where hand-crafted fea-tures were first extracted, followed by learningalgorithms (Srinivas et al., 2016). The perfor-mance of these systems highly depends upon thequality of the extracted features and the ability ofthe learning algorithms (Fu and Rui, 2017). AsCV progressed, extracting these complex featuresbecame a tedious task, giving rise to algorithmsthat can learn directly from the raw data with-out the need for hand-crafted feature engineering.The major breakthrough happened in 2012 whenobject classification on ImageNet (Russakovskyet al., 2015) improved vastly from top-5 errorof 25% in 2011 to 16% in 2012. This was theresult of shift from hand-engineered features tolearned deep features (Felsberg, 2017). AlexNetwas the first deep learning model that won theILSVRC championship in 2012 by drastically re-ducing the top-5 error rate on the ImageNet Chal-lege compared to the previous shallow networks.Since AlexNet, a series of CNN models havebeen proposed that advanced state-of-the-art suchas VGG-16 (Simonyan and Zisserman, 2014),GoogleNet (Szegedy et al., 2015), and ResidualNetworks (ResNet) (He et al., 2016). All thesemodels differ in terms of various structural decom-positions which led them to have better learningability and high predictive performance.

2.2 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are specialnetworks that process sequential or temporal dataincluding language, speech, and handwriting.Learning sequential data requires memory of pre-vious states and a feedback mechanism. RNNsform an internal state of the network where con-nections between units form a cycle, which al-lows it to exhibit dynamic temporal behavior (Leeet al., 2017). Simple RNNs suffer from thevanishing or exploding gradients problem whentrained with gradient based techniques. To over-come these challenges, Long Short Term Memory(LSTM) (Hochreiter and Schmidhuber, 1997) andGated Recurrent Unit (GRU) (Cho et al., 2014)were introduced which are able to learn very deepRNNs and can successfully remember sequenceshaving duration of varying lengths.

2.3 Joint Image and Language Modeling

Due to the success of deep learning techniques inindividual domains of AI including vision, speechand language, researchers are aiming at problemsat the intersection of vision, language, knowledgerepresentation and common-sense reasoning. Anultimate goal of CV is to have comprehensive vi-sual understanding that involves not only namingthe classes of objects present in a scene, but alsodescribe their attributes and recognize relation-ship between objects (Krishna et al., 2017). Muchprogress has been made towards this goal, includ-ing object classification (Krizhevsky et al., 2012),object detection and localisation (Girshick et al.,2014), and object and instance segmentation (Heet al., 2017a). On the other hand, the overall goalof NLP is to understand, draw inferences from,summarise, translate and generate accurate natu-ral text and language. State-of-the-art results onvarious NLP tasks, including Part-of-Speech tag-ging (Collobert et al., 2011), Parsing (Dyer et al.,2015; Vinyals et al., 2015), Named Entity Recog-nition (Collobert et al., 2011), Semantic Role La-beling (Zhou and Xu, 2015; He et al., 2017b),and machine translation (Sutskever et al., 2014;Wu et al., 2016) are pushing towards that goal.Early work combining vision with language in-cludes image annotation, where the task is to as-sign labels to an image. However, image annota-tion only associates isolated words with the imagecontent and ignores the relationships between ob-jects and their relation to the world. To generate

29

a coherent interpretation of a scene and describeit in a natural way, the task of image caption-ing emerged within the language-vision commu-nity, together with large-scale captioning datasetsincluding Flickr30k 1 and MSCOCO 2. Caption-ing involves generating a textual description thatverbalizes the most salient aspects (objects, at-tributes, scene properties) of the image by ana-lyzing it. In order to tackle more complex tasksthat combine vision and language and to develophigh-level reasoning, Visual Question Answer-ing (VQA) (Stanislaw et al., 2015) was proposedwhich is equivalent to Visual Turing Test. In VQA,the goal is to predict the answer correctly after rea-soning over the image and a question in naturaltext (Teney et al., 2017). To further extend thistask, Visual Dialog (Das et al., 2017a,b) was pro-posed that requires an AI agent to hold meaning-ful dialog with humans in natural language aboutvisual content. Apart from this, research is mov-ing towards linking language to actions in the realworld, also known as language grounding (Chenand Mooney, 2011), which finds applications inhuman-robot interaction, robotic navigation andmanipulation. Although there has been language-vision research for these general purpose tasks, itsprogress has been underutilised in healthcare.

3 Related Work

Developing clinical decision support has longbeen a major research focus in medical image pro-cessing. In recent years, deep learning modelshave outperformed conventional machine learningapproaches in tasks such as dermatologist levelclassification of skin lesions (Esteva et al., 2017),detection of liver lesions (Ben-Cohen et al., 2016),detection of pathological-image findings (Zhanget al., 2017a), automated detection of pneumo-nia from chest X-rays (Rajpurkar et al., 2017),and segmentation of brain MRI (Milletari et al.,2016). Although there are many publicly availabledatasets for general purpose tasks (Ferraro et al.,2015), there are few publicly available datasetsin the medical domain. Recently, (Wang et al.,2017) introduced a large-scale Chest X-ray datasetnamed ChestX-ray8 that is publicly available. Thedataset consists of 112, 120 frontal-view chest X-rays images of 30, 805 patients. The labels of

1http://shannon.cs.illinois.edu/DenotationGraph/

2http://cocodataset.org/

the images are automatically assigned by applyingNLP techniques to the paired radiology reports.(Zhang et al., 2017b) proposed MD-Net, that canread pathology bladder cancer images, can gener-ate diagnostic reports, retrieve images by symp-tom descriptions, and provide justification of thedecision process by highlighting image regions us-ing an attention mechanism. Moreover, (Shinet al., 2016) proposed CNN-RNN model that canefficiently detect a disease in medical image, findthe context (e.g. location and severity of affectedorgan) and also correlate the salient regions ofthe image with Medical Subject Headings (MeSH)terms. They work on Open-i (U.S. NLM), a pub-licly available dataset that consists of 3955 radi-ology reports from the Indiana Network for Pa-tient Care, and 7, 470 associated chest X-rays fromthe hospitals' PACS. Evaluation in terms of BLEUscore (Papineni et al., 2002) demonstrates that themodel is able to locate diseases and able to gener-ate Medical Subject Headings (MeSH) terms withhigh precision.

In addition, ImageCLEF challenges3 have beenleading advances in the medical field by promot-ing evaluation of technologies for annotation, in-dexing and retrieval of textual data and medicalimages (Ionescu et al., 2017). Motivated by theneed for automated image understanding meth-ods in the healthcare domain, ImageCLEF orga-nized its first concept detection and caption pre-diction tasks in 2017 (Eickhoff et al., 2017). TheImageCLEFcaption challenge consists of two subtasks including Concept detection and Captionprediction. The concept detection task consists ofidentifying the UMLS Concept Unique Identifiers(CUIs). Majority of the submissions consider con-cept detection as a multi-label classification task.As both of these tasks are inter-related, there hasbeen work where first concepts in the medical im-ages are identified and then captions are gener-ated based on the predicted concepts. (Abachaet al., 2017) consider CUIs in the training set asthe labels to be assigned. Two methods namelyCNN based approach and the Binary Relevancevia Decision Trees (BR-DT) were used. In (Hasanet al., 2017), an encoder-decoder based frameworkis used where image features are extracted usingCNN and RNN-based architecture with attentionmechanism is used to translate the image featuresto relevant captions. In (Rahman et al., 2017), a

3http://www.imageclef.org/

30

Content Based Image Retrieval (CBIR) based ap-prroach is used where first images in both trainingand validation sets are indexed by extracting sev-eral low-level color, texture and edge-related vi-sual features. The similarity search is then usedto find the closest matching image in the train (orvalidation set) for each each query (test) image forcaption prediction. For similarity matching, eachfeature is concatenated to form a combined fea-ture vector and Euclidean distance is used for k-Nearest Neighbor image similarity.

In the ImageCLEF challenge, submissions var-ried in their usage of external resources. Forinstance (Hasan et al., 2017) do semantic pre-processing of captions using MetaMap and UMLSmeta-thesaurus. Pre-training CNN models onPubMed Central images helped in boosting effec-tiveness compared to training on general purposeImageNet dataset. Although these challenges pro-vide labeled medical images for modality clas-sification and concept predictions, the datasetsare still much smaller (thousands of images) thanthe ImageNet dataset (Russakovsky et al., 2015)which contains 1.2 million natural images. More-over, there are issues with ImageCLEFcaptiondataset as the UMLS concepts are extracted usingprobabilistic process which introduces errors. Theanalysis of dataset showed that some of the imageshad no concepts attached.

Learning image context from the correspond-ing clinical text and generating textual reports verysimilar to radiologists has not yet been achieved.With recent advancements in machine learning(specially deep learning), it is not hard to imaginean opportunity to aid radiologists by developingmultimodal clinical decision support systems.

4 Proposed Research

We identify research gaps in the intersection ofmedical imaging, computer vision and natural lan-guage processing as listed in in the following re-search questions. Our work will address some ofthese gaps.

How to automatically generate a radiologyreport for a given medical image?In medical imaging, the accurate diagnosis or as-sessment of a disease depends on both image ac-quisition and image interpretation. While im-age acquisition has improved substantially due tofaster rates and increased resolution of the acqui-sition devices, image interpretation is still per-

formed by a radiologist, where the radiologist hasonly a few minutes with an imaging study to de-scribe the findings in the form of radiology report.Such reporting is a time-consuming task and of-ten represents a bottleneck in the clinical diagno-sis pipeline (Ionescu et al., 2017). We will developmachine learning models that automatically gener-ate radiology reports by interpreting medical im-ages in order to augment the radiology practice.

How to develop a question answering systemthat can reason over medical images?

There has been growing interest in AI to supportclinical decision making and in improving clini-cal work-flow by better patient engagement. Au-tomated systems that can interpret complex med-ical images and provide findings in natural lan-guage text can significantly enhance the produc-tivity of hospitals, reduce burden on radiologistsand provide a “second opinion”, leading to re-duced errors in radiology practice. VQA (Stanis-law et al., 2015) has been successful on genericimages, but it has not been explored in the med-ical domain. We will develop machine learningmodels that combine NLP and CV techniques toanswer clinically relevant questions based on med-ical images. Subsequently, clinical visual dialogsystems could be developed based on the modelsfor medical VQA. The dialog agent will respondto patient’s queries in an interactive manner basedon medical images, clinical text reports and pasthistory of the patient.

How to annotate medical images from theaccompanied radiology reports in a weaklysupervised manner?

A large volume of medical imaging data and textis accumulated in hosptitals' PACS. To harnessthis data for advancing healthcare is challeng-ing. Manual annotation of medical data is almostimpossible due to the complex nature of medi-cal images, requirement of domain expertise, pri-vacy, ethics and healthcare data regulations. Theprocessing of clinical text is challenging due tocombinations of ad-hoc formatting, eliding wordswhich can be inferred from context, and liberal useof parenthetical expressions, jargon and acronymsto increase the information density. We will ex-plore NLP techniques to annotate medical imagesfrom the accompanying radiology reports.

31

How to highlight the relevant area in a medicalimage based on the features extracted fromradiology reports?Although machine learning, especially deep learn-ing, models have been successful in various do-mains, they are often treated as black boxes. Whilethis might not be a problem in other more deter-ministic domains such as image annotation (wherethe end user can objectively validate the tags as-signed to the images), in health care, not only thequantitative algorithmic performance is important,but also the reason why the algorithms works isrelevant. In fact, model interpretability is crucialfor convincing medical professionals of the valid-ity of actions recommended by predictive systems.We will develop models using CV, NLP and atten-tion mechanisms which highlight the relevant areain a medical image based on the feature extractedfrom the radiology reports.

How to train machine learning models whendata is small or classes are imbalanced?Obtaining datasets in the medical imaging domainthat are as comprehensively annotated as Ima-geNet remains a challenge. When sufficient datais not available, transfer learning or fine tuning arethe ways to proceed. In transfer learning, CNNmodels pre-trained from natural image dataset orfrom a different medical domain are used for anew medical task at hand. On the other hand, infine-tuning, when a medium sized dataset does ex-ist for the task at hand, one suggested scheme is touse a pre-trained CNN as initialisation of the net-work, following which further supervised trainingis conducted, of selected network layers, using thenew data for the task at hand. In this task, we willexplore the effectiveness of transfer learning andfine-tuning in the medical domain.

How to incorporate the temporal nature ofdiseases in machine learning models?Diseases evolve and change over time in a non-deterministic manner. The existing deep learningmodels assume static vector-based inputs, whichdo not take time factor into consideration. In or-der to understand the temporal nature of healthcaredata, we need to develop deep learning modelswhose parameters gets incrementally updated withtime. Considering that the time factor is impor-tant in all kinds of healthcare problems, traininga time-sensitive machine learning model is criticalfor a better understanding of the patient condition

and providing timely clinical decision support. Wewill work towards exploring ways of how to incor-porate temporal information in the machine learn-ing models to have temporal reasoning. This willhelp in understanding the progressive nature ofdiseases and to alert medical staff about the chang-ing conditions of patients at right time.

How to increase the number of features toimprove performance and robustness ofCDSS?Due to rise of Electronic Health Records (orEHR), hospitals store data in various forms in-cluding patient’s medical history, demographics,progress notes, medications, vital signs, immu-nizations, laboratory data, genetics and genomicsdata, and radiology reports. Combining two ormore modalities allows integration of the strengthsof individual modalities. We will work towardscombining various data sources in healthcare sothat better decisions can be made, in turn re-sulting in achieving the overall goal of precisionmedicine.

How to develop bi-directional models formedical indexing and retrieval?With the widespread use of EHR and PACS tech-nology in hospitals, the size of medical data isgrowing rapidly, which in turn demands effectiveand efficient retrieval systems. Clinical and radi-ology practices heavily rely on processing storedmedical data providing aid in decision making andincreasing productivity. Existing medical retrievalsystems have limitations in terms of the seman-tic gap (between the low level visual informationcaptured by imaging devices and the high levelsemantics perceived by humans) (Qayyum et al.,2017). We will develop bi-directional multimodalmachine learning models that perform retrievalbased on both textual and visual content. The pro-posed approach can retrieve medical images eitherbased on the textual query as an input or by pro-viding sample query images. In addition, the de-veloped model can also align images and text inlarge medical data collections.

5 Expermental Framework

5.1 DatasetsThe proposed research work has approval fromMacquarie University Human Research EthicsCommittee to use medical data from MacquarieUniversity Hospital. We will also use datasets

32

that are publicly available such as ChestX-Ray8,Open-i 4, and ImageCLEF 5 challenge datasets.These datasets comprise of medical images andtheir accompanied text in the form of disease la-bels or caption, mined from open source biomedi-cal literature and image collections.

5.2 Evaluation Metrics

For medical captioning task, we will usestandard image captioning metrics such asBLEU (Papineni et al., 2002), ROUGE (Lin,2004), METEOR (Banerjee and Lavie, 2005),CIDEr (Vedantam et al., 2015), and SPICE (An-derson et al., 2016). For VQA in the medical do-main, we will use accuracy for multiple-choicequestions, but to measure how much a predictedanswer differs from ground truth based on differ-ences in their semantic meaning, Wu-Palmer Simi-larity (Wu and Palmer, 1994) will be used. For theVisual-Dialog task in the medical domain, an algo-rithm has to return candidate answers for a givenmedical image, dialog history, question, and a listof candidate answers. We will use two standard re-trieval metrics namely, recall@k and mean recip-rocal rank (MRR) (Das et al., 2017a). In the taskof medical retrieval system, the evaluation task isto measure how effectively an algorithm is able toproduce search results to satisfy the user’s queryin the form of sample image or complex textualquery. For this task, standard information retrievalmetrics such as Precision, Recall, and F-score willbe used.

5.3 Baseline Methods

There are three main approaches to generate im-age captions: (1) Using templates that rely ondetectors and map the output to linguistic struc-tures; (2) Using language models that yield moreexpressive captions overcoming the limitations oftemplate based approach; and, (3) Caption re-trieval and recombination that involves retrievingcaptions based on training data instead of gen-erating new captions. We will work on CNN-RNN framework and caption retrieval approaches.The model proposed by Hasan et al. (2017) wasranked first in the caption prediction task in theImageCLEF challenge, which was based on deeplearning approach using language models. Apartfrom this, deep learning methods have demon-

4https://openi.nlm.nih.gov/5http://www.imageclef.org/

strated successful results in general purpose im-age captioning, therefore the first baseline methodis to incorporate an encoder-decoder based archi-tecture. Specifically, initial image features willbe extracted using a CNN model, namely VGG-19, which is pre-trained on the ImageNet datasetand is fine-tuned on the given ImageCLEF train-ing dataset to extract the image features from alower convolution layer such that the decoder canfocus on the salient aspects of the image via an at-tention mechanism. Second, text features will beextracted and pre-processed. Two reserved wordsnamely start and end are appended to indicate thestart and end of the captions. While training, theoutput of the last hidden layer of the CNN model(Encoder) is given to the first time step of theLSTM (decoder). We set x1 = start and the de-sired label, y1 = first word of the caption. Simi-larly, we set the all the remaining words and finallythe last target label yT = end token. The modelwill be trained with an adaptive learning rate op-timization algorithm, and dropout as a regulariza-tion mechanism. The model hyper-parameters aretuned based on the BLEU score on the validationset. Once the model is trained, captions are gen-erated on the test images by predicting one wordat every time step based on the context vector, theprevious hidden state, and the previously gener-ated words.

6 Conclusion

We argue the need for language and vision re-search in the medical domain by showing its suc-cessful applications on general purpose tasks. Weidentify various research directions in the medi-cal imaging applications that have not been fullyexplored, and can be solved by combining visionand language processing. This research aims todevelop machine learning models that jointly rea-son over medical images and accompanying clin-ical text in radiology. The proposed research isfruitful in advancing healthcare by building vari-ous clinical decision support systems to augmentradiologist's work.

Acknowledgements I would like to thank mysupervisors, Dr. Sarvnaz Karimi, Dr. Kevin Ho-Shon and Dr. Len Hamey, for their feedback thatgreatly improved the paper. This research is sup-ported by Macquarie University Research Train-ing Program scholarship. Also thankful to Googlefor providing travel grant to attend the conference.

33

ReferencesAsma Ben Abacha, Alba G. Seco de Herrera, Soumya

Gayen, Dina Demner-Fushman, and Sameer Antani.2017. NLM at ImageCLEF 2017 Caption Task. InCLEF2017 Working Notes, CEUR Workshop Pro-ceedings, Dublin, Ireland.

Peter Anderson, Basura Fernando, Mark Johnson, andStephen Gould. 2016. SPICE: Semantic Propo-sitional Image Caption Evaluation. In EuropeanConference on Computer Vision, Amsterdam, TheNetherlands.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgments. In Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion and/or Summarization, pages 65–72, Ann Ar-bor, Michigan, United States.

Avi Ben-Cohen, Idit Diamant, Eyal Klang, MichalAmitai, and Hayit Greenspan. 2016. Fully convo-lutional network for liver segmentation and lesionsdetection. In Deep Learning and Data Labeling forMedical Applications, pages 77–85, Cham. SpringerInternational Publishing.

Leonard Berlin. 2001. Defending the ”missed” radio-graphic diagnosis. American Journal of Roentgenol-ogy, 176(2):863–867.

Leonard Berlin and Ronald W. Hendrix. 1998. Per-ceptual errors and negligence. American Journal ofRoentgenology, 170(4):863–867.

Adrian P. Brady. 2017. Error and discrepancy in radiol-ogy: inevitable or avoidable? Insights into Imaging,8(1):171–182.

Tianrun Cai, Andreas A. Giannopoulos, Sheng Yu,Tatiana Kelil, Beth Ripley, Kanako K. Kumamaru,Frank J. Rybicki, and Dimitrios Mitsouras. 2016.Natural language processing technologies in radiol-ogy research and clinical applications. RadioGraph-ics, 36(1):176–191. PMID: 26761536.

David L. Chen and Raymond J. Mooney. 2011. Learn-ing to interpret natural language navigation instruc-tions from observations. In Proceedings of the 25thAAAI Conference on Artificial Intelligence, pages859–865, San Francisco, California.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) from

scratch. Journal of Machine Learning Research,12:2493–2537.

Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jose M.F. Moura, DeviParikh, and Dhruv Batra. 2017a. Visual Dialog. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1080–1089,Hawaii, United States.

Abhishek Das, Satwik Kottur, Jose M.F. Moura, Ste-fan Lee, and Dhruv Batra. 2017b. Learning coop-erative visual dialog agents with deep reinforcementlearning. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 2970–2979,Venice, Italy.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack Long Short-Term Memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers), pages 334–343, Beijing, China.

Carsten Eickhoff, Immanuel Schwall, Alba Gar-cia Seco de Herrera, and Henning Muller. 2017.Overview of ImageCLEFcaption 2017 - Image Cap-tion Prediction and Concept Detection for Biomedi-cal Images. In Working Notes of CLEF 2017 - Con-ference and Labs of the Evaluation Forum, Dublin,Ireland, September 11-14, 2017.

Andre Esteva, Brett Kuprel, Roberto A. Novoa, JustinKo, Susan M. Swetter, Helen M. Blau, and Sebas-tian Thrun. 2017. Dermatologist-level classificationof skin cancer with deep neural networks. Nature,542:115–118.

Michael Felsberg. 2017. Five years after the deeplearning revolution in computer vision: State of theart methods for online image and video analysis.Linkoping: Linkoping University Press, pages 1–13.

Francis Ferraro, Nasrin Mostafazadeh, Ting-HaoHuang, Lucy Vanderwende, Jacob Devlin, MichelGalley, and Margaret Mitchell. 2015. A survey ofcurrent datasets for vision and language research.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages207–213, Lisbon, Portugal.

Jianlong Fu and Yong Rui. 2017. Advances in deeplearning approaches for image tagging. APSIPATransactions on Signal and Information Processing,6:e11.

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten-dra Malik. 2014. Rich feature hierarchies for accu-rate object detection and semantic segmentation. InProceedings of the 2014 IEEE Conference on Com-puter Vision and Pattern Recognition, pages 580–587, Washington, DC, United States.

34

Peter Hall and Paul McKevitt. 1995. Integrating visionprocessing and natural language processing witha clinical application. In Proceedings 1995 2ndNew Zealand International Two-Stream Conferenceon Artificial Neural Networks and Expert Systems,pages 373–376, Dunedin, New Zealand.

Sadid A. Hasan, Yuan Ling, Joey Liu, Rithesh Sreeni-vasan, Shreya Anand, Tilak Raj Arora, Vivek Datla,Kathy Lee, Ashequl Qadir, Christine Swisher, andOladimeji Farri. 2017. PRNA at ImageCLEF 2017Caption Prediction and Concept Detection tasks. InCLEF2017 Working Notes, CEUR Workshop Pro-ceedings, Dublin, Ireland.

Kaiming He, Georgia Gkioxari, Piotr Dollr, and RossGirshick. 2017a. Mask R-CNN. In 2017 IEEE In-ternational Conference on Computer Vision, pages2980–2988, Venice, Italy.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In 2016 IEEE Conference on ComputerVision and Pattern Recognition, pages 770–778,Nevada, United States.

Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle-moyer. 2017b. Deep semantic role labeling: Whatworks and what’s next. In Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages473–483, Vancouver, Canada.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long Short-Term Memory. Neural Computation,9(8):1735–1780.

Bogdan Ionescu, Henning Muller, Mauricio Ville-gas, Helbert Arenas, Giulia Boato, Duc-Tien Dang-Nguyen, Yashin Dicente Cid, Carsten Eickhoff,Alba G. Seco de Herrera, Cathal Gurrin, BayzidulIslam, Vassili Kovalev, Vitali Liauchuk, JosianeMothe, Luca Piras, Michael Riegler, and ImmanuelSchwall. 2017. Overview of ImageCLEF 2017: In-formation extraction from images. In Experimen-tal IR Meets Multilinguality, Multimodality, and In-teraction, pages 315–337, Cham. Springer Interna-tional Publishing.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A. Shamma,Michael S. Bernstein, and Li Fei-Fei. 2017. Vi-sual genome: Connecting language and vision us-ing crowdsourced dense image annotations. Inter-national Journal of Computer Vision, 123(1):32–73.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-ton. 2012. ImageNet classification with deep con-volutional neural networks. In Proceedings of the25th International Conference on Neural Informa-tion Processing Systems - Volume 1, NIPS’12, pages1097–1105, USA.

JG Lee, S Jun, YW Cho, H Lee, GB Kim, JB Seo,and N Kim. 2017. Deep learning in medical imag-ing: General overview. Korean Journal of Radiol-ogy, 18(4):570–584.

Chin-Yew Lin. 2004. ROUGE: A package for auto-matic evaluation of summaries. In 42nd AnnualMeeting of the Association for Computational Lin-guistics, volume Text Summarization Branches Out,pages 1–8, Barcelona, Spain.

Mike Mastanduno. 2017. Survey of deep learning inradiology. [Online; posted 19-January-2017].

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ah-madi. 2016. V-Net: Fully convolutional neural net-works for volumetric medical image segmentation.CoRR, abs/1606.04797.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics, Philadelphia, Pennsylvania,United States.

Adnan Qayyum, Syed Muhammad Anwar, Muham-mad Awais, and Muhammad Majid. 2017. Medicalimage retrieval using deep convolutional neural net-work. Neurocomputing, 266:8 – 20.

Mahmudur Rahman, Terrance Lagree, and MartinaTaylor. 2017. A cross-modal concept detection andcaption prediction approach in ImageCLEFcaptiontrack of ImageCLEF 2017. In CLEF2017 WorkingNotes, CEUR Workshop Proceedings, Dublin, Ire-land.

Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Bran-don Yang, Hershel Mehta, Tony Duan, DaisyDing, Aarti Bagul, Curtis Langlotz, Katie Shpan-skaya, Matthew P. Lungren, and Andrew Y. Ng.2017. CheXNet: Radiologist-level pneumonia de-tection on chest X-rays with deep learning. CoRR,abs/1711.05225.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. ImageNetLarge Scale Visual Recognition Challenge. Interna-tional Journal of Computer Vision, 115(3):211–252.

Hoo-Chang Shin, Kirk Roberts, Le Lu, Dina Demner-Fushman, Jianhua Yao, and Ronald M. Summers.2016. Learning to read chest X-rays: Recurrent neu-ral cascade model for automated image annotation.In 2016 IEEE Conference on Computer Vision andPattern Recognition, pages 2497–2506.

Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. CoRR, abs/1409.1556.

C. A Sohani. 2013. A difficult challenge for radiol-ogy. The Indian Journal of Radiology and Imaging,23(1):110–112.

35

Suraj Srinivas, Ravi Kiran Sarvadevabhatla,Konda Reddy Mopuri, Nikita Prabhu, SrinivasS. S. Kruthiventi, and R. Venkatesh Babu. 2016.A taxonomy of deep convolutional neural nets forcomputer vision. Frontiers in Robotics and AI,2:36.

Antol Stanislaw, Agrawal Aishwarya, Lu Ji-asen, Mitchell Margaret, Batra Dhruv, ZitnickC. Lawrence, and Parikh Devi. 2015. VQA: VisualQuestion Answering. In International Conferenceon Computer Vision, pages 2425–2433, Santiago,Chile.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Proceedings of the 27th InternationalConference on Neural Information Processing Sys-tems - Volume 2, NIPS’14, pages 3104–3112, Cam-bridge, MA, United States.

Christian Szegedy, Wei Liu, Yangqing Jia, PierreSermanet, Scott Reed, Dragomir Anguelov, Du-mitru Erhan, Vincent Vanhoucke, and Andrew Ra-binovich. 2015. Going deeper with convolutions. In2015 IEEE Conference on Computer Vision and Pat-tern Recognition, pages 1–9, Boston, Massachusetts,United States.

Damien Teney, Qi Wu, and Anton van den Hengel.2017. Visual Question Answering: A tutorial. IEEESignal Processing Magazine, 34(6):63–75.

NIH U.S. NLM. Open-i: An open access biomedi-cal search engine. https://openi.nlm.nih.gov/.

Ramakrishna Vedantam, C. Lawrence Zitnick, andDevi Parikh. 2015. CIDEr: Consensus-based imagedescription evaluation. In 2015 IEEE Conferenceon Computer Vision and Pattern Recognition, pages4566–4575, Boston, Massachusetts, United States.

Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-mar as a foreign language. In Proceedings of the28th International Conference on Neural Informa-tion Processing Systems - Volume 2, NIPS’15, pages2773–2781, Cambridge, MA, United States.

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu,Mohammadhadi Bagheri, and Ronald M. Sum-mers. 2017. ChestX-Ray8: Hospital-scale chest X-Ray database and benchmarks on weakly-supervisedclassification and localization of common thoraxdiseases. In 2017 IEEE Conference on ComputerVision and Pattern Recognition, pages 3462–3471,Hawaii, United States.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, Hideto

Kazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging the gapbetween human and machine translation. CoRR,abs/1609.08144.

Zhibiao Wu and Martha Palmer. 1994. Verb seman-tics and lexical selection. In 32nd Annual Meet-ing of the Association for Computational Linguis-tics, pages 133–138, Las Cruces, New Mexico.

Zizhao Zhang, Pingjun Chen, Manish Sapkota, andLin Yang. 2017a. TandemNet: Distilling knowledgefrom medical images using diagnostic reports as op-tional semantic references. In Medical Image Com-puting and Computer-Assisted Intervention MIC-CAI 2017, pages 320–328, Quebec City, Quebec,Canada.

Zizhao Zhang, Yuanpu Xie, Fuyong Xing, Mason Mc-Gough, and Lin Yang. 2017b. MDNet: A semanti-cally and visually interpretable medical image diag-nosis network. In 2017 IEEE Conference on Com-puter Vision and Pattern Recognition, pages 3549–3557, Hawaii, United States.

Jie Zhou and Wei Xu. 2015. End-to-end learning ofsemantic role labeling using recurrent neural net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),pages 1127–1137, Beijing, China.

36


Recognizing Complex Entity Mentions: A Review and Future Directions

Xiang DaiCSIRO Data61 and School of IT, University of Sydney

Sydney, [email protected]

Abstract

Standard named entity recognizers can ef-fectively recognize entity mentions thatconsist of contiguous tokens and do notoverlap with each other. However, in prac-tice, there are many domains, such as thebiomedical domain, in which there arenested, overlapping, and discontinuous en-tity mentions. These complex mentionscannot be directly recognized by conven-tional sequence tagging models becausethey may break the assumptions basedon which sequence tagging techniques arebuilt. We review the existing methodswhich are revised to tackle complex entitymentions and categorize them as token-level and sentence-level approaches. Wethen identify the research gap, and discusssome directions that we are exploring.

1 Introduction

Named entity recognition (NER), the task of iden-tifying and classifying named entities (NE) withintext, has received substantial attention. This islargely due to its crucial role in conducting severaldownstream tasks, such as entity linking (Lim-sopatham and Collier, 2016; Pan et al., 2017), re-lation extraction (Zeng et al., 2014), question an-swering (Molla et al., 2007) and knowledge baseconstruction (Zhang, 2015).

Traditionally, the NER problem can be definedas: given a sequence of tokens, output a list of tu-ples < Is, Ie, t >, each of which is a NE mentionin text. Here, Is and Ie are the starting and end-ing index of the NE mention, respectively, and tis the type of the entity from a pre-defined cate-gory scheme. There are two assumptions associ-ated with this perspective:

1. An NE mention consists of contiguous to-kens, where all the tokens indexed betweenIs and Ie are part of the mention; and,

2. These linear spans do not overlap with eachother. In other words, no token in the text canbelong to more than one NE mention.

Based on these two assumptions, the most com-mon approach to NER is to use sequence taggingtechniques with a BIO or BIOLU label set. Eachtoken is assigned with a tag which is usually com-posed of a position indicator and an entity type.The position indicator is used to represent the to-ken’s role in a NE mention. In the BIOLU schema,B stands for the beginning of a mention, I for theintermediate of a mention, O for outside a men-tion, L for the last token of a mention, and U for amention having only one token (Ratinov and Roth,2009).

Sequential tagging models, such as linear-chainCRFs and BiLSTM-CRF, have achieved start-of-the-art effectiveness in many NER data sets (Lam-ple et al., 2016; Ma and Hovy, 2016; Chiu andNichols, 2016), since most training data sets arealso annotated based on these two assumptions.

However, in practice, there are many domains,such as the biomedical domain, which involvenested, overlapping, discontinuous NE mentionsthat break the two assumptions mentioned above.We categorize these mentions as complex entitymentions, and note that standard tagging tech-niques cannot be applied directly to recognizethese mentions (Muis and Lu, 2016; Dai et al.,2017). In the following paragraphs, we explainthese complex entity mentions in details.

Nested NE mentions One NE mention is com-pletely contained by the other. We call both of thementions involved as nested entity mentions. Fig-ure 1a is an example taken from the GENIA cor-

37

Figure 1: Examples involving overlapping, discontinuous and nested NE mentions. In (a), ‘HIV-1 en-hancer’ and ‘HIV-1’ are nested NE mentions. In (b), ‘intense pelvic pain’ and ‘back pain’ overlap,meanwhile, ‘intense pelvic pain’ is a discontinuous mention.

pus (Kim et al., 2003). Here, ‘HIV-1 enhancer’ isa DNA mention, and it contains another mention‘HIV-1’, which is a virus.

Multi-type NE mentions An extreme case ofnested NE mentions is one on which an NE men-tion has multiple entity types. For example, in theEPPI corpus (Alex et al., 2007), proteins can alsobe annotated as drug/compound, indicating thatthe protein is used as a drug to affect the functionof a cell. Such a mention should be classified asboth protein and drug/compound. In this case, weconsider this mention as two mentions of differenttypes, and these two mentions contain each other.

Overlapping NE mentions Two NE mentionsoverlap, but no one is completely contained bythe other. Figure 1b is an example taken from theCADEC corpus (Karimi et al., 2015), which is an-notated for adverse drug events (ADE) and rele-vant concepts. In this example, two ADEs: ‘in-tense pelvic pain’ and ‘back pain’, share a com-mon token ‘pain’, and neither is contained by theother.

Discontinuous NE mentions The mentionconsists of discontiguous tokens. In other words,the mention contains at least one gap. In Fig-ure 1b, ‘intense pelvic pain’ is a discontinuousNE mention since it is interrupted by ‘and back’.

These complex NE mentions can hold very use-ful information for downstream tasks. Some-times, the nested and overlapping structure itselfare already good indicators of the relationship be-tween different entities involved. For example,an ORG mention ‘University of Sydney’ containsa LOC mention ‘Sydney’. This structure has im-plied the location of the organization, and recog-nition of these mentions can potentially speed upthe construction of a knowledge base. In addi-tion, such entities often have fixed representations

in different languages. Therefore, recognizing NEmentions, especially these discontinuous NE men-tions, can improve the performance of a machinetranslation system (Klementiev and Roth, 2006).Furthermore, we notice that similar complex struc-tures also exist in other NLP tasks, such as multi-word expressions recognition (Baldwin and Kim,2010). The ideas proposed for a NER task can thusbe applied to tackle similar difficulties in othertasks.

Below, we briefly review existing methods torecognize complex mentions and discuss theirstrengths and limitations. We also discuss the re-search directions we are exploring to address theresearch gaps.

2 Token-level Approach

Sequence tagging techniques take the representa-tion of each token as input and output a label foreach token. These local decisions are chained to-gether to perform joint inference. Figure 2 is anillustration of a linear-chain CRF model where thetag of one token depends on both the features ofthat token in context and the tag of the previoustoken. The tag sequence predicted by the taggeris finally decoded into NE mentions using explicitrules. Here, the intermediate outputs for each to-ken are usually BIO tags in standard NER tasks.However, since the BIO tags cannot effectivelyrepresent complex NE mentions, a natural direc-tion is to expand the BIO tag set so that differentkinds of complex entity mentions can be captured.We categorize the methods based on conventionalsequence tagging as token-level approach.

Metke-Jimenez and Karimi (2015) introduceda BIO variant schema to represent discontinuousand overlapping NE mentions. Concretely, inaddition to the BIO tags, four new position in-dicators, BD, ID, BH, and IH are proposed todenote Beginning of Discontinuous body, Inside

38

xt-2 xt-1 xt xt+1 xt+2

yt-2 yt-1 yt yt+1 yt+2...

...

...

...

Figure 2: In a linear-chain CRF model, the out-put for each token depends on the features of thattoken in context and the output for the previoustoken.

Figure 3: An encoding example of two NE men-tions: ‘intense pelvic pain’ and ‘back pain’. Here,we keep only the position indicator and removethe entity type, since this schema can only repre-sent overlapping mentions of the same entity type.

of Discontinuous body, Beginning of Head, andInside of Head. Here, the word sequences whichare shared by multiple mentions are called head,and the remaining parts of the discontinuous men-tion are called body. Figure 3 is an encoding ex-ample using this schema. ‘pain’ is the beginningof the head that is shared by two mentions, andtherefore tagged as BH. ‘intense pelvic’ is the bodyof a discontinuous mention, while ‘back’ is the be-ginning of a continuous mention. We note that,even in this simple example, it is still impossibleto represent several discontinuous mentions unam-biguously. For example, this encoding can also bedecoded as having three mentions: ‘intense pelvicpain’, ‘back pain’ and ‘pain’. Muis and Lu (2016)introduced the notion of model ambiguity and the-oretically demonstrated that the models based onBIO variants usually have high ambiguity level,and therefore low precision in practice. Anotherlimitation of this schema is that it supports onlyoverlapping mentions of the same entity type.

Schneider et al. (2014) also proposed severalBIO schema variants to encode multiword expres-sions with gaps and nested structure. They includetwo strict restrictions which are motivated linguis-tically in their work:

1. An expression can be completely containedwithin another expression, but no overlap-ping is allowed; and,

2. A contained expression cannot contain otherexpressions. In other words, the nested struc-ture has maximum two levels.

We note that these strict restrictions cannot be ap-plied directly on our NER tasks.

Alex et al. (2007) proposed three approachesbased on a maximum entropy model (Curran andClark, 2003) to deal with nested NE mentions:

Layering The tagger first identifies the innermost(or outermost) mentions, then the followingtaggers are used to identify increasingly nextlevel mentions. Finally, the output of the tag-gers on different layers is combined by takingthe union.

Joined labeling Each word is assigned a tag byconcatenating the tags of all levels of nesting.Then a tagger is trained on the data contain-ing the joined labels. During inference, thejoined labels are decoded into their originalBIO format for each entity type.

Cascade Separate models are trained for each en-tity type or by grouping several entity typeswithout nested structures. Similar to the lay-ering approach, the latter models can utilizethe outputs from previous models as inputfeatures. Despite the difficulty of orderingand grouping entity models and the fact thatthis approach cannot deal with nested men-tions of the same entity type, the cascade ap-proach still achieves the best results amongthese three approaches.

Byrne (2007) and Xu et al. (2017) used a sim-ilar approach to deal with nested NE mentions.They concatenated adjacent tokens (up to a certainlength) into potential mention spans. Then thesespans, together with their left and right contexts,are fed into a classifier (a maximum entropy tag-ger in (Byrne, 2007) and a feedforward neural net-work in (Xu et al., 2017)). The classifier is trainedto first predict whether the span is a valid NE men-tion, and then its entity type if it is a NE mention.

3 Sentence-level Approach

Instead of predicting whether a specific token orseveral tokens belong to a NE mention and its rolein the mention, some methods predict directly acombination of NE mentions within a sentence.We categorize these methods as sentence-level ap-proach.

39

Figure 4: An example of sentence with three NEmentions. P(ER) and L(OC) refer to the entitytypes.

McDonald et al. (2005) proposed a new per-spective of NER as structured multi-label classifi-cation. Instead of starting index and ending index,they represent each NE mention using the set of to-ken positions that belong to the mention. Figure 4is an example of this representation, with each to-ken tagged using an I/O schema. This representa-tion is very flexible as it allows NE mentions con-sisting of discontiguous tokens and does not re-quire mentions to exclude each other. Using thisrepresentation, the NER problem is converted intothe multi-label classification problem of finding upto k correct labels among all possible labels, wherek is a hyper-parameter of the model. Labels canbe decoded to all possible NE mentions in the sen-tence. They do not come from a pre-defined cate-gory but depend on the sentence being processed.McDonald et al. (2005) used large-margin onlinelearning algorithms to train the model, so that thescores of correct labels (NE mentions) are higherthan those of all other possible incorrect mentions.Another advantage of this method is that the out-puts of the model are unambiguous for all kinds ofcomplex entity mentions and easy to be decoded,although the method suffers from a O(n3T ) infer-ence algorithm, where n is the length of the sen-tence and T is the number of entity types.

Finkel and Manning (2009) used a discrimina-tive constituency parser to recognize nested NEmentions. They represent each sentence as a con-stituency tree, where each mention corresponds toa phrase in the tree. In addition, each node needsto be annotated with its parent and grandparent la-bels, so that the CRF-CFG parser can learn howNE mentions nest. Ringland (2016) also exploreda joint model using the Berkeley parser (Petrovet al., 2006), and showed that it performed welleven without specialized NER features. However,one disadvantage of their models, as in (McDonaldet al., 2005), is that their time complexity is cubicin the number of tokens in the sentence. Further-more, the high quality parse training data, which

Figure 5: An example sub-hypergraph with twonested NE mentions: ‘University of Iowa’ (ORG)and ‘Iowa’ (LOC). Here, one mention correspondsto a path consisting of (AETI+X) nodes. Notethat this hypergraph cannot be used to representdiscontinuous mentions, but, in (Muis and Lu,2016), they expand the hypergraph representationto capture discontinuous mentions through twonew node types: B for within the mention, and Ofor part of the gap.

is not always available, plays a crucial role in thesuccess of the joint model (Li et al., 2017).

Lu and Roth (2015), extended by Muis andLu (2016), proposed a novel hypergraph tocompactly represent exponentially many possiblenested mentions in one sentence, and one sub-hypergraph of the complete hypergraph can there-fore be used to represent a combination of men-tions in the sentence. Figure 5 is an exampleof such a sub-hypergraph, which represents twonested NE mentions.

The training objectives of these models are tomaximize the log-likelihood of training instancesconsisting of the sentence and mention-encodedhypergraph. During inference, the model will firstpredict a sub-hypergraph among all possible sub-hypergraph of the complete hypergraph, and pre-dicted mentions can be decoded from the outputsub-hypergraph.

This hypergraph representation still suffersfrom some degree of ambiguity during decod-ing stage. For example, when one mention is

40

contained by another mention with the same en-tity type and their boundaries are all different,the hypergraph can be decoded in different ways.This ambiguity comes from the fact that, if onenode has multiple parent nodes and multiple childnodes, there is no mechanism to decide which ofthe parent node is paired with which child node.

4 Research Plan

We note that the drawbacks of existing meth-ods can be broadly categorized into: (1) lack ofexpressivity; and (2) computational complexity.Most of token-level approaches were proposed forsome specific scenarios or data sets, therefore usu-ally with strict restrictions. For example, the BIOvariant schema in (Schneider et al., 2014) was de-signed for nested structure with maximum depthof two. Therefore, it is difficult to be applied onGENIA corpus (Kim et al., 2003), which containsnested entities up to four layers of embedding. Inaddition, these token-level methods are usually de-vised to deal with only either nested or discontinu-ous mentions, and seldom can be used to tackle allkinds of complex entity mentions simultaneously.

In contrast, sentence-level approaches are over-all more flexible and less ambiguous, howeverwith higher computational cost. For example, bothFinkel and Manning (2009) and McDonald et al.(2005) methods suffer from a high time complex-ity which is cubic in the number of tokens in thesentence. Our aim is to propose a model that rec-ognizes all kinds of complex entity mentions, withlow ambiguity level and low computational com-plexity. Some specific directions include:

• The representation in (McDonald et al.,2005), introduced in Section 3, is most flex-ible and straightforward among all schemesdesigned for representing complex entitymentions. It can be used to represent allnested, overlapping and discontinuous entitymentions with unbounded length and depth.We are exploring recent advances in multi-label classification methods (Xu et al., 2016;Shi et al., 2017) to reduce the computationalcomplexity of this approach.

• Sequence-to-sequence models (Sutskeveret al., 2014; Cho et al., 2014) had achievedgreat success in machine translation and textgeneration tasks, especially after enhancedby attention mechanisms (Luong et al.,

2015). We are exploring extending theencoder-decoder architecture to recognizecomplex entity mentions. During inferencestage, instead of one tag sequence capturingall mentions in the sentence, the decodercan produce multiple sequences, each ofwhich corresponds to one possible mentioncombination, analogous to several possibletarget sentences in machine translation tasks.

• Supervised learning NER methods are af-fected by the quantity and quality of theavailable annotated corpora. However, sinceannotating mentions with complex structurerequires more human efforts than annotat-ing only the outermost or longest continu-ous spans, training data for complex entitymention recognition is rare. Furthermore, themedical domain is where complex mentionswidely exist, such as disorder mentions andadverse drug events. The cost of producinggold standard corpus in such a domain is veryhigh, due to the expertise required and thelimited access to some medical text, such aselectronic health records.

Active learning aims to reduce the cost ofconstructing a labeled dataset by allowinga human-in-the-loop (Settles and Craven,2008; Stanovsky et al., 2017). The modelselects one or several most informative in-stances and presents these instances to the an-notators. Since only these most informativeinstances need to be manually annotated byhuman experts, it can reduce the need for hu-man effort and therefore the cost of construct-ing large labeled dataset. We are exploringthis method to relieve the pain of lackingtraining data.

• Finally, we are going to utilize recent ad-vances in NER domain to improve the ef-fectiveness of complex entity mentions rec-ognizers, such as character-level embedding(Kuru et al., 2016) and joint models (Luoet al., 2015).

Besides employing active learning to create spe-cific data set with nested, overlapping, discontinu-ous entity mentions, we notice that there are someoff-the-shelf corpora in biomedical domain thatwe can use to evaluate our proposed methods, al-though, to our knowledge, none of these data sets

41

contains all three kinds of complex mentions, e.g.,GENIA (Kim et al., 2003) only contains nested en-tity mentions, and CADEC (Karimi et al., 2015)and SemEval2014 (Pradhan et al., 2014) containoverlapping and discontinuous mentions. In ad-dition, ACE 1 and NNE (Ringland, 2016) arenewswire corpora with nested entity mentions.

On these data sets, we will use standard evalua-tion metrics for NER tasks, namely micro-averageprecision, recall and f1-score, to evaluate the ef-fectiveness of proposed methods in recognizingboth complex and simple mentions. However,due to the complexity of complex NE mentions,we will include different boundary matching re-laxation, such as partial match and approximatematch (Tsai et al., 2006), to measure the proposedmethods in identifying these complex mentions.

5 Summary

We reviewed the existing methods of recogniz-ing nested, overlapping and discontinuous en-tity mentions, categorizing them as token-leveland sentence-level approaches, and discussed theirstrengths and limitations. We also identified theresearch gap and introduce some directions we areexploring.

Acknowledgments The author thanks SarvnazKarimi, Ben Hachey, Cecile Paris and Data61’sLanguage and Social Computing team for theirsupport and helpful discussions. The author alsothanks three anonymous reviewers for their in-sightful comments.

ReferencesBeatrice Alex, Barry Haddow, and Claire Grover. 2007.

Recognising nested named entities in biomedicaltext. In ACL Workshop on Biological, Translational,and Clinical Language Processing, pages 65–72,Prague, Czech Republic.

Timothy Baldwin and Su Nam Kim. 2010. Multi-word expressions. In Handbook of Natural Lan-guage Processing, pages 267–292.

Kate Byrne. 2007. Nested named entity recognitionin historical archive text. In International Con-ference on Semantic Computing, pages 589–596,Irvine, California.

Jason Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional lstm-cnns. Transac-

1https://www.ldc.upenn.edu/collaborations/past-projects/ace

tions of the Association for Computational Linguis-tics, 4:357–370.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In Conference onEmpirical Methods in Natural Language Process-ing, pages 1724–1734, Doha, Qatar.

James Curran and Stephen Clark. 2003. Language in-dependent ner using a maximum entropy tagger. InConference on Natural Language Learning, pages164–167.

Xiang Dai, Sarvnaz Karimi, and Cecile Paris. 2017.Medication and adverse event extraction from noisytext. In Australasian Language Technology Work-shop, pages 79–87, Brisbane, Australia.

Jenny Rose Finkel and Christopher D. Manning. 2009.Nested named entity recognition. In Conference onEmpirical Methods in Natural Language Process-ing, pages 141–150, Suntec, Singapore.

Sarvnaz Karimi, Alejandro Metke-Jimenez, MadonnaKemp, and Chen Wang. 2015. CADEC: A corpus ofadverse drug event annotations. Journal of Biomed-ical Informatics, 55:73–81.

J-D Kim, Tomoko Ohta, Yuka Tateisi, and JunichiTsujii. 2003. Genia corpusa semantically anno-tated corpus for bio-textmining. Bioinformatics,19(suppl 1):i180i182.

Alexandre Klementiev and Dan Roth. 2006. Weaklysupervised named entity transliteration and discov-ery from multilingual comparable corpora. In Inter-national Conference on Computational Linguisticsand Annual Meeting of the Association for Compu-tational Linguistics, pages 817–824, Sydney, Aus-tralia.

Onur Kuru, Ozan Arkan Can, and Deniz Yuret. 2016.Charner: Character-level named entity recognition.In International Conference on Computational Lin-guistics, pages 911–921, Osaka, Japan.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Annual Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 260–270, San Diego, California.

Peng-Hsuan Li, Ruo-Ping Dong, Yu-Siang Wang, Ju-Chieh Chou, and Wei-Yun Ma. 2017. Leverag-ing linguistic structures for named entity recogni-tion with bidirectional recursive neural networks. InConference on Empirical Methods in Natural Lan-guage Processing, pages 2654–2659, Copenhagen,Denmark.

42

Nut Limsopatham and Nigel Collier. 2016. Normalis-ing medical concepts in social media texts by learn-ing semantic representation. In Annual Meetingof the Association for Computational Linguistics,pages 1014–1023, Berlin, Germany.

Wei Lu and Dan Roth. 2015. Joint mention extrac-tion and classification with mention hypergraphs. InConference on Empirical Methods in Natural Lan-guage Processing, pages 857–867, Lisbon, Portugal.

Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Za-iqing Nie. 2015. Joint entity recognition and disam-biguation. In Conference on Empirical Methods inNatural Language Processing, pages 879–888, Lis-bon, Portugal.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. In Conference onEmpirical Methods in Natural Language Process-ing, pages 1412–1421, Lisbon, Portugal.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. InAnnual Meeting of the Association for Computa-tional Linguistics, pages 1064–1074, Berlin, Ger-many.

Ryan McDonald, Koby Crammer, and FernandoPereira. 2005. Flexible text segmentation with struc-tured multilabel classification. In Human LanguageTechnology Conference and Conference on Empiri-cal Methods in Natural Language Processing, pages987–994, Vancouver, Canada.

Alejandro Metke-Jimenez and Sarvnaz Karimi. 2015.Concept extraction to identify adverse drug reac-tions in medical forums: A comparison of algo-rithms. CoRR abs/1504.06936.

Diego Molla, Menno Zaanen, and Steve Cassidy. 2007.Named entity recognition in question answering ofspeech data. In Australasian Language TechnologyWorkshop, pages 57–65, Melbourne, Australia.

Aldrian Obaja Muis and Wei Lu. 2016. Learning torecognize discontiguous entities. In Conference onEmpirical Methods in Natural Language Process-ing, pages 75–84, Austin, Texas.

Xiaoman Pan, Boliang Zhang, Jonathan May, JoelNothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages.In Annual Meeting of the Association for Compu-tational Linguistics, pages 1946–1958, Vancouver,Canada.

Slav Petrov, Leon Barrett, Romain Thibaux, and DanKlein. 2006. Learning accurate, compact, and in-terpretable tree annotation. In International Con-ference on Computational Linguistics and AnnualMeeting of the Association for Computational Lin-guistics, pages 433–440, Sydney, Australia.

Sameer Pradhan, Noemie Elhadad, Wendy Chapman,Suresh Manandhar, and Guergana Savova. 2014.Semeval-2014 task 7: Analysis of clinical text.In International Workshop on Semantic Evaluation,pages 54–62, Dublin, Ireland.

Lev Ratinov and Dan Roth. 2009. Design challengesand misconceptions in named entity recognition. InConference on Natural Language Learning, pages147–155, Boulder, Colorado.

Nicky Ringland. 2016. Structured Named Entities.Ph.D. thesis, University of Sydney.

Nathan Schneider, Emily Danchik, Chris Dyer, andNoah A Smith. 2014. Discriminative lexical se-mantic segmentation with gaps: Running the mwegamut. Transactions of the Association for Compu-tational Linguistics, 2:193–206.

Burr Settles and Mark Craven. 2008. An analysisof active learning strategies for sequence labelingtasks. In Conference on Empirical Methods in Nat-ural Language Processing, pages 1070–1079, Hon-olulu, Hawaii.

Haoran Shi, Pengtao Xie, Zhiting Hu, Ming Zhang, andEric P. Xing. 2017. Towards automated icd codingusing deep learning. CoRR abs/1711.04075.

Gabriel Stanovsky, Daniel Gruhl, and Pablo N Mendes.2017. Recognizing mentions of adverse drug reac-tion in social media using knowledge-infused recur-rent models. In Conference of the European Chap-ter of the Association for Computational Linguistics,pages 142–151, Valencia, Spain.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Conference on Neural InformationProcessing Systems, pages 3104–3112, Montreal,Canada.

Richard Tzong-Han Tsai, Shih-Hung Wu, Wen-ChiChou, Yu-Chun Lin, Ding He, Jieh Hsiang, Ting-Yi Sung, and Wen-Lian Hsu. 2006. Various criteriain the evaluation of biomedical named entity recog-nition. BMC Bioinformatics, 7(1):92.

Chang Xu, Dacheng Tao, and Chao Xu. 2016. Ro-bust extreme multi-label learning. In Conferenceon Knowledge Discovery and Data Mining, pages1275–1284, San Francisco, California.

Mingbin Xu, Hui Jiang, and Sedtawut Watcharawit-tayakul. 2017. A local detection approach for namedentity recognition and mention detection. In AnnualMeeting of the Association for Computational Lin-guistics, pages 1237–1247, Vancouver, Canada.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation classification via con-volutional deep neural network. In InternationalConference on Computational Linguistics, pages2335–2344, Dublin, Ireland.

43

Ce Zhang. 2015. DeepDive: A Data Management Sys-tem for Automatic Knowledge Base Construction.Ph.D. thesis, University of Wisconsin Madison.

44


Automatic Detection of Cross-Disciplinary Knowledge Associations

Menasha Thilakaratne, Katrina Falkner, Thushari AtapattuSchool of Computer Science

University of AdelaideAdelaide, Australia

{firstname.lastname}@adelaide.edu.au

Abstract

Detecting interesting, cross-disciplinaryknowledge associations hidden in scien-tific publications can greatly assist scien-tists to formulate and validate scientifi-cally sensible novel research hypotheses.This will also introduce new areas of re-search that can be successfully linked withtheir research discipline. Currently, thisprocess is mostly performed manually byexploring the scientific publications, re-quiring a substantial amount of time andeffort. Due to the exponential growth ofscientific literature, it has become almostimpossible for a scientist to keep track ofall research advances. As a result, sci-entists tend to deal with fragments of theliterature according to their specialisation.Consequently, important and hidden asso-ciations among these fragmented knowl-edge that can be linked to produce sig-nificant scientific discoveries remain un-noticed. This doctoral work aims to de-velop a novel knowledge discovery ap-proach that suggests most promising re-search pathways by analysing the existingscientific literature.

1 Problem Statement

Formulation of scientifically sensible novel re-search hypotheses requires a comprehensive anal-ysis of the existing literature. However, thevoluminous nature of literature (Cheadle et al.,2016) makes the hypothesis generation process ex-tremely difficult and time-consuming even in thenarrow specialisation of a scientist. In this re-gard, Literature-Based Discovery (LBD) researchis highly beneficial as it aims to detect non-trivialimplicit associations by analysing a massive num-

ber of documents that have the potential to gen-erate novel research hypotheses (Ganiz et al.,2005). Moreover, LBD outcomes encourage theprogress of cross-disciplinary research by suggest-ing promising cross domain research pathways,which are typically unnoticed during manual anal-ysis (Sebastian et al., 2017b).

Independent of the domain, LBD is highly valu-able to accelerate knowledge acquisition and re-search development process. However, the exist-ing LBD approaches are mostly limited to medi-cal domain that attempt to find associations amonggenes, proteins, drugs, and diseases. The mainreason for this can be the highly specific and de-scriptive nature of medical literature that is suit-able for LBD research (Ittipanuvat et al., 2014).The application of LBD process in domains suchas Computer Science (CS) is challenging due tothe rapidly evolving nature of terms in the contentof research publications. The medical related LBDapproaches are strongly coupled with the medi-cal domain knowledge by utilising resources suchas Unified Medical Language System (UMLS),MetaMap, and Medical Subject Headings (MeSH)descriptors (Sebastian et al., 2017a). This makesthe applicability of these approaches to other do-mains challenging.

LBD research outside of medical domain is stillin an immature state. There are only a few LBDstudies performed outside of medical domain (e.g.,Water Purification (Kostoff et al., 2008), Technol-ogy & Social Issues (Ittipanuvat et al., 2014), Hu-manities (Cory, 1997)). To date, a work by Gor-don and Lindsay (2002) is the only available CS-related LBD research. Hence, in this doctoral re-search, we attempt to contribute to LBD disciplineoutside of medical domain by automating cross-disciplinary knowledge discovery process. As aproof of concept, the proposed solution will be ap-plied to different CS-related concepts.

45

2 LBD Discovery Models

Most of the LBD literature are based on the fun-damental premise introduced by Swanson namely,ABC Model (Swanson, 1986). It employs a simplesyllogism to identify the potential knowledge as-sociations. i.e. given two concepts A and C in twodisjoint scientific literature, if A is associated withconcept B, and the same concept B is associatedwith C, the model deduces that A is associatedwith C. Swanson demonstrated how these com-bined knowledge pairs contribute to reach solu-tions by manually making several medical discov-eries (e.g., Raynaud’s disease↔ Fish Oil (Swan-son, 1986) and Migraine ↔ Magnesium (Swan-son, 1988)). These medical discoveries are the ba-sis of the LBD discipline.

The ABC model has two variants named asopen and closed discovery models (Figure 1)(Henry and McInnes, 2017). Open discovery startswith an initial user-defined concept (e.g., LearningAnalytics (LA)) where the LBD process automat-ically analyses the literature related to the initialconcept to detect the potential interesting and im-plicit associations. This model is generally usedwhen there is a single problem (A-concept) withlimited knowledge on what concepts can be in-volved (B and C concepts). This model greatlyassists in hypothesis generation process. On thecontrary, closed discovery process requires twouser-defined concepts (e.g., LA and Deep Learn-ing) as the input to output potential hidden asso-ciations (B-concepts) between these specified twoconcepts A and C. This model is generally usedfor hypotheses testing and validation. However,the derived associations of the model can also beconsidered to generate more granular hypotheses.The granularity of the user-defined concepts of thetwo discovery models can vary depending on theresearcher’s interest (Ganiz et al., 2005).

3 Related Work

Even though the early work in LBD was per-formed manually, over the time, different compu-tational techniques were adopted to automate theLBD process. Most of the existing LBD researchare semi-automated and requires a human expertto make decisions during the LBD process (Sebas-tian et al., 2017a).

Much of the early computational approachesutilise lexical statistics (Swanson and Smalheiser,1997) such as term frequency-inverse document

A

B1

B2

Bn

C1

C2

C3

C4

Cm

Target Concepts/C-Concepts

Intermediate Concepts/B-Concepts

Start Concept/A-Concept

.

.

.

.

.

.

.

AnalysesA-literature

AnalysesB-literature

A C

B1

B2

Bn

B2 ...

AnalysesA-literature

AnalysesC-literature

Open Discovery Closed Discovery

Start Concept/A-Concept

Target Concepts/C-Concepts

Intermediate Concepts/B-Concepts

.

.

.

Figure 1: Open and closed discovery models.

frequency, token frequencies, which can be con-sidered as the most primitive LBD approach.Later, Distributional Semantics approaches (Gor-don and Dumais, 1998) such as Latent Seman-tic Indexing, Reflective Random Indexing wereintroduced. Subsequently, knowledge-based ap-proaches (Weeber et al., 2001) were adopted in theLBD process that heavily rely on the existence ofexternal structured knowledge-based resources toacquire domain-specific knowledge.

Relations-based approaches (Hristovski et al.,2006) make use of user-defined explicit predicatesto convey the meaning of the associations. How-ever, these approaches are restricted to problemswhere semantic types and predicates are knownin advance. Another category is Graph-based ap-proaches (Cameron et al., 2015) that generate anumber of bridging terms to define the associa-tions. Bibliometrics-based approaches (Kostoff,2014) utilise bibliographic link structures in theLBD process to identify potential knowledge as-sociations. Several attempts have been taken inLBD literature to employ link prediction tech-niques (Sebastian et al., 2015). i.e. attributes ofthe concepts and observed links are used to predictthe existence of new links between the concepts.

As discussed earlier, majority of the LBD re-search are in medical domain and dependent onmedical domain knowledge. As a result, it is notfeasible to apply these approaches to other do-mains. To date, there are only a handful of LBDresearch studies performed outside of the medicaldomain. This points out the importance of con-tributing to non-medical LBD research which isstill in an early stage.

4 Goals and Research Questions

Development of an automatic LBD system cansignificantly improve the typical research process

46

A

Web of Science

A-Concept

Literature Database

Obtain A-literature

Analyse author keyword

frequencies to obtain the B-

concepts

Obtain B-literature

Preprocess title and abstract to

construct word2vec vector

space

Obtain the main keywords in the

literature by analysing the title

and abstract

Measure the informativeness of the obtained keywords by analysing the nearest

neighbours of the word2vec model

Derive a weight for each keyword

based on the informativeness

and the frequencies

Order based on the derived weight and

extend with meaningful nearest neighbours in the

vector space

Evaluate the derived target C-concepts list

by using intersection evaluation

Figure 2: High-level overview of the proposed open discovery model.

followed by the scientists. With such system, sci-entists can generate scientifically sensible researchhypotheses in a shorter time by considering thesuggestions provided by the system. Moreover,the cross-domain knowledge discovery process ofLBD facilitates the development of cross disci-plinary research. Thus, in this study we are aim-ing to develop a novel knowledge discovery ap-proach by utilising both the variants of LBD pro-cess namely, open and closed discovery.

Our main intention is to uplift the LBD pro-cess in non-medical domains. The ultimate goalsof this doctoral research are; 1) Fully automatethe LBD process: Most of the existing LBD ap-proaches are semi-automatic and require humandecisions to direct the knowledge discovery pro-cess at various stages. Thus, automating the en-tire LBD process will be highly beneficial for theusers of the LBD model. 2) Provide a generic LBDsolution that is independent of domain specificknowledge: Most of the existing LBD approachesrely on domain-specific knowledge to identify theknowledge associations. As a result, the appli-cability of these approaches to other domains arelimited. Therefore, it is important to generate aLBD model that is suitable for any domain, with-out incorporating any domain specific knowledge.

While focusing on technical literature, in par-ticular, CS domain, the research questions of thisstudy are; 1) How to leverage NLP and MachineLearning techniques to enhance the understandingof content in the literature to accurately detect re-search areas with different levels of granularity?2) How bibliometrics analysis can be integrated toenhance identification of implicit knowledge asso-

ciations? 3) What are the scoring schemes that canbe used to rank the identified associations? 4) Howto improve the existing evaluation approaches toaccurately validate the LBD outcomes? This doc-toral work plans to answer these research ques-tions to fulfill the aforementioned goals.

5 Current Work and Future Directions

We have conducted a comprehensive literatureanalysis and have defined our research questionsbased on the gaps identified in the literature. Cur-rently, we are carrying out preliminary studies toidentify potential techniques that would be use-ful to enhance the predictability of the open dis-covery LBD model. Figure 2 depicts the high-level overview of our methodology that addressesresearch questions 1 and 3. The proposed LBDapproach is based on Swanson’s ABC discoverymodel. In this approach we are attempting to iden-tify the importance of neural word embeddings(Mikolov et al., 2013b) to accurately capture thecontext of the main keywords of the abstracts. Asfor the literature database, we are using Web of Sci-ence Core Collection (WoS)1 to obtain the metadata of the literature such as title, abstract, and au-thor keywords.

As shown in Figure 2, initially the frequenciesof the author keywords were analysed to obtainthe B-literature. The retrieved title and abstractin B-literature were cleaned using several care-fully picked preprocessing techniques. In sum-mary, the potential abbreviations in the text areidentified by using multiple regex patterns. Af-terwards, variable length n-grams in the text were

1https://clarivate.com/products/web-of-science/

47

identified by using a formula based on colloca-tion patterns described in Mikolov et al. (2013a).We also removed numbers, punctuation marks andterms constituent of single letters before analysingthe texts.

After the preprocessing phase, we identifiedthe sentence in the abstract that describes the in-tention/purpose of the study by using multipleintention-based word patterns. We further pro-cessed the identified purpose sentence along withthe title by removing stop words. The intentionof using the purpose sentence and title is that theytypically include the most important concepts thatbest describe the study.

For each post-processed n-grams (wi) of thepurpose sentence and title, we calculated a seman-tic importance (informativeness) score based onword2vec (Mikolov et al., 2013b) word embed-ding method. In other words, we measured theinformativeness of wi based on the validity and se-mantic richness of N neighbouring terms derivedusing cosine similarity. To measure the validityand semantic richness of N neighbouring terms,we imposed the following three criterions for thethree categories of the neighbouring terms; uni-grams, abbreviations and n-grams respectively. 1)valid technical unigram 2) valid detected abbrevia-tion 3) valid n-gram by eliminating partial n-gramswith different Part of Speech (POS) tag patterns. Ifthe neighbouring term (ni) fulfills the relevant cri-teria based on its category, it will be consideredas a valid, quality neighbouring term (ni ∈ N &N ⊂ N). We excluded wi if its informativeness isless than or equal to 50%. i.e. the excluded termshave majority of neighbours that does not fulfillthe valid, quality neighbouring criterions.

informativeness(wi) =1

N

N∑

i=1

[ni ∈ N ]

The frequency of wi denotes the importance ofthe term within B-literature. Therefore, we multi-plied informativeness(wi) by the number of occur-rences wi appeared in the title and purpose sen-tences to obtain the final weighted score. Thisscore was used to rank the derived wi terms whichrepresent the important concepts in B-literature(i.e. seed concepts). Each seed concept was ex-tended by linking valid quality neighbouring terms(using the same criteria used to measure the va-lidity and semantic richness of the neighbouringterms) in word2vec vector space.

We performed the same steps of the experi-ment with fastText (Bojanowski et al., 2016) wordembedding method. An important observationis that with respect to word2vec, we obtained abroad topic coverage in the same field showingwhat areas are connected with the seed conceptwhereas with fastText, we obtained topics in a nar-row range.

We used Learning Analytics (LA) as the A-concept to evaluate our approach. The reasonfor choosing LA is that it is a relatively novelbut rapidly growing area connected to many disci-plines like education, psychology, machine learn-ing etc. We utilised intersection evaluation of Gor-don and Lindsay’s work (2002) to evaluate the C-concepts obtained for LA. As for the evaluation lit-erature database, we used WoS and Scopus2 to ob-tain the intersection frequencies. We categorisedthe derived C-concepts as existing, emerging andnovel based on the intersection frequencies. In to-tal we obtained 564 knowledge associations 3.

Unlike Gordon and Lindsay’s work (2002), theknowledge discovery process of our approach isfully automated and does not require any humanintervention to make decisions during the process.Therefore, verifying the validity of the obtainedconcepts is important to ensure that the associ-ated intersections are meaningful. We performedan expert-based concept validation by utilising le-gitimate concept criteria discussed in Hurtado etal. (2016). From the evaluation performed bytwo LA researchers with CS background, we ob-tained an average accuracy of 97.87%, 98.21%,95.68% for existing, emerging, and novel associ-ations respectively. The experts’ agreement forthe valid terms is 93.6%. Through our experi-ment, we could successfully identify many inter-esting C-concepts that have the potential to gener-ate scientifically sensible novel research hypoth-esis. For example, our existing C-concepts in-cluded well-established research areas in LA suchas machine learning, data mining, and e-learningwhereas the emerging C-concepts included infre-quently used potential research areas in LA suchas computer vision, linked open data, and cogni-tive science. We could also obtain many inter-esting novel C-concepts such as word embeddingtechniques, deep learning architectures such asLSTM, BLSTM, CNN, that can be utilised in future

2https://www.scopus.com/3https://bit.ly/2rApl7C

48

Table 1: Methodology comparison(Gordon andLindsay, 2002)

Our Approach

Process Requires humanintervention

Fully automatic

Concepts Bi-grams only Uni-grams, ab-breviations andvariable lengthn-grams

Techniques Lexical Statis-tics

Lexical Statis-tics & Distribu-tional Semantics

Output List of bi-grams Detailed contex-tualised seman-tics groupings

LA research. Therefore, the suggestions providedthrough our approach will greatly influence to up-lift the process of research in LA. In comparisonwith the only existing CS-related LBD approach(Gordon and Lindsay, 2002), our approach utilisesan improved methodology (Table 1).

To the best of our knowledge, this is the firstnon-medical LBD study that utilises neural wordembeddings to detect the target C-concepts. Ourinitial results demonstrate the importance of ex-ploiting neural word embeddings to effectivelyidentify potential cross-disciplinary knowledgeassociations buried in literature. We would like tofurther enhance our existing approach by consider-ing the below-mentioned future directions that arecategorised based on our four research questions(RQ) described in Section 4.

RQ 1 (Content Analysis): A subtle analysis ofliterature is needed to accurately capture the hid-den knowledge associations. In our current exper-iment, we are considering concepts at keywords-level by identifying seed concepts. As an improve-ment, we would like to have an organised topicstructure with different levels of granularity. In or-der to achieve that, we are intending to utilise se-mantic web technologies and pre-existing topicalcategories (e.g., Dewey Decimal Classification) toenhance the understanding of the content. More-over, the identified topic structure will also be use-ful to provide a clearly structured, logical outputto the user than merely listing the identified as-sociations. Due to the lack of LBD research that

is included

includes

is published

publishes

writes

iswritten

Author

Term VenuePaper

cites

Figure 3: Entity & relation types (Shakibian andCharkari, 2017).

analyse the effects of topic modeling (Sebastianet al., 2017b), it is also important to study how top-ical information propagate among research pub-lications to detect interesting, implicit knowledgeassociations. Another interesting future directionwould be to utilise deep language understandingtechniques to infer ontologies from the scientificliterature automatically which can be utilised toidentify more granular knowledge associations.

RQ 2 (Bibliometrics Analysis): In our cur-rent experiment, we are utilising the popular ABCmodel to discover the knowledge associations.However, the inference steps introduced throughABC model is simple and not foolproof. There-fore, in our future research studies, we are intend-ing to analyse more complex inference steps toidentify complex knowledge associations that can-not be identified through ABC model. To achievethat, we are aiming to integrate a graph-based ap-proach by analysing the relationships among thefour entity types (i.e. author, term, paper, venue)illustrated in Figure 3. In other words, we areintending to utilise different bibliographics-basedlink structures such as co-author relationships, di-rect citation links, co-word analysis, bibliograph-ics coupling, and co-citation links to uncover com-plex knowledge associations. For example, whenauthors from disjoint research fields collaboratefor a research, it implies a potential associationbetween the two knowledge areas. This simpleco-author relationship can be further expanded tomore complex associations by analysing sharedauthors in the citations of the source and target lit-erature, analysing authors in source literature thatare cited by the target literature etc. Same as forthe author entity, this procedure can be followedfor the remaining entities (i.e. paper, term, venue)of the network schema in Figure 3 to derive morecomplex and implicit associations. With regards toterm entity, the identified associations can be fur-ther expanded by leveraging topic modeling and

49

topical categories. We are intending to automati-cally generate all the aforementioned associations(up to four degree meta path associations) for thefour entity types by traversing through the networkschema in Figure 3. From the derived associationswe would like to identify the most effective asso-ciation links by comparing the output results. Theidentified effective association links will providean improved understanding of an implicit associa-tion than the simple ABC model.

RQ 3 (Associations Ranking): The generatedtarget concepts list should be ordered in a waywhere the most significant knowledge associationshould be listed in the top. Therefore, it is impor-tant to identify the factors that can be utilised torank the derived associations. In our current ap-proach, we are incorporating semantic importanceand frequency to rank the associations. In fre-quently evolving domains like CS, it is also impor-tant to consider the temporal factors to accuratelyidentify the new research advancements. There-fore, we would like to propose different temporal-related weighting mechanisms to rank the targetC-concepts. To accomplish this, we need to anal-yse the concepts in chronologically ordered timeslices. For example, when deriving the tempo-ral weight of a concept, factors such as the sig-nificance of the concept in its corresponding timeinterval, and changes of the concept’s trend overthe time using a sliding-window mechanism needto be considered. Moreover, we can also utilisepre-existing algorithms that analyse the evolutionof topics (e.g., Dynamic Topic Model (Wang andMcCallum, 2006), Topics over Time Model (Bleiand Lafferty, 2006)) in this regard.

RQ 4 (Evaluation): Evaluating the validity ofthe identified knowledge associations outside ofmedical domain is very challenging due to theunavailability of gold standard datasets. Medi-cal LBD literature mostly attempted to replicateSwanson’s manually detected medical discoveries(e.g., Raynaud’s Disease ↔ Fish Oil) to evaluatetheir results (Sebastian et al., 2017a). However,when dealing with other domains, the possibleevaluation approaches that can be utilised are in-tersection evaluation (Gordon and Lindsay, 2002),time-sliced evaluation (Yetisgen-Yildiz and Pratt,2009) and expert based evaluation (Gordon andLindsay, 2002) that have number of inherent limi-tations in accurately validating the results. There-fore, we would like to improve the existing LBD

evaluation approaches to accurately evaluate ourresults. In our current evaluation, we are using in-tersection evaluation along with expert-based con-cept validation. In our future experiments, wewould like to quantitatively evaluate the knowl-edge associations by utilising information retrievalmetrics such as precision and recall (to evaluatethe complete set of target C-concepts) and 11-point average interpolated precision curves, preci-sion at k, and mean average precision (to evaluatethe rankings of the target C-concepts). To quanti-tatively evaluate the overall quality of the results aground truth is required. For this purpose, we areintending to create a time-sliced dataset describedin the work of Yetisgen-Yildiz and Pratt (2009). Inother words, the literature is divided into two setsnamely, pre-cut-off set (includes literature beforea cut-off date) and post-cut-off set (includes liter-ature after the cut-off date). Afterwards, the LBDmethodology is applied to pre-cut-off set to obtainthe implicit knowledge associations. Later, theexistence of these identified associations (that donot exist explicitly in pre-cut-off set) is checked inthe post-cut-off set. A major limitation of this ap-proach is that a knowledge association consideredas a false positive can become a true positive oncea new research is published. This limitation can beovercome upto some extent by incorporating hu-man experts to further evaluate validity of the falsepositives. Another interesting avenue for evalu-ation would be user performance evaluation byincorporating users with diversified range of ex-pertise such as users with limited prior knowledgeand experts in the field (Qi and Ohsawa, 2016).Through this approach, we can evaluate the extentto which the proposed LBD approach assist differ-ent levels of users to generate hypotheses by util-ising the suggested knowledge associations.

Thus, this doctoral work can be expanded in nu-merous ways since LBD outside of medical do-main is still in an early stage. Our next phase isto address the above discussed four focus points.Moreover, in our future work, we are also aim-ing to test our proposed LBD methodology on thebetter studied medical domain as well as on otherdomains such as humanities and social sciences.

References

David M. Blei and John D. Lafferty. 2006. Dynamictopic models. In Proceedings of the 23rd interna-

50

tional conference on Machine learning - ICML ’06,pages 113–120.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

Delroy Cameron, Ramakanth Kavuluru, Thomas C.Rindflesch, Amit P. Sheth, KrishnaprasadThirunarayan, and Olivier Bodenreider. 2015.Context-driven automatic subgraph creation forliterature-based discovery. Journal of BiomedicalInformatics, 54:141–157.

Chris Cheadle, Hongbao Cao, Andrey Kalinin, andJaqui Hodgkinson. 2016. Advanced literature anal-ysis in a Big Data world. Annals of the New YorkAcademy of Sciences, 1387(1):25–33.

Kenneth A. Cory. 1997. Discovering hidden analogiesin an online humanities database. Computers andthe Humanities, 31:1–12.

Mc Ganiz, Wm Pottenger, and Cd Janneck. 2005. Re-cent advances in literature based discovery. Techni-cal report.

M. Gordon and R. K. Lindsay. 2002. Literature-baseddiscovery on the world wide web. ACM Transac-tions on Internet Technology, 2:261–275.

Michael D Gordon and Susan Dumais. 1998. Usinglatent semantic indexing for literature based discov-ery. J. Am. Soc. Inf. Sci. Technol., 49(8):674–685.

Sam Henry and Bridget T. McInnes. 2017. LiteratureBased Discovery: Models, methods, and trends.

Dimitar Hristovski, Carol Friedman, Thomas C Rind-flesch, and Borut Peterlin. 2006. Exploiting seman-tic relations for literature-based discovery. AMIA ...Annual Symposium proceedings / AMIA Symposium.AMIA Symposium, pages 349–353.

Jose L. Hurtado, Ankur Agarwal, and Xingquan Zhu.2016. Topic discovery and future trend forecastingfor texts. Journal of Big Data, 3(1).

V. Ittipanuvat, K. Fujita, I. Sakata, and Y. Kajikawa.2014. Finding linkage between technology and so-cial issue: a literature based discovery approach.Journal of Engineering and Technology Manage-ment, pages 160–184.

Ronald N. Kostoff. 2014. Literature-related discov-ery: Common factors for Parkinson’s Disease andCrohn’s Disease. Scientometrics, 100(3):623–657.

Ronald N. Kostoff, Jeffrey L. Solka, Robert L. Rushen-berg, and Jeffrey A. Wyatt. 2008. Literature-relateddiscovery (LRD): Water purification. TechnologicalForecasting and Social Change, 75(2):256–275.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Distributed representations of wordsand hrases and their compositionality. In NIPS,pages 1–9.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013b. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781, pages 1–12.

Ji Qi and Yukio Ohsawa. 2016. Matrix-like visualiza-tion based on topic modeling for discovering con-nections between disjoint disciplines. IntelligentDecision Technologies, 10(3):273–283.

Yakub Sebastian, Eu Gene Siew, and Sylvester O. Ori-maye. 2017a. Emerging approaches in literature-based discovery: techniques and performance re-view.

Yakub Sebastian, Eu Gene Siew, and Sylvester OluboluOrimaye. 2015. Predicting future links betweendisjoint research areas using heterogeneous bibli-ographic information network. In Lecture Notesin Computer Science (including subseries LectureNotes in Artificial Intelligence and Lecture Notes inBioinformatics), volume 9078, pages 610–621.

Yakub Sebastian, Eu Gene Siew, and Sylvester OluboluOrimaye. 2017b. Learning the heterogeneous bib-liographic information network for literature-baseddiscovery. Knowledge-Based Systems, 115:66–79.

Hadi Shakibian and Nasrollah Moghadam Charkari.2017. Mutual information model for link predictionin heterogeneous complex networks. Scientific Re-ports, 7.

D. R. Swanson. 1988. Migraine and magnesium:eleven neglected connections. Perspectives in Biol-ogy and Medicine, 31:526–557.

Don R. Swanson. 1986. Fish Oil, Raynaud’s Syn-drome, and Undiscovered Public Knowledge. Per-spectives in Biology and Medicine, 30(1):1–18.

Don R. Swanson and Neil R. Smalheiser. 1997. Aninteractive system for finding complementary litera-tures: A stimulus to scientific discovery. ArtificialIntelligence, 91(2):183–203.

Xuerui Wang and Andrew McCallum. 2006. Topicsover time: A non-Markov continuous-time modelof topical trends. In Proceedings of the 12th ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 424–433.

Marc Weeber, Henny Klein, Lolkje T.W. De Jong-VanDen Berg, and Rein Vos. 2001. Using conceptsin literature-based discovery: Simulating Swanson’sRaynaud-fish oil and migraine-magnesium discover-ies. Journal of the American Society for InformationScience and Technology, 52(7):548–557.

Meliha Yetisgen-Yildiz and Wanda Pratt. 2009. A newevaluation methodology for literature-based discov-ery systems. Journal of Biomedical Informatics,42(4):633–643.

51


Language Identification and Named Entity Recognition in Hinglish CodeMixed Tweets

Kushagra SinghIIIT Delhi

Indira SenIIIT Delhi

{kushagra14056,indira15021,pk}@iiitd.ac.in

Ponnurangam KumaraguruIIIT Delhi

Abstract

While growing code-mixed content onOnline Social Networks (OSNs) providesa fertile ground for studying various as-pects of code-mixing, the lack of auto-mated text analysis tools render such stud-ies challenging. To meet this challenge, afamily of tools for analyzing code-mixeddata such as language identifiers, parts-of-speech (POS) taggers, chunkers havebeen developed. Named Entity Recogni-tion (NER) is an important text analysistask which is not only informative by it-self, but is also needed for downstreamNLP tasks such as semantic role labeling.In this work, we present an exploration ofautomatic NER of code-mixed data. Wecompare our method with existing off-the-shelf NER tools for social media content,and find that our systems outperforms thebest baseline by 33.18 % (F1 score).

1 Introduction

Code-switching or code-mixing occurs when“lexical items and grammatical features from twolanguages appear in one sentence” (Muysken,2000). 1 It is frequently seen in multilingual com-munities and is of interest to linguists due to itscomplex relationship with societal factors (Sridharand Sridhar, 1980).

With the rise of Web 2.0, the volume of texton online social networks (OSN) such as Twitter,Facebook, Reddit has grown. It is estimated thataround 240 Million Indian users, alone, are activeon Twitter 2. A significant fraction of these users

1Many researchers use code-mixing and code-switchinginterchangeably, which we follow in this work

2https://www.statista.com/statistics/381832/twitter-users-india/

are bilingual, or even trilingual, and their tweetscan be monolingual in English or their vernacular,or code-mixed. Past research has looked at multi-ple dimensions of this behaviour such as its rela-tionship to emotion expression (Rudra et al., 2016)and identity. Code-mixing or multilingualism oftweets poses a significant problem to both theOSNs’ underlying text mining algorithms as wellas researchers trying to study online discourse,since most existing tools for analyzing OSN textcontent caters to monolingual data. For exam-ple, Twitter’s abuse detection systems fail to flagcode-mixed tweets as offensive. 3 Recent effortsto build tools for code-mixed content include lan-guage identifiers (Solorio and Liu, 2008), parts-of-speech(POS) taggers (Vyas et al., 2014), andchunking (Sharma et al., 2016). A natural exten-sion of these set of automated natural languageprocessing (NLP) tools is a Named Entity Rec-ognizer (NER) for code-mixed social media data,which we present in this paper. Additionally, aslanguage tags are an essential feature for NLPtasks, including NER, we also present a neural net-work based language identifier.

Our main contributions are:

1. Building a token-level language identifica-tion system for Hindi-English (Hi-En) codemixed tweets, described in detail in Section 3.

2. Building an NER for En-Hi code-mixedtweets, which we explain in Section 4. Wealso show, in Section 5, that our NER per-forms better than existing baselines.

2 Related Work

In this section we briefly describe the approachesfor automatic language identification and extrac-

3http://timesofindia.indiatimes.com/india/to-avoid-social-media-police-indian-trolls-go-vernacular/articleshow/60139671.cms

52

tion of named entities.Language Identification for code-mixed content

has been previously explored in Barman et al.(2014). Particularly close to our work is the useof deep-learning approaches for detecting token-level language tags for code-mixed content (Jaechet al., 2016).

We particularly focus on efforts to buildingNERs for social media content and, NERs forIndian languages and code-mixed corpora. So-cial Media text, including and especially tweets,have subtle variations from written and spokentext. These include slacker grammatical struc-ture, spelling variations, ad-hoc abbreviations andmore. See Ritter et al. (2011) for detailed dif-ferences between tweets and traditional textualsources. Monolingual NER for tweets include(Ritter et al., 2011; Li et al., 2012). We build onthese approaches to account for code-mixing anduse the former as a baseline to test our methodagainst.

3 Language Identification usingTransliteration

We build a token level language identificationmodel (LIDF) for code-mixed tweets using mono-lingual corpora available for both the languages,supplemented by a small set of annotated code-mixed tweets. We hypothesize that words or char-acter sequences of different languages encode dif-ferent structures. Therefore, we aim to capturethis character structure for both languages. Sub-sequently, given a token, we try to see which lan-guage fits better. Our LIDF algorithm comprisesof multiple steps, described in Figure 1. Each ofthe steps have been explained in detail below.

3.1 Roman-Devanagari transliteration

We restrict ourselves to instances of En-Hi code-mixing where the Hindi component is written inRoman script. 4 Therefore, any model trained onHindi corpora (which will be in Devanagari) is notdirectly usable. To make use of such a corpus, wefirst transliterate words written in Roman to De-vanagari script.

We train a model T , that takes a token writ-ten in Roman characters and generates its Devana-gari equivalent (given token school, T generates-k� l). We follow an approach used by (Rosca

4This is followed in previous studies since the quantity ofcode-mixed content with non-Roman Hindi is negligible.

Figure 1: Different steps of our language identi-fication pipeline. We extract features using lan-guage models trained on monolingual corpora,and train a classifier based on these features.

and Breuel, 2016) for English-Arabic translitera-tion and, and train an Attention Sequence to Se-quence (Seq2Seq) model (Bahdanau et al., 2014).Given an input sequence of Roman characters, Tgenerates a sequence of Devanagari characters.

We train T using Roman-Devanagari translit-eration pairs mined by (Gupta et al., 2012) and(Khapra et al., 2014). After combining the twosets and removing duplicates, we were left with41,383 unique pairs.

We explore multiple Seq2Seq models, experi-menting with different RNN cells, encoder and de-coder depths, output activation functions and thenumber of RNN cells in each layer. We evalu-ate their performance using the character error ratemetric, comparing with LITCM (Bhat et al., 2015)as a baseline. We report the performance of ourtop five models in Table 1, all of which performbetter than the baseline.

Our best model comprises of a 2 layer bidi-rectional RNN encoder and a 2 layer RNN de-coder, each layer comprising of 128 GRU unitswith ReLU activation. We use the Adagrad op-timizer to train T , adding a dropout of 0.5 aftereach layer and use the early stopping technique toprevent over fitting. Larger and deeper networksperform relatively poorer since they start overfit-ting quickly.

We observe that Hindi words which have mul-tiple spellings when written in Roman script areall transliterated to the same Hindi token. There-fore, T could also be used for normalization and

53

Model Depth # Units Cell CER

LITCM – – – 18.88Seq2Seq 3 256 LSTM 17.23Seq2Seq 2 64 GRU 17.09Seq2Seq 3 128 GRU 16.99Seq2Seq 2 128 LSTM 16.67Seq2Seq 2 128 GRU 16.54

Table 1: Performance of different transliterationmodels. Reported CER is in percentage and wascalculated after a 5-fold cross validation. Depthwas kept the same for both encoding and decodinglayers.

we hope to investigate this in more detail in future.

3.2 Extracting features using monolingualcorpora

For both languages, we learn the structure of thelanguage at a character level. We do this by train-ing a model which for a token of length n, learnsto predict the last character given the first n − 1characters as input. More formally, for each to-ken {c1, c2, ..., cn}, the model learns to predict cngiven the input sequence {c1, c2, ..., cn−1}. Wemodel this as a sequence classification problem us-ing LSTM RNNs (Hochreiter and Schmidhuber,1997). We use ELM to refer to the English lan-guage model, and HLM to refer to the Hindi lan-guage model.

The same architecture is used for both ELM andHLM, as shown in Figure 1. Both comprise oftwo RNN layers with 128 LSTM cells each, usingReLU activation. The output of the second RNNlayer at the last time step is connected to a fullyconnected (FC) layer with softmax activation. Thesize of this FC layer is equal to the character vo-cabulary V of the language. We take a softmaxover the FC layer to predict cn. The normalizedoutputs from the FC layer can be thought of as aprobability distribution over V , the ith normalizedoutput being equal to P (Vi|c1, c2, ..., cn−1), whereVi is the ith character in the vocabulary. We re-fer to the normalized outputs from the FC layer ofELM and HLM as PELM and PHLM respectively,and use them as features for our language detec-tion classifier.

For training ELM, we use the News Crawldataset provided as a part of the WMT 2014 trans-

lation task.5 As a preprocessing step, we re-move all non-alphabetic characters (such as num-bers and punctuations), and convert all upper-case alphabets to lowercase. After preprocess-ing, we were left with a total of 98,565,179 to-kens, 189,267 of which were unique. For HLM,we use the IIT Bombay Hindi corpus (Kunchukut-tan et al., 2017). We follow the same preprocess-ing steps, except for converting to lowercase (sincethere is no concept of case in Hindi). This yielded59,494,325 tokens, of which 161,020 were unique.

The input sequence is encoded into a sequenceof one hot vectors before feeding it to the network.We use categorical cross-entropy as the loss func-tion, optimizing the model using gradient descent(Adagrad). 20% of the unique tokens are held outand used to validate the performance of our model.Once again, we use the early stopping techniqueand add a dropout of 0.5 after each layer to avoidover-fitting.

Figure 2: Architecture for ELM and HLM. Thefigure shows one cell in each LSTM layer unrolledover time.

3.3 Predicting language tag using PELM andPHLM

Given a word {c1, c2, ..., cn} we first translit-erate it to Devanagari using T , generating{c′1, c′2, ..., c′k}. Then by passing {c1, c2, ..., cn−1}and {c′1, c′2, ..., c′k−1} through ELM and HLM re-spectively, we obtain PELM and PHLM. Our hy-pothesis is that we can differentiate PELM of aHindi word from the PELM of an English word,

5http://statmt.org/wmt14/translation-task.html

54

since the character sequence structure of a Hindiword is different from that of English words(which are used to train ELM). Similarly, we candifferentiate PHLM of an English word from thePHLM of a Hindi word.

We use a set of tweets curated by Sakshi Guptaand Radhika (2016) which are annotated for lan-guage at a token level (each token is either En-glish, Hindi or Rest) to train a three class clas-sifier using (i) PELM (ii) PHLM, (iii) ratio of non-alphabetic characters in W , (iv) ratio of capital-ization in W and (v) Binary feature indicatingwhether W is title-case as features. The last threefeatures help identify the Rest tokens. Our 3-classclassifier is a fully connected neural network with2 hidden layers using ReLU activation. On 5-fold cross validation, our model achieves an av-erage F1 score of 0.934 and an average accuracyof 0.961 across the three classes. This is a slightimprovement over the model proposed by Sharmaet al. (2016), which had an accuracy of 0.938 onthe same dataset (as reported by reported by Sak-shi Gupta and Radhika (2016)).

4 Named Entity Recognition

Named entity recognition typically comprises oftwo components, (i) entity segmentation and (ii)entity classification. Both these components caneither be modeled separately as done by Ritteret al. (2011), or they can be combined and tackledtogether like the model proposed by Finkel et al.(2005). We adopt the latter approach, modelingboth components together as a sequence labelingtask. Our hypothesis is that named entities canbe identified using features extracted from wordssurrounding it. We explore models using Con-ditional Random Fields and LSTM RNNs usinghandcrafted features described below.

4.1 Features

Our hand-crafted features are described below.

• Token based features : The current tokenT , T after stripping all characters which arenot in the Roman alphabet (Tclean), and con-verting all characters in Tclean to lowercase(Tnorm) generates three different features.We create Twordhsape by replacing all upper-case letters in T with X , all lowercase let-ters with x, all numerals with o and leave allother characters as they are. For example,

Adam123 becomes Xxxxooo. We also usetoken length TL as as feature.

• Affixes : Prefixes and suffixes of length 1 to5 extracted from T , padded with whitespaceif needed. These help in identifying phrasesthat are not entities. For example, an Englishtoken ending in ing is highly unlikely to be anamed entity.

• Character based features : Binary fea-tures indicating (i) whether T is title case,(ii) whether T has an uppercase letter, (iii)whether T has all uppercase characters, (iv)whether T has a non-alphanumeric characterand (v) whether T is a hashtag. We also cal-culate the fraction of characters in T whichare ASCII

• Language based features : The languagewhich T belongs to, as predicted by themodel proposed in section 3. For the LSTMmodel, we also use PELM and PHLM generatedfor T .

• Syntactic features : POS tag for T as pre-dicted by the model trained by Owoputi et al.(2013), POS tag and chunk tag for T andas predicted by the shallow parser trainedby Sharma et al. (2016)

• Tweet capitalization features : From thetweet that T belongs to, we extract (i) fractionof characters that are uppercase, (ii) fractionof characters that are lowercase, (iii) fractionof tokens that are title case.6

4.2 Proposed Models

Our LSTM model (Figure 3) comprises of twobidirectional RNN layers using LSTM cells andReLU activation. The input at the time step t isFt, i.e. the feature vector for the token at position tin the tweet. Ft is generated by concatenating theextracted features for the token at position t. T ,Tclean, Tnorm, Twordhsape and affixes are passedthrough embedding layers which are randomly ini-tialized and learnt during the training phase. Allreal-valued features are encoded into one-hot vec-tors of length 10, using percentile binning.

The output (ht) of the second RNN layer at eachtime step is passed to a separate fully connected

6Using the output of the T cap classifier trained by Ritteret al reduces accuracy.

55

Type Lample Ritter LSTM CRF N

PER 38.73 38.45 65.47 72.23 1644LOC 38.35 46.84 67.53 72.50 744ORG 16.99 13.54 50.94 70.35 375

All 33.77 36.88 64.64 72.06 2763

Table 2: Performance (F1 scores) of different mod-els on segmentation and classification. N is thetotal number of entities in the entire dataset. Re-ported numbers are in percentages. Results ofCRF and LSTM are on 5-fold cross validation.

layer FCt. We take a softmax over the output ofFCt to predict the label (Lt) for the token at po-sition t. While training, we use the Adagrad opti-mizer and add a dropout of 0.5 after each layer.

Figure 3: LSTM NER architecture. The figureshows one cell in each LSTM layer unrolled overtime

Our CRF model is the standard generalizedCRF proposed by Lafferty et al. (2001), which al-lows a flow of information across the sequence inboth directions. We add L1 and L2 regularizationto prevent over-fitting, and do an extensive gridsearch to come up with optimum values for theseconstants.

5 Experiments and Results

5.1 Data collection and annotationWe randomly sampled 50,000 tweets from thecode mixed tweet dataset collected by (Patro et al.,2017). On this set, we ran our language detectionalgorithm and filtered tweets which had at leastfive Hi tokens. The filtered data also had tweetscontaining Roman tokens belonging to languages

other than English and Hindi (like transliteratedTelugu), such tweets were removed during the an-notation process.

The final dataset comprised of 2079 tweets(35,374 tokens). 13,860 (39.18 %) of the to-kens were En, 11,391 (32.2 %) Hi and 10,123(28.61 %) Rest. Each tweet was annotated at atoken level for three classical named entity typesPerson, Location and Organisation, using theIOB format. The annotation process was carriedout by three linguists proficient in both Englishand Hindi. The final label was decided basedon majority vote and any instance (around 2%)where all three annotators disagreed was resolvedthrough discussion. In all, the annotators identi-fied 2763 entity phrases (3751 tokens) which in-cluded 1,644 Person entities, 744 Location enti-ties and 375 Organisation entities.

5.2 Baselines and Results

We compare our proposed models (LSTM andCRF) with two baseline systems, (i) a state of theart English NER model proposed by Lample et al.(2016) and (ii) a state of the art NER model fortweets proposed by Ritter et al. (2011). NamedEntities do not have belong to one particular lan-guage though the same NE might have differentforms in different languages, such as Cologne inEnglish, is Kln in German. For the purpose of thisstudy, we do not assign NEs any language tags andleave the detection and mapping of multiple NEforms as future work.

The results are summarized in Table 2 (segmen-tation) and Table 3 (segmentation and classifica-tion). All metrics are calculated on a phrase level,no partial credit is awarded. An incorrectly identi-fied boundary is penalized as both, a false positiveand a false negative. For computing Table 2, wegeneralize entity tags (B-PER, B-LOC, B-ORGbecome B-ENT, I-PER, I-LOC, I-ORG become I-ENT). As expected, both these systems fare poorlyon our data at both entity segmentation and entityclassification. We believe this is due to the highnumber of out of vocabulary tokens (belonging toHindi) in the data.

6 Discussion

In this paper, we present a Named Entity Recog-nition tool specifically targeting Hindi-Englishcode-mixed content. To build our NER model, we

56

Model Precision Recall F1 score

Lample 36.55 46.59 40.97Ritter 64.64 34.24 44.77LSTM 74.45 64.87 69.33CRF 84.95 69.91 76.70

Table 3: Performance of different models at entitysegmentation. All numbers are in percentages.

also present a unique semi-supervised languageidentifier which exploits the character-level differ-ences in languages. We validate the performanceof our NER against off-the-shelf NER for Twitterand observe that our model outperforms them.

In future, we plan to explore building otherdownstream NLP tools such as Semantic RoleLabeling or Entity-specific Sentiment Analyzerswhich make use of NER for code-mixed data.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua

Bengio. 2014. Neural machine translation byjointly learning to align and translate. CoRRabs/1409.0473. http://arxiv.org/abs/1409.0473.

Utsab Barman, Amitava Das, Joachim Wagner, andJennifer Foster. 2014. Code mixing: A challengefor language identification in the language of socialmedia. In Proceedings of the first workshop on com-putational approaches to code switching. pages 13–23.

Irshad Ahmad Bhat, Vandan Mujadia, Aniruddha Tam-mewar, Riyaz Ahmad Bhat, and Manish Shrivas-tava. 2015. Iiit-h system submission for fire2014shared task on transliterated search. In Proceedingsof the Forum for Information Retrieval Evaluation.ACM, New York, NY, USA, FIRE ’14, pages 48–53. https://doi.org/10.1145/2824864.2824872.

Jenny Rose Finkel, Trond Grenager, and Christo-pher Manning. 2005. Incorporating non-local in-formation into information extraction systems bygibbs sampling. In Proceedings of the 43rd An-nual Meeting on Association for Computational Lin-guistics. Association for Computational Linguistics,Stroudsburg, PA, USA, ACL ’05, pages 363–370.https://doi.org/10.3115/1219840.1219885.

Kanika Gupta, Monojit Choudhury, and Kalika Bali.2012. Mining hindi-english transliteration pairsfrom online hindi lyrics. In Proceedings of the TenthInternational Conference on Language Resourcesand Evaluation (LREC 2012). pages 2459–2465.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Comput. 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.

Aaron Jaech, George Mulcaire, Mari Ostendorf, andNoah A Smith. 2016. A neural model for languageidentification in code-switched tweets. In Proceed-ings of The Second Workshop on Computational Ap-proaches to Code Switching. pages 60–64.

Mitesh M. Khapra, Ananthakrishnan Ramanathan,Anoop Kunchukuttan, Karthik Visweswariah, andPushpak Bhattacharyya. 2014. When transliter-ation met crowdsourcing : An empirical studyof transliteration via crowdsourcing using efficient,non-redundant and fair quality control. In Nico-letta Calzolari (Conference Chair), Khalid Choukri,Thierry Declerck, Hrafn Loftsson, Bente Maegaard,Joseph Mariani, Asuncion Moreno, Jan Odijk, andStelios Piperidis, editors, Proceedings of the NinthInternational Conference on Language Resourcesand Evaluation (LREC’14). European Language Re-sources Association (ELRA), Reykjavik, Iceland.

Anoop Kunchukuttan, Pratik Mehta, and PushpakBhattacharyya. 2017. The IIT bombay english-hindi parallel corpus. CoRR abs/1710.02855.http://arxiv.org/abs/1710.02855.

John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and label-ing sequence data. In Proceedings of the Eigh-teenth International Conference on Machine Learn-ing. Morgan Kaufmann Publishers Inc., San Fran-cisco, CA, USA, ICML ’01, pages 282–289.http://dl.acm.org/citation.cfm?id=645530.655813.

Guillaume Lample, Miguel Ballesteros, SandeepSubramanian, Kazuya Kawakami, and ChrisDyer. 2016. Neural architectures for namedentity recognition. CoRR abs/1603.01360.http://arxiv.org/abs/1603.01360.

Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, An-witaman Datta, Aixin Sun, and Bu-Sung Lee. 2012.Twiner: named entity recognition in targeted twit-ter stream. In Proceedings of the 35th internationalACM SIGIR conference on Research and develop-ment in information retrieval. ACM, pages 721–730.

Pieter Muysken. 2000. Bilingual speech: A typologyof code-mixing, volume 11. Cambridge UniversityPress.

Olutobi Owoputi, Brendan O’Connor, Chris Dyer,Kevin Gimpel, Nathan Schneider, and Noah ASmith. 2013. Improved part-of-speech tagging foronline conversational text with word clusters. Asso-ciation for Computational Linguistics.

Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Ab-hipsa Basu, Prithwish Mukherjee, Monojit Choud-hury, and Animesh Mukherjee. 2017. All that is en-glish may be hindi: Enhancing language identifica-tion through automatic ranking of likeliness of wordborrowing in social media. CoRR abs/1707.08446.http://arxiv.org/abs/1707.08446.

57

Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011.Named entity recognition in tweets: an experimentalstudy. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing. Asso-ciation for Computational Linguistics, pages 1524–1534.

Mihaela Rosca and Thomas Breuel. 2016.Sequence-to-sequence neural network mod-els for transliteration. CoRR abs/1610.09565.http://arxiv.org/abs/1610.09565.

Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Ka-lika Bali, Monojit Choudhury, and Niloy Ganguly.2016. Understanding language preference for ex-pression of opinion and sentiment: What do hindi-english speakers do on twitter? In EMNLP. pages1131–1141.

Piyush Bansal Sakshi Gupta and Mamidi Radhika.2016. Resource creation for hindi-english codemixed social media text .

Arnav Sharma, Sakshi Gupta, Raveesh Motlani, PiyushBansal, Manish Shrivastava, Radhika Mamidi, andDipti M Sharma. 2016. Shallow parsing pipeline forhindi-english code-mixed social media text. In Pro-ceedings of NAACL-HLT . pages 1340–1345.

Thamar Solorio and Yang Liu. 2008. Learning to pre-dict code-switching points. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing. Association for ComputationalLinguistics, pages 973–981.

Shikaripur N Sridhar and Kamal K Sridhar. 1980. Thesyntax and psycholinguistics of bilingual code mix-ing. Canadian Journal of Psychology/Revue cana-dienne de psychologie 34(4):407.

Yogarshi Vyas, Spandana Gella, Jatin Sharma, KalikaBali, and Monojit Choudhury. 2014. Pos tagging ofenglish-hindi code-mixed social media content. In

EMNLP. volume 14, pages 974–979.

58


German and French Neural Supertagging Experiments for LTAG Parsing

Tatiana Bladier Andreas van Cranenburgh Younes Samih Laura KallmeyerHeinrich Heine University of Düsseldorf

Universitätsstraße 1, 40225 Düsseldorf, Germany{bladier,cranenburgh,samih,kallmeyer}@phil.hhu.de

Abstract

We present ongoing work on data-drivenparsing of German and French with Lexi-calized Tree Adjoining Grammars. We usea supertagging approach combined withdeep learning. We show the challengesof extracting LTAG supertags from theFrench Treebank, introduce the use of left-and right-sister-adjunction, present a neu-ral architecture for the supertagger, andreport experiments of n-best supertaggingfor French and German.

1 Introduction

Lexicalized Tree Adjoining Grammar (LTAG;Joshi and Schabes, 1997) is a linguistically mo-tivated grammar formalism. Productions in anLTAG support an extended domain of locality(EDL). This allows them to express linguistic gen-eralizations that are not captured by typical sta-tistical parsers based on context-free grammars ordependency parsing. Each derivation step is trig-gered by a lexical element and a principled distinc-tion is made between its arguments and modifiers,which is reflected in richer derivations. This hasapplications in the context of other tasks which canmake use of linguistically rich analyses, such asframe semantic parsing or semantic role labeling(Sarkar, 2007). On the other hand, the increasedexpressiveness of LTAG makes efficient parsingand statistical estimations more challenging.

Previous work (Bangalore and Joshi, 1999;Sarkar, 2007) has shown that the task of parsingwith LTAGs can be facilitated through the inter-mediate step of supertagging—a task of assign-ing possible supertags (i.e. elementary trees) foreach word in a given sentence (Chen, 2010). Su-pertagging has been referred to as “almost pars-

ing” (Bangalore and Joshi, 1999), since supertag-ging performs a large part of the task of syntac-tic disambiguation and increases the parsing effi-ciency by lexicalizing syntactic decisions beforemoving on to the more expensive polynomial pars-ing algorithm (Sarkar, 2007).

Recently, several papers proposed neural ar-chitectures for supertagging with CombinatoryCategorial Grammar (CCG; Lewis et al., 2016;Vaswani et al., 2016) and LTAG (Kasai et al.,2017). Supertagging with LTAG is more chal-lenging than with CCG due to a higher num-ber of supertags (counting on average 4000 dis-tinct supertags for LTAGs). Also, almost half ofthe LTAG supertags occur only once. Neverthe-less, the reported neural supertagging approach forLTAG (Kasai et al., 2017) reaches an accuracy of88-90 % for English (compared to over 95 % forCCG). In this paper we apply a similar recurrentneural architecture to supertagging with LTAGsbased on Samih (2017) and Kasai et al. (2017) toGerman and French data and compare against pre-viously reported results. For the German data, wecompare our results to the LTAG supertaggers re-ported in Bäcker and Harbusch (2002) and West-burg (2016). To our knowledge, no results forFrench supertagging based on LTAG or CCG havebeen reported so far.

2 Neural Supertagging with LTAGs

2.1 Lexicalized Tree Adjoining Grammar

A Tree Adjoining Grammar (TAG; Joshi and Sch-abes, 1997) is a linguistically and psychologicallymotivated tree rewriting formalism (Sarkar, 2007).A TAG consists of a finite set of elementary trees,which can be combined to form larger trees via theoperations of substitution (replacing a leaf nodemarked with ↓ with an initial tree) or adjunction(replacing an internal node with an auxiliary tree).

59

NP-SUJ*

D

L’the

VN*

ADV

nenot

NP-SUJ

N

activiteactivity

ROOT

SENT

NP-SUJ↓ VN

VP

suffitsuffice

*SENT

ADV

pas�negat.�

=⇒

ROOT

SENT

NP-SUJ

D

L’the

N

activiteactivity

VN

ADV

nenot

VP

suffitsuffice

ADV

pas�negat.�

Figure 1: Supertagging with French LTAG for L’activité ne suffit pas (“The activity does not suffice”)

An auxiliary tree has a foot node (marked with ∗)with the same label as the root node. When adjoin-ing an auxiliary tree to some node n, the daughternodes of n become daughters of the foot node. Asample TAG derivation is shown in Figure 2, inwhich the elementary trees for Mary and pizza aresubstituted to the subject and object slots of thelikes tree and the auxiliary tree for absolutely isadjoined at the VP-node.

NP

Mary

S

NP↓ VP

V

likes

NP↓VP

AdvP

absolutely

VP*

NP

pizza

⇒

S

NP

Mary

VP

AdvP

absolutely

VP

V

likes

NP

pizza

Figure 2: Elementary trees and a derived tree inLTAG

In a lexicalized version of TAG (LTAG) everytree is associated with a lexical item and repre-sents the span over which this item can specifyits syntactic or semantic constraints (for exam-ple, subject-verb number agreement or semanticroles) capturing also long-distance dependenciesbetween the sentence tokens (Kipper et al., 2000).

2.2 RNN-based TAG supertagging

A supertagger is a partial parsing model whichis used to assign a sequence of LTAG elemen-tary trees to the sequence of words in a sentence(Sarkar, 2007). Supertagging can thus be seen aspreparation for further syntactic parsing which im-proves the efficiency of the TAG parser throughreducing syntactic lexical ambiguity and sentencecomplexity. Figure 1 provides an example of su-pertagging with an LTAG for French.

Several techniques were proposed for supertag-ging over the years, among which are HMM-based (Bäcker and Harbusch, 2002), n-gram-based(Chen et al., 2002), and Lightweight DependencyAnalysis models (Srinivas, 2000). Recent ad-

vances show the applicability of recurrent neuralnetworks (RNNs) for supertagging (Lewis et al.,2016; Vaswani et al., 2016; Kasai et al., 2017).

RNN-based supertagging with LTAGs can beseen as a standard sequence labeling task, albeitwith a large set of labels (i.e., several thousandclasses as supertags). Our deep learning pipelineis shown in Figure 3. A similar architectureshowed good results for POS tagging across manylanguages (Plank et al., 2016).

Figure 3: Supertagging architecture based onSamih (2017); dimensions shown in parentheses.

We use two kinds of embeddings: pre-trainedword embeddings from the Sketch Engine collec-tion of language models (Jakubícek et al., 2013;Bojanowski et al., 2016), and character embed-dings based on the training set data. The pre-trained word embeddings encode distributional in-formation from large corpora. The advantage ofthe character embeddings is that they can addition-ally encode subtoken information such as morpho-logical features and help in dealing with unseenwords, without doing any feature engineering on

60

Parameters French German, reduced set German, full set English(this work) (Kaeshammer, 2012) (Kaeshammer, 2012) (Kasai et al., 2017)

Supertags 5145 2516 3426 4727Supertags occur. once 2693 1123 1562 2165POS tags 13 53 53 36Sentences 21550 28879 50000 44168Avg. sentence length 31.34 17.51 17.71 appr. 20Accuracy 78.54 85.91 88.51 89.32

Table 1: Supertagging experiments

morphological features.The embeddings go through a recurrent layer to

capture the influence of tokens in the precedingand subsequent context for each token. For therecurrent layer we use either bidirectional LongShort Term Memory (LSTM) or Gated RecurrentUnits (GRU). We use a Convolutional Neural Net-work (CNN) layer for character embeddings, sinceit was proved to be one of the best options for ex-tracting morphological information from word to-kens (Ma and Hovy, 2016). The results for theword and character models are concatenated andfed through a softmax layer that gives a probabilitydistribution for possible supertags. Dropout lay-ers are added to counter overfitting. We replacedwords without an entry in the word embeddingswith a randomly instantiated vector of the samedimension (100). Table 2 provides an overview ofthe hyper-parameters we used for the supertaggerarchitecture.

Layer Hyper-parameters Value

Characters CNN numb. of filters 40state size 400

Bi-GRU state size 400initial state 0.0

Words embedding vector dim. 100window size 5

Char. embedding dimension 50

batch size 128

Dropout dropout rate 0.5

Table 2: Hyper-parameters of the supertagger.

3 LTAG induction from the FrenchTreebank

Inducing a grammar from a treebank entails iden-tifying a set of productions that could have pro-duced its parse trees. In the case of LTAG thismeans decomposing the trees into a sequence ofelementary trees, one for each word in the sen-tence.

In order to extract a TAG from the French Tree-bank (FTB; Abeillé et al., 2003), we applied theheuristic procedure described by Xia (1999). Themain idea of this approach is to consider the treesin the treebank as derived trees from an LTAG. El-ementary trees are extracted in top-down fashionusing percolation tables to identify grammaticallyobligatory elements (i.e., complements), gram-matically optional elements (i.e., modifiers), aswell as a head child for each constituent. Allsub-trees corresponding to modifiers and comple-ments are extracted in a further step forming aux-iliary trees and initial trees, respectively, while thehead child and its lexical anchor are kept in thetree. When extracted in this way, elementary treescontain the corresponding lexical anchor and thebranches represent a particular syntactic context ofa construction with slots for its complements.

3.1 LTAG induction: pre-processing stepsBefore induction of different LTAGs for French,we carried out pre-processing steps described inCandito et al. (2010) and Crabbé and Candito(2008) including extension of the original POStag set in FTB from 13 to 26 POS tags and un-doing multi-word expressions (MWEs) with reg-ular syntactic patterns (e.g. (MWN (A ancien) (Nélève))→ (NP (AP (A ancien)) (N élève))). About14 % of the word tokens (79,466 out of the totalof 557,095 tokens) in FTB belong to flat MWEs.After rewriting compounds with regular syntacticpatterns, the number of MWEs is reduced to ap-proximately 5 %.

We also restructured some trees in order to bringthe complements on a higher level in the tree.In particular, we shifted the initial prepositionalphrase of the VPinf constituents to a higher leveland raised the subordinating conjunction (C-S) ofthe final clause constituents (Ssub) (see Figure 4).

After the preprocessing we extracted the fol-lowing LTAGs from FTB for our supertagging ex-periments: including 13 or 26 POS tags, withand without compounds, including and excluding

61

Gold supertag Predicted supertag Example

(PP (APPR < >)(NP↓ )) (PP* (APPR < >)(NP↓ )) zu einem Eigenheim zu verhelfen(NP (DP↓ )(NN < >)) (NP (NN < >)) das heutige und künftige Kreditvolumen

(S (S* )($, < >)(S↓ )) ($,* < >)S

S $, S

(S (NP↓ )(VVFIN < >)(NP↓ )(PTKVZ↓ )) (S (NP↓ )(VVFIN < >)(NP↓ )) der Umsatzminus geht auf 125 Millionen [...] zurück

Table 3: Most common error classes for German TAG supertagging with TiGer treebank

VPinf-A OBJ

NP-OBJ

N

activiteactivity

D

l’the

VN

V

repartirrestart

V

fairemake

P

ato

⇓

PP

VPinf-A OBJ

NP-OBJ

N

activiteactivity

D

l’the

VN

V

repartirrestart

V

fairemake

P

ato

Figure 4: FTB preprocessing: complement raising

punctuation marks. Table 1 provides some statis-tics on the extracted LTAG which led to the mostaccurate supertagging results (13 POS tags, with-out compounds, including punctuation marks).

3.2 Left- and right-sister-adjunctionExtraction of an LTAG from FTB is challengingdue to the flat structure of the trees, which al-lows any combination of arguments and modi-fiers. In order to preserve the original flat struc-tures in the FTB as far as possible and to facilitatethe extraction of the elementary trees we decidedagainst the traditional notion of adjunction in TAGwhich relies on nested structures and apply sister-adjunction; i.e., the root of a sister-adjoining treecan be attached as a daughter of any node of an-other tree with the same node label.

ROOT

SENT

NP-SUJ↓ VN

ADV

nenot

VP

suffitsuffice

*SENT

ADV

pas〈negat.〉

=⇒

ROOT

SENT

NP-SUJ↓ VN

ADV

nenot

VP

suffitsuffice

ADV

pas〈negat.〉

Figure 5: Left-sister-adjunction

Since a modifier can appear on the right or on

the left side relative to the position of the con-stituent head, we distinguish between right- andleft-sister-adjoining trees (marked with * on theleft or the right side of the root label as shown inFigure 5).

A left-sister-adjoining tree γ can only be ad-joined to a node η in the tree τ if the root labelof γ is the same as the label of η and the anchor ofthe elementary tree τ comes in the sentence beforethe anchor of γ. The children of γ are inserted onthe right side of the children in η and become thechildren of η. A right-sister-adjunction is definedin a similar way.

The resulting LTAGs with sister-adjunction arebasically LTIGs (Lexicalized Tree Insertion Gram-mar; Schabes and Waters, 1995) in the way thatthe auxiliary trees do not allow wrapping adjunc-tion or adjunction on the root node but permit mul-tiple simultaneous adjunction on a single node ofinitial trees. However, since LTIG is a special vari-ant of LTAG, we refer to the extracted grammar asLTAG in the remainder of the paper.

4 Experiments and error analysis

4.1 Experimental setups for German andFrench

In order to compare the performance of our su-pertagger with previous work of Kasai et al.(2017) and LTAG-based supertaggers for German(Bäcker and Harbusch, 2002; Westburg, 2016), weexperimented with the supertags extracted by Kae-shammer (2012) from the German TiGer treebank(Brants et al., 2004). The set of supertags for Ger-man has the following train, test, and dev. split:39,925, 5035, and 5040 sentences. We ran a su-pertagging experiment with this number of sen-tences, since it is compatible with the experimen-tal setup described in Kasai et al. (2017). Sincethe number of sentences in FTB is smaller than inTiGer, we created a sample of the train set of theTiGer treebank with a comparable number of sen-tences in the train set (18,809). For the supertag-ging experiments with the French LTAG, we di-

62

vided FTB in the standard train, development andtest sets (19,080, 1235, and 1235 sentences), mak-ing our test and dev. sets comparable to the dev.and test set reported in Candito et al. (2009).

Tables 3 and 5 show the most frequent erro-neous supertags for German and French. The sym-bol < > in the supertags signifies the spot for thelexical anchor, while * marks the foot node of aux-iliary trees and ↓ represents a substitution site.

4.2 German TAG supertagging with TiGer

Generally, results for supertagging with GermanLTAGs appear to be slightly lower than for En-glish. Westburg (2016) reports an accuracy of82.92 % for German TAG with a supertaggerbased on perceptron training algorithm, whileBäcker and Harbusch (2002) reached 78.3 % witha HMM-based TAG supertagger.

Supertagging for German is more challengingthan for English due to a higher number of wordorder variations and the resulting sparseness ofthe data (Bäcker and Harbusch, 2002). However,our experiments show that the proposed neuralsupertagging architecture reaches the best perfor-mance among the previously described supertag-gers for German (88.51 %) and gets comparableresults to the supertagging model for English de-scribed in Kasai et al. (2017) (see Tables 1 and 4).

System Accuracy

Bäcker and Harbusch (2002) (HMM-based) 78.3Westburg (2016) 82.92

This work, full training set (Bi-LSTM) 87.67This work, full training set (GRU) 88.51This work, reduced training set (Bi-LSTM) 85.26This work, reduced training set (GRU) 85.91

Table 4: Supertagging experiments with GermanTiGer treebank.

The biggest class of errors for German supertag-ging contains wrong predictions concerning thetype of the elementary tree (e.g. the supertaggerpredicts an auxiliary tree instead of an initial treeor vice versa). The main reason for this kind oferror is the particularity of German which allowsdependent elements in a sentence being divided bya big number of other tokens. For example, a de-terminer and the determined word or the separa-ble verb prefix and the verb stem can be separatedby a dozen other tokens, as in the sentence DerUmsatzminus geht auf 125 Millionen [..] zurück(Engl. “The sales drop goes down to 125 mil-lions”), the verb geht and its prefix zurück are sep-

arated by 11 tokens (see Table 3).Since the window size of tokens presented to

the supertagger is limited, the connection betweenthe tokens can be overlooked by the supertag-ger. However, increasing the window size leadsto greater noise in the data. We experimented withwindow sizes of 5, 9, and 13 for German and gotthe best results with a window size of 5 (two wordsbefore and after the token).

Another source of mistakes for German is theintersentential punctuation in large complex sen-tences containing several subordinated clauses.This error can also be explained by the windowsize of tokens presented to the supertagger—thesupertagger does not capture the complex struc-ture of the sentence and classifies the punctuationmark as a one-child auxiliary tree (see Table 3).

Another big class of errors comes from PPswhich can be either optional (modifiers) or oblig-atory elements. For example, the supertagger didnot recognize that the verb verhelfen (Engl. “tohelp”) requires a prepositional phrase as an argu-ment (e.g. zu einem Eigenheim zu verhelfen; Engl.“to help someone to buy a property”) and erro-neously classified this complement as a modifierPP.

4.3 French TAG supertagging with FTB

Supertagging with French LTAGs appears to bemore challenging compared to German or English.There are several general reasons for the perfor-mance drop of the supertagger, one of which is ahigher average sentence length in FTB (31.34 to-kens per sentence, compared to 17.51 in TiGer).Sentences in FTB more frequently have a complexsyntactic structure including explicative elementsseparated with brackets or commas.

The large number of supertags lead to higherdata sparsity and make the sequence labeling prob-lem more difficult for the supertagger. One ex-planation for the larger number of supertags, be-sides the longer and more complex sentence struc-tures in FTB, is the large number of flat multi-word expressions in FTB. Our experiments showthat rewriting MWEs with regular compounds im-proves the supertagging performance.

A large number of supertagging errors forFrench occur due to different sites of attachment ofthe intersentential punctuation marks in FTB. Thepunctuation marks in FTB are attached to the cor-responding constituents and not consistently to the

63

Gold supertag Predicted supertag Example

(NP* (PP (P <>) (NP↓ ))) (PP (P <>) (NP↓ )) 32 % par an(NP* (PONCT < >)) (SENT* (PONCT <>)) -LRB- 66,7 % -RRB-

(NP* (N < >)) (N < >) Mme Dominique Alduy(ROOT (SENT (NP↓ ) (VN (V < >)))) (VN (V < >)) le droit est officiellement transgressé

Table 5: Most common error classes for LTAG supertagging with French Treebank

System Accuracy

This work (GRU), 13 POS, undone comp. 78.54This work (GRU), 13 POS, no punct. marks 74.44This work (GRU), 13 POS, with compounds 76.78This work (GRU), 26 POS, with compounds 74.84This work (Bi-LSTM), 13 POS, undone comp. 77.67

Table 6: Supertagging experiments with FrenchTreebank (FTB).

root node of the whole sentence. However, sincepunctuation marks also help to identify possibleconstituents, omitting them does not improve su-pertagging.

Similar to supertagging with German LTAGs,PP attachments are also a major source of errorswith French LTAGs. In addition to difficultieswith classifying PPs as modifiers or complements(as with German data), the supertagger for Frenchmore frequently encounters problems with iden-tifying the correct site for attaching the PPs to anode in the syntactic tree. The reason for these er-rors could be that FTB—in comparison to TiGer—does not offer additional function marks to distin-guish PPs as modifiers from prepositional comple-ments of the support verbs.

4.4 N-best supertagging experimentsThe softmax layer of the supertagging model wedescribed in section 2.2 provides a distributionof probabilities of the supertags when classifyingwords in a sentence, and we used this distribu-tion to enable our supertagger to predict n-best su-pertags.

n-bestAccuracyGerman(full set)

AccuracyGerman(red. set)

AccuracyFrench

1-best 88.51 85.91 78.542-best 94.37 93.04 87.343-best 96.08 95.00 90.855-best 97.45 96.66 94.387-best 98.03 97.40 96.0010-best 98.52 97.97 97.08

Table 7: N-best supertagging experiments.

We experimented with different numbers of n-best supertags for every word, counting the num-ber of accurately predicted supertags each time

when at least one of the n-best supertags was pre-dicted correctly. The experiments show a quickgrowth in accuracy prediction up to 5-best su-pertags, while for ranks n > 5 the improvementof accuracy is not as big (see Table 7).

5 Conclusion and Future Work

We proposed a neural architecture for supertag-ging with TAG for German and French and carriedout experiments to measure the performance of thesupertagging model for these languages. We in-duced several different LTAGs from FTB in orderto compare the supertagging performance. The re-sults with German LTAG show that the neural su-pertagging model achieves comparable results tothe state-of-the art TAG supertagging model de-scribed in Kasai et al. (2017) for English, eventhough German is more difficult for supertaggingdue to the free word order and the data sparseness.Supertagging for French appears to be more diffi-cult due to the larger average length of sentencesand a big number of multiword expressions.

In future work we plan to increase performanceof the supertagger for French by dividing the su-pertagging algorithm in two steps: factorization ofthe extracted supertags in tree families and decid-ing afterwards on the correct supertag within thepredicted tree family. We plan to use the improvedsupertagger for graph-based parsing. In particu-lar, we aim at adapting the A*-based PARTAGE

parser for LTAGs developed by Waszczuk (2017)for parsing with extracted supertags. We also in-tend to add deep syntactic features and informa-tion on semantic roles to the supertags in order totest whether the proposed supertagging architec-ture can be used for semantic role labeling.

Acknowledgements

This work was carried out as a part of the re-search project TREEGRASP (http://treegrasp.phil.hhu.de) funded by a Consolidator Grant of the Eu-ropean Research Council (ERC). We thank threeanonymous reviewers for their careful reading,valuable suggestions and constructive comments.

64

ReferencesAnne Abeillé, Lionel Clément, and François Toussenel.

2003. Building a treebank for french. In Treebanks,pages 165–187. Springer.

Jens Bäcker and Karin Harbusch. 2002. Hid-den markov model-based supertagging in a user-initiative dialogue system. In Proceedings of TAG+6, pages 269–278.

Srinivas Bangalore and Aravind K Joshi. 1999. Su-pertagging: An approach to almost parsing. Com-putational linguistics, 25(2):237–265.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

Sabine Brants, Stefanie Dipper, Peter Eisenberg, Sil-via Hansen-Schirra, Esther König, Wolfgang Lezius,Christian Rohrer, George Smith, and Hans Uszkor-eit. 2004. Tiger: Linguistic interpretation of a ger-man corpus. Research on language and computa-tion, 2(4):597–620.

Marie Candito, Benoît Crabbé, and Pascal Denis. 2010.Statistical french dependency parsing: treebank con-version and first results. In Seventh InternationalConference on Language Resources and Evaluation-LREC 2010, pages 1840–1847. European LanguageResources Association (ELRA).

Marie Candito, Benoît Crabbé, and Djamé Seddah.2009. On statistical parsing of french with su-pervised and semi-supervised strategies. In Pro-ceedings of the EACL 2009 Workshop on Computa-tional Linguistic Aspects of Grammatical Inference,CLAGI ’09, pages 49–57, Stroudsburg, PA, USA.Association for Computational Linguistics.

John Chen. 2010. Semantic labeling and parsingvia tree-adjoining grammars. Supertagging: UsingComplex Lexical Descriptions in Natural LanguageProcessing, pages 431–448.

John Chen, Srinivas Bangalore, Michael Collins, andOwen Rambow. 2002. Reranking an n-gram su-pertagger. In Proceedings of the Sixth InternationalWorkshop on Tree Adjoining Grammar and RelatedFrameworks (TAG+ 6), pages 259–268.

Benoit Crabbé and Marie Candito. 2008. Expéri-ences d’analyse syntaxique statistique du français.In 15ème conférence sur le Traitement Automatiquedes Langues Naturelles-TALN’08, pages pp–44.

Miloš Jakubícek, Adam Kilgarriff, Vojtech Kovár,Pavel Rychly, and Vít Suchomel. 2013. The tentencorpus family. In 7th International Corpus Linguis-tics Conference CL, pages 125–127.

Aravind K Joshi and Yves Schabes. 1997. Tree-adjoining grammars. In Handbook of formal lan-guages, pages 69–123. Springer.

Miriam Kaeshammer. 2012. A German treebank andlexicon for tree-adjoining grammars. Ph.D. the-sis, MasterâAZs thesis, Universitat des Saarlandes,Saarlandes, Germany.

Jungo Kasai, Bob Frank, Tom McCoy, Owen Ram-bow, and Alexis Nasr. 2017. Tag parsing with neu-ral networks and vector representations of supertags.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages1712–1722.

Karin Kipper, Hoa Trang Dang, William Schuler, andMartha Palmer. 2000. Building a class-based verblexicon using tags. In Proceedings of the Fifth Inter-national Workshop on Tree Adjoining Grammar andRelated Frameworks (TAG+ 5), pages 147–154.

Mike Lewis, Kenton Lee, and Luke Zettlemoyer. 2016.LSTM CCG parsing. In Proceedings of the 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 221–231.

Xuezhe Ma and Eduard Hovy. 2016. End-to-endsequence labeling via bi-directional lstm-cnns-crf.arXiv preprint arXiv:1603.01354.

Barbara Plank, Anders Søgaard, and Yoav Goldberg.2016. Multilingual part-of-speech tagging withbidirectional long short-term memory models andauxiliary loss. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 412–418, Berlin, Germany. Association for Computa-tional Linguistics.

Younes Samih. 2017. Dialectal Arabic processing Us-ing Deep Learning. Ph.D. thesis, Düsseldorf, Ger-many.

Anoop Sarkar. 2007. Combining supertagging andlexicalized tree-adjoining grammar parsing. Com-plexity of Lexical Descriptions and its Relevance toNatural Language Processing: A Supertagging Ap-proach, page 113.

Yves Schabes and Richard C Waters. 1995. Tree in-sertion grammar: a cubic-time, parsable formal-ism that lexicalizes context-free grammar withoutchanging the trees produced. Computational Lin-guistics, 21(4).

Bangalore Srinivas. 2000. A lightweight dependencyanalyzer for partial parsing. Natural Language En-gineering, 6(2):113–138.

Ashish Vaswani, Yonatan Bisk, Kenji Sagae, and RyanMusa. 2016. Supertagging with LSTMs. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages232–237.

Jakub Waszczuk. 2017. Leveraging MWEs in practi-cal TAG parsing: towards the best of the two world.Ph.D. thesis.

65

Anika Westburg. 2016. Supertagging for german - animplementation based on tree adjoining grammars.Master’s thesis, Heinrich Heine University of Düs-seldorf.

Fei Xia. 1999. Extracting tree adjoining grammarsfrom bracketed corpora. In Proceedings of the 5thNatural Language Processing Pacific Rim Sympo-sium (NLPRS-99), pages 398–403.

66


SuperNMT: Neural Machine Translation with Semantic Supersenses andSyntactic Supertags

Eva VanmassenhoveAdapt Centre

Dublin City UniversityIreland

[email protected]

Andy WayAdapt Centre

Dublin City UniversityIreland

[email protected]

Abstract

In this paper we incorporate semantic su-persensetags and syntactic supertag fea-tures into EN–FR and EN–DE factoredNMT systems. In experiments on vari-ous test sets, we observe that such features(and particularly when combined) help theNMT model training to converge fasterand improve the model quality accordingto the BLEU scores.

1 Introduction

Neural Machine Translation (NMT) models haverecently become the state-of-the art in the field ofMachine Translation (Bahdanau et al., 2014; Choet al., 2014; Kalchbrenner et al., 2014; Sutskeveret al., 2014). Compared to Statistical MachineTranslation (SMT), the previous state-of-the-art,NMT performs particularly well when it comes toword-reorderings and translations involving mor-phologically rich languages (Bentivogli et al.,2016). Although NMT seems to partially ‘learn’or generalize some patterns related to syntax fromthe raw, sentence-aligned parallel data, more com-plex phenomena (e.g. prepositional-phrase at-tachment) remain problematic (Bentivogli et al.,2016). More recent work showed that explic-itly (Sennrich and Haddow, 2016; Nadejde et al.,2017; Bastings et al., 2017; Aharoni and Goldberg,2017) or implicitly (Eriguchi et al., 2017) model-ing extra syntactic information into an NMT sys-tem on the source (and/or target) side could leadto improvements in translation quality.

When integrating linguistic information into anMT system, following the central role assignedto syntax by many linguists, the focus has beenmainly on the integration of syntactic features. Al-though there has been some work on semantic fea-tures for SMT (Banchs and Costa-Jussa, 2011), so

far, no work has been done on enriching NMT sys-tems with more general semantic features at theword-level. This might be explained by the factthat NMT models already have means of learningsemantic similarities through word-embeddings,where words are represented in a common vectorspace (Mikolov et al., 2013). However, makingsome level of semantics more explicitly availableat the word level can provide the translation sys-tem with a higher level of abstraction beneficial tolearn more complex constructions. Furthermore,a combination of both syntactic and semantic fea-tures would provide the NMT system with a wayof learning semantico-syntactic patterns.

To apply semantic abstractions at the word-levelthat enable a characterisation beyond that what canbe superficially derived, coarse-grained semanticclasses can be used. Inspired by Named EntityRecognition which provides such abstractions fora limited set of words, supersense-tagging usesan inventory of more general semantic classesfor domain-independent settings (Schneider andSmith, 2015). We investigate the effect of inte-grating supersense features (26 for nouns, 15 forverbs) into an NMT system. To obtain these fea-tures, we used the AMALGrAM 2.0 tool (Schnei-der et al., 2014; Schneider and Smith, 2015) whichanalyses the input sentence for Multi-Word Ex-pressions as well as noun and verb supersenses.The features are integrated using the frameworkof Sennrich et al. (2016), replicating the tags forevery subword unit obtained by byte-pair encod-ing (BPE). We further experiment with a combi-nation of semantic supersenses and syntactic su-pertag features (CCG syntactic categories (Steed-man, 2000) using EasySRL (Lewis et al., 2015))and less complex features such as POS-tags, as-suming that supersense-tags have the potential tobe useful especially in combination with syntacticinformation.

67

The remainder of this paper is structured as fol-lows: First, in Section 2, the related work is dis-cussed. Next, Section 3 presents the semantic andsyntactic features used. The experimental set-upis described in Section 4 followed by the results inSection 5. Finally, We conclude and present someof the ideas for future work in Section 6.

2 Related Work

In SMT, various linguistic features such asstems (Toutanova et al., 2008) lemmas (Mareceket al., 2011; Fraser et al., 2012), POS-tags (Avramidis and Koehn, 2008), dependencylabels (Avramidis and Koehn, 2008) and su-pertags (Hassan et al., 2007; Haque et al., 2009)are integrated using pre- or post-processingtechniques often involving factored phrase-basedmodels (Koehn and Hoang, 2007). Compared tofactored NMT models, factored SMT models havesome disadvantages: (a) adding factors increasesthe sparsity of the models, (b) the n-grams limitthe size of context that is taken into account, and(c) features are assumed to be independent of eachother. However, adding syntactic features to SMTsystems led to improvements with respect to wordorder and morphological agreement (Williamsand Koehn, 2012; Sennrich, 2015).

One of the main strengths of NMT is its strongability to generalize. The integration of linguis-tic features can be handled in a flexible way with-out creating sparsity issues or limiting context in-formation (within the same sentence). Further-more, the encoder and attention layers can beshared between features. By representing the en-coder input as a combination of features (Alexan-drescu and Kirchhoff, 2006), Sennrich and Had-dow (2016) generalized the embedding layer insuch a way that an arbitrary number of linguisticfeatures can be explicitly integrated. They then in-vestigated whether features such as lemmas, sub-word tags, morphological features, POS tags anddependency labels could be useful for NMT sys-tems or whether their inclusion is redundant.

Similarly, on the syntax level, Shi et al. (2016)show that although NMT systems are able to par-tially learn syntactic information, more complexpatterns remain problematic. Furthermore, some-times information is present in the encoding vec-tors but is lost during the decoding phase (Van-massenhove et al., 2017). Sennrich and Haddow(2016) show that the inclusion of linguistic fea-

tures leads to improvements over the NMT base-line for EN–DE (0.6 BLEU), DE–EN (1.5 BLEU)and EN–RO (1.0 BLEU). When evaluating thegains from the features individually, it results thatthe gain from different features is not fully cumu-lative. Nadejde et al. (2017) extend their work byincluding CCG supertags as explicit features in afactored NMT systems. Moreover, they experi-ment with serializing and multitasking and showthat tightly coupling the words with their syntac-tic features leads to improvements in translationquality (measured by BLEU) while a multitask ap-proach (where the NMT predicts CCG supertagsand words independently) does not perform bet-ter than the baseline system. A similar observa-tion was made by Li et al (2017), who incorporatethe linearized parse trees of the source sentencesinto ZH–EN NMT systems. They propose threedifferent sorts of encoders: (a) a parallel RNN,(b) a hierarchical RNN, and (c) a mixed RNN.Like Nadejde et al. (2017), Li et al (2017) observethat the mixed RNN (the simplest RNN encoder),where words and label annotation vectors are sim-ply stitched together in the input sequences, yieldsthe best performance with a significant improve-ment (1.4 BLEU). Similarly, Eriguchi et al. (2016)integrated syntactic information in the form of lin-earized parse trees by using an encoder that com-putes vector representations for each phrase inthe source tree. They focus on source-side syn-tactic information based on Head-Driven PhraseStructure Grammar (Sag et al., 1999) where targetwords are aligned not only with the correspond-ing source words but with the entire source phrase.Wu et al. (2017) focus on incorporating source-side long distance dependencies by enriching eachsource state with global dependency structure.

To the best of our knowledge, there has notbeen any work on explicitly integrating semanticinformation in NMT. Similarly to syntactic fea-tures, we hypothesize that semantic features in theform of semantic ‘classes’ can be beneficial forNMT providing it with an extra ability to general-ize and thus better learn more complex semantico-syntactic patters.

3 Semantics and Syntax in NMT

3.1 Supersense Tags

The novelty of our work is the integration of ex-plicit semantic features supersenses into an NMTsystem. Supersenses are a term which refers to

68

the top-level hypernyms in the WordNet (Miller,1995) taxonomy, sometimes also referred to as se-mantic fields (Schneider and Smith, 2015). Thesupersenses cover all nouns and verbs with a totalof 41 supersense categories, 26 for nouns and 15for verbs. To obtain the supersense tags we usedthe AMALGrAM (A Machine Analyzer of LexicalGroupings and Meanings) 2.0 tool 1 which in ad-dition to the noun and verb supersenses analyzesEnglish input sentences for MWEs. An exam-ple of a sentence annotated with the AMALGrAMtool is given in (1):2

(1)(a) “He seemed to have little faith in our democratic

structures, suggesting that various articles could bemisused by governments.”

(b) “He seemed|cognition to have|stative lit-tle faith|COGNITION in our democraticstructures|ARTIFACT , suggesting|communicationthat various articles|COMMUNICATION couldbe|‘a misused|social by governments|GROUP .”

As can be noted in (1), some supersenses, suchas cognition exist for both nouns and verbs. How-ever, the supersense tags for verbs are always low-ercased while the ones for nouns are capitalized.This way, the supersenses also provide syntacticinformation useful for disambiguation as in (2),where the word work is correctly tagged as a noun(with its capitalized supersense tag ACT) in thefirst part of the sentence and as a verb (with thelowercased supersense tag social). Furthermore,there is a separate tag to distinguish auxiliary verbsfrom main verbs.

(2)(a) “In the course of my work on the opinion, I in fact

became aware of quite a number of problems anddifficulties for EU citizens who live and work inSwitzerland”

(b) “In the course|EVENT of my work|ACTon the opinion|COGNITION , Iin fact became|stative aware of quitea number of problems|COGNITION anddifficulties|COGNITION for EU citizens|GROUPwho live|social and work|social inSwitzerland|LOCATION .”

Since MWEs and supersenses naturally comple-ment each other, Schneider and Smith (2015) in-tegrated the MWE identification task (Schneideret al., 2014) with the supersense tagging task ofCiaramita and Altun (2006). In Example (2), the

1https://github.com/nschneid/pysupersensetagger

2All the examples are extracted from our data used lateron to train the NMT systems

MWEs in fact, a number of and EU citizens areretrieved by the tagger.

We add this semantico-syntactic information inthe source as an extra feature in the embeddinglayer following the approach of Sennrich and Had-dow (2016), who extended the model of Bahdanauet al. (2014). A separate embedding is learnedfor every source-side feature provided (the worditself, POS-tag, supersense tag etc.). These em-bedding vectors are then concatenated into oneembedding vector and used in the model insteadof the simple word embedding one (Sennrich andHaddow, 2016).

To reduce the number of out-of-vocabulary(OOV) words, we follow the approach of Sennrichet al. (2016) using a variant of BPE for word seg-mentation capable of encoding open vocabularieswith a compact symbol vocabulary of variable-length subword units. For each word that is splitinto subword units, we copy the features of theword in question to its subword units. In (3), wegive an example with the word ‘stormtroopers’that is tagged with the supersense tag ‘GROUP’.It is split into 5 subword units so the supersensetag feature is copied to all its five subword units.Furthermore, we add a none tag to all words thatdid not receive a supersense tag.

(3)Input: “the stormtroopers”SST: “the stormtroopers|GROUP”BPE: “the stor@@ m@@ tro@@ op@@ ers”Output: “the|none stor@@|GROUP ...

op@@|GROUP ers|GROUP”

For the MWEs we decided to copy the super-sense tag to all the words of the MWE (if providedby the tagger), as in (4). If the MWE did not re-ceive a particular tag, we added the tag mwe to allits components, as in example (5)

(4)Input: “EU citizens”SST: “EU citizens|GROUP”

Output: “EU|GROUP citizens|GROUP”

(5)Input: “a number of”SST: “a number of”

Output: “a|mwe number|mwe of|mwe ”

3.2 Supertags and POS-tags

We hypothesize that more general semantic infor-mation can be particularly useful for NMT in com-bination with more detailed syntactic information.To support our hypothesis we also experimented

69

with syntactic features (separately and in com-bination with the semantic ones): POS tags andCCG supertags.

The POS tags are generated by the StanfordPOS-tagger (Toutanova et al., 2003); for the su-pertags we used the EasySRL tool (Lewis et al.,2015) which annotates words with CCG tags.CCG tags provide global syntactic information onthe lexical level. This kind of information canhelp resolve ambiguity in terms of prepositionalattachment, among others. An example of a CCG-tagged sentence is given in (6):

(6)It|NP is|(S[dcl]\NP)/NP a|NP/N modern|N/N form|N/PP

of|PP/NP colonialism|N .|.

4 Experimental Set-Up

4.1 Data sets

Our NMT systems are trained on 1M parallel sen-tences of the Europarl corpus for EN–FR and EN–DE (Koehn, 2005). We test the systems on 5K sen-tences (different from the training data) extractedfrom Europarl and the newstest2013. Two differ-ent test sets are used in order to show how ad-ditional semantic and syntactic features can helpthe NMT system translate different types of testsets and thus evaluate the general effect of our im-provement.

4.2 Description of the NMT system

We used the nematus toolkit (Sennrich et al.,2017) to train encoder-decoder NMT models withthe following parameters: vocabulary size: 35000,maximum sentence length: 60, vector dimension:1024, word embedding layer: 700, learning op-timizer: adadelta. We keep the embeddinglayer fixed to 700 for all models in order to en-sure that the improvements are not simply due toan increase of the parameters in the embeddinglayer. In order to by-pass the OOV problem andreduce the number of dictionary entries we useword-segmentation with BPE (Sennrich, 2015).We ran the BPE algorithm with 89, 500 operations.We trained all our BPE-ed NMT systems withCCG tag features, supersensetags (SST), POS tagsand the combination of syntactic features (POS orCCG) with the semantic ones (SST). All systemsare trained for 150,000 iterations and evaluated af-ter every 10,000 iterations.

5 Results

5.1 English–French

For both test sets, the NMT system with super-senses (SST) converges faster than the baseline(BPE) NMT system. As we hypothesized, the ben-efits of the features added, was more clear on thenewstest2013 than on the Europarl test set. Fig-ure 1 compares the BPE-ed baseline system (BPE)with the supertag-supersensetag system (CCG–SST) automatically evaluated on the newstest2013(in terms of BLEU (Papineni et al., 2002)) over all150,000 iterations. From the graph, it can also beobserved that the system has a more robust, con-sistent learning curve.

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000

130000

140000

150000

160000

0

5

10

15

20

25

BPE SST_CCG

Iterations

BLEU

Figure 1: Baseline (BPE) vs Combined (SST–CCG) NMTSystems for EN–FR, evaluated on the newstest2013.

To see in more detail how our semantically en-riched SST system compares to an NMT systemwith syntactic CCG supertags and how a systemthat integrates both semantic features and syntac-tic features (SST–CCG) performs, a more detailedgraph is provided in Figure 2 where we zoom inon later stages of the learning process. AlthoughSennrich and Haddow (2016) observe that featuresare not necessarily cumulative (possibly since theinformation from the syntactic features partiallyoverlapped), the system enriched with both se-mantic and syntactic features outperforms the twoseparate systems as well as the baseline system.The best CCG-SST model (23.21 BLEU) out-performs the best BPE-ed baseline model (22.54BLEU) with 0.67 BLEU (see Table 1). More-over, the benefit of syntactic and semantic fea-tures seems to be more than cumulative at somepoints, confirming the idea that providing both in-formation sources can help the NMT system learnsemantico-syntactic patterns. This supports ourhypothesis that semantic and syntactic features are

70

particularly useful when combined.

100000 110000 120000 130000 140000 150000 16000018.5

19

19.5

20

20.5

21

21.5

22

22.5

23

23.5

BPE SST_CCGCCG SST

Iterations

BLEU

Figure 2: Baseline (BPE) vs Syntactic (CCG) vs Semantic(SST) and Combined (SST–CCG) NMT Systems for EN–FR,evaluated on the newstest2013.

BLEU BPE CCG SST SST–CCGBest Model 22.54 23.03 22.86 23.21

Table 1: Best BLEU scores for Baseline (BPE), Syntac-tic (CCG), Semantic (SST) and Combined (SST–CCG) NMTsystems for EN-FR evaluated on the newstest2013

5.2 English–GermanThe results for the EN–DE system are very similarto the EN–FR system: the model converges fasterand we observe the same trends with respect to theBLEU scores of the different systems. Figure 3compares the BPE-ed baseline system (BPE) withthe NMT system enriched with SST and CCG tags(SST–CCG). In the last iterations, see Figure 4,we see how the two systems enriched with super-sense tags and CCG tags lead to small improve-ments over the baseline. However, their combi-nation (SST–CCG) leads to a more robust NMTsystem with a higher BLEU (see Table 2).

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000

130000

140000

0

5

10

15

20

25

BPE SST_CCGIterations

BLEU

Figure 3: Baseline (BPE) vs Combined (CCG–SST) NMTSystems for English–German, evaluated on the Europarl testset.

100000 110000 120000 130000 14000020.5

21

21.5

22

22.5

23

BPE CCGSST SST_CCG

Iterations

BL

EU

Figure 4: Baseline (BPE) vs Syntactic (CCG) vs Seman-tic (SST) and Combined (CCG–SST) NMT Systems for EN–DE, evaluated on the Europarl test set.

BLEU BPE CCG SST SST–CCGBest Model 22.32 22.47 22.51 22.85

Table 2: Best BLEU scores for Baseline (BPE), Syntac-tic (CCG), Semantic (SST) and Combined (SST–CCG) NMTsystems for EN-DE evaluated on the Europarl test set.

6 Conclusions and Future Work

In this work we experimented with EN–FR andEN–DE data augmented with semantic and syn-tactic features. For both language pairs we observethat adding extra semantic and/or syntactic fea-tures leads to faster convergence. Furthermore, thebenefit of the additional features is more clear on adissimilar test set which is in accordance with ouroriginal hypothesis stating that semantic and syn-tactic features (and their combination) can be ben-eficial for generalization. In the future, we wouldlike to perform manual evaluations on the outputsof our systems to see whether they correlate withthe BLEU scores. In the next step, we will let themodels converge, create the ensemble models forthe different systems and compute whether the in-crease in BLEU score is significant. Furthermore,we would like to experiment with larger datasets toverify whether the positive effect of the linguisticfeatures remains.

7 Acknowledgements

I would like to thank Natalie Schluter for her in-sightful comments and feedback and Roam Ana-lytics for sponsoring the ACL SRW travel grant.This work has been supported by the DublinCity University Faculty of Engineering & Com-puting under the Daniel O’Hare Research Schol-

71

arship scheme and by the ADAPT Centre forDigital Content Technology which is funded un-der the SFI Research Centres Programme (Grant13/RC/2106) and is co-funded under the EuropeanRegional Development Fund.

ReferencesRoee Aharoni and Yoav Goldberg. 2017. Towards

string-to-tree neural machine translation. In Pro-ceedings of the Association for Computational Lin-guistics, ACL, Vancouver, Canada.

Andrei Alexandrescu and Katrin Kirchhoff. 2006. Fac-tored Neural Language Models. In Proceedings ofthe Human Language Technology Conference of theNAACL, Companion Volume: Short Papers, pages1–4, New York, USA.

Eleftherios Avramidis and Philipp Koehn. 2008. En-riching Morphologically Poor Languages for Sta-tistical Machine Translation. In Proceedings ofThe Association for Computer Linguistics (ACL-08),pages 763–770, Ohio, USA.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural Machine Translation by JointlyLearning to Align and Translate. Banff, Canada.

Rafael E Banchs and Marta R Costa-Jussa. 2011. ASemantic Feature for Statistical Machine Transla-tion. In Proceedings of the Fifth Workshop on Syn-tax, Semantics and Structure in Statistical Transla-tion, pages 126–134, Portland, Oregon, USA.

Joost Bastings, Ivan Titov, Wilker Aziz, DiegoMarcheggiani, and Khalil Sima’an. 2017. GraphConvolutional Encoders for Syntax-Aware Neu-ral Machine Translation. In Conference on Em-pirical Methods in Natural Language Processing(EMNLP), Copenhagen, Denmark.

Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, andMarcello Federico. 2016. Neural versus Phrase-Based Machine Translation Quality: a Case Study.In Proceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, EMNLP,pages 257–267, Austin, Texas, USA.

Kyunghyun Cho, Bart van Merrienboer, CalarGulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Pro-ceedings of EMNLP 2014, pages 1724–1734, Doha,Qatar.

Massimiliano Ciaramita and Yasemin Altun. 2006.Broad-Coverage Sense Disambiguation and Infor-mation Extraction with a Supersense Sequence Tag-ger. In Proceedings of the 2006 Conference on Em-pirical Methods in Natural Language Processing,pages 594–602, Sydney, Australia.

Akiko Eriguchi, Kazuma Hashimoto, and YoshimasaTsuruoka. 2016. Tree-to-Sequence Attentional Neu-ral Machine Translation. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (ACL), pages 823–833, Berlin,Germany.

Akiko Eriguchi, Yoshimasa Tsuruoka, and KyunghyunCho. 2017. Learning to parse and translate im-proves neural machine translation. In Proceedingsof the Association for Computational Linguistics,ACL, Vancouver, Canada.

Alexander Fraser, Marion Weller, Aoife Cahill, and Fa-bienne Cap. 2012. Modeling Inflection and Word-Formation in SMT. In Proceedings of the 13th Con-ference of the European Chapter of the Associationfor Computational Linguistics (ACL), pages 664–674, Jeju Island, Republic of Korea.

Rejwanul Haque, Sudip Kumar Naskar, Yanjun Ma,and Andy Way. 2009. Using supertags as sourcelanguage context in smt. In European Associationfor Machine Translation, Barcelona, Spain.

Hany Hassan, Khalil Sima’an, and Andy Way. 2007.Supertagged phrase-based statistical machine trans-lation. In Proceedings of the 45th Annual Meeting ofthe Association of Computational Linguistics, pages288–295, Prague, Czech Republic.

Zhaopeng Tu Muhua Zhu Min Zhang Guodong ZhouJunhui Li, Deyi Xiong. 2017. Modeling Source Syn-tax for Neural Machine Translation. In Proceed-ings of the Association for Computational Linguis-tics, ACL, page 688697, Vancouver, Canada.

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A Convolutional Neural Network forModelling Sentences. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics, ACL, Volume 1: Long Papers,pages 655–665, Baltimore, MD, USA.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In MT summit, vol-ume 5, pages 79–86, Phuket, Thailand.

Philipp Koehn and Hieu Hoang. 2007. Factored trans-lation models. In Proceedings of Conference onEmpirical Methods in Natural Language ProcessingConference on Computational Natural LanguageLearning Joint Meeting following ACL (EMNLP-CONLL, pages 868–876, Prague, Czech Republic.

Mike Lewis, Luheng He, and Luke Zettlemoyer. 2015.Joint A* CCG Parsing and Semantic Role Labelling.In Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, EMNLP,pages 1444–1454, Lisbon, Portugal.

David Marecek, Rudolf Rosa, Petra Galuscakova, andOndrej Bojar. 2011. Two-step Translation withGrammatical Post-Processing. In Proceedings of theSixth Workshop on Statistical Machine Translation,pages 426–432, Edinburgh, UK.

72

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013. Linguistic Regularities in Continuous SpaceWord Representations. In hlt-Naacl, volume 13,pages 746–751, Atlanta, USA.

George A Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

Maria Nadejde, Siva Reddy, Rico Sennrich, TomaszDwojak, Marcin Junczys-Dowmunt, Philipp Koehn,and Alexandra Birch. 2017. Predicting Target Lan-guage CCG Supertags Improves Neural MachineTranslation. In Proceedings of the Second Confer-ence on Machine Translation, pages 68–79, Copen-hagen, Denmark.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318.

Ivan A Sag, Thomas Wasow, Emily M Bender, andIvan A Sag. 1999. Syntactic theory: A formal in-troduction, volume 92. Center for the Study of Lan-guage and Information Stanford, CA.

Nathan Schneider, Emily Danchik, Chris Dyer, andNoah A. Smith. 2014. Discriminative Lexical Se-mantic Segmentation with Gaps: Running the MWEGamut. volume 2, pages 193–206, Baltimore,Maryland, USA.

Nathan Schneider and Noah A. Smith. 2015. A Cor-pus and Model Integrating Multiword Expressionsand Supersenses. In NAACL HLT 2015, The 2015Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies,, pages 1537–1547, Denver,Colorado, USA.

Rico Sennrich. 2015. Modelling and Optimizing onSyntactic N-grams for Statistical Machine Transla-tion. Transactions of the Association for Computa-tional Linguistics, 3:169–182.

Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-dra Birch, Barry Haddow, Julian Hitschler, MarcinJunczys-Dowmunt, Samuel L”aubli, Antonio Vale-rio Miceli Barone, Jozef Mokry, and Maria Nade-jde. 2017. Nematus: a Toolkit for Neural MachineTranslation. In Proceedings of the Software Demon-strations of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics, pages 65–68, Valencia, Spain.

Rico Sennrich and Barry Haddow. 2016. Linguistic In-put Features Improve Neural Machine Translation.In Proceedings of the First Conference on MachineTranslation, WMT, ACL, pages 83–91, Berlin, Ger-many.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural Machine Translation of Rare Words

with Subword Units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics, ACL Volume 1: Long Papers, Berlin,Germany.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016. DoesString-Based Neural MT Learn Source Syntax? InProceedings of Empirical Methods on Natural Lan-guage Processing (EMNLP 2016), Austin, Texas,USA.

Mark Steedman. 2000. The Syntactic Process, vol-ume 24. Cambridge: MIT Press.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to Sequence Learning with Neural Net-works. In Advances in Neural Information Pro-cessing Systems 27: Annual Conference on NeuralInformation Processing Systems, pages 3104–3112,Montreal, Quebec, Canada.

Kristina Toutanova, Dan Klein, Christopher D Man-ning, and Yoram Singer. 2003. Feature-rich Part-of-Speech Tagging with a Cyclic Dependency Net-work. In Proceedings of the 2003 Conferenceof the North American Chapter of the Associationfor Computational Linguistics on Human LanguageTechnology-Volume 1, pages 173–180.

Kristina Toutanova, Hisami Suzuki, and Achim Ruopp.2008. Applying Morphology Generation Models toMachine Translation. In ACL, pages 514–522.

Eva Vanmassenhove, Jinhua Du, and Andy Way. 2017.Investigating ‘aspect’ in nmt and smt: Translatingthe english simple past and present perfect. Com-putational Linguistics in the Netherlands Journal,7:109–128.

Philip Williams and Philipp Koehn. 2012. Ghkm RuleExtraction and Scope-3 Parsing in Moses. In Pro-ceedings of the Seventh Workshop on Statistical Ma-chine Translation, pages 388–394.

Shuangzhi Wu, Ming Zhou, and Dongdong Zhang.2017. Improved Neural Machine Translation withSource Syntax. In Proceedings of the Twenty-SixthInternational Joint Conference on Artificial Intelli-gence (IJCAI 2017)., pages 4179–4185, Melbourne,Australia.

73


Unsupervised Semantic Abstractive Summarization

Shibhansh DohareCSE Department

IIT Kanpursdohare’

Vivek GuptaMicrosoft Research

Bangaloret-vigu"

Harish KarnickCSE Department

IIT Kanpurhk’

Abstract

Automatic abstractive summary genera-tion remains a significant open problemfor natural language processing. In thiswork, we develop a novel pipeline for Se-mantic Abstractive Summarization (SAS).SAS, as introduced by Liu et al. (2015)first generates an AMR graph of an inputstory, through which it extracts a summarygraph and finally, creates summary sen-tences from this summary graph. Com-pared to earlier approaches, we developa more comprehensive method to gener-ate the story AMR graph using state-of-the-art co-reference resolution and MetaNodes. Which we then use in a novelunsupervised algorithm based on how hu-mans summarize a piece of text to ex-tract the summary sub-graph. Our algo-rithm outperforms the state of the art SASmethod by 1.7% F1 score in node predic-tion.

1 Introduction

Summarization of large texts is still an open prob-lem in natural language processing. Automaticsummarization is often used in summarizing largetexts like stories, journal papers, news articles andeven larger texts like books and court judgments.

Existing methods for summarization can bebroadly categorized into two categories Extrac-tive and Abstractive. Most of the work doneon summarization in the past has been ExtractiveDang and Owczarzak (2008). Extractive meth-ods directly pick up words and sentences from thetext to generate a summary. Vanderwende et al.(2004) transformed the input to nodes, then used

’@cse.iitk.ac.in, ”@microsoft.com, Shibhansh is thecorresponding author

the Pagerank algorithm to score nodes, and finallygrow the nodes from high-value to low-value us-ing some heuristics. Some of the approaches com-bine this with sentence compression so that moresentences can be packed in the summary. McDon-ald (2007), Martins and Smith (2009), Almeidaand Martins (2013), and Gillick and Favre (2009)among others used ILPs and approximations forencoding compression and extraction. However,human level summary generation require rephras-ing sentences and combining information fromdifferent parts of the text. Thus, these methodsare inherently limited in the sense that they cannever generate human level summaries for largeand complicated documents.

On the other hand, most Abstractive methodstake advantages of the recent developments indeep learning. Specifically, the recent success ofthe sequence to sequence Sutskever et al. (2014)learning models, where recurrent networks readthe text; encodes it and then generate target textproduce promising results. Rush et al. (2015),Chopra et al. (2016), Nallapati et al. (2016),See et al. (2017) used standard encoder-decodermodels along with their variants to generate sum-maries. Takase et al. (2016) incorporated theAMR information in the standard encoder-decodermodels to improve results. These approaches haveproduced promising results and have been recentlyshown to be competitive with the extractive meth-ods, but they are still far from reaching humanlevel quality in summary generation. One of thesignificant problems with these methods is thatthere is no guarantee that they can handle sub-tleties of language like the presence of a word thatnegates the meaning of the full text, hard to cap-ture co-references, etc.

Banarescu et al. (2013) introduced AMR as abase for work on statistical natural language un-derstanding and generation. AMR tries to cap-

74

ture “who is doing what to whom“ in a sentence.An AMR represents the meaning of a sentence us-ing rooted, acyclic, labeled, directed graphs. Fig-ure 2 shows the AMR graph of the sentence “Ilooked carefully all around me“ generated by theJAMR parser Flanigan et al. (2014). The nodesin the AMR are labeled with concepts, in Figure 2‘around’ represents one such concept. Edges con-tain the information regarding the semantic rela-tion between the concepts. In Figure 2 directionis the relation between the concepts look-01 andaround. AMR relies on Propbank for semantic re-lations (edge labels). Concepts can also be of theform run-01 where the index 01 represents the firstsense of the word run. Further details about theAMR can be found in the AMR guidelines Ba-narescu et al. (2015). Liu et al. (2015) startedthe work on summarization using AMR, which wecall Semantic Abstractive Summarization (SAS).

Liu et al. (2015) introduced the fundamentalidea behind SAS. In SAS the final summary is pro-duced by extracting a summary subgraph from thestory graph and generating the summary from thisextracted graph (See Figure 1). But the work waslimited to obtaining the summary graph due to theabsence of AMR to text generators at that time.They used various graphical features like distancefrom the root, the number of outgoing edges, etc.and sentence number as features for nodes. Theprocedure then learned weights over these featureswith the constraint that the nodes must form a con-nected graph.

In this work, we propose an alternative methodto use AMRs for abstractive summarization. Ourapproach is inspired by the way humans summa-rize any piece of text. User studies Chin et al.(2009); Kang et al. (2011) have shown that hu-mans summarize by first writing down the keyphrases and then try to figure out the relationshipsamong them and then organize the data accord-ingly. Falke and Gurevych (2017) used similarideas to propose the task of concept map basedsummarization. We design our algorithm alongthe same lines. The first step is to find the mostimportant entities/events in the text. The secondstep is to identify the key relations among the mostimportant entities/events, and finally, in the laststep, we capture information around the selectedrelation. AMRs provide a natural way to achievethis process, as all the events/entities can be rep-resented by a node Rao et al. (2017) or a group

of nodes, while any relation can be captured by apath in the AMR graph. We also develop a morecomprehensive method to generate the story AMRfrom the sentence AMRs based on event/entityco-reference resolution and Meta Nodes. Our al-gorithm outperforms the previous state of the artmethods for SAS by 1.7% F1 score on Node pre-diction.

Our major contributions in this work are :

• We propose a novel unsupervised algorithmfor the key step of summary graph extraction,which provides a stronger baseline for futurework on SAS.

• We propose a novel method to generate thestory AMR based on a more comprehensiveco-reference resolution and Meta Nodes.

The rest of the paper is organized as follows.Section 2 and 3 contain description of the datasetsand the algorithm used for summary generation re-spectively. Section 4 contains the results of exper-iments using our approach.

2 Datasets

We use the proxy report section of the AMR BankKnight et al. (2014), as it is the only section thatis relevant for the task because it contains thegold-standard (human-generated) AMR graphs fornews articles and their summaries. In the trainingset, the stories and summaries contain 17.5 sen-tences and 1.5 sentences on average respectively.The training and test sets include 298 and 33 sum-mary document pairs respectively.

3 Pipeline for Summary Generation

The pipeline consists of three major steps. Thefirst step is to convert the document into an AMR(step-1). The next step is to extract a summaryAMR from the document AMR constructed in theprevious step (step-2). The final step generatestext from the extracted sub-graph (step-3). In thefollowing subsections, we expand on each step.

3.1 Step 1: Story to AMR: Document graphgeneration

Document AMR refers to the AMR representingthe meaning of the whole document. The AMR

Code for the complete pipeline for the end to endsummarization is available at https://github.com/shibhansh/Unsupervised-SAS

75

Figure 1: The pipeline proposed by (Liu et al., 2015) had the following step - AMR Parsing, Naive nodemerging, Subgraph selection and Text generation

Figure 2: The graphical representation of theAMR graph of the sentence : ”I looked care-fully all around me” using AMRICA Saphra andLopez (2015)

formalism guarantees that no two nodes refer tothe same event/entity. Liu et al. (2015) extends thisprinciple to multiple sentences by merging nodesreferring to the same named entity (or date) acrosssentences. However, they adopted a naive ap-proach to for co-reference resolution using a sim-ple name and date matching 3.1. The co-referenceresolution can be greatly improved if we take ad-vantage of the huge literature on text co-referenceresolution. We solve node co-reference resolu-tion using text co-reference resolution followed bymapping the text to a node using Alignments.

Node Co-reference Resolution is a crucial step,as a wrongly generated document AMR can pro-duce a factually wrong summary. To mitigatewrong mergers, we implement multiple sanitychecks to avoid wrong mergers. Text co-referenceresolution techniques can be broadly categorizedinto three major categories - neural, statisticaland rule-based. We used the state-of-the-art end-to-end neural co-reference resolution system Leeet al. (2017). Future work can use an ensemble ofco-reference resolvers to improve robustness. Alist of major sanity checks that we employed

• Don’t merge nodes which have an outgo-ing edge with label :name if the value of

:name argument is different, and neither ofthe names is the initials of the other

• Don’t merge if, cycle emerges in the graphafter the merger

• Don’t merge if, the nodes to be merged havecommon outgoing edge labels, and the nodesthat are connected with these edges are dif-ferent

For mapping text to the node, we use align-ments. Alignments provide a mapping from a wordin the text to the corresponding node in the AMR.Most co-reference systems provide co-referencesbetween noun phrases instead of individual words.But for node co-reference resolution we are re-quired to merge individual nodes rather than agroup of nodes. However, Lee et al. (2017) sys-tem also outputs attention weight for every wordof a noun phrase which signifies the importanceof each word in the noun phrase. We merge thenodes corresponding to the word that has the max-imum attention weight among the words in thenoun phrase.

Merging nodes that refer to the sameevent/entity suggests that the merged node ismore important in the graph than the originalnodes as there are more incoming and outgoingedges in the graph now. Co-reference resolutioncaptures explicit reference of an event/entity,which implies that the nodes should be mergedas they are same and thus it helps increase theimportance of the node. But, there are manycases where words are not referring to the sameentity or event but they refer to the same abstractconcept, or there might be cases where the wordsare talking about the same event without explicitlyreferring to it. In such cases, these words shouldreinforce the importance of each other, but simpleco-reference resolution does not capture this, andhence co-reference resolution is not enough. Weneed something new in the graph that captureswhen two nodes are reinforcing the importanceof each other without actually merging the two

76

nodes. In table 1 we present two examples wherewords reinforce the importance of each otherwithout referring to the exact same thing.

These examples inspire us to introduce a newset of nodes which we call Meta nodes. In thiswork, we use Meta nodes to increase the impor-tance of only common nouns. Common nounslike drugs, opium, etc. can occur a lot of times inthe text which suggests that they are relevant forthe text, but they are not identified by co-referencesystems as their different occurrences do not referto exactly the same thing. To capture the impor-tant common nouns which are otherwise not cap-tured in the co-reference resolution, we add a newMeta node in the graph for each such set of com-mon nouns. In Example 2 of Fig. 1, we introduce aMeta Node for the common noun Opium, which ispresent twice in the story. Each Meta node is con-nected to all the occurrences of the correspond-ing common noun. The nodes connected with ameta node signifies that the nodes at some levelmight refer to the same thing. Meta nodes are usedas representative for the group during ranking butthey are not extracted in the final summary graph,and hence they are not used during the final stepof summary generation.

The cases that we examined in Table 1 are caseswhere the words don’t have a perfect identity butrather a near identity. This points out that co-reference resolution is not a simple yes/no ques-tion but rather a complicated one. This problem ofthe complexity of co-reference resolution has beenexplored theoretically in the literature Recasenset al. (2011); Versley (2008) and our work willbenefit directly from more work on the complex-ity of co-reference resolution. In our current work,we don’t implement any procedure to detect rein-forcements of the sort given in Example 2 of Table1. Future works may include event co-referenceresolution and word similarity using word embed-dings to identify such reinforcements.

3.2 Step 2: Summary Graph Extraction

Summary graph extraction is a key step in SAS. Inthis step, we extract the summary sentence AMRgraphs from the document AMR produced in Step-1. We take our cue from the way humans summa-rize a text by first identifying the most importantentities/events in the text then finding the most im-portant relationships among these events/entitiesand finally include information surrounding the

Figure 3: An example of node merging in a verybasic AMR, The mergers 1 and 2 were also presentin the methods proposed by Liu et al. (2015) butnot the merger 3. Here, dash line represent nodeto be merged.

selected relationship(s).Step-A: Finding Important Nodes - For

finding important events/entities, we useterm frequency-inverse document frequency(Tf − IDF) to determine the importance of anynode. We first find the top n nodes in the graphusing term frequency. This n depends upon thesize of the summary required. Similar to earlierapproaches we use Alignments to find text corre-sponding to the nodes. Finally, we use Tf-IDFvalues of the text corresponding to the nodesto rank the selected n nodes. The proxy reportsection of the AMR Bank is quite small with only298 training stories. We use the CNN-DailymailHermann et al. (2015) corpus containing around300,000 news articles to evaluate the DocumentFrequencies (DF). We calculate Tf − Idf as -

Tf − Idf = Tf × log10(300, 000/(DF + 1))

As explained in section 3.1, Meta Nodes areused as a representative for a set of nodes duringimportance evaluation. Hence, during importanceevaluation we do not consider nodes that are con-nected with any Meta Node. To evaluate the im-portance of a Meta Node, we take the number ofnodes connected with a Meta Node as the term fre-quency for the Meta Node.

Step-B: Finding Key Relation- The next stepis to find the important relationship between a pairof selected nodes. We use a heuristic in this step.The idea is that the key relationship between thenodes will generally be present in the sentencewhere they occur together for the first time. Ifthere is no such sentence, then there is probablyno important direct relationship between the twonodes, and we ignore the pair. AMRs contain se-mantic information at the top of the AMR graph.

77

Table 1: In Example 1, the words illegal and ban reinforce each others importance but they are notcaptured by co-reference resolution. We add a Meta Node connected to the nodes corresponding to thewords illegal and ban. During importance evaluation, the occurrences of this Meta Node will be theseoccurrences of illegal and ban and term frequency for this Meta Node will be 2. Similarly, in the secondexample both the occurrences of the word opium are connected to a new Meta Node

1. On 011006 The Citizen newspaper stated that it is illegal for South Africans to be involved inmercenary activity or to render foreign military assistance inside or outside of South Africa. TheCitizen newspaper stated that the South African Foreign Ministry announced on 011005 that theSouth African government imposed the mercenary activity ban following reports that 1000 Muslimswith military training have enlisted to leave South Africa for Afghanistan to fight for the Talibanagainst the United States.2. Head of the U.N. drug office Antonio Maria Costa said that Afghanistan has produced so muchopium in recent years that the Taliban are cutting back poppy cultivation and stockpiling raw opiumin an effort to support prices and preserve a major source of financing for the insurgency. Costa saidthis to reporters last week as the U.N. Drug Office Office prepared to release its latest survey ofAfghanistan’s opium crop.

Table 2: Results on the Proxy report section of the AMR bank. First-half contains the Recall, Precision,and F-1 for the nodes in the generated summary AMR. The second half contains the scores for the finalsummary generated using state-of-the-art text generator evaluated using the ROUGE metric

Subgraph extraction Full pipelineMethod Recall Precision F1 R-1 R-2 R-L

Liu et al. (2015) 63.5 54.7 58.7 - - -Unsupervised SAS (Naive co-reference) 67.9 57.2 60.3 - - -

Unsupervised SAS (Lee et al. (2017) co-reference) 67.0 58.0 60.4 39.5 16.5 29.0Unsupervised SAS (Human co-reference) 67.7 60.4 62.4 40.9 16.7 29.5

Thus, in the selected sentence we find a path be-tween the two nodes closest to the root. If oneof the selected nodes happens to be a Meta Node,the occurrences of the Meta Node include all theoccurrences of all the nodes that the Meta Noderepresents (Fig. 1).

Step-C: Capture Surrounding Information -The final step in subgraph extraction is to expandaround the selected path to capture the surround-ing information. We use OpenIE Banko (2009) atthis step. The output of the OpenIE system aretuples of the form (arg; relation; arg). The rele-vant tuples for us are the set of tuples that containthe selected path. As, these tuples contain all theauxiliary information about the relationship thatthey are describing, selecting a tuple will solve theproblem of graph expansion. To capture the max-imum amount of auxiliary information we choosethe largest tuple among the set of relevant tuples.This ends the process of summary graph extrac-tion. Algorithm 1 provides an overview of the en-tire algorithm.

3.3 Step 3: Summary Generation

To generate sentences from the extracted AMRgraphs we use state of the art AMR to text gen-erator Konstas et al. (2017).

4 Experiments

In table 2 we report results on the test set of theproxy report section of the AMR bank. The tablecontains results using the human annotated AMRs.We outperform the state-of-the-art in SAS by 1.7%F1 scores in node prediction. Similar to previousmethods we use the target summary size to controlthe length of the output summary.

To evaluate the effectiveness of the methodtill the summary graph extraction step, we com-pare the generated summary graph with the gold-standard target summary graph. We report Recall,Precision, and F1 for graph nodes. Finally, to eval-uate the effectiveness of the pipeline, we evaluatethe performance using ROUGE Lin (2004), andwe report ROUGE-1, ROUGE-2, and ROUGE-L.

78

Figure 4: The updated pipeline proposed by us has the following step - AMR Parsing, Co-referenceresolution, Open Information Extraction,TF-IDF calculation, Subgraph selection and Text generation

Algorithm 1 Overview of AlgorithmInput. Input original text dataNode co-reference resolution and Meta nodeformationNode ranking using Tf-idfRank nodesRank node pairs, using node weightsfor Sorted node pairs do

Select the first co-occurrence sentenceif no such sentence then

Continueend ifFind the path closest to root in the sentenceFind relevant tuples using OpenIEChoose the largest tuple containing the pathif no such tuple then

Output the selected sentence AMRend ifif summary size > required size then

Breakend if

end forAMR to text conversion

As clear from table 2 there is not much dif-ference between the scores when we use naivenode resolution and date merging and when weuse state-of-art co-reference resolution. To checkthe impact of co-reference resolution, we alsodid manual co-reference resolution on the test setwhich resulted in a further 2% increase in thescores to 62.4%. We suspect that a significant rea-son for lower performance with state of the artco-reference resolution might be the inability ofthe system to handle cataphoric references. Thesereferences are particularly crucial in news articleswhere the first occurrence of an entity/event isgenerally essential.


In this work, we present a new method to do Se-mantic Abstractive Summarization (Figure 4). Weoutperform the previous state-of-the-art methodsfor SAS by 1.7% and by 3.7% using human co-reference resolution. In the process, we completethe SAS pipeline for the first time showing thatSAS can be used to construct high-quality sum-maries. We also extend the method to construct adocument AMR graph from the sentence AMRsusing Meta nodes which can further be used insome future formalism for Document MeaningRepresentation.

The work will benefit directly from improve-ments in each step of the pipeline. Specifically,the advances in co-reference resolution for thenear-identity cases might significantly improve thesummary quality. We are currently experiment-ing with bigger text summarization datasets likeDUC 2004 and DUC 2006. The hypothesis weused to find the key relations is the main hurdlein extending the work to multi-document summa-rization as all other steps can be directly appliedin multi-document summarization. Using differ-ent methods that might be based on supervision tofind the key relation is an interest direction for fu-ture work.

Acknowledgments

We would like to thank the anonymous ACL re-viewers for their insightful comments. We alsothank Susmit Wagle and Prerna Bharti of IIT Kan-pur for helpful discussions. Vivek Gupta acknowl-edges support from Microsoft Research TravelAward. We would also like to thank Roam Analyt-ics for sponsoring Shibhansh’s ACL SRW travelgrant. The opinions stated in this paper are of au-thors alone.

79

ReferencesMiguel Almeida and Andre Martins. 2013. Fast and ro-

bust compressive summarization with dual decom-position and multi-task learning. In Proceedings ofthe 51st Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 196–206. Association for Computational Lin-guistics.

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2015. Amr-guidelines.

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Martha Palmer, Philipp Koehn, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. Proceedings of Linguistic Annota-tion Workshop.

Michele Banko. 2009. Open Information Extractionfor the Web. Ph.D. thesis, Seattle, WA, USA.AAI3370456.

George Chin, Olga A. Kuchar, and Katherine E. Wolf.2009. Exploring the analytical processes of intel-ligence analysts. In InProceedings of the SIGCHIConference on Human Factors in Computing Sys-tems, page 1120. Association for ComputationalLinguistics.

Sumit Chopra, Michael Auli, and Alexander M. Rush.2016. Abstractive sentence summarization with at-tentive recurrent neural networks. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, pages 93–98, SanDiego, California. Association for ComputationalLinguistics.

Hoa Trang Dang and Karolina Owczarzak. 2008.Overview of the tac 2008 update summarizationtask. In Proceedings of Text Analysis Conference(TAC).

Tobias Falke and Iryna Gurevych. 2017. Bringingstructure into summaries: Crowdsourcing a bench-mark corpus of concept maps. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 2951–2961. Asso-ciation for Computational Linguistics.

Jeffrey Flanigan, Sam Thomson, Jaime Carbonell,Chris Dyer, and Noah A. Smith. 2014. A discrim-inative graph-based parser for the abstract mean-ing representation. In Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1426–1436. Association for Computational Linguistics.

Dan Gillick and Benoit Favre. 2009. A scalable globalmodel for summarization. In Proceedings of theNAACL Workshop on Integer Linear Programmingfor Natural Langauge Processing.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend.

Youn-ah Kang, Carsten Gorg, and John Stasko. 2011.How can visual analytics assist investigative analy-sis? design implications from an evaluation. IEEETransactions on Visualization and Computer Graph-ics, 17(5):570–583.

Kevin Knight, Laura Baranescu, Claire Bonial,Madalina Georgescu, Kira Griffitt, Ulf Hermjakob,Daniel Marcu, Martha Palmer, and Nathan Schnei-der. 2014. Deft phase 2 amr annotation r1ldc2015e86. philadelphia: Linguistic data consor-tium. Abstract meaning representation (AMR) an-notation release 1.0 LDC2014T12. Web Download.Philadelphia: Linguistic Data Consortium.

Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, YejinChoi, and Luke Zettlemoyer. 2017. Neural amr:Sequence-to-sequence models for parsing and gen-eration. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguis-tics. Association for Computational Linguistics.

Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference resolu-tion. CoRR, abs/1707.07045.

C. Lin. 2004. Rouge: A package for automatic evalua-tion of summaries. Text Summarization BranchesOut, Post-Conference Workshop of ACL 2004.Barcelona, Spain.

Fei Liu, Flanigan Jeffrey, Thomson Sam, Sadeh Nor-man, and Smith Noah A. 2015. Toward abstractivesummarization using semantic representations. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics. Association for ComputationalLinguistics, Denver, Colorado, pages 10771086.

Andre F. T. Martins and Noah A. Smith. 2009. Sum-marization with a joint model for sentence extrac-tion and compression. In Proceedings of the ACLWorkshop on Integer Linear Programming for Natu-ral Language Processing.

Ryan McDonald. 2007. A study of global inferencealgorithms in multi-document summarization. InProceedings of the 29th European Conference on IRResearch, ECIR’07, pages 557–564, Berlin, Heidel-berg. Springer-Verlag.

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Caglar Gulcehre, and Bing Xiang. 2016. Ab-stractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The20th SIGNLL Conference on Computational NaturalLanguage Learning, pages 280–290, Berlin, Ger-many. Association for Computational Linguistics.

80

Sudha Rao, Daniel Marcu, Kevin Knight, and HalDaume III. 2017. Biomedical event extraction us-ing abstract meaning representation. In BioNLP2017, pages 126–135, Vancouver, Canada,. Associ-ation for Computational Linguistics.

Marta Recasens, Eduard Hovy, and M. Antnia Mart.2011. Identity, non-identity, and near-identity: Ad-dressing the complexity of coreference. Lingua,121(6):1138 – 1152.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for sentence sum-marization.

Naomi Saphra and Adam Lopez. 2015. Amrica: anamr inspector for cross-language alignments. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Demonstrations, pages 36–40,Denver, Colorado. Association for ComputationalLinguistics.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in Neural Information Pro-cessing Systems 27, pages 3104–3112. Curran As-sociates, Inc.

Sho Takase, Jun Suzuki, Naoaki Okazaki, TsutomuHirao, and Masaaki Nagata. 2016. Neural head-line generation on abstract meaning representation.In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages1054–1059, Austin, Texas. Association for Compu-tational Linguistics.

Lucy Vanderwende, Michele Banko, and ArulMenezes. 2004. Event-centric summary generation.In Proceedings of DUC.

Yannick Versley. 2008. Vagueness and referential am-biguity in a large-scale annotated corpus. Researchon Language and Computation, 6(3):333–353.

A Example to generate document AMRfrom sentence AMR

In this appendix, we give an example showing howto generate Document AMR from the sentenceAMRs. Consider a short multi-sentence story -

A Kathmandu police officer reports –. 1 soldierof the Royal Nepal Army was seriously injured on29 August 2002 when a bomb disposal team at-tempted to defuse the bomb left at an electricitypole in okubahal near Sundhara in Lalitpur dis-trict in Kathmandu. Anti-government insurgentsare believed to have planted the bomb. The injured

soldier has been admitted to the army hospital inKathmandu.

Figure 5 shows the sentence AMRs of the foursentences of the short story. The nodes that re-fer to the similar entity have to be merged; thedashed lines connect the nodes to be merged. Fig-ure 6 shows, the generated document AMR fromthe merger. Fig 6 also shows how large the AMRsof even short stories can become after merging.

If the summarization process were to follow,we would’ve started by finding the key nodes inthe document graph based on TF− IDF. TheTerm frequency is the number of incoming edgesin the AMR. It is clear from the document AMRthat the important nodes based on Term frequencyare Soldier, Bomb, and Kathmandu. Then we useTF− IDF to rank among these key nodes, it turnsout that the key nodes that the final ranking in de-creasing order of importance is Kathmandu, Sol-dier and Bomb. The next step is to find the keyrelation, which according to our hypothesis liesin the sentence where they first co-occur, i.e., thesecond sentence. The exact relation is the highestpath. As clear from Figure 5 the path will includethe nodes corresponding to the words Kathmandu,Soldier, Injured. And finally in the last step we usethe OpenIE system to capture important informa-tion surrounding this path.

81

Figure 5: The AMRs of the 4 sentence of the short story. Dashed lines represent the nodes to be merged.

82

Figure 6: The document AMR generated by merging the four sentence AMRs of the story.83


Biomedical Document Retrieval for Clinical Decision Support System

Jainisha SankhavaraIRLP lab,

Dhirubhai Ambani Institute of Information and Communication TechnologyGandhinagar, Gujarat, India

[email protected]

Abstract

The availability of huge amount ofbiomedical literature have opened up newpossibilities to apply Information Re-trieval and NLP for mining documentsfrom them. In this work, we are focus-ing on biomedical document retrieval fromliterature for clinical decision support sys-tems. We compare statistical and NLPbased approaches of query reformulationfor biomedical document retrieval. Also,we have modeled the biomedical docu-ment retrieval as a learning to rank prob-lem. We report initial results for statis-tical and NLP based query reformulationapproaches and learning to rank approachwith future direction of research.

1 Introduction and Motivation

Medical and Healthcare related searches are hav-ing major focus of internet search now a days.The recent statistics shows that 61% of adultslook online for health information (Jones, 2009).This demands proper search and retrieval systemsfor health related biomedical queries. BiomedicalInformation Retrieval (BIR) seeks special atten-tion due to the characteristics of biomedical termi-nologies. Major challenges in biomedical domainare in handling complex, ambiguous, inconsis-tent medical terms and their ad-hoc abbreviations.Many medical terms are very complex. The aver-age length of biomedical entities is much higherthan general entities which makes entity identifi-cation task difficult for biomedical domain. En-tity identification and normalization helps to bet-ter solve the problems of retrieval and ranking ofdocuments for medical search systems, biomedi-cal text summarization, biomedical text data visu-alization, etc.

As we are focusing here on biomedical docu-ment retrieval and ranking system, biomedical lit-erature should be in consideration. Biomedical lit-erature is an important source of study in medi-cal science. Thousands of articles are being addedinto biomedical literature each year. This largeset of biomedical text articles can be used as acollection for Clinical Decision Support Systemwhere the related biomedical articles are extractedand suggested to medical practitioners to best caretheir patients. For this purpose, dataset from Clin-ical Decision Support (CDS) track is used whichcontains millions of full text biomedical articlesfrom PMC (PubMed Central)1. The statistics ofCDS 2014, 2015 and 2016 datasets are given inthe table 1. CDS2 track focuses on retrieval ofbiomedical articles which are related to patient’smedical case reports. These medical case reportswhich are being used as queries are case narra-tives of patients medical condition. They describespatients’ medical condition i.e. medical history,symptoms, tests performed, treatments etc. Fora given query/case report, the main problem is tofind relevant documents from the available collec-tion and rank them.

2 Background

’Information Retrieval: A Health and BiomedicalPerspective’ (Hersh, 2008) provides basic theory,implementation and evaluation of IR systems inhealth and biomedicine. The tasks of named en-tity recognition and relation and event extraction,summarization, question answering, and literaturebased discovery are outlined in Biomedical textmining: a survey of recent progress (Simpson andDemner-Fushman, 2012).

Automatic processing of biomedical text also

1 http://www.ncbi.nlm.nih.gov/pmc/2 http://www.trec-cds.org/

84

Dataset CDS 2014 CDS 2015 CDS 2016#Documents 733,138 733,138 1,255,259Collection size 47.2 GB 47.2 GB 87.8 GB#Total terms 1,600,536,286 1,600,536,286 2,954,366,841#Uniq. terms 3,689,317 3,689,317 4,564,612#Topics 30 30 30#Rel. docs/Topic 112 150 182

Query formsDescription,

SummaryDescription,

SummaryNote, Description,

SummaryAvg. length of Description (in words) 75.8 80.4 119.9Avg. length of Summary (in words) 24.6 20.4 33.3Avg. length of Note (in words) - - 239.4Avg. Doc length (in words) 2183 2183 2353

Table 1: CDS DATA statistics

suffers from lexical ambiguity (homonymy andpolysemy) and synonymy. Automatic queryexpansion (AQE) (Maron and Kuhns, 1960;Carpineto and Romano, 2012) which has a longhistory in information retrieval can be useful todeal with such problems. For instance, medicalqueries were expanded with other related termsfrom RxNorm, a drug dictionary, to improve therepresentation of a query for relevance estimation(Demner-Fushman et al., 2011). The emergenceof medical domain specific knowledge like UMLScan contribute to the retrieval system to gain moreunderstanding of the biomedical documents andqueries. The Unified Medical Language System(UMLS) (Bodenreider, 2004) is a metathesaurusfor medical domain. It is maintained by NationalLibrary of Medicine (NLM) and it is the mostcomprehensive resource, unifying over 100 dictio-naries, terminologies, and ontologies. Various ap-proaches of information retrieval with the UMLSMetathesaurus have been reported: some with de-cline in results (Hersh et al., 2000) and some withgain in results (Aronson and Rindflesch, 1997).The next section of this paper includes statisticalapproaches as well as NLP based approaches.

3 Query Reformulation for BiomedicalDocument Retrieval

Here, we present statistical and NLP based queryreformulation approaches for biomedical docu-ment retrieval. Statistical approaches includefeedback based query expansion and feedbackdocument discovery based query expansion. AnNLP based approach that is UMLS concept basedquery reformulation is also discussed here.

3.1 Automatic Query Expansion WithPseudo Relevance Feedback & RelevanceFeedback

Query Expansion (QE) is the process of reformu-lating a query to improve retrieval performanceand efficiency of IR systems. QE is proved to beefficient in case of document retrieval (Carpinetoand Romano, 2012). It helps to overcome vocabu-lary mismatch issues by expanding the user querywith additional relevant terms and by re-weightingall terms. Query Expansion which uses the top re-trieved relevant documents is known as RelevanceFeedback. It requires human judgment to iden-tify relevant documents from top retrieved docu-ments. While pseudo Relevance Feedback tech-nique assumes the top retrieved documents to berelevant and uses as feedback documents. It doesnot require human input at all. The Query ex-pansion based approaches for biomedical domaingives better results as compared to retrieval with-out query expansion (Sankhavara et al., 2014).

Table 2 and table 3 shows the results of standardretrieval (without expansion), Pseudo-RelevanceFeedback (PRF) based Query Expansion and Rel-evance Feedback (RF) based Query Expansionwith BM25 and In expC2 retrieval models (Am-ati et al., 2003) on CDS 2014, 2015 and 2016datasets. The retrieval model BM25 is a rank-ing function based on probabilistic retrieval frame-work while In expC2 is also a probabilistic butbased on Divergence From Randomness (DFR).These models are available in Terrier IR Plate-form3 (Ounis et al., 2005) which is developed atSchool of Computing Science, University of Glas-

3http://terrier.org

85

MAP CDS 2014 CDS 2015 CDS 2016BM25 0.1071 0.1147 0.062BM25+PRF10 0.1542 (+44%) 0.1805 (+57.4%) 0.0769 (+24%)BM25+RF10 0.205 (+91.4%) 0.1941 (+69.2%) 0.0984 (+58.7%)BM25+RF50 0.2768 (+158.5%) 0.2283 (+99%) 0.1456 (+134.8%)In expC2 0.1096 0.1201 0.0632In expC2+PRF10 0.1623 (+48.1%) 0.1725 (+43.6%) 0.0754 (+19.3%)In expC2+RF10 0.2117 (+93.2%) 0.1895 (+57.8%) 0.0992 (+57%)In expC2+RF50 0.2587 (+136%) 0.2191 (+82.4%) 0.1275 (+101.7%)

Table 2: Results (MAP) of Query Expansion with PRF and RF

infNDCG CDS 2014 CDS 2015 CDS 2016BM25 0.1836 0.2115 0.171BM25+PRF10 0.2522 (+37.4%) 0.283 (+33.8%) 0.2047 (+19.7%)BM25+RF10 0.3355 (+82.7%) 0.3028 (+43.2%) 0.2428 (+42%)BM25+RF50 0.4186 (+128%) 0.3478 (+64.4%) 0.3094 (+80.9%)In expC2 0.2002 0.2132 0.1785In expC2+PRF10 0.2724 (+36.1%) 0.2734 (+28.2%) 0.2018 (+13.1%)In expC2+RF10 0.3426 (+71.1%) 0.3015 (+41.4%) 0.245 (+37.3%)In expC2+RF50 0.4019 (+100.7%) 0.339 (+59%) 0.3219 (+80.3%)

Table 3: Results (infNDCG) of Query Expansion with PRF and RF

gow. Here, we have used terrier plateform for theexperiments. Summary part of the query is usedfor retrieval with top 10 and 50 top documents forfeedback in expansion. MAP and infNDCG areused as evaluation metrics (Manning et al., 2008).Higher the value of evaluation measure, better theretrieval result of system. The result improveswith PRF and RF based query expansion givingstatistically significant results (p < 0.05) as com-pared to no expansion. Here RF is giving 50-60%more improvement than PRF over no expansion.We argue that biomedical retrieval should be donekeeping human in the loop. A small human inter-vention can increase the retrieval accuracy to 60%more.

3.2 Feedback Document Discovery for QueryReformulation

Feedback Document Discovery based query ex-pansion as described in (Sankhavara and Ma-jumder, 2017) learns to identify relevant docu-ments for query expansion from top retrieved doc-uments. The main aim is to use small amount ofhuman judgement and learn pseudo judgement forother documents to reformulate the queries. Oneapproach is based on classification. If we have hu-man judgements available for some of the feed-back documents, then it will serve as a trainingdata for classification. The documents were repre-

sented as a collection of bag-of-words, the TF-IDFscores of the words represent features and humanrelevance scores provides the classes. Then therelevance is predicted for other top retrieved feed-back documents. The second approach is basedon and classification+clustering. It first appliesclassification in similar way as in first approachand then applies clustering on relevance predictedclass by the classification method, thus filteringout more non-relevant documents from relevantones. Since, the convergence of K-means cluster-ing depends on the initial choice of cluster cen-troids, the initial cluster centroids are chosen asthe average of relevant documents vectors and theaverage of non-relevant documents vectors fromtraining data.

Here we have used that approach with differ-ent features. The TF-IDF features are weightedbased on type of words. CliNER tool (Boag et al.,2015) has been used to identify medical entitiesof type ’problem’, ’test’ and ’treatment’ from doc-uments, which was trained on i2b2 2010 dataset(Uzuner et al., 2011). The i2b2 2010 dataset in-cludes discharge summaries from Partners Health-Care, from Beth Israel Deaconess Medical Cen-ter and from University of Pittsburgh MedicalCenter. These discharge summaries are fully de-identified and manually annotated for concept, as-sertion, and relation information. Here, we have

86

CDS 2014MAP infNDCG

No featureweighting

Featureweighting

using CliNER

No featureweighting

Featureweighting

using CliNEROriginal Queries 0.1071 0.1836Queries+RF50 0.2768 0.4186{Nearest neighbors}50 200 0.2761 0.2747 0.4177 0.4140{Nearest neighbors + k-means}50 200 0.2794 0.2777 0.4220 0.4195{Neural net}50 200 0.2790 0.2787 0.4235 0.4240{Neural net + k-means}50 200 0.2790 0.2807 0.4218 0.4269

Table 4: Results of Feedback Document Discovery

Figure 1: Query wise difference graph ofinfNDCG for feedback document discovery andrelevance feedback

used these discharge summaries along with theirconcept annotations to train CliNER. This trainedmodel is applied on CDS documents to identify’problem’, ’test’ and ’treatment’ concept entities.The features related to these entities in CDS docu-ments are weighted thrice, thus giving importanceto these entities while learning to identify feed-back document. For feedback document discoverywith weighted entities, we have used top 50 docu-ments and their corresponding relevance for train-ing, then the relevance was predicted for next top200 documents and used for expansion of queries.For classification two methods, nearest neighbourclassifier and neural net classifier, have been usedand k-means is used for clustering with k=2 forrelevant and non-relevant documents. In all cases,only relevant identified documents are used forexpanding queries. The comparison of resultsof original queries without expansion, expansionwith relevance feedback and expansion with twoapproaches of feedback document discovery (clas-sification and classification+clustering) for CDS2014 dataset is given in table 4.

The results clearly indicates improvement overoriginal queries and relevance feedback. Fig 1shows query wise difference in infNDCG between{Neural net + k-means}50 200 and Queries+RF50.Out of 30 queries of CDS 2014, 2 queries degradeperformance but 7 queries improve.

3.3 UMLS Concepts Based QueryReformulation

Medical domain-specific knowledge can be in-corporated to the process of query reformulationin Biomedical IR system. There are knowledgebased approaches proposed in the literature (Aron-son and Rindflesch, 1997; Demner-Fushman et al.,2011; Hersh, 2008). In the biomedical text re-trieval, medical concepts and entities are more in-formative than other common terms. Moreover,medical ontologies, thesaurus and biomedical en-tity identifiers are available to identify medical re-lated concepts.

Here we have used the resource UMLS. Thefollowing three query reformulation experimentsare done using it. First: The UMLS conceptsare identified from the query text and used withqueries. Second: Along with the UMLS concepts,MeSH (Medical Subject Heading) terms are alsoidentified and used in queries. MeSH is a hier-archically organized vocabulary of UMLS. Third:Medical entities are identified manually and usedwith queries. One example query with all thesereformulations is presented in Appendix A.

Table 5 shows the results of these reformulatedqueries of CDS 2014. PRF and RF based queryexpansion is also carried out on each form of thequeries. The results shows clear improvementwhen using UMLS concepts in queries as com-pared to original queries. One more important ob-servation here to make is that, for no-expansion

87

infNDCG CDS 2014BM25 In expC2

MAP infNDCF MAP infNDCFOriginal Queries 0.1071 0.1836 0.1096 0.2002Original Queries + PRF10 0.1542 0.2522 0.1623 0.2724Original Queries + RF10 0.2050 0.3355 0.2117 0.3426Original Queries + RF50 0.2768 0.4186 0.2587 0.4019Queries + UMLS concepts 0.1660 0.1830 0.1597 0.1781Queries + UMLS concepts + PRF10 0.1607 0.2607 0.1486 0.2431Queries + UMLS concepts + RF10 0.2164 0.3423 0.2138 0.3459Queries + UMLS concepts + RF50 0.2776 0.4232 0.2569 0.4021Queries + UMLS concepts + Mesh terms 0.1039 0.1749 0.1086 0.1792Queries + UMLS concepts + Mesh terms + PRF10 0.1460 0.2409 0.1411 0.2376Queries + UMLS concepts + Mesh terms + RF10 0.2052 0.1992 0.3321 0.3291Queries + Manual Entities 0.1112 0.1860 0.1140 0.2114Queries + Manual Entities + PRF10 0.1601 0.2634 0.1584 0.2650Queries + Manual Entities + RF10 0.2112 0.3394 0.2120 0.3414

Table 5: Results of UMLS based query processing

and PRF, the manual entities fail to improve MAPwhen compared to UMLS entities but certainlygive better results in terms of infNDCG.

4 Learning To Rank

Learning to rank (LTR) (Liu et al., 2009) is an ap-plication of machine learning in the constructionof ranking models for information retrieval sys-tems where retrieval problem is modeled as a rank-ing problem. LTR framework requires trainingdata of queries and documents matching them to-gether with relevance degree of each match. Train-ing data is used by a learning algorithm to producea ranking model which computes relevance of doc-uments for actual queries.

The LTR framework is applied on CDS 2014dataset where the features for query documentpairs are computed similarly as the features usedfor OHSUMED LETOR dataset (Qin et al., 2010).These features are mainly based on TF, IDF andtheir normalized versions. Since the whole docu-ment pool is too large, document pooling has beendone and top K documents (by BM25) for eachquery are used for feature extraction. SVMRankhas been used as a machine learning framework.

Table 6 shows the results of LTR when the fea-tures are computed on Title+Abstract part of thedocuments, on Title+Abstract+Content of the doc-uments (i.e. full documents). With these varia-tions of features, the experiments are carried outon original queries, queries with UMLS conceptsand queries with manually identified medical con-

infNDCGOHSUMEDfeatures on

T, A andT+A

OHSUMEDfeatures onT, A and C

Original Queries 0.097 0.1769Queries + UMLS 0.0833 0.1556Queries + Manual 0.1049 0.1785

Table 6: Results of Learning to Rank with differ-ent features

infNDCGRetrieval (BM25) 0.1836LTR using human judgements 0.1769Pseudo LTR K=1000 0.1849Pseudo LTR K=1500 0.1872Pseudo LTR K=2000 0.1859Pseudo LTR K=2500 0.1865Pseudo LTR K=3000 0.1865

Table 7: Results Learning To Rank with pseudojudgements

cepts.

All these LTR experiments require humanjudgement for training. To overcome the need ofmanual judgement, pseudo judgements were con-sidered where out of k training documents, Top k/2documents are considered to be relevant and otherk/2 documents to be non-relevant.

As shown in table 7, the results of LTR trainedusing pseudo qrels are better than one with actual

88

human judged qrels but the difference is not statis-tically significant. The results of LTR are compa-rable to retrieval using BM25.

5 Future Research Directions

Biomedical text processing and information re-trieval being a new field of research opens upmany research directions. In this article, we havepresented a preliminary study of statistical andNLP based biomedical document retrieval tech-niques for clinical decision support systems. Itincluded query reformulation based informationretrieval framework with pseudo relevance feed-back, relevance feedback, feedback document dis-covery and UMLS concept based reformulationfor Biomedical domain. Standard IR frameworksPRF and RF works good enough for Clinical De-cision Support System. Feedback document dis-covery based query reformulation which is a sta-tistical approaches can be improvised in futurefor significant improvement. Another statisticalmodel Learning to Rank is also having futurescope for more improvement. The initial frame-work for NLP based approach UMLS conceptbased retrieval also shows improvement in the re-sults. Therefore, we plan to combine statisticaland NLP based approaches and come up with newbetter model for biomedical document retrieval forDecision Support Systems. Also, we are planningto do feature weighting using NLP at entity levelin feedback document discovery approach.

ReferencesGianni Amati, Cornelis Joost, and Van Rijsbergen.

2003. Probabilistic models for information retrievalbased on divergence from randomness.

Alan R Aronson and Thomas C Rindflesch. 1997.Query expansion using the umls metathesaurus. InProceedings of the AMIA Annual Fall Symposium,page 485. American Medical Informatics Associa-tion.

William Boag, Kevin Wacome, Tristan Naumann, andAnna Rumshisky. 2015. Cliner: A lightweight toolfor clinical named entity recognition. AMIA JointSummits on Clinical Research Informatics (poster).

Olivier Bodenreider. 2004. The unified medical lan-guage system (umls): integrating biomedical termi-nology. Nucleic acids research, 32(suppl 1):D267–D270.

Claudio Carpineto and Giovanni Romano. 2012. A sur-vey of automatic query expansion in information re-trieval. ACM Computing Surveys (CSUR), 44(1):1.

Dina Demner-Fushman, Swapna Abhyankar, AntonioJimeno-Yepes, Russell F Loane, Bastien Rance,Francois-Michel Lang, Nicholas C Ide, EmiliaApostolova, and Alan R Aronson. 2011. Aknowledge-based approach to medical records re-trieval. In TREC.

William Hersh. 2008. Information retrieval: a healthand biomedical perspective. Springer Science &Business Media.

William Hersh, Susan Price, and Larry Donohoe. 2000.Assessing thesaurus-based query expansion usingthe umls metathesaurus. In Proceedings of theAMIA Symposium, page 344. American Medical In-formatics Association.

S Jones. 2009. The social life of health information.Pew research center, Washington, DC, Pew Internet& American Life Project.

Tie-Yan Liu et al. 2009. Learning to rank for informa-tion retrieval. Foundations and Trends R© in Infor-mation Retrieval, 3(3):225–331.

Christopher D Manning, Prabhakar Raghavan, HinrichSchutze, et al. 2008. Introduction to information re-trieval, volume 1. Cambridge university press.

Melvin Earl Maron and John L Kuhns. 1960. On rel-evance, probabilistic indexing and information re-trieval. Journal of the ACM (JACM), 7(3):216–244.

Iadh Ounis, Gianni Amati, Vassilis Plachouras, BenHe, Craig Macdonald, and Douglas Johnson. 2005.Terrier information retrieval platform. In EuropeanConference on Information Retrieval, pages 517–519.

Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. 2010.Letor: A benchmark collection for research on learn-ing to rank for information retrieval. InformationRetrieval, 13(4):346–374.

Jainisha Sankhavara and Prasenjit Majumder. 2017.Biomedical information retrieval. In Working notesof FIRE 2017 - Forum for Information RetrievalEvaluation, pages 154–157.

Jainisha Sankhavara, Fenny Thakrar, Prasenjit Ma-jumder, and Shamayeeta Sarkar. 2014. Fusing man-ual and machine feedback in biomedical domain.In Proceedings of The Twenty-Third Text REtrievalConference, TREC 2014, Gaithersburg, Maryland,USA, November 19-21, 2014.

Matthew S Simpson and Dina Demner-Fushman. 2012.Biomedical text mining: a survey of recent progress.In Mining text data, pages 465–517. Springer.

Ozlem Uzuner, Brett R South, Shuying Shen, andScott L DuVall. 2011. 2010 i2b2/va challenge onconcepts, assertions, and relations in clinical text.Journal of the American Medical Informatics Asso-ciation, 18(5):552–556.

89

A Example Query

<topic number="1" type="diagnosis"><summary>58-year-old woman with hypertension and obesity presents with

exercise-related episodic chest pain radiating to the back.</summary><UMLS entities>hypertension obesity exercise related chest pain radiating back nos

</UMLS entities><MeSH entities>Vascular Diseases Overnutrition Overweight Motor Activity Human

Activities Torso Bone and Bones Neurologic Manifestations Sensation</MeSH entities><manual entities>woman hypertension obesity exercise-related episodic chest pain

radiating back </manual entities></topic>

B Example Document

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC 3990010/pdf/1745-6215-15-124.pdf

C OHSUMED features

90


A Computational Approach to Feature Extraction for Identification ofSuicidal Ideation in Tweets

Ramit Sawhney Prachi Manchanda Raj Singh Swati AggarwalNetaji Subhas Institute of Technology, New Delhi 110078, India

{ramits, prachim, rajs}[email protected] [email protected]

Abstract

Technological advancements in the WorldWide Web and social networks in particu-lar coupled with an increase in social me-dia usage has led to a positive correlationbetween the exhibition of Suicidal ideationon websites such as Twitter and cases ofsuicide. This paper proposes a novel su-pervised approach for detecting suicidalideation in content on Twitter. A set offeatures is proposed for training both lin-ear and ensemble classifiers over a datasetof manually annotated tweets. The per-formance of the proposed methodology iscompared against four baselines that uti-lize varying approaches to validate its util-ity. The results are finally summarized byreflecting on the effect of the inclusion ofthe proposed features one by one for suici-dal ideation detection.

1 Introduction

According to World Health Organization, suicideis the second leading cause of death among 15-29-year-olds across the world. In fact, close to800,000 people die due of suicide each year. Thenumber of people who attempt suicide is muchhigher. While an individual suicide is often a soli-tary act, it can often have a devastating impacton families (Cerel et al., 2008). Many suicidedeaths are preventable and it is important to under-stand the ways in which individuals communicatetheir depression and thoughts for preventing suchdeaths. (Sher, 2004) Suicide prevention is mainlyhinged on surveillance and monitoring of suicideattempts and self-harm tendencies.

The younger generation has started to turn tothe Internet (Chan and Fang, 2007) for seekinghelp, discussing depression and suicide-related in-formation and offering support. The availability

of suicide-related material on the Internet plays animportant role in the process of suicide ideation.Due to this increasing availability of content onsocial media websites (such as Twitter, Facebookand Reddit etc.), and blogs (Yates et al., 2017)there is an urgent need to identify affected indi-viduals and offer help. Suicidal ideation refersto thoughts of killing oneself or planning suicide,while suicidal behavior is often defined to includeall possible acts of self-harm with the intention ofcausing death (Costello et al., 2002). AlthoughTwitter provides a unique opportunity to identifyat-risk of individuals (Jashinsky et al., 2014) anda possible avenue for intervention at both the indi-vidual and social level, there exist no best practicesfor suicide prevention using social media.

While there is a developing body of literatureon the topic of identifying patterns in the lan-guage used on social media that expresses suici-dal ideation (De Choudhury et al., 2016), very fewattempts have been made to employ feature ex-traction methods for binary classifiers that sepa-rate text related to suicide from text that clearlyindicates the author exhibiting suicidal intent. Anumber of successful models (Yates et al., 2017)have been used for sentence level classification,however, ones that are successful for being ableto learn to separate suicidal ideation from depres-sion as well as less worrying content such as re-porting of a suicide, memorial, campaigning, andsupport. etc, require a greater analysis to selectmore specific features and methods to build anaccurate and robust model. The drastic impactthat suicide has on surrounding community cou-pled with the lack of specific feature extraction andclassification models for the identification of sui-cidal ideation on social media, so that action canbe taken is the driving motivation for the work pre-sented in this paper.

Suicide prevention by suicide detection (Zung,1979) is one of the most effective ways to drasti-

91

cally reduce suicidal rates. The major practical ap-plication of this work lies in it’s easy adaptabilityto any social media forum (Robinson et al., 2016),wherein it can be used directly for analyzing text-based content posted by its users and flag it if thecontent is concerning.

The main contributions of this paper can besummarized as follows:

1. The creation of a labeled dataset for learn-ing the patterns in tweets exhibiting suicidalideation by manual annotation.

2. Proposed a set of features to be fed into clas-sifiers to improve the performance.

3. Employed four binary classifiers with theproposed set of features and compared themagainst baselines utilizing varied approachesto validate the proposed methodology.

2 Related Work

Media communication can have both positive andnegative influence on suicidal ideation. A sys-tematic review of all articles in PsycINFO, MED-LINE, EMBASE, Scopus, and CINAH from 1991to 2011 for language constructs relating to self-harm or suicide by Daine et al. (2013) concludedthat internet may be used as an intervention toolfor vulnerable individuals under the age of 25.However, not all language constructs containingthe word suicide indicate suicidal intent, specificsemantic constructs may be used for predictingwhether a sentence implies self-harm tendenciesor not.

A suicide note analysis method for automat-ing the identification of suicidal ideation was builtusing binary support vector machine classifiersby Desmet and Hoste (2013) using fine-grainedemotion detection for classifier optimization withlexico-semantic features for optimization. In2014, Huang et al. (2014) used rule-based meth-ods with hand-crafted unsupervised classificationfor developing a real-time suicidal ideation detec-tion system deployed over Weibo1, a microblog-ging platform. This approach differs from the pro-posed approach in terms of both features and thereach of the social media platforms. Topic mod-eling in Chinese microblogs (Huang et al., 2015)for suicide ideation detection has also proven to be

1http://www.scmp.com/topics/weibo

efficient, however for a limited subset with a fairlydifferent set of features.

Studies corresponding to rise in suicidalideation associated with specific temporal events(Kumar et al., 2015) have also been performed,but do not specifically focus on building a robustsystem that simply analyzes content coupled withno other factors. Related literature also focuseson building systems that analyze tweets of userswho have committed suicide (Coppersmith et al.,2016), that may not specifically hint at suicidalideation, as opposed to the proposed problem.

3 Data

3.1 Data Collection

Traditionally, it has been difficult extracting datarelated to suicidal ideation or mental illnesses dueto social stigma but now, an increasing number ofpeople are turning to the Internet to vent their frus-tration, seek help and discuss mental health issues(Milne et al., 2016), (Sueki et al., 2014). To main-tain the privacy of the individuals in the dataset,we do not present direct quotes from any data, norany identifying information.

Anonymised data was collected from mi-croblogging website Twitter - specifically, contentcontaining self-classified suicidal ideation (i.e.text posts tagged with the word ’suicide) over theperiod of December 3, 2017 to January 31, 2018.The Twitter REST API2 was used for collectionof tweets containing any of the following Englishwords or phrases that are consistent with the ver-nacular of suicidal ideation (O’Dea et al., 2015):

suicidal; suicide; kill myself; my suicide note;my suicide letter; end my life; never wake up;can’t go on; not worth living; ready to jump; sleepforever; want to die; be dead; better off withoutme; better off dead; suicide plan; suicide pact;tired of living; don’t want to be here; die alone; goto sleep forever; wanna die; wanna suicide; com-mit suicide; die now; slit my wrist; cut my wrist;slash my wrist; do not want to be here; want itto be over; want to be dead; nothing to live for;ready to die; not worth living; why should I con-tinue living; take my own life; thoughts of suicide;to take my own life; suicide ideation; depressed; Iwish I were dead; kill me now

The texts were collected without knowing thesentiment. For example, when collecting tweets

2https://dev.twitter.com/rest/public/search

92

on hashtag #suicide, it is not known initiallywhether:

• the tweet is posted for suicide awareness andprevention;

• the person is talking about suicidal ideationand/or ways to kill himself;

• the tweet reports a third persons suicide eg:news report;

• the tweet uses suicide as a figure of speecheg: career suicide

3.2 Data AnnotationThen, text posts equaling 5213 in all were col-lected which were subsequently human annotated.The Human annotators consisted of both univer-sity students fairly active on social media, andaware of aspects of cognitive psychology as wellas university faculty in the domain of Psychol-ogy and Machine Learning. Human annotatorswere asked to indicate if the text implied suici-dal ideation using binary criteria by answering thequestion Does this text imply self-harm tendenciesor suicidal intent?. Each post was scrutinized andanalyzed by three independent annotators (H1, H2

and H3), due to the subjectivity of text annotation,wherein ambiguous posts were set to the defaultlevel, Suicidal intent absent. Posts were exam-ined individually and annotated according to thefollowing classification system:

1. Suicidal intent present:

• Text conveys a serious display of suici-dal ideation; e.g., I want to die or I wantto kill myself or I wish my last suicideattempt was successful;• Care was taken to classify only those

posts as suicidal where suicide risk isnot conditional unless some event is aclear risk factor eg: depression, bully-ing, substance use;• Posts where suicide plan and/or previ-

ous attempts are discussed; e.g., ”Thefact that I tried to kill myself and itdidn’t work makes me more depressed.”• Tone of text is sombre and not flippant,

eg: This makes me want to kill myself,lol, ”This day is horrible, I want to killmyself hahaha” are not included in thiscategory.

H1 H2 H3

H1 − 0.61 0.48H2 0.61 − 0.51H3 0.48 0.51 −

Table 1: Cohen’s Kappa for three annotatorsH1, H2 and H3

2. Suicidal intent absent:

• The default category for all posts.• Posts emphasizing on suicide related

news or information; e.g., Two femalesuicide bombers hit crowded market inMaiduguri.• Posts such as Suicide squad sounds like

a good option; no reasonable evidenceto suggest that the risk of suicide ispresent; includes posts containing songlyrics, etc, were marked within this cat-egory.• Posts pertaining to condolence and sui-

cide awareness; e.g., ”5 suicide preven-tion helplines in India you need to knowabout”, Politician accused of driving hiswife to suicide.

Annotators were instructed to select only one ofthe above categories and to select the default levelin case of ambiguity. In all, 15.76% (822) of alltweets were annotated to be suicidal, which werethen used to train and validate the classifiers pre-sented in the following sections. A satisfactoryagreement between the annotators (e.g., 0.61 forH1 and H2) can be inferred from Table 1.

4 Proposed Methodology

The overall methodology is divided into threephases. The initial phase consists of preprocessingthe text within a tweet, the second phase involvesfeature extraction from preprocessed tweets for thetraining and testing of binary classifiers for the sui-cidal ideation identification, and the final phaseactually classifies and identifies tweets exhibitingsuicidal ideation. The details of these individualphases are presented below.

4.1 Preprocessing

Preprocessing is achieved by applying a series offilters, based on Xiang et al. (2012), in the ordergiven below to process the raw tweets.

93

1. Removal of non-English tweets using Ling-Pipe (Baldwin and Carpenter, 2003) withHadoop.

2. Identification and elimination of user men-tions in tweet bodies having the format of@username, URLs as well as retweets in theformat of RT.

3. Removal of all hashtags with length > 10 dueto a great volume of hashtags being concate-nated words, which tends to amplify the vo-cabulary size inadvertently.

4. Stopword removal.

4.2 Feature ExtractionTweets exhibiting suicidal ideation lack a semi-rigid pre-defined lexico-syntactic pattern. Hence,they warrant the use of hand engineering and ana-lyzing a set of features (Wang et al., 2016) in con-trast to sentence and word embeddings in a super-vised setting using Deep Learning Models suchas Convolutional Neural Networks (Kim, 2014)(CNN). The proposed methodology utilizes thefollowing set of features for classification.

• Statistical Features. These encompass thenumber of tokens, and their length.

• LIWC Features. Features extracted using theLinguistic Inquiry and Word Count program(LIWC) (Pennebaker et al., 2001) capturepeople’s social and psychological states byanalyzing the text to generate labels. Owingto the immense similarity in the nature of theproblem of Suicidal Ideation detection in textand the background of LIWC in social, clin-ical, and cognitive psychology, LIWC fea-tures are an ideal candidate for inclusion asa subset of features for our overall classifica-tion problem.

As an example, the accompanying tweet isassociated with negative emotions and cog-nitive processes with a high authenticity andemotional tone. I’m holding a gun and de-ciding if I want to go through with suicide ornot. I want to commit suicide really badly...Help?

• Part of Speech counts. POS counts foreach label assigned by the Stanford Part-Of-Speech Tagger (Manning et al., 2014) are

used as a feature. POS Tags include nouns,adjectives, adverbs, verbs, etc.

• TF-IDF. The Term Frequency-Inverse Docu-ment Frequency (TF-IDF) is used as a featureto reflect the importance of a particular wordwithin the corpus and is given by:

tfidf(t) = freq(t)× lnN

|{d ∈ D : t ∈ d}|

where, t is the word feature, N is the numberof tweets, and d is a document in the docu-ment set D.

• Topics Probability. The probability distribu-tion of each topic over its terms are used asa feature, which is based on the approachthat the tweets are represented as randommixtures over latent topics. Latent Dirich-let Allocation (LDA) (Blei et al., 2003) isa generative probabilistic model that is usedto describe each such topic as a generativemodel which generates words of the vocabu-lary with certain probabilities, and forms thebasis of evaluating Topics Probability.

4.3 Classification

Suicidal Ideation detection is formulated as a su-pervised binary classification problem. For ev-ery tweet ti ∈ D, the document set, a binaryvalued variable yi ∈ {0, 1} is introduced, whereyi = 1 denotes that the tweet ti exhibits Suici-dal Ideation. To learn this, the classifiers mustdetermine whether any sentence in ti possesses acertain structure or keywords that mark the exis-tence of any possible Suicidal thoughts. The fea-tures presented above are the used to train classi-fication models to identify tweets exhibiting Sui-cidal Ideation. Linear classifiers such as Logis-tic Regression as well as Ensemble Classifiers in-cluding Random Forest (Liaw et al., 2002), Gradi-ent Boosting Decision Tree (Friedman, 2002) andXGBoost (Chen and Guestrin, 2016) are employedfor classification.

Both XGBoost and Gradient Boosting DecisionTrees aim to boost the performance of a classi-fier in a stage-wise fashion by iteratively addinga new classifier to the ensemble to allow the op-timization of a differentiable loss function. TheRandom Forest classifier is one of the most pop-ular ensemble machine learning algorithm based

94

on Bootstrap Aggregation (Quinlan et al., 1996)or bagging. It modifies the bagging procedure sothat the learning algorithm is limited to a randomsample of features of which to search, which hasshown promise in text classification problems.

5 Baselines

Validation of the proposed methodology is doneby comparison against Baseline models that actas a useful point for comparison. Comparison interms of the evaluation metrics presented beloware also done with other recent models for SuicidalIdeation classification as follows:

Long Short Term Memory (LSTM) models aremore robust to noise in comparison to RecurrentNeural Networks (RNN) (Liu et al., 2016), andbetter able to capture long-term dependencies ina sequence, due to their ability to learn how toforget past observations. The LSTM model usesh = 128 memory units, with a dropout probabil-ity of 0.2, and ReLU (Nair and Hinton, 2010) wasused for activation. For training, the Adam Op-timizer was used to minimize log loss. A batchsize of 64 was chosen and trained for a total of100 epochs. Pre-Trained word2vec word embed-dings that were trained on 100 billion words fromGoogle News are employed as features for classi-fication. Support Vector Machines (Desmet andHoste, 2013) (SVM) have been shown to workwell with short informal text (Pak and Paroubek,2010) and other promising results in the cognitivebehavior domain (De Choudhury et al., 2013). Thefeatures described in Desmet and Hoste (2013) areused by the SVM for classification. Rule-basedapproaches focusing on maximizing the informa-tion gain aim to reduce the uncertainty of the classa particular tweet belongs to. A J48 decision tree(C4.5) (Quinlan et al., 1996) was used with thefeatures above for classification.Lastly, a Negation Resolution (Gkotsis et al.,2016) based approach that is relatively recent, thatemploys parse trees to build a set of basic rulesthat rely on minimum domain knowledge is used.

6 Results and Analysis

6.1 Analysis: Comparison with Baselines

Table 2 presents the results for both baselines aswell as the classifiers with the proposed method-ology in terms of four evaluation metrics: Accu-racy, Precision, Recall and F1 Score. The first

four rows represent the results of the proposed fea-tures with both Linear and Ensemble classifiers asdescribed in the Classification section above. Thefinal four rows represent the baseline results.

The proposed features used in conjunction withthe first four models described in the Classificationsection supersede the baseline models in termsof performance along most metrics. The LSTMmodel has the highest recall, owing to its abil-ity to capture long term dependencies, howeverits overall performance in terms of accuracy andF1 score is relatively less. Both SVM and Rule-based classification don’t perform as well as theproposed methodology, owing to the lack of fea-tures used in these models that are not suitablefor learning how to classify tweets with SuicidalIdeation. Both of these methods are more suit-able in a general domain, however, the featuresin the proposed methodology are more specific tothe particular problem domain of Suicidal Ideationdetection, particularly the LIWC features and Top-ics probability. Lastly, the Negation Resolutionmethod performs poorly on the dataset, due toits inability to adapt to a vast and highly diverseform of suicidal ideation communication and itsimplicit rigidity. This in comparison to the pro-posed methodology, is unable to effectively learnand extract the essential features from input text,and thus does not perform as well.In conclusion, the proposed methodology consist-ing of feature extraction coupled with ensembleand linear classifiers supersedes the baselines pre-sented from various domains in terms of perfor-mance.

6.2 Classifiers with proposed features

The first four rows of Table 2 represent the resultsin terms of the evaluation metrics for differentclassifiers, both Linear and Ensemble, using theproposed set of features. While the performance ofthe four classifiers is comparable, Random Forestclassifiers perform the best. This is attributed tothe ability of Random Forest classifiers that tackleerror reduction by reducing variance rather thanreducing bias. As has been seen with various textclassification problems, Logistic Regression per-forms fairly well despite its simplicity, and has agreater accuracy and F1 score in comparison withboth Boosting Algorithms.

Table 3 shows the variation in performance ofthe Random Forest classifier with the inclusion of

95

Table 2: Classification Results in terms of Evaluation metrics.Model Accuracy Precision Recall F1 scoreLogistic Regression 0.830 0.819 0.850 0.832Random Forest 0.858 0.842 0.846 0.844Gradient Boosting Decision Tree 0.805 0.802 0.820 0.807XGBoost 0.817 0.831 0.800 0.812LSTM 0.789 0.745 0.874 0.796Support Vector Machine 0.792 0.821 0.692 0.754Rule-based Classification 0.801 0.824 0.743 0.781Negation Resolution 0.527 0.542 0.752 0.635

Table 3: Variation in performance with the inclusion of featuresFeatures used Accuracy Precision Recall F1 scoreStatistical Features(SF) only 0.596 0.547 0.600 0.569SF + TF-IDF 0.669 0.663 0.753 0.702SF + TF-IDF + POS counts 0.789 0.821 0.705 0.721SF + TF-IDF + POS + Topics Probability 0.807 0.814 0.820 0.817All Features 0.858 0.842 0.846 0.844

the various features. The precision reduces by asmall amount with the inclusion of Topics prob-ability feature implying that a greater subset oftweets is classified as suicidal due to the LDA uni-grams included via Topics probability features, butis finally boosted by the inclusion of the LIWCfeatures. The POS counts also lead to a reductionin the recall, which is compensated with the sub-sequent inclusion of Topics Probability and LIWCfeatures. The drastic improvements are attributedto the TF-IDF, POS counts and LIWC features interms of most evaluation metrics. It is observedthat the proposed set of features perform the bestin conjunction with Random Forest classifiers, andthe improvement in performance with the inclu-sion of each feature validates the need for the ex-traction of that feature.

6.3 Error Analysis

Some categories of errors that occur are:

1. Seemingly Suicidal tweets: Human annota-tors as well as our classifier could not iden-tify whether ”I want to kill myself, lol. :(”was representative of suicidal ideation or afrivolous reference to suicide.

2. Pragmatic difficulty: The tweet ”I lost mybaby. Signing off..” was correctly identifiedby our human annotators as a tweet with sui-cidal intent present. This tweet contains anelement of topic change with no explicit men-

tion of suicidal ideation, but our classifiercould not capture it.

3. Ambiguity: The tweet ”Is it odd to know I’llcommit suicide?” is a tweet that both humanannotators as well as the proposed methodol-ogy couldn’t classify due to it’s ambiguity.


This paper proposes a model to analyze tweets, bydeveloping a set of features to be fed into clas-sifiers for identification of Suicidal Ideation us-ing Machine Learning. When annotated by hu-mans, 15.76% of the total dataset of 5213 tweetswas found to be suicidal. Both linear and ensem-ble classifiers were employed to validate the se-lection of features proposed for Suicidal Ideationdetection. Comparisons with baseline models em-ploying various strategies such as Negation Res-olution, LSTMs, Rule-based methods were alsoperformed. The major contribution of this workis the improved performance of the Random for-est classifier as compared to other classifiers aswell as the baselines. This indicates the promise ofthe proposed set of features with a bagging basedapproach with minimal correlation show as com-pared to other classifiers. In the future, there isscope for larger amounts of data to be scrapedfrom more social media websites as well as inves-tigate the performance withdeep learning modelssuch as CNNs, LSTM-CNNs, etc.

96

ReferencesBreck Baldwin and Bob Carpenter. 2003. Ling-

pipe. Available from World Wide Web: http://alias-i.com/lingpipe.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. Journal of ma-chine Learning research, 3(Jan):993–1022.

Julie Cerel, John R Jordan, and Paul R Duberstein.2008. The impact of suicide on the family. Crisis,29(1):38–44.

Kara Chan and Wei Fang. 2007. Use of the internetand traditional media among young people. YoungConsumers, 8(4):244–256.

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: Ascalable tree boosting system. In Proceedings of the22nd acm sigkdd international conference on knowl-edge discovery and data mining, pages 785–794.ACM.

Glen Coppersmith, Kim Ngo, Ryan Leary, and An-thony Wood. 2016. Exploratory analysis of socialmedia prior to a suicide attempt. In Proceedingsof the Third Workshop on Computational Lingusiticsand Clinical Psychology, pages 106–117.

E Jane Costello, Daniel S Pine, Constance Hammen,John S March, Paul M Plotsky, Myrna M Weissman,Joseph Biederman, H Hill Goldsmith, Joan Kauf-man, Peter M Lewinsohn, et al. 2002. Developmentand natural history of mood disorders. Biologicalpsychiatry, 52(6):529–542.

Kate Daine, Keith Hawton, Vinod Singaravelu, AnneStewart, Sue Simkin, and Paul Montgomery. 2013.The power of the web: a systematic review of stud-ies of the influence of the internet on self-harm andsuicide in young people. PloS one, 8(10):e77555.

Munmun De Choudhury, Michael Gamon, ScottCounts, and Eric Horvitz. 2013. Predicting depres-sion via social media. ICWSM, 13:1–10.

Munmun De Choudhury, Emre Kiciman, Mark Dredze,Glen Coppersmith, and Mrinal Kumar. 2016. Dis-covering shifts to suicidal ideation from mentalhealth content in social media. In Proceedings ofthe 2016 CHI conference on human factors in com-puting systems, pages 2098–2110. ACM.

Bart Desmet and VeRonique Hoste. 2013. Emotion de-tection in suicide notes. Expert Systems with Appli-cations, 40(16):6351–6358.

Jerome H Friedman. 2002. Stochastic gradient boost-ing. Computational Statistics & Data Analysis,38(4):367–378.

George Gkotsis, Sumithra Velupillai, Anika Oellrich,Harry Dean, Maria Liakata, and Rina Dutta. 2016.Don’t let notes be misunderstood: A negation de-tection method for assessing risk of suicide in men-tal health records. In Proceedings of the Third

Workshop on Computational Lingusitics and Clin-ical Psychology, pages 95–105.

Xiaolei Huang, Xin Li, Tianli Liu, David Chiu, Ting-shao Zhu, and Lei Zhang. 2015. Topic model foridentifying suicidal ideation in chinese microblog.In Proceedings of the 29th Pacific Asia Conferenceon Language, Information and Computation, pages553–562.

Xiaolei Huang, Lei Zhang, David Chiu, Tianli Liu,Xin Li, and Tingshao Zhu. 2014. Detecting suici-dal ideation in chinese microblogs with psycholog-ical lexicons. In Ubiquitous Intelligence and Com-puting, 2014 IEEE 11th Intl Conf on and IEEE 11thIntl Conf on and Autonomic and Trusted Comput-ing, and IEEE 14th Intl Conf on Scalable Computingand Communications and Its Associated Workshops(UTC-ATC-ScalCom), pages 844–849. IEEE.

Jared Jashinsky, Scott H Burton, Carl L Hanson, JoshWest, Christophe Giraud-Carrier, Michael D Barnes,and Trenton Argyle. 2014. Tracking suicide risk fac-tors through twitter in the us. Crisis: The Jour-nal of Crisis Intervention and Suicide Prevention,35(1):51.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

Mrinal Kumar, Mark Dredze, Glen Coppersmith, andMunmun De Choudhury. 2015. Detecting changesin suicide content manifested in social media fol-lowing celebrity suicides. In Proceedings of the26th ACM Conference on Hypertext & Social Me-dia, pages 85–94. ACM.

Andy Liaw, Matthew Wiener, et al. 2002. Classifi-cation and regression by randomforest. R news,2(3):18–22.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang.2016. Recurrent neural network for text classi-fication with multi-task learning. arXiv preprintarXiv:1605.05101.

Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David McClosky.2014. The stanford corenlp natural language pro-cessing toolkit. In Proceedings of 52nd annualmeeting of the association for computational lin-guistics: system demonstrations, pages 55–60.

David N Milne, Glen Pink, Ben Hachey, and Rafael ACalvo. 2016. Clpsych 2016 shared task: Triagingcontent in online peer-support forums. In Proceed-ings of the Third Workshop on Computational Lin-gusitics and Clinical Psychology, pages 118–127.

Vinod Nair and Geoffrey E Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In Proceedings of the 27th international conferenceon machine learning (ICML-10), pages 807–814.

97

Bridianne O’Dea, Stephen Wan, Philip J Batterham,Alison L Calear, Cecile Paris, and Helen Chris-tensen. 2015. Detecting suicidality on twitter. In-ternet Interventions, 2(2):183–188.

Alexander Pak and Patrick Paroubek. 2010. Twitter asa corpus for sentiment analysis and opinion mining.In LREc, volume 10.

James W Pennebaker, Martha E Francis, and Roger JBooth. 2001. Linguistic inquiry and word count:Liwc 2001. Mahway: Lawrence Erlbaum Asso-ciates, 71(2001):2001.

J Ross Quinlan et al. 1996. Bagging, boosting, and c4.5. In AAAI/IAAI, Vol. 1, pages 725–730.

Jo Robinson, Georgina Cox, Eleanor Bailey, Sarah Het-rick, Maria Rodrigues, Steve Fisher, and Helen Her-rman. 2016. Social media and suicide prevention: asystematic review. Early intervention in psychiatry,10(2):103–121.

L Sher. 2004. Preventing suicide. Qjm, 97(10):677–680.

Hajime Sueki, Naohiro Yonemoto, Tadashi Takeshima,and Masatoshi Inagaki. 2014. The impact ofsuicidality-related internet use: A prospective largecohort study with young and middle-aged internetusers. PloS one, 9(4):e94841.

Yufei Wang, Stephen Wan, and Cecile Paris. 2016.The role of features and context on suicide ideationdetection. In Proceedings of the AustralasianLanguage Technology Association Workshop 2016,pages 94–102.

Guang Xiang, Bin Fan, Ling Wang, Jason Hong, andCarolyn Rose. 2012. Detecting offensive tweetsvia topical feature discovery over a large scale twit-ter corpus. In Proceedings of the 21st ACM inter-national conference on Information and knowledgemanagement, pages 1980–1984. ACM.

Andrew Yates, Arman Cohan, and Nazli Goharian.2017. Depression and self-harm risk assessment inonline forums. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2968–2978.

William WK Zung. 1979. Suicide prevention by sui-cide detection. Psychosomatics, 20(3):153–155.

98


BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu UsingWord-level Annotations

Sreekavitha Parupalli, Vijjini Anvesh Rao and Radhika MamidiLanguage Technologies Research Center (LTRC)

International Institute of Information Technology, Hyderabad{sreekavitha.parupalli, vijjinianvesh.rao}@research.iiit.ac.in

[email protected]

Abstract

The presented work aims at generatinga systematically annotated corpus thatcan support the enhancement of senti-ment analysis tasks in Telugu using word-level sentiment annotations. From On-toSenseNet, we extracted 11,000 adjec-tives, 253 adverbs, 8483 verbs and sen-timent annotation is being done by lan-guage experts. We discuss the methodol-ogy followed for the polarity annotationsand validate the developed resource. Thiswork aims at developing a benchmark cor-pus, as an extension to SentiWordNet, andbaseline accuracy for a model where lex-eme annotations are applied for sentimentpredictions. The fundamental aim of thispaper is to validate and study the possi-bility of utilizing machine learning algo-rithms, word-level sentiment annotationsin the task of automated sentiment identifi-cation. Furthermore, accuracy is improvedby annotating the bi-grams extracted fromthe target corpus.

1 Introduction

Sentiment analysis deals with the task of determin-ing the polarity of text. To distinguish positiveand negative opinions in simple texts such as re-views, blogs, and news articles, sentiment analysis(or opinion mining) is used. Over time, it evolvedfrom focusing on explicit opinion expressions toaddressing a type of opinion inference which is aresult of opinions expressed towards events havingpositive or negative effects on entities.

There are three ways in which one can performsentiment analysis : document-level, sentence-level, entity or word-level. These determine the

polarity value considering the whole document,sentence-wise polarity, word-wise in some giventext respectively (Naidu et al., 2017). Despite ex-tensive research, the existing solutions and sys-tems have a lot of scope for improvement, to meetthe standards of the end users. The main prob-lem arises while cataloging the possibly infiniteset of conceptual rules that operate behind the an-alyzing the hidden polarity of the text (Das andBandyopadhyay, 2011). In this paper, we performa word-level sentiment annotation to validate theusage of such techniques for improving sentimentanalysis task. Furthermore, we use word embed-dings of the word-level sentiment annotated lexi-con to predict the sentiment label of a document.We experiment with various machine learning al-gorithms to analyze the affect of word-level sen-timent annotations on (document-level) sentimentanalysis.

The paper is organized as follows. In section 2we discuss the previous works in the field of sen-timent analysis, existing resources for Telugu andspecific advances that are made in Telugu. Section3 describes our corpus and annotation scheme. 4section describes several experiments that are car-ried out and the accuracies obtained. We also ex-plain the results in detail in 4.3. Section 5 show-cases our conclusions and section 6 shows thescope for future work.

2 Related Work

• Sentiment Analysis: Several ap-proaches have been proposed to capture thesentiment in the text where each approach ad-dresses the issue at different levels of granu-larity. Some researchers have proposed meth-ods for document-level sentiment classifica-tion (Pang et al., 2002; Turney and Littman,2003). At the top level of granularity, it is

99

often impossible to infer the sentiment ex-pressed about any particular entity, becausea document may convey different opinionsfor different entities. Hence, when we con-sider the tasks of opinion mining where thesole aim is to capture the sentiment polari-ties about entities, such as products in prod-uct reviews, it has been shown that sentence-level and phrase-level analysis lead to a per-formance gain (Wilson et al., 2005; Choi andWiebe, 2014). In the context of Indian lan-guages, (Das et al., 2012) proposes an alter-nate way to build the resources for multilin-gual affect analysis where translations intoTelugu are done using WordNet.

• SentiWordNet : (Das and Bandyopad-hyay, 2010) proposes multiple computationaltechniques like, WordNet based, dictionarybased, corpus based and generative ap-proaches to generate Telugu SentiWordNet.(Das and Bandyopadhyay, 2011) proposes atool Dr Sentiment where it automatically cre-ates the PsychoSentiWordNet which is an ex-tension of SentiWordNet that presently holdshuman psychological knowledge on a few as-pects along with sentiment knowledge.

• Advances in Telugu: (Naidu et al.,2017) utilizes Telugu SentiWordNet on thenews corpus to perform the task of Senti-ment Analysis. (Mukku and Mamidi, 2017)developed a polarity annotated corpus wherepositive, negative, neutral polarities are as-signed to 5410 sentences in the corpus col-lected from several sources. They developeda gold standard annotated corpus of Telugusentences aimed at improving sentiment anal-ysis in Telugu. To minimize the dependenceof machine learning(ML) approaches for sen-timent analysis on abundance of corpus, thispaper proposes a novel method to learn rep-resentations of resource-poor languages bytraining them jointly with resource-rich lan-guages using a siamese network (Choudharyet al., 2018a). A novel approach to clas-sify sentences into their corresponding sen-timent using contrastive learning is proposedby (Choudhary et al., 2018b) which utilizesthe shared parameters of siamese networks.

(Gangula and Mamidi, 2018) and (Mukkuand Mamidi, 2017) are the only reported

works for Telugu sentiment analysis usingsentence-level annotations who developedannotated corpora. Ours is the first of it’skind NLP research which uses sentiment an-notation of bi-grams for sentiment analysis(opinion mining).

3 Building the Benchmark Corpus

Lexicons play an important role in sentiment anal-ysis. Having annotated lexicon is key to carry outsentiment analysis efficiently. The primary task insentiment analysis is to identify the polarity of textin any given document. The polarity may be eitherpositive, negative or neutral (Naidu et al., 2017).Sentiment is a property of human intelligence andis not entirely based on the features of a language.Thus, peoples involvement is required to capturethe sentiment (Das and Bandyopadhyay, 2011).Having said this, we establish that annotated lex-icons are of immense importance in any languagefor sentiment analysis (a.k.a opinion mining).

For our experiments, we utilize the reviewsdataset from Sentiraama 1 corpus. It contains668 reviews in total for 267 movies, 201 prod-ucts and 200 books. Product reviews has 101 pos-itive and 100 negative entries; movie reviews has136 positive and 132 negative reviews; book re-views data has 100 positive and 100 negative en-tries. Since the obtained corpus is only annotatedwith document-level sentiment labels, we performthe word-level sentiment annotation manually.

3.1 Annotation Procedure

In this paper, sentiment polarities are classifiedinto 4 labels : positive, negative, neutral and am-biguous. Positive and negative labels are givenin case of positive and negative sentiments in theword respectively. Ambiguous label is given towords which acquire sentiment based on the wordsit is used along with or it’s position in a sentence.Neutral label is given when the word has no senti-ment in it. However, neutral and ambiguous sen-timent labels are of no significant use for the taskof sentiment analysis. Henceforth, those labels areignored in our experiments.

Sentiment annotations are performed on twodifferent kinds of data. Table 1 showcases the dis-tribution of sentiment labels at the word-level.

1https://ltrc.iiit.ac.in/showfile.php?filename=downloads/sentiraama/

100

• Unigrams: We obtain 7,663 words fromTelugu SentiWordNet 2 resource to calculatethe base-line accuracy of any word-level sen-timent annotated model. These words are al-ready annotated for sentiment/polarity. How-ever, it doesn’t provide extensive coverage ofTelugu. Later on, we discover a newly de-veloped large resource of Telugu words by(Parupalli and Singh, 2018), OntoSenseNet,which has a collection of 21,000 words (ad-jectives+verbs+adverbs). We perform thetask of word-level sentiment annotation onthe words obtained from this resource and werefer to these annotated words as unigramsthroughout this paper. Language experts whoperformed the annotations are given someguidelines to follow. Experts are imploredto look at the word, it’s gloss and then de-cide which one of the four sentiment labels ismore apt for a given word. Aforementionedword-level sentiment annotation is an attemptto improve the coverage of SentiWordNet.

• Bigrams: Furthermore, sentiment cannotalways be captured in a single word.This pa-per aims to check if bigram annotation isa suitable approach for improving the effi-ciency of sentiment analysis. To validate thehypothesis, we extract bigrams, which oc-curred at least more than once, only fromthe target corpus - Sentiraama dataset de-veloped by (Gangula and Mamidi, 2018).For example, consider the bigram (’DhokA’,’ledu’). The words individually mean ‘hur-dle (DhokA)’, ‘no (ledu)’. Thus, in word-level annotation task they would be given anegative label. However, the bigram meansthere is ‘nothing that can stop’ which invokesa positive sentiment. Such occurrences arequite common in the text, especially reviews,which lead us to believe that bigram polar-ity has potential to enhance sentiment anal-ysis, opinion mining. The usage of this de-veloped resource in experiments performed isexplained in section 4.

3.2 Validation

Annotations are done by 2 native speakers of Tel-ugu. If the annotators aren’t able to decide which

2http://amitavadas.com/sentiwordnet.php

label to assign, they are advised to tag it as uncer-tain. In case of a disagreement, the label given bythe annotator with more experience is given pri-ority. Validation of the developed resource is doneusing Cohen’s Kappa (Cohen, 1968). By consider-ing the uncertain cases as borderline cases (whereat least one annotator tagged the word as uncer-tain), Kappa value is seen as 0.91. This showsalmost perfect agreement and this proves the con-sistency in annotation task. This is especially highbecause when both the annotators are uncertain,we did a re-iteration to finalize the tag. Such re-iterative task is done for about 2,400 words duringthe development of our resource.


In this section we will analyze and observe howword-level polarity affects overall sentiment of thetext through majority polling approach and ma-chine learning based classification approaches.

4.1 Majority Polling Approach

A simple intuitive approach to identify the senti-ment label of the text is to calculate the sum ofpositive(+1) and negative(-1) polarity values in it.If the sum is positive, it shows that number of pos-itive words have outnumbered the number of neg-ative words thus resulting in a positive sentimenton the whole. Otherwise, the polarity of the text isnegative. Cases where the sum equals to 0 are ig-nored. Following are the word-level polarities weconsider for positive and negative labels:

• Unigram: We use the annotated unigramdata that is discussed in 3. For each review,we consider the unigram labels to carry themajority polling approach.

• Bigram: The extracted bigrams are anno-tated for positive and negative polarity. Ini-tially, we divide our data into training andtesting sets in 7:3 ratio. We only consider theannotated bigrams from the training corpusto predict the sentiment polarity of reviews inthe test data.

• Unigram+Bigram: In this trial, we com-bine the unigram and bigram data to performmajority polling. We consider the whole un-igram data whereas bigrams extracted fromthe training set are only considered for pre-dictions.

101

Resource Positive Negative Neutral Ambiguous TotalSentiWordNet 3 2135 4076 359 1093 7663

Dictionary (Parupalli and Singh, 2018) 3080 4232 3391 10199 20896Bigrams 1978 1762 8990 1996 14826

Table 1: Distribution of Sentiment Labels in Several Resources

Furthermore, as Telugu is agglutinative in na-ture (Pingali and Varma, 2006), we experimentwith the above mentioned approaches after per-forming morphological segmentation provided byIndic NLP library 4. Morphological segmentationis performed on the original reviews data and n-grams (positive and negative labels) to see if wecould get more accurate sentiment prediction ofthe reviews due to increment in the coverage.

4.2 Machine Learning Based ClassificationApproach

In this section, we perform document-level senti-ment analysis task with word embedding models,specifically Word2Vec. We utilize a Word2Vecmodel that is trained on corpus consisting ofscrapped data from Telugu websites, with 270 mil-lion non-unique tokens on the whole. Further-more, to obtain vectors for each review, we takeword vector of every word in the review and calcu-late their average to get a single document vector.

Figure 1: Comparative analysis of percentage ac-curacies produced by various classifiers

Though traditional vector-based word represen-tations help us accomplish various natural lan-guage processing tasks, they often lack informa-tion related to sentiment analysis. Thus, we aim toenrich the Word2Vec vectors obtained from cor-pus by incorporating word-level polarity features.We do this by adding the features we proposein 4.1 to the original averaged Word2Vec vector,

4http://anoopkunchukuttan.github.io/indic_nlp_library/

which is expected to increase the accuracy of po-larity prediction. The additional features we addedare: positive unigrams (number of positive po-larity unigrams in the review), negative unigrams(number of negative polarity unigrams in the re-view), positive bigrams (number of positive polar-ity bigrams in the review), negative bigrams (num-ber of negative polarity bigrams in the review).We partition these document vectors into trainingand testing sets to develop various classifier mod-els. In this paper, we have implemented 5 classi-fiers, namely, Linear SVM, Gaussian SVM, Ran-dom Forest, Neural Network, K Nearest Neigh-bor (KNN). Percentage accuracies are illustratedalong with the improvement in accuracies after ad-dition of our proposed features in Figure 1 and re-sults are discussed in 4.3.

4.3 Results

In this section, we showcase and analyze the re-sults of the two experiments we have done in sec-tion 4.

4.3.1 Majority Polling Approach :

Results illustrated in Table 2 show that certainword-level features do capture information rele-vant to document-level sentiment analysis. Ourhypothesis in Section 3.1 shows that bigram po-larity annotations have potential to enhance senti-ment analysis. High accuracy obtained by usingonly bigrams for majority polling proves our hy-pothesis. However, there is a trade-off betweencoverage and accuracy. This can be depicted fromthe huge increase in the count of unclassified re-views in case of bigram majority poling. We alsoobserve that effect of morphological segmentationon accuracy is hardly positive. This indicates thatin case of Telugu, morphological data has rele-vance to sentiment expressed and morphologicalsegmentation would result in loss of such valuableinformation for sentiment analysis tasks.

102

SentiWordNet Our resource Bigram Uni+BigramsBefore Segmentation 61.86 62.84 78.97 55.44Unclassified reviews 23/201 14/201 108/201 10/201After Segmentation 60.23 58.29 49.46 57.89Unclassified reviews 20/201 18/201 36/201 8/201

Table 2: Comparison of accuracies obtained through majority polling on different resources.

4.3.2 Machine Learning Based ClassificationApproach:

This approach shows that across all the classifiers,addition of word-level polarity features improvesthe process of classification. Therefore, classi-fiers can predict document-level sentiment polar-ity with better accuracies. Hence, our hypothe-sis is validated once again. Accuracies doesn’timprove significantly over the baseline value butshow a small increment always. KNN classifiershows a huge drop in accuracy after inclusion ofthe new features proposed. This is observed be-cause KNN assumes all features to hold equalimportance for classification. Hence, KNN failsto ignore the noisy features which explains thedrop. Random forest and neural network classi-fiers don’t show significant learning from the pro-posed features. Finally, we observe that linearSVM classifier works best to identify the polar-ity of a text for our features indicating linear sep-arability of the data. This also explains the badperformance of Gaussian SVM. Linear SVM pro-duces an accuracy of 84.08% when SentiWord-Net words alone are used as a feature, which canbe considered as the baseline accuracy. It givesan accuracy of 83.44% ,84.34% and 86.57% forunigrams, bigrams and unigrams+bigrams respec-tively as features of Linear SVM classifier.

5 Conclusions

In this paper, efforts are made to develop an an-notated corpus of 21,000 words to enrich TeluguSentiWordNet. This is a work in progress. Weperform annotations of 14,000 bigrams that areextracted from target corpus to validate their im-portance. This is a first-of-it’s-kind approach inTelugu to enhance sentiment analysis. Manual an-notations done show perfect agreement which val-idates the developed resource. Furthermore, weprovide a justification to why word-level senti-ment annotation of bigrams enhances sentimentanalysis though an intuitive majority polling ap-

proach, by using several ML classifiers. The re-sults are analyzed for further insights.

6 Future Work

We extract bigrams only from the target corpusbecause we wanted to mainly validate the impor-tance of bigrams in sentiment analysis. However,attempts should be made to enhance the Senti-WordNet with, at least, some most occurring bi-grams in Telugu. We hope this corpus can serveas a basis for more work to be done in the areaof sentiment analysis for Telugu. A continuationto this paper could be handling the enrichment ofadjectives and adverbs available in OntoSenseNetfor Telugu.

6.1 Crowd sourcing

We can develop a crowd sourcing platform wherethe annotations can be done by several languageexperts instead of a few. This helps in the annota-tion of large corpora. We aim to develop a crowdsourcing model for the same in near future. Thiswould be of immense help in annotation of 21,000unigrams extracted from the dictionary developedby (Parupalli and Singh, 2018).

7 Acknowledgements

This work is part of the ongoing MS thesis in Ex-act Humanities under the guidance of Prof. Rad-hika Mamidi. I am immensely grateful to VijayaLakshmi for helping me with data collection. Iwould like to thank Nurendra Choudary for re-viewing the paper and for his part in the ideation.I would like to extend my gratitude to AbhilashReddy for annotating the dataset, reviewing thework carefully and constantly pushing us to dobetter. I want to thank Rama Rohit Reddy for hissupport and for validating the novelty of this re-search at several points. I acknowledge the supportof Google in the form of an International TravelGrant, which enabled me to attend this conference.

103

ReferencesYoonjung Choi and Janyce Wiebe. 2014. +/-

effectwordnet: Sense-level lexicon acquisition foropinion inference. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1181–1191.

Nurendra Choudhary, Rajat Singh, Ishita Bindlish, andManish Shrivastava. 2018a. Emotions are univer-sal: Learning sentiment based representations ofresource-poor languages using siamese networks.arXiv preprint arXiv:1804.00805.

Nurendra Choudhary, Rajat Singh, Ishita Bindlish, andManish Shrivastava. 2018b. Sentiment analysis ofcode-mixed languages leveraging resource rich lan-guages. arXiv preprint arXiv:1804.00806.

Jacob Cohen. 1968. Weighted kappa: Nominal scaleagreement provision for scaled disagreement or par-tial credit. Psychological bulletin, 70(4):213.

Amitava Das and Sivaji Bandyopadhyay. 2010. Sen-tiwordnet for indian languages. In Proceedings ofthe Eighth Workshop on Asian Language Resouces,pages 56–63.

Amitava Das and Sivaji Bandyopadhyay. 2011. Dr sen-timent knows everything! In Proceedings of the49th annual meeting of the association for compu-tational linguistics: human language technologies:systems demonstrations, pages 50–55. Associationfor Computational Linguistics.

Dipankar Das, Soujanya Poria, Chandra Mohan Dasari,and Sivaji Bandyopadhyay. 2012. Building re-sources for multilingual affect analysis–a case studyon hindi, bengali and telugu. In Workshop Pro-gramme, page 54.

Rama Rohit Reddy Gangula and Radhika Mamidi.2018. Resource creation towards automated senti-ment analysis in telugu (a low resource language)and integrating multiple domain sources to enhancesentiment prediction. In Proceedings of the EleventhInternational Conference on Language Resourcesand Evaluation (LREC 2018), Paris, France. Euro-pean Language Resources Association (ELRA).

Sandeep Sricharan Mukku and Radhika Mamidi. 2017.Actsa: Annotated corpus for telugu sentiment anal-ysis. In Proceedings of the First Workshop onBuilding Linguistically Generalizable NLP Systems,pages 54–58.

Reddy Naidu, Santosh Kumar Bharti, Korra SathyaBabu, and Ramesh Kumar Mohapatra. 2017. Sen-timent analysis using telugu sentiwordnet.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.2002. Thumbs up?: sentiment classification usingmachine learning techniques. In Proceedings of theACL-02 conference on Empirical methods in naturallanguage processing-Volume 10, pages 79–86. As-sociation for Computational Linguistics.

Sreekavitha Parupalli and Navjyoti Singh. 2018. En-richment of ontosensenet: Adding a sense-annotatedtelugu lexicon. arXiv preprint arXiv:1804.02186.

Prasad Pingali and Vasudeva Varma. 2006. Hindiand telugu to english cross language information re-trieval at clef 2006. In CLEF (Working Notes).

Peter D Turney and Michael L Littman. 2003. Mea-suring praise and criticism: Inference of semanticorientation from association. ACM Transactions onInformation Systems (TOIS), 21(4):315–346.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the con-ference on human language technology and empiri-cal methods in natural language processing, pages347–354. Association for Computational Linguis-tics.

104


Reinforced Extractive Summarization with Question-Focused Rewards

Kristjan Arumae Fei LiuDepartment of Computer Science

University of Central Florida, Orlando, FL 32816, [email protected] [email protected]

Abstract

We investigate a new training paradigm forextractive summarization. Traditionally,human abstracts are used to derive gold-standard labels for extraction units. How-ever, the labels are often inaccurate, be-cause human abstracts and source docu-ments cannot be easily aligned at the wordlevel. In this paper we convert human ab-stracts to a set of Cloze-style comprehen-sion questions. System summaries are en-couraged to preserve salient source con-tent useful for answering questions andshare common words with the abstracts.We use reinforcement learning to explorethe space of possible extractive summariesand introduce a question-focused rewardfunction to promote concise, fluent, andinformative summaries. Our experimentsshow that the proposed method is effec-tive. It surpasses state-of-the-art systemson the standard summarization dataset.

1 Introduction

We study extractive summarization in this workwhere salient word sequences are extracted fromthe source document and concatenated to form asummary (Nenkova and McKeown, 2011). Exist-ing supervised approaches to extractive summa-rization frequently use human abstracts to createannotations for extraction units (Gillick and Favre,2009; Li et al., 2013; Cheng and Lapata, 2016).E.g., a source word is labelled 1 if it appears inthe abstract, 0 otherwise. Despite the usefulness,there are two issues with this scheme. First, a vastmajority of the source words are tagged 0s, onlya small portion are 1s. This is due to the fact thathuman abstracts are short and concise; they oftencontain words not present in the source. Second,

Source DocumentThe first doses of the Ebola vaccine were on a commercial flight toWest Africa and were expected to arrive on Friday, according to aspokesperson from GlaxoSmithKline (GSK) one of the companiesthat has created the vaccine with the National Institutes of Health.

Another vaccine from Merck and NewLink will also be tested.

“Shipping the vaccine today is a major achievement and showsthat we remain on track with the accelerated development of ourcandidate Ebola vaccine,” Dr. Moncef Slaoui, chairman of globalvaccines at GSK said in a company release. (Rest omitted.)

AbstractThe first vials of an Ebola vaccine should land in Liberia Friday

QuestionsQ: The first vials of an vaccine should land in Liberia FridayQ: The first vials of an Ebola vaccine should in Liberia FridayQ: The first vials of an Ebola vaccine should land in Friday

Table 1: Example source document, the top sentence of theabstract, and system-generated Cloze-style questions. Sourcecontent related to the abstract is italicized.

not all labels are accurate. Source words that arelabelled 0 may be paraphrases, generalizations, orotherwise related to words in the abstracts. Thesesource words are often mislabelled. Consequently,leveraging human abstracts to provide supervisionfor extractive summarization remains a challenge.

Neural abstractive summarization can alleviatethis issue by allowing the system to either copywords from the source texts or generate new wordsfrom a vocabulary (Rush et al., 2015; Nallapatiet al., 2016; See et al., 2017). While the techniquesare promising, they face other challenges, such asensuring the summaries remain faithful to the orig-inal. Failing to reproduce factual details has beenrevealed as one of the main obstacles for neuralabstractive summarization (Cao et al., 2018; Songand Liu, 2018). This study thus chooses to focuson neural extractive summarization.

We explore a new training paradigm for extrac-tive summarization. We convert human abstractsto a set of Cloze-style comprehension questions,where the question body is a sentence of the ab-stract with a blank, and the answer is an entity or akeyword. Table 1 shows an example. Because the

105

questions cannot be answered by applying generalworld knowledge, system summaries are encour-aged to preserve salient source content that is rele-vant to the questions (≈ human abstract) such thatthe summaries can work as a document surrogateto predict correct answers. We use an attentionmechanism to locate segments of a summary thatare relevant to a given question so that the sum-mary can be used to answer multiple questions.

This study extends the work of (Lei et al., 2016)to use reinforcement learning to explore the spaceof extractive summaries. While the original workfocuses on generating rationales to support super-vised classification, the goal of our study is to pro-duce fluent, generic document summaries. Thequestion-answering (QA) task is designed to ful-fill this goal and the QA performance is only sec-ondary. Our research contributions can be summa-rized as follows:

• we investigate an alternative training scheme forextractive summarization where the summariesare encouraged to be semantically close to hu-man abstracts in addition to sharing commonwords;

• we compare two methods to convert human ab-stracts to Cloze-style questions and investigateits impact on QA and summarization perfor-mance. Our results surpass those of previoussystems on a standard summarization dataset.

2 Related Work

This study focuses on generic summarization.It is different from the query-based summariza-tion (Daume III and Marcu, 2006; Dang andOwczarzak, 2008), where systems are trained toselect text pieces related to predefined queries. Inthis work we have no predefined queries but thesystem carefully generates questions from humanabstracts and learns to produce generic summariesthat are capable of answering all questions.

Cloze questions have been used in reading com-prehension (Richardson et al., 2013; Weston et al.,2016; Mostafazadeh et al., 2016; Rajpurkar et al.,2016) to test the system’s ability to perform rea-soning and language understanding. Hermann etal. (2015) describe an approach to extract (context,question, answer) triples from news articles. Ourwork draws on this approach to automatically cre-ate questions from human abstracts.

Reinforcement learning (RL) has been recentlyapplied to a number of NLP applications, includ-

ing dialog generation (Li et al., 2017), machinetranslation (MT) (Ranzato et al., 2016; Gu et al.,2018), question answering (Choi et al., 2017), andsummarization and sentence simplification (Zhangand Lapata, 2017; Paulus et al., 2017; Chen andBansal, 2018; Narayan et al., 2018). This studyleverages RL to explore the space of possible ex-tractive summaries. The summaries are encour-aged to preserve salient source content useful foranswering questions as well as sharing commonwords with the abstracts.

3 Our Approach

Given a source documentX , our system generatesa summary Y = (y1, y2, · · · , y|Y |) by identifyingconsecutive sequences of words: yt is 1 if the t-thsource word is included in the summary, 0 oth-erwise. In this section we investigate a question-oriented rewardR(Y ) that encourages summariesto contain sufficient content useful for answeringkey questions about the document (§3.1); we thenuse reinforcement learning to explore the space ofpossible extractive summaries (§3.2).

3.1 Question-Focused Reward

We reward a summary if it can be used as a docu-ment surrogate to answer important questions. Let{(Qk, e∗k)}Kk=1 be a set of question-answer pairsfor a source document, where e∗k is the ground-truth answer corresponding to an entity or a key-word. We encode the question Qk into a vector:qk = Bi-LSTM(Qk) ∈ Rd using a bidirectionalLSTM, where the last outputs of the forward andbackward passes are concatenated to form a ques-tion vector. We use the same Bi-LSTM to en-code the summary Y to a sequence of vectors:(hS1 ,h

S2 , · · · ,hS|S|) = Bi-LSTM(Y ), where |S| is

the number of words in the summary; hSt ∈ Rd isthe concatenation of forward and backward hiddenstates at time step t. Figure 1 provides an illustra-tion of the system framework.

An attention mechanism is used to locate partsof the summary that are relevant to Qk. We de-fine αk,i ∝ exp(qkW

ahSi ) to represent the impor-tance of the i-th summary word (hSi ) to answeringthe k-th question (qk), characterized by a bilin-ear term (Chen et al., 2016a). A context vector ckis constructed as a weighted sum of all summarywords relevant to the k-th question, and it is usedto predict the answer. We define the QA rewardRa(Y ) as the log-likelihood of correctly predict-

106

Bavarians

have

been

second

most

0

0

1

1

1

named@placeholder football ‘s most valuable brand

@placeholderGermany second on the‘s

Extractive Summary

@entity1

valuable 1@entity2

Document Answers Questions

list

q1

q2

c1

c2

@entity1: “Bayern Munich”

@entity2: “Manchester United”

Figure 1: System framework. The model uses an extractive summary as a document surrogate to answer important questionsabout the document. The questions are automatically derived from the human abstract.

ing all answers. {Wa,Wc} are learnable modelparameters.

αk,i =exp(qkW

ahSi )∑|S|i=1 exp(qkW

ahSi )(1)

ck =

|S|∑

i=1

αk,ihSi (2)

P (ek|Y,Qk) = softmax(Wcck) (3)

Ra(Y ) =1

K

K∑

k=1

logP (e∗k|Y,Qk) (4)

In the following we describe approaches to ob-tain a set of question-answer pairs {(Qk, e∗k)}Kk=1

from a human abstract. In fact, this formula-tion has the potential to make use of multiple hu-man abstracts (subject to availability) in a unifiedframework; in that case, the QA pairs will be ex-tracted from all abstracts. According to Eq. (4),the system is optimized to generate summaries thatpreserve salient source content sufficient to answerall questions (≈ human abstract).

We expect to harvest one question-answer pairfrom each sentence of the abstract. More are pos-sible, but the QA pairs will contain duplicate con-tent. There are a few other noteworthy issues. Ifwe do not collect any QA pairs from a sentence ofthe abstract, its content will be left out of the sys-tem summary. It is thus crucial for the system toextract at least one QA pair from any sentence inan automatic manner. Further, the questions mustnot be answered by simply applying general worldknowledge. We expect the adequacy of the sum-mary to have a direct influence on whether or notthe questions will be correctly answered. Moti-vated by these considerations, we perform the fol-lowing steps. We split a human abstract to a setof sentences, identify an answer token from each

sentence, then convert the sentence to a questionby replacing the token with a placeholder, yield-ing a Cloze question. We explore two approachesto extract answer tokens:

• Entities. We extract four types of named enti-ties {PER, LOC, ORG, MISC} from sentencesand treat them as possible answer tokens.

• Keywords. This approach identifies the ROOT

word of a sentence dependency parse tree andtreats it as a keyword-based answer token.Not all sentences contain entities, but everysentence has a root word; it is often the mainverb of the sentence.

We obtain K question-answer pairs from each hu-man abstract, one pair per sentence. If there areless than K sentences in the abstract, the QA pairsof the top sentences will be duplicated, with theassumption that the top sentences are more impor-tant than others. If multiple entities reside in a sen-tence, we randomly pick one as the answer token;otherwise if there are no entities, we use the rootword instead.

To ensure that the extractive summaries are con-cise, fluent, and close to the original wording, weadd additional components to the reward function:(i) we define Rs(Y ) = | 1

|Y |∑|Y |

t=1 yt − δ| to re-strict the summary size. We require the percentageof selected source words to be close to a prede-fined threshold δ. This constraint works well at re-stricting length, with the average summary size ad-hering to this percentage; (ii) we further introduceRf (Y ) =

∑|Y |t=2 |yt−yt−1| to encourage the sum-

maries to be fluent. This component is adoptedfrom (Lei et al., 2016), where few 0/1 switchesbetween yt−1 and yt indicates the system is se-lecting consecutive word sequences; (iii) we en-courage system and reference summaries to sharecommon bigrams. This practice has shown suc-

107

cess in earlier studies (Gillick and Favre, 2009).Rb(Y ) is defined as the percentage of referencebigrams successfully covered by the system sum-mary. These three components together ensure thewell-formedness of extractive summaries. The fi-nal reward function R(Y ) is a linear interpolationof all the components; γ, α, β are coefficients andwe describe their parameter tuning in §4.

R(Y )=Ra(Y )+γRb(Y )−αRf(Y )−βRs(Y )(5)

3.2 Reinforcement LearningIn the following we seek to optimize a policyP (Y |X) for generating extractive summaries sothat the expected reward EP (Y |X)[R(Y )] is max-imized. Taking derivatives of this objective withrespect to model parameters θ involves repeat-edly sampling summaries Y = (y1, y2, · · · , y|Y |)(illustrated in Eq. (6)). In this way reinforce-ment learning exploits the space of extractive sum-maries of a source document.

∇θEP (Y |X)[R(Y )]

= EP (Y |X)[R(Y )∇θ logP (Y |X)]

≈ 1N

∑Nn=1R(Y (n))∇θ logP (Y (n)|X) (6)

To calculate P (Y |X) and then sample Yfrom it, we use a bidirectional LSTM to en-code a source document to a sequence of vectors:(hD1 ,h

D2 , · · · ,hD|X|) = Bi-LSTM(X). Whether

to include the t-th source word in the summary(yt) thus can be decided based on hDt . However,we also want to accommodate the previous t-1sampling decisions (y1:t−1) to improve the fluencyof the extractive summary. Following (Lei et al.,2016), we introduce a single-direction LSTM en-coder whose hidden state st tracks the samplingdecisions up to time step t (Eq. 8). It representsthe semantic meaning encoded in the current sum-mary. To sample the t-th word, we concatenate thetwo vectors [hDt ||st−1] and use it as input to a feed-forward layer with sigmoid activation to estimateyt ∼ P (yt|y1:t−1, X) (Eq. 7).

P (yt|y1:t−1, X) = σ(Wh[hDt ||st−1] + bh) (7)

st = LSTM([hDt ||yt], st−1) (8)

P (Y |X) =∏|Y |t=1 P (yt|y1:t−1, X) (9)

Note that Eq. (7) can be pretrained using goldstan-dard summary sequence Y ∗ = (y∗1, y

∗2, · · · , y∗|Y |)

to minimize the word-level cross-entropy loss,

System R-1 R-2 R-LLSA (Steinberger and Jezek, 2004) 21.2 6.2 14.0LexRank (Erkan and Radev, 2004) 26.1 9.6 17.7TextRank (Mihalcea and Tarau, 2004) 23.3 7.7 15.8SumBasic (Vanderwende et al., 2007) 22.9 5.5 14.8KL-Sum (Haghighi and Vanderwende, 2009) 20.7 5.9 13.7Distraction-M3 (Chen et al., 2016b) 27.1 8.2 18.7Seq2Seq w/ Attn (See et al., 2017) 25.0 7.7 18.8Pointer-Gen w/ Cov (See et al., 2017) 29.9 10.9 21.1Graph-based Attn (Tan et al., 2017) 30.3 9.8 20.0

Extr+EntityQ (this paper) 31.4 11.5 21.7Extr+KeywordQ (this paper) 31.7 11.6 21.5

Table 2: Results on the CNN test set (full-length F1 scores).

where we set y∗t as 1 if (xt, xt+1) is a bigram inthe human abstract. For reinforcement learning,our goal is to optimize the policy P (Y |X) usingthe reward function R(Y ) (§3.1) during the train-ing process. Once the policy P (Y |X) is learned,we do not need the reward function (or any QApairs) at test time to generate generic summaries.Instead we choose yt that yields the highest prob-ability yt = argmax P (yt|y1:t−1, X).

4 Experiments

All training, validation, and testing was performedusing the CNN dataset (Hermann et al., 2015; Nal-lapati et al., 2016) containing news articles pairedwith human-written highlights (i.e., abstracts). Weobserve that a source article contains 29.8 sen-tences and an abstract contains 3.54 sentences onaverage. The train/valid/test splits contain 90,266,1,220, 1,093 articles respectively.

4.1 HyperparametersThe hyperparameters, tuned on the validation set,include the following: the hidden state size ofthe Bi-LSTM is 256; the hidden state size of thesingle-direction LSTM encoder is 30. Dropoutrate (Srivastava, 2013), used twice in the samplingcomponent, is set to 0.2. The minibatch size isset to 256. We apply early stopping on the vali-dation set, where the maximum number of epochsis set to 50. Our source vocabulary contains 150Kwords; words not in the vocabulary are replacedby the 〈unk〉 token. We use 100-dimensionalword embeddings, initialized by GloVe (Penning-ton et al., 2014) and remain trainable. We set β= 2α and select the best α ∈ {10, 20, 50} andγ ∈ {5, 6, 7, 8} using the valid set (best value un-derlined). The maximum length of input is set to100 words; δ is set to be 0.4 (≈40 words). We usethe Adam optimizer (Kingma and Ba, 2015) with

108

an initial learning rate of 1e-4 and halve the learn-ing rate if the objective worsens beyond a thresh-old (> 10%). As mentioned we utilized a bigrambased pretraining method. We found that this sta-bilized the training of the full model.

4.2 Results

We compare our methods with state-of-the-artpublished systems, including both extractive andabstractive approaches (their details are summa-rized below). We experiment with two variants ofour approach. “EntityQ” uses QA pairs whose an-swers are named entities. “KeywordQ” uses pairswhose answers are sentence root words. Accord-ing to the R-1, R-2, and R-L scores (Lin, 2004)presented in Table 2, both methods are superiorto the baseline systems on the benchmark dataset,yielding 11.5 and 11.6 R-2 F-scores, respectively.

• LSA (Steinberger and Jezek, 2004) uses the la-tent semantic analysis technique to identify se-mantically important sentences.

• LexRank (Erkan and Radev, 2004) is a graph-based approach that computes sentence impor-tance based on the concept of eigenvector cen-trality in a graph representation of source sen-tences.

• TextRank (Mihalcea and Tarau, 2004) is an un-supervised graph-based ranking algorithm in-spired by algorithms PageRank and HITS.

• SumBasic (Vanderwende et al., 2007) is an ex-tractive approach that assumes words occurringfrequently in a document cluster have a higherchance of being included in the summary.

• KL-Sum (Haghighi and Vanderwende, 2009)describes a method that greedily adds sentencesto the summary so long as it decreases the KLdivergence.

• Distraction-M3 (Chen et al., 2016b) trains thesummarization model to not only attend to tospecific regions of input documents, but alsodistract the attention to traverse different con-tent of the source document.

• Pointer-Generator (See et al., 2017) allows thesystem to not only copy words from the sourcetext via pointing but also generate novel wordsthrough the generator.

• Graph-based Attention (Tan et al., 2017) in-troduces a graph-based attention mechanism toenhance the encoder-decoder framework.

K1 K2 K3 K4 K5# Uniq Entities 23.7K 37.0K 46.1K 50.3K 50.3KTrain Acc (%) 46.1 37.2 34.2 33.6 34.8Valid Acc (%) 12.8 14.0 14.7 15.7 15.4Valid R-2 F (%) 11.2 11.1 11.2 11.1 10.8

# Uniq Keywds 7.3K 10.4K 12.5K 13.7K 13.7KTrain Acc (%) 30.5 28.2 27.6 27.5 27.5Valid Acc (%) 19.3 22.5 22.2 23.0 21.9Valid R-2 F (%) 11.2 11.1 10.8 11.0 10.8

Table 3: Train/valid accuracy and R-2 F-scores when usingvarying numbers of QA pairs (K=1 to 5) in the reward func.

In Table 3, we vary the number of QA pairsused per article in the reward function (K=1 to 5).The summaries are encouraged to contain compre-hensive content useful for answering all questions.When more QA pairs are used (K1→K5), we ob-serve that the number of answer tokens has in-creased and almost doubled: 23.7K (K1)→50.3K(K5) for entities as answers, and 7.3K→13.7K forkeywords. The enlarged answer space has an im-pact on QA accuracies. When using entities asanswers, the training accuracy is 34.8% (Q5) andvalidation is 15.4% (Q5), and there appears to bea considerable gap between the two. In contrast,the gap is quite small when using keywords as an-swers (27.5% and 21.9% for Q5), suggesting thatusing sentence root words as answers is a more vi-able strategy to create QA pairs.

Comparing to QA studies (Chen et al., 2016a),we remove the constraint that requires answer en-tities (or keywords) to reside in the source docu-ments. Adding this constraint improves the QAaccuracy for a standard QA system. However, be-cause our system does not perform QA during test-ing (the question-answer pairs are not available forthe test set) but only generate generic summaries,we do not enforce this requirement and report notesting accuracies. We observe that the R-2 scoresonly present minor changes from K1 to K5. Weconjecture that more question-answer pairs do notmake the summaries contain more comprehensivecontent because the input and the summary are rel-atively short; K=1 yields the best results.

In Table 4, we present example system and ref-erence summaries. Our extractive summaries canbe overlaid with the source documents to assistpeople with browsing through the documents. Inthis way the summaries stay true to the originaland do not contain information that was not in thesource documents.

Future work. We are interested in investigating

109

Source DocumentIt was all set for a fairytale ending for record breaking jockey AP Mc-Coy. In the end it was a different but familiar name who won theGrand National on Saturday.

25-1 outsider Many Clouds, who had shown little form going into therace, won by a length and a half, ridden by jockey Leighton Aspell.Aspell won last year’s Grand National too, making him the firstjockey since the 1950s to ride back-to-back winners on differenthorses.

“It feels wonderful, I asked big questions,” Aspell said...

Abstract25-1 shot Many Clouds wins Grand NationalSecond win a row for jockey Leighton AspellFirst jockey to win two in a row on different horses since 1950s

Table 4: Example system summary and human abstract. Thesummary words are shown in bold in the source document.

approaches that automatically group selected sum-mary segments into clusters. Each cluster can cap-ture a unique aspect of the document, and clus-ters of text segments can be color-highlighted. In-spired by the recent work of Narayan et al. (2018),we are also interested in conducting the usabilitystudy to test how well the summary highlights canhelp users quickly answer key questions about thedocuments. This will provide an alternative strat-egy for evaluating our proposed method againstboth extractive and abstractive baselines.

5 Conclusion

In this paper we explore a new training paradigmfor extractive summarization. Our system convertshuman abstracts to a set of question-answer pairs.We use reinforcement learning to exploit the spaceof extractive summaries and promote summariesthat are concise, fluent, and adequate for answer-ing questions. Results show that our approach iseffective, surpassing state-of-the-art systems.

Acknowledgments

We thank the anonymous reviewers for their valu-able suggestions. This work is in part supported byan unrestricted gift from Bosch Research. Krist-jan Arumae gratefully acknowledges a travel grantprovided by the National Science Foundation.

ReferencesZiqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.

Faithful to the original: Fact aware neural abstrac-tive summarization. In Proceedings of the AAAIConference on Artificial Intelligence (AAAI).

Danqi Chen, Jason Bolton, and Christopher D. Man-ning. 2016a. A thorough examination of thecnn/daily mail reading comprehension task. In Pro-

ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (ACL).

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,and Hui Jiang. 2016b. Distraction-based neural net-works for document summarization. In Proceedingsof the Twenty-Fifth International Joint Conferenceon Artificial Intelligence (IJCAI).

Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-tive summarization with reinforce-selected sentencerewriting. In Proceedings of the Annual Meeting ofthe Association for Computational Linguistics.

Jianpeng Cheng and Mirella Lapata. 2016. Neuralsummarization by extracting sentences and words.In Proceedings of ACL.

Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, IlliaPolosukhin, Alexandre Lacoste, and Jonathan Be-rant. 2017. Coarse-to-fine question answering forlong documents. In Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics (ACL).

Hoa Trang Dang and Karolina Owczarzak. 2008.Overview of the TAC 2008 update summarizationtask. In Proceedings of Text Analysis Conference.

Hal Daume III and Daniel Marcu. 2006. Bayesianquery-focused summarization. In Proceedings ofthe 44th Annual Meeting of the Association for Com-putational Linguistics (ACL).

Gunes Erkan and Dragomir R. Radev. 2004. LexRank:Graph-based lexical centrality as salience in textsummarization. Journal of Artificial IntelligenceResearch .

Dan Gillick and Benoit Favre. 2009. A scalable globalmodel for summarization. In Proceedings of theNAACL Workshop on Integer Linear Programmingfor Natural Langauge Processing.

Jiatao Gu, Daniel Jiwoong Im, and Victor O.K. Li.2018. Neural machine translation with gumbel-greedy decoding. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence(AAAI).

Aria Haghighi and Lucy Vanderwende. 2009. Explor-ing content models for multi-document summariza-tion. In Proceedings of the North American Chap-ter of the Association for Computational Linguistics(NAACL).

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Proceedings ofNeural Information Processing Systems (NIPS).

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof the International Conference on Learning Repre-sentations (ICLR).

110

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing neural predictions. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Chen Li, Fei Liu, Fuliang Weng, and Yang Liu. 2013.Document summarization via guided sentence com-pression. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean,Alan Ritter, and Dan Jurafsky. 2017. Adversariallearning for neural dialogue generation. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing (EMNLP).

Chin-Yew Lin. 2004. ROUGE: a package for au-tomatic evaluation of summaries. In Proceedingsof ACL Workshop on Text Summarization BranchesOut.

Rada Mihalcea and Paul Tarau. 2004. TextRank:Bringing order into text. In Proceedings of EMNLP.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and cloze evaluation for deeper understanding ofcommonsense stories. In Proceedings of the NorthAmerican Chapter of the Association for Computa-tional Linguistics (NAACL).

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Caglar Gulcehre, and Bing Xiang. 2016. Ab-stractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the20th SIGNLL Conference on Computational NaturalLanguage Learning (CoNLL).

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018. Ranking sentences for extractive summariza-tion with reinforcement learning. In Proceedings ofthe 16th Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies (NAACL-HLT).

Ani Nenkova and Kathleen McKeown. 2011. Auto-matic summarization. Foundations and Trends inInformation Retrieval .

Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vectors forword representation. In Proceedings of the Confer-ence Empirical Methods in Natural Language Pro-cessing (EMNLP).

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questionsfor machine comprehension of text. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. In Proceedingsof the International Conference on Learning Repre-sentations (ICLR).

Matthew Richardson, Christopher J.C. Burges, andErin Renshaw. 2013. MCTest: A challenge datasetfor the open-domain machine comprehension oftext. In Proceedings of Empirical Methods in Natu-ral Language Processing (EMNLP).

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for sentence sum-marization. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics (ACL).

Kaiqiang Song and Fei Liu. 2018. Structure-infusedcopy mechanisms for abstractive summarization.In Proceedings of the International Conference onComputational Linguistics (COLING).

Nitish Srivastava. 2013. Improving Neural Net-works with Dropout. Master’s thesis, University ofToronto, Toronto, Canada.

Josef Steinberger and Karel Jezek. 2004. Using latentsemantic analysis in text summarization and sum-mary evaluation. In Proceedings of ISIM.

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.Abstractive document summarization with a graph-based attentional neural model. In Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics (ACL).

Lucy Vanderwende, Hisami Suzuki, Chris Brockett,and Ani Nenkova. 2007. Beyond SumBasic: Task-focused summarization with sentence simplificationand lexical expansion. Information Processing andManagement 43(6):1606–1618.

Jason Weston, Antoine Bordes, Sumit Chopra, SashaRush, Bart van Merrienboer, Armand Joulin, andTomas Mikolov. 2016. Towards AI-complete ques-tion answering: A set of prerequisite toy tasks.In Proceedings of the International Conference onLearning Representations (ICLR).

Xingxing Zhang and Mirella Lapata. 2017. Sentencesimplification with deep reinforcement learning. InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing (EMNLP).

111


Graph-based Filtering of Out-of-Vocabulary Wordsfor Encoder-Decoder Models

Satoru Katsumata, Yukio Matsumura∗, Hayahide Yamagishi∗ and Mamoru KomachiTokyo Metropolitan University

{katsumata-satoru, matsumura-yukio, yamagishi-hayahide}@ed.tmu.ac.jp,[email protected]

Abstract

Encoder-decoder models typically onlyemploy words that are frequently used inthe training corpus to reduce the compu-tational costs and exclude noise. How-ever, this vocabulary set may still in-clude words that interfere with learning inencoder-decoder models. This paper pro-poses a method for selecting more suit-able words for learning encoders by uti-lizing not only frequency but also co-occurrence information, which we captureusing the HITS algorithm. We apply ourproposed method to two tasks: machinetranslation and grammatical error correc-tion. For Japanese-to-English translation,this method achieves a BLEU score that is0.56 points more than that of a baseline.Furthermore, it outperforms the baselinemethod for English grammatical error cor-rection, with an F0.5-measure that is 1.48points higher.

1 Introduction

Encoder-decoder models (Sutskever et al., 2014)are effective in tasks such as machine translation(Cho et al., 2014; Bahdanau et al., 2015) andgrammatical error correction (Yuan and Briscoe,2016). Vocabulary in encoder-decoder models isgenerally selected from the training corpus in de-scending order of frequency, and low-frequencywords are replaced with an unknown word token<unk>. The so-called out-of-vocabulary (OOV)words are replaced with <unk> to not increasethe decoder’s complexity and to reduce noise.However, naive frequency-based OOV replace-ment may lead to loss of information that is nec-essary for modeling context in the encoder.

∗Both authors equally contributed to the paper.

This study hypothesizes that vocabulary con-structed using unigram frequency includes wordsthat interfere with learning in encoder-decodermodels. That is, we presume that vocabularyselection that considers co-occurrence informa-tion selects fewer noisy words for learning robustencoders in encoder-decoder models. We applythe hyperlink-induced topic search (HITS) algo-rithm to extract the co-occurrence relations be-tween words. Intuitively, the removal of wordsthat rarely co-occur with others yields better en-coder models than ones that include noisy low-frequency words.

This study examines two tasks, machine transla-tion (MT) and grammatical error correction (GEC)to confirm the effect of decreasing noisy words,with a focus on the vocabulary of the encoder side,because the vocabulary on the decoder side is rela-tively limited. In a Japanese-to-English MT exper-iment, our method achieves a BLEU score that is0.56 points more than that of the frequency-basedmethod. Further, it outperforms the frequency-based method for English GEC, with an F0.5-measure that is 1.48 points higher.

The main contributions of this study are as fol-lows:

1. The simple but effective preprocessingmethod we propose for vocabulary selec-tion improves encoder-decoder model perfor-mance.

2. This study is the first to address noise re-duction in the source text of encoder-decodermodels.

2 Related Work

There is currently a growing interest in apply-ing neural models to MT (Sutskever et al., 2014;Cho et al., 2014; Bahdanau et al., 2015; Wu

112

et al., 2016) and GEC (Yuan and Briscoe, 2016;Xie et al., 2016; Ji et al., 2017); hence, thisstudy focuses on improving the simple attentionalencoder-decoder models that are applied to thesetasks.

In the investigation of vocabulary restriction inneural models, Sennrich et al. (2016) applied bytepair encoding to words and created a partial char-acter string set that could express all the wordsin the training data. They increased the numberof words included in the vocabulary to enable theencoder-decoder model to robustly learn contex-tual information. In contrast, we aim to improveneural models by using vocabulary that is appro-priate for a training corpus—not to improve neuralmodels by increasing their vocabulary.

Jean et al. (2015) proposed a method of re-placing and copying an unknown word token witha bilingual dictionary in neural MT. They auto-matically constructed a translation dictionary froma training corpus using a word-alignment model(GIZA++), which finds a corresponding sourceword for each unknown target word token. Theyreplaced the unknown word token with the corre-sponding word into which the source word wastranslated by the bilingual dictionary. Yuan andBriscoe (2016) used a similar method for neuralGEC. Because our proposed method is performedas preprocessing, it can be used simultaneouslywith this replace-and-copy method.

Algorithms that rank words using co-occurrence are employed in many naturallanguage processing tasks. For example,TextRank (Mihalcea and Tarau, 2004) usesPageRank (Brin and Page, 1998) for keywordextraction. TextRank constructs a word graph inwhich nodes represent words, and edges representco-occurrences between words within a fixedwindow; TextRank then executes the PageRankalgorithm to extract keywords. Although this isan unsupervised method, it achieves nearly thesame precision as one state-of-the-art supervisedmethod (Hulth, 2003). Kiso et al. (2011) usedHITS (Kleinberg, 1999) to select seeds andcreate a stop list for bootstrapping in naturallanguage processing. They reported significantimprovements over a baseline method usingunigram frequency. Their graph-based algorithmwas effective at extracting the relevance betweenwords, which cannot be grasped with a simpleunigram frequency. In this study, we use HITS

Algorithm 1 HITSRequire: hubness vector i0Require: adjacency matrix ARequire: iteration number τEnsure: hubness vector iEnsure: authority vector p1: function HITS(i0, A, τ )2: i← i03: for t = 1, 2, ..., τ do4: p← ATi5: i← Ap6: normalize i and p

7: return i and p8: end function

to retrieve co-occurring words from a trainingcorpus to reduce noise in the source text.

3 Graph-based Filtering of OOV Words

3.1 Hubness and authority scores from HITS

HITS, which is a web page ranking algorithm pro-posed by Kleinberg (1999), computes hubness andauthority scores for a web page (node) using theadjacency matrix that represents the web page’slink (edge) transitions. A web page with high au-thority is linked from a page with high hubnessscores, and a web page with a high hubness scorelinks to a page with a high authority score. Algo-rithm 1 shows pseudocode for the HITS algorithm.Hubness and authority scores converge by settingthe iteration number τ to a sufficiently large value.

3.2 Vocabulary selection using HITS

In this study, we create an adjacency matrix froma training corpus by considering a word as anode and the co-occurrence between words as anedge. Unlike in web pages, co-occurrence be-tween words is nonbinary; therefore, several co-occurrence measures can be used as edge weights.Section 3.3 describes the co-occurrence measuresand the context in which co-occurrence is defined.

The HITS algorithm is executed using the adja-cency matrix created in the way described above.As a result, it is possible to obtain a score indi-cating importance of each word while consideringcontextual information in the training corpus.

Figure 1 shows a word graph example. Aword that obtains a high score in the HITS al-gorithm is considered to co-occur with a varietyof words. Figure 1 demonstrates that second or-der co-occurrence scores (the scores of words co-occurring with words that co-occur with variouswords (Schutze, 1998)) are also high.

113

Figure 1: An example word graph created for tensentences in the training corpus used for GEC1.

In this study, words with high hubness scoresare considered to co-occur with an importantword, and low-scoring words are excluded fromthe vocabulary. Using this method appears to gen-erate a vocabulary that includes words that aremore suitable for representing a context vector forencoder models.

3.3 Word graph construction

To acquire co-occurrence relations, we use a com-bination of each word and its peripheral words.Specifically, we combine the target word with sur-rounding words within window width N and countthe occurrences. When defining the context in thisway, because the adjacency matrix becomes sym-metric, the same hubness and authority scores canbe obtained. Figure 2 shows an example of co-occurrence in which N is set to two.

We use raw co-occurrence frequency (Freq) andpositive pointwise mutual information (PPMI) be-tween words as the (x, y) element Axy of the adja-cency matrix. However, naive PPMI reacts sensi-tively to low-frequency words in a training corpus.To account for high-frequency, we weight the PMIby the logarithm of the number of co-occurrencesand use PPMI based on this weighted PMI (Equa-

1In this study, singleton words and their co-occurrencesare excluded from the graph.

Figure 2: An example of co-occurrence context.

tion 2).

Afreqxy = |x, y| (1)

Appmixy = max(0, pmi(x, y) + log2 |x, y|) (2)

Equation 3 is the PMI of target word x and co-occurrence word y. M is the number of tokens ofthe combination, |x, ∗| and |∗, y| are the numberof token combinations when fixing target word xand co-occurrence word y, respectively.

pmi(x, y) = log2

M · |x, y||x, ∗||∗, y| (3)

4 Machine Translation

4.1 Experimental settingIn the first experiment, we conduct a Japanese-to-English translation using the Asian ScientificPaper Excerpt Corpus (ASPEC; Nakazawa et al.,2016). We follow the official split of the train, de-velopment, and test sets. As training data, we useonly the first 1.5 million sentences sorted by sen-tence alignment confidence to obtain a Japanese–English parallel corpus (sentences of more than60 words are excluded). Our training set consistsof 1,456,278 sentences, development set consistsof 1,790 sentences, and test set consists of 1,812sentences. The training set has 247,281 Japaneseword types and 476,608 English word types.

The co-occurrence window width N is set totwo. For combinations that co-occurred only oncewithin the training corpus, we set the value of el-ement Axy of the adjacency matrix to zero. Theiteration number τ of the HITS algorithm is set to300. As mentioned in Section 1, we only use theproposed method on the encoder side.

For this study’s neural MT model2, we imple-ment global dot attention (Luong et al., 2015). Wetrain a baseline model that uses vocabulary that isdetermined by its frequency in the training corpus.Vocabulary size is set to 100K on the encoder sideand 50K on the decoder side. Additionally, we

2https://github.com/yukio326/nmt-chainer

114

baseline HITS (Freq) HITS (PPMI)BLEU (50K) 22.24 - 22.40BLEU (100K) 22.21 22.25 22.77p-value - 0.35 0.01

Table 1: BLEU scores for Japanese-to-Englishtranslation3. The parentheses indicate vocabularysize of the encoder.

COMMON outputs DIFF outputsbaseline PPMI baseline PPMI

BLEU 22.33 22.98 21.44 21.98

Table 2: BLEU scores of the COMMON and DIFFoutputs.

conduct an experiment of varying vocabulary sizeof the encoder to 50K in the baseline and PPMIto investigate the effect of vocabulary size. Un-less otherwise noted, we conduct an analysis ofthe model using the vocabulary size of 100K. Thenumber of dimensions for each of the hidden andembedding layers is 512. The mini-batch size is150. AdaGrad is used as an optimization methodwith an initial learning rate of 0.01. Dropout isapplied with a probability of 0.2.

For this experiment, a bilingual dictionary isprepared for postprocessing unknown words (Jeanet al., 2015). When the model outputs an unknownword token, the word with the highest attentionscore is used as a query to replace the unknowntoken with the corresponding word from the dic-tionary. If not in the dictionary, we replace the un-known word token with the source word (unk rep).This dictionary is created based on word align-ment obtained using fast align (Dyer et al., 2013)on the training corpus.

We evaluate translation results using BLEUscores (Papineni et al., 2002).

4.2 Results

Table 1 shows the translation accuracy (BLEUscores) and p-value of a significance test (p <0.05) by bootstrap resampling (Koehn, 2004).The PPMI model improves translation accuracyby 0.56 points in Japanese-to-English translation,which is a significant improvement.

Next, we examine differences in vocabulary bycomparing each model with the baseline. Com-pared to the vocabulary of the baseline in 100Ksetting, Freq and PPMI replace 16,107 and 17,166

3BLEU score for postprocessing (unk rep) improves by0.46, 0.44, and 0.46 points in the baseline, Freq, and PPMI,respectively.

types, respectively; compared to the vocabulary ofthe baseline in 50K setting, PPMI replaces 4,791types.

4.3 Analysis

According to Table 1, the performance of Freq isalmost the same as that of the baseline. Whenexamining the differences in selected words invocabulary between PPMI and Freq, we findthat PPMI selects more low-frequency words inthe training corpus compared to Freq, becausePPMI deals with not only frequency but also co-occurrence.

The effect of unk rep is almost the same in thebaseline as in the proposed method, which indi-cates that the proposed method can be combinedwith other schemes as a preprocessing step.

As a comparison of the vocabulary size 50Kand 100K, the BLEU score of 100K is higher thanthat of 50K in PPMI. Moreover, the BLEU scoresare almost the same in the baseline. We supposethat the larger the vocabulary size of encoder, themore noisy words the baseline includes, while thePPMI filters these words. That is why the pro-posed method works well in the case where thevocabulary size is large.

To examine the effect of changing the vocab-ulary on the source side, the test set is dividedinto two subsets: COMMON and DIFF. The for-mer (1,484 sentences) consists of only the com-mon vocabulary between the baseline and PPMI,whereas the latter (328 sentences) includes at leastone word excluded from the common vocabulary.

Table 2 shows the translation accuracy of theCOMMON and DIFF outputs. Translation perfor-mance of both corpora is improved.

In order to observe how PPMI improves COM-MON outputs, we measure the similarity of thebaseline and PPMI output sentences by count-ing the exact same sentences. In the COMMONoutputs, 72 sentence pairs (4.85%) are the same,whereas 9 sentence pairs are the same in the DIFFoutputs (2.74%). Surprisingly, even though it usesthe same vocabulary, PPMI often outputs differentbut fluent sentences.

Table 3 shows an example of Japanese-to-English translation. The outputs of the proposedmethod (especially PPMI) are improved, despitethe source sentence being expressed with commonvocabulary; this is because the proposed methodyielded a better encoder model than the baseline.

115

src 有用物質の分離・抽出 ,反応性向上 ,新材料創製 ,廃棄物処理 ,分析等の分野がある。baseline there are fields such as separation , extraction , extraction , improvement of new material creation , waste treatment ,

analysis , etc .Freq there are separation and extraction of useful substances , the improvement of reactivity , new material creation , waste

treatment and analysis .PPMI there are the fields such as separation and extraction of useful materials , the reaction improvement , new material

creation , waste treatment , analysis , etc ...ref the application fields are separation and extraction of useful substances , reactivity improvement , creation of new

products , waste treatment , and chemical analysis .

Table 3: An example of Japanese-to-English translation on a source sentence from COMMON.

baseline PPMI50K 150K 50K 150K

Precision 48.09 46.53 49.45 49.23Recall 8.30 8.50 8.61 9.02F0.5 24.55 24.55 25.37 26.03

Table 4: F0.5 results on the CoNLL-14 test set4.

COMMON outputs DIFF outputsbaseline PPMI baseline PPMI

P 48.26 60.07 9.40 17.32R 0.01 0.01 0.01 0.02F0.5 0.04 0.04 0.04 0.08

Table 5: F0.5 of COMMON and DIFF outputs.

5 Grammatical Error Correction

5.1 Experimental setting

The second experiment addresses GEC. We com-bine the FCE public dataset (Yannakoudakis et al.,2011), NUCLE corpus (Dahlmeier et al., 2013),and English learner corpus from the Lang-8learner corpus (Mizumoto et al., 2011) and re-move sentences longer than 100 words to createa training corpus. From the Lang-8 learner cor-pus, we use only the pairs of erroneous and cor-rected sentences. We use 1,452,584 sentences asa training set (502,908 types on the encoder sideand 639,574 types on the decoder side). We evalu-ate the models’ performances on the standard setsfrom the CoNLL-14 shared task (Ng et al., 2014)using CoNLL-13 data as a development set (1,381sentences) and CoNLL-14 data as a test set (1,312sentences)4. We employ F0.5 as an evaluationmeasure for the CoNLL-14 shared task.

We use the same model as in Section 4.1 as aneural model for GEC. The models’ parameter set-tings are similar to the MT experiment, except forthe vocabulary and batch sizes. In this experiment,we set the vocabulary size on the encoder and de-coder sides to 150K and 50K, respectively. Ad-

4We do not consider alternative answers suggested by theparticipating teams.

src Genetic refers the chance of inheriting a dis-order or disease .

baseline Genetic refers the chance of inheriting a dis-order or disease .

PPMI Genetic refers to the chance of inheriting adisorder or disease .

gold Genetic risk refers to the chance of inherit-ing a disorder or disease .

Table 6: An example of GEC using a source sen-tence from COMMON.

ditionally, we conduct the experiment of changingvocabulary size of the encoder to 50K to investi-gate the effect of the vocabulary size. Unless oth-erwise noted, we conduct an analysis of the modelusing the vocabulary size of 150K. The mini-batchsize is 100.

5.2 ResultTable 4 shows the performance of the baseline andproposed method. The PPMI model improves pre-cision and recall; it achieves a F0.5-measure 1.48points higher than the baseline method.

In setting the vocabulary size of encoder to150K, PPMI replaces 37,185 types from the base-line; in the 50K setting, PPMI replaces 10,203types.

5.3 AnalysisThe F0.5 of the baseline is almost the same whilethe PPMI model improves the score in the casewhere the vocabulary size increases. Similar toMT, we suppose that the PPMI filters noisy words.

As in Section 4.3, we perform a follow-up ex-periment using two data subsets: COMMON andDIFF, which contain 1,072 and 240 sentences, re-spectively.

Table 5 shows the accuracy of the error correc-tion of the COMMON and DIFF outputs. Preci-sion increases by 11.81 points, whereas recall re-mains the same for the COMMON outputs.

In GEC, approximately 20% of COMMON’soutput pairs differ, which is caused by the dif-

116

MT GECbaseline PPMI baseline PPMI

tokens 52,700 27,364 126,884 70,003Ave. tokens 3.07 1.59 3.36 1.85

Table 7: Number of words included only in eitherthe baseline or PPMI vocabulary.

ferences in the training environment. Unlike MT,we can copy OOV in the target sentence from thesource sentence without loss of fluency; therefore,our model has little effect on recall, whereas itsprecision improves because of noise reduction.

Table 6 shows an example of GEC. The pro-posed method’s output improves when the sourcesentence is expressed using common vocabulary.

6 Discussion

We described that the proposed method has a pos-itive effect on learning the encoder. However, wehave a question; what affects the performance?We conduct an analysis of this question in this sec-tion.

First, we count the occurrence of the words in-cluded only in the baseline or PPMI in the trainingcorpus. We also show the number of the tokensper types (“Ave. tokens”) included only in eitherthe baseline or PPMI vocabulary.

The result is shown in Table 7. We find that theproposed method uses low-frequency words in-stead of high-frequency words in the training cor-pus. This result suggests that the proposed methodworks well despite the fact that the encoder of theproposed method encounters more <unk> thanthe baseline. This is because the proposed methodexcludes words that may interfere with the learn-ing of encoder-decoder models.

Second, we conduct an analysis of the POS ofthe words in GEC to find why increasing OOVimproves the learning of encoder-decoder models.Specifically, we apply POS tagging to the trainingcorpus and calculate the occurrence of the POS ofthe words only included in the baseline or PPMI.We use NLTK as a POS tagger.

Table 8 shows the result. It is observed thatNOUN is the most affected POS by the proposedmethod and becomes often represented by <unk>.NOUN words in the vocabulary of the baselinecontain some non-English words, such as Japaneseor Korean. These words should be treated as OOVbut the baseline fails to exclude them using onlythe frequency. According to Table 8, NUM is also

POS baseline PPMI ALLNOUN 92,693 44,472 4,644,478VERB 11,066 10,099 3,597,895PRON 127 107 1,869,422ADP 626 685 1,836,193DET 128 202 1,473,391ADJ 13,855 12,270 1,429,056ADV 2,032 1,688 931,763PRT 319 75 615,817CONJ 62 28 537,346PUNCT 110 11 223,573NUM 5,585 299 207,487OTHER 281 67 5,209Total 126,884 70,003 17,371,630

Table 8: Number of the POS of words only in-cluded in the baseline or PPMI.

affected by the proposed method. NUM wordsof the baseline include a simple numeral such as“119”, in addition to incorrectly segmented nu-merals such as “514&objID”. This word appears25 times in the training corpus owing to the noisynature of Lang-8. We suppose that the proposedmethod excludes these noisy words and has a pos-itive effect on training.

7 Conclusion

In this paper, we proposed an OOV filteringmethod, which considers word co-occurrence in-formation for encoder-decoder models. Unlikeconventional OOV handling, this graph-basedmethod selects the words that are more suitablefor learning encoder models by considering con-textual information. This method is effective fornot only machine translation but also grammaticalerror correction.

This study employed a symmetric matrix (sim-ilar to skip-gram with negative sampling) to ex-press relationships between words. In future re-search, we will develop this method by usingvocabulary obtained by designing an asymmetricmatrix to incorporate syntactic relations.

Acknowledgments

We thank Yangyang Xi of Lang-8, Inc. for allow-ing us to use the Lang-8 learner corpus. We alsothank Masahiro Kaneko and anonymous review-ers for their insightful comments. We thank RoamAnalytics for a travel grant to support the presen-tation of this paper.

117

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proc. of ICLR.

Sergey Brin and Lawrence Page. 1998. Theanatomy of a large-scale hypertextual web searchengine. Comput. Netw. ISDN Syst. 30(1-7):107–117.https://doi.org/10.1016/S0169-7552(98)00110-X.

Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. 2014.Learning phrase representations using RNNencoder–decoder for statistical machine trans-lation. In Proc. of EMNLP. pages 1724–1734.http://www.aclweb.org/anthology/D14-1179.

Daniel Dahlmeier, Hwee Tou Ng, and Siew MeiWu. 2013. Building a large annotated cor-pus of learner English: The NUS corpus oflearner English. In Proc. of BEA. pages 22–31.http://www.aclweb.org/anthology/W13-1703.

Chris Dyer, Victor Chahuneau, and Noah A.Smith. 2013. A simple, fast, and effec-tive reparameterization of IBM model 2.In Proc. of NAACL-HLT . pages 644–648.http://www.aclweb.org/anthology/N13-1073.

Anette Hulth. 2003. Improved automatic key-word extraction given more linguistic knowl-edge. In Proc. of EMNLP. pages 216–223.http://www.aclweb.org/anthology/W03-1028.

Sebastien Jean, Orhan Firat, Kyunghyun Cho,Roland Memisevic, and Yoshua Bengio. 2015.Montreal neural machine translation systems forWMT15. In Proc. of WMT . pages 134–140.http://aclweb.org/anthology/W15-3014.

Jianshu Ji, Qinlong Wang, Kristina Toutanova, YongenGong, Steven Truong, and Jianfeng Gao. 2017. Anested attention neural hybrid model for grammat-ical error correction. In Proc. of ACL. pages 753–762. http://aclweb.org/anthology/P17-1070.

Tetsuo Kiso, Masashi Shimbo, Mamoru Komachi,and Yuji Matsumoto. 2011. HITS-based seedselection and stop list construction for boot-strapping. In Proc. of ACL. pages 30–36.http://www.aclweb.org/anthology/P11-2006.

Jon M. Kleinberg. 1999. Authoritative sources in ahyperlinked environment. J. ACM 46(5):604–632.https://doi.org/10.1145/324133.324140.

Philipp Koehn. 2004. Statistical signifi-cance tests for machine translation evalua-tion. In Proc. of EMNLP. pages 388–395.http://www.aclweb.org/anthology/W04-3250.

Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective approaches toattention-based neural machine translation.

In Proc. of EMNLP. pages 1412–1421.http://aclweb.org/anthology/D15-1166.

Rada Mihalcea and Paul Tarau. 2004. TextRank:Bringing order into texts. In Proc. of EMNLP. pages404–411. http://www.aclweb.org/anthology/W04-3252.

Tomoya Mizumoto, Mamoru Komachi, Masaaki Na-gata, and Yuji Matsumoto. 2011. Mining re-vision log of language learning sns for auto-mated Japanese error correction of second languagelearners. In Proc. of IJCNLP. pages 147–155.http://www.aclweb.org/anthology/I11-1017.

Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchi-moto, Masao Utiyama, Eiichiro Sumita, SadaoKurohashi, and Hitoshi Isahara. 2016. ASPEC:Asian scientific paper excerpt corpus. In Proc. ofLREC. pages 2204–2208.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, ChristianHadiwinoto, Raymond Hendy Susanto, and Christo-pher Bryant. 2014. The CoNLL-2014 shared taskon grammatical error correction. In Proc. of CoNLLShared Task. pages 1–14.

Kishore Papineni, Salim Roukos, Todd Ward,and Wei-Jing Zhu. 2002. BLEU: a methodfor automatic evaluation of machine trans-lation. In Proc. of ACL. pages 311–318.http://www.aclweb.org/anthology/P02-1040.

Hinrich Schutze. 1998. Automatic word sense discrim-ination. Computational Linguistics 24(1):97–123.http://dl.acm.org/citation.cfm?id=972719.972724.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In Proc. of ACL. pages 1715–1725.http://www.aclweb.org/anthology/P16-1162.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Proc. of NIPS. pages 3104–3112.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144 .

Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Ju-rafsky, and Andrew Y Ng. 2016. Neural languagecorrection with character-based attention. arXivpreprint arXiv:1603.09727 .

Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.2011. A new dataset and method for automaticallygrading ESOL texts. In Proc. of ACL-HLT . pages180–189. http://www.aclweb.org/anthology/P11-1019.

118

Zheng Yuan and Ted Briscoe. 2016. Grammati-cal error correction using neural machine transla-tion. In Proc. of NAACL-HLT . pages 380–386.http://www.aclweb.org/anthology/N16-1042.

119


Exploring Chunk Based Templates for Generating a subset of English Text

Nikhilesh BhatnagarLTRC, IIIT Hyderabad

Manish ShrivastavaLTRC, IIIT Hyderabad

Radhika MamidiLTRC, IIIT Hyderabad

Abstract

Natural Language Generation (NLG) is aresearch task which addresses the auto-matic generation of natural language textrepresentative of an input non-linguisticcollection of knowledge. In this paper,we address the task of the generation ofgrammatical sentences in an isolated con-text given a partial bag-of-words which thegenerated sentence must contain. We viewthe task as a search problem (a problem ofchoice) involving combinations of smallerchunk based templates extracted from atraining corpus to construct a completesentence. To achieve that, we propose afitness function which we use in conjunc-tion with an evolutionary algorithm as thesearch procedure to arrive at a potentiallygrammatical sentence (modeled by the fit-ness score) which satisfies the input con-straints.

1 Introduction

One of the reasons why NLG is a challengingproblem is because there are many ways in whicha given content can be represented. These are rep-resented by the stylistic constraints which addresssyntactic and pragmatic choices (largely) indepen-dent of the information conveyed.

Classically, there are two major subtasks recog-nized in NLG: Strategic Generation and TacticalGeneration (Sentence Planning and Surface Real-ization)1(Reiter and Dale, 2000). Strategic Gen-eration - “what to say” deals with identifying therelevant information to present to the audience andTactical Generation - “how to say” addresses the

1Because we follow a template based approach, there issome overlap between the Content Determination and Ag-gregation steps.

problems of linguistic representation of the inputconcepts. In this work, we address the problem oftactical generation, with a focus on the grammat-icality of the generated sentences. We formulateour task as follows: to generate syntactically cor-rect sentences given a set of constraints such as abag-of-words, partial ordering, etc.

So, for example, given a bag of words such as“man”, “plays”, “football” and length constraints,a sentence like “The man plays football in Octo-ber.” would be acceptable.

Our approach involves a corpus derived formu-lation of template based generation. Templates areinstances of canned text with a slot-filler structure(“gaps”) which can be filled with the appropriateinformation thus realizing the sentence. Since theyare a manual resource, it is rather expensive andhard to generalize over different types or domainsof text.

Thus, it is desirable to be able to automaticallyextract templates from a corpus. Also, to increasethe syntactic coverage, we use sub-sentence level(smaller) templates to generate a sentence.

2 Background and Related Work

Traditionally, template based systems are used inscenarios where the output text is structurally verywell defined and/or requires very high quality textas output with little variance. This work is in-spired from (Van Deemter et al., 2005) who pointout that template based systems and “real” NLGsystems are “Turing equivalent” meaning that atleast in terms of expressiveness, there is no theo-retical disparity between the two. (Rudnicky andOh, 2002) use language models to generate text. Inrecent years, (Kondadadi et al., 2013) present a hy-brid NLG system which generates text by rankingtagged clusters of templates. NaturalOWL (Gala-nis and Androutsopoulos, 2007) use templates for

120

Figure 1: System architecture

their sentence plans and rules to manipulate them.

3 Core Assumptions and Motivation

As an extension of our previous work (Bhatnagarand Mamidi, 2016), in this work, we adopt a sim-plistic model of language - a sequence of linguisticunits. Given a set of such units, each possible sen-tence is contained in the search space of all pos-sible permutations. Thus, given a grammaticalityfitness function, pruning and vocabulary reductionis essential to be able to tractably search this space.

Since templates are based on canned text, tem-plates are locally grammatical. The core idea isto effectively use the local grammaticality guar-antee of a corpus extracted template to combine,rather than construct the component templates togenerate a sentence. These templates themselvescontain individual tokens, effectively consideringtemplates as a unit of sentence construction in-stead of tokens. The obvious trade-off is that sincethere are a lot more templates than tokens, whichresults in a much larger search space. However, ifappropriate abstractions are used, perhaps the vo-cabulary problem can be mitigated somewhat.

The task is then twofold: how to determinewhich templates to combine (pruning) and con-straining that with a measure of grammaticality ofthe generated sentence (fitness).

4 Templates and SentenceRepresentation

We use chunks as a basis for the linear templatesbecause chunks are linguistically sound and hencethe vocabulary increase is lesser compared to un-constrained spans of text. In general, an extendedfeature has syntactic identifiers added to the fea-ture space. This is to better inform syntactic be-havior of the template even though it increasesthe feature space a little. Abstracted features havemultiple features clustered together which reducesthe feature space. We describe the extension pro-cess below.

4.1 Abstraction

The chunks should be abstracted in such a wayso as to have a minimal impact on their syntacticcombination behavior.

All punctuation and stopwords are not ab-stracted because they are highly relevant, syntac-tically. Following are the mappings applied to thechunks:

1. Named Entity Abstraction (NE): Each NE ismapped to a unique symbol corresponding toits category.

2. POS Abstraction (POS): POS categories suchas “CD”, “FW” and “SYM” and continuous“NNP” and “NNPS” sequences are mappedto their corresponding POS tags.

3. Cluster Abstraction (WC): Each token (ex-cept punctuation and stopwords), is mappedto the cluster ID of its syntatico-semanticcluster. Since a token can have multiple POScategories which in turn effect its syntacticbehavior, we consider a token with differentPOS categories as distinct while clustering.The clusters are obtained by computing theKMeans cluster for token-POS pairs with eu-clidean distance of L2-normed vector embed-dings.

4.2 Extended Categories

We extend the POS and chunk tag categories tobetter inform template combinations:

1. Extended POS (EPOS): Each punctuation,stopword and NE is assigned its own uniquePOS category.

2. Extended Chunk Tags (ECTag): Since the“O” (outside) chunk tag is a “default” cate-gory, it contributes a lot of syntactic confu-sion, we assign a separate chunk tag for all

121

Table 1: Template ExampleChunk Feature Chunk Template Template FeatureChunk Tag “NP” “NP” Extended Chunk Tag (ECTag)Tokens “a”, “popular”, “wrestler” “a”, “JJ213”, “NN5266” Tok-POS cluster (WC)POS “DT”, “JJ”, “NN” “DT”, “JJ” ,“NN” Extended POS (EPOS)Head token “wrestler” “NN5266” Head Tok-POS cluster (HWC)Head POS “NN” “NN” Head Extended POS (HEPOS)

“a” and “NN5266” Junction Tok-POS clusters (WCJ)“DT”, “NN” Junction Extended POS (EPOSJ)“a BLANK BLANK” “Blank” Construction Feature (BlankCo)“a JJ NN” Extended POS Construction Feature (EPOSCo)

chunks tagged “O” which contain sentenceendings (“.”, “,”, or “!”) or brackets (“(” or“)”). In addition, chunk tags for chunks con-taining wh-words (POS tags “WDT”, “WP”,“WP$” or “WRB”) are marked (eg. “NP” be-comes “WNP”).

4.3 Template features

A template is created after applying abstractionsto a chunk and extending its syntactic categories.The template, however has two more featuretypes:

1. Junction Features (WCJ, EPOSJ): Junctionfeatures are comprised of the leftmost andrightmost features of the template. Thesefeatures are used to predict if two templates“glue” together well at the point of contact(the junction). Both factors (tok-pos clustersand extended POS category) are applicable.

2. Head Features (HWC, HEPOS): Head fea-tures represent the “external” features for achunk used to compute a global grammatical-ity component. Both factors - extended POSand tok-pos clusters are applicable.

3. Construction features (BlankCo, EPOSCo):Construction features represent the chunk asa syntactic construction or a layout, encod-ing the positional combination of the com-ponents comprising it. It is constructed bycreating a feature for the ordered tuple of to-kens comprising the chunk where all tokensare mapped to a single “blank” symbol or itsextended POS except punctuations and stop-words. Two factors (single “blank” and ex-tended POS) are applicable.

Table 1 shows an example chunk and its corre-sponding derived template. A sentence is repre-sented as a sequence of templates described above.

5 Scoring the template combinations

It should be noted that this score is not equivalentto a syntactic correctness score, but rather a sub-set of it. This is because here we are dealing withconfigurations of untampered templates whose lo-cal syntactic correctness still holds and the syntac-tic incorrectness is a matter of their combinatorialconfiguration while a grammaticality score needsto deal with “broken” chunks as well.

The fitness score F is a linear combinationof length-normalized total log-probabilities TPof different sequences derived from the sentencecomputed using an NGram language model. Thetotal probability of an NGram model is defined as:

TP (s) = P (w1|bos)P (w2|w1bos)...

..P (eos|wkwk−1...wk−n+1)(1)

where n is the order of the language model, s isa sequence and wi is the ith element in s. F isa weighted sum of length-normalized total log-probabilities of five different sequences:

nTP (s) = log10(TP (s))/(ls + 1) (2)

F (s) = αcnTP (ec) + αconTP (co)/2

+ αjnTP (j)/2 + αhnTP (h)/2

+ αlnTP (l)/2

(3)

where nTP (s) is the length normalized total log-probability for the sequence s where ls is the num-ber of elements in that sequence. ec, co, j, h and lrepresent the sequences in chunk score, construc-tion score, junction score, head score and lexicalscore. αi are the weight for each component scorewhich are discussed:

1. Extended Chunk Score nTP (ec) - This scoreis calculated using the extended chunk tags(ECTags) for the sentence. It is useful in de-termining the global syntactic structure.

122

Table 2: A sentence and sequences used to compute the fitnessTemplate Feature Sequence Sentence

NP[Sand] VP[blows] PP[in] NP[the strong wind] .[.]NP[NN969/NN] VP[VBZ29/VBZ] PP[in/in] NP[the/the JJ347/JJ NN628/NN] .[./.]

Extended Chunk Tags (ECTags) NP VP PP NP .Blank Construction (BlankCo) BLANK BLANK in the BLANK BLANK .Extended POS Construction (EPOSCo) NN VBZ in the JJ NN .Tok-POS Junction (WCJ) (bos, NN969), (NN969, VBZ29), (VBZ29, in), (in, the), (NN628, .), (., eos)Extended POS Junction (EPJ) (bos, NN), (NN, VBZ), (VBZ, in), (in, the), (NN, .), (., eos)Head Tok-POS (HWC) NN969 VBZ29 in NN628 .Head Extended POS (HEPOS) NN VBZ in NN .Tok-POS Sequence (WCs) NN969 VBZ29 in the JJ347 NN628 .Extended POS Sequence (EPOSs) NN VBZ in the JJ NN .

2. Construction Score nTP (co) - This scoreis calculated using blank construction layout(BlackCo) and extended POS constructionlayout (EPOSCo). It is useful in augmentingthe overall grammaticality of the sentence asit encodes the syntactic layout of the chunkas a whole.

3. Junction Score JP (j) - This score is calcu-lated as the sum of the bigram probability ofthe left junction of the right template condi-tioned on the right junction of the left tem-plate for both tok-pos clusters (WC) and ex-tended POS (EPOS) for all junctions in thesentence. This score represents a local viewof inter-template cohesiveness. It representshow a sentence “glues” together.

4. Head Score nTP (h) - This score is calcu-lated using head tok-pos clusters (HWC) andhead extended POS (HEPOS). Counter to thejunction score, it represents a more semanticview of the chunk interactions. This is be-cause generally a chunk head is often the pri-mary content word in that chunk.

5. Lexical Score nTP (l) - This score is com-puted using the tok-pos clusters (WC) andextended POS (EPOS). This score representsa baseline grammaticality score on a lexicallevel.

A sentence with all the sequences are fed to thefitness function are shown in table 2.

6 Parameters for pruning the searchspace

The search space is the space of all permutationsof the templates to form the sentence. Such asearch space is huge. We observe that depend-ing on the particular constraints described in sec-tion 7, we can prune the permutation search space.This is done by finding a subset of the template

bank that can occur at each position in the tem-plate sequence. The subsets are represented by aset of constraints which we call a search config-uration. Thus, say, if a sentence is determined tohave 5 templates, there will be a search configu-ration computed for each template slot, in accor-dance with the global constraints for the sentence.Described below is the specification of a searchconfiguration:

1. Length The templates must contain exactlythe specified number of tokens (tok-pos clus-ters) in it.

2. Extended chunk tag (ECtag) The templatesmust have the specified extended chunk tag.

3. Tok-POS clusters and extended POS tags(WC-EPOS) All the specified tok-pos clus-ters and their corresponding extended POStags must be present in the templates.

Note that neither of the above constraints needsto be specified for a search configuration, in whichcase all templates can be considered to fill that po-sition. We use memoization to make the pruningcomputationally feasible.

Eg. if a search configuration has a length of5, no preference specified for the extended searchtag and (cluster(“run”/NN), “NN”) in the tok-pos(WC-EPOS) list, all templates containing 5 tokenswhich also contain the cluster for run used as an“NN” are valid for that configuration.

7 Search

To search the very large permutation space, we usea population based searching method which usesonly mutation as the genetic operator for generat-ing new solutions. Following are the componentsof the evolutionary search:

123

7.1 Population Selection

The search can be parametrized by specifying acollection of sentence level constraints which havetheir own individual sub-constraints. These con-straints are different from search configurations asdefined in section 6 as these constraints operateon a sentence level, while search configurations,which are derived from these constraints operateon a template level. The sentence level constraintsare listed below:

1. Number of tokens: The number of tokens inthe generated sentence must be in a specifiedrange. A maximum value is required.

2. Number of templates: The number of tem-plates in the generated sentence must be inthe specified range.

3. Chunk specifications (inclusion): For eachchunk specification, at least one templatemust be present in the sentence which followsit and it must satisfy all the sub-constraintssuch as extended chunk tag and constituenttok-pos clusters.

4. Position Chunk specifications: This is achunk constraint like the one describedabove, with the additional constraint of afixed position.

5. Ordered Chunk specifications: These are listof chunk constraints with the additional con-straint that they be in order in the generatedsentence.

6. Tok-pos specifications (inclusion): The sen-tence must contain the specified tok-pos clus-ters.

7. Ordered Tok-pos specifications: The speci-fied tok-pos clusters must occur in the sameorder in the sentence.

Note that every constraint and sub-constraint canbe left empty which means that there could beno specification for the extended chunk tag for achunk constraint. Based on these sentence levelconstraints, search configurations for each tem-plate position are derived.

7.1.1 Sentence level constraints to templatelevel SearchConfigs

Based on the number of templates selected be-tween the range given, it distributes the totallength specified between all the templates ran-domly. Then, it arranges the Chunks, Ordered-Chunks and PositionFixedChunks and distributesthe token level constraints (OrderedTokPOSs and

TokPOSConstraints) to the templates. Now, allthree parameters of the SearchConfigs have beenminimally inferred and the search space can nowbe pruned.

7.1.2 Sampling from the pruned space

Since we are dealing with naturally occurring text,the distribution of the templates grouped by ex-tended chunk tags follows a Zipf curve meaningthat almost 40% of the time, an ”NP” is selectedand a “.” almost never gets a chance even thoughboth templates are prominent in the set of tem-plates having the same chunk tag as them. Thisdrastically hampers the chances of getting struc-tural variance in the templates constructed. Toremedy this, we assign weights to the set of tem-plates with a dampening exponent of 0.4 whichmakes the distribution more uniform, yet pre-serves the selectional biases.

1. Dampen the Zipf for selection of extendedchunk tag

2. Dampen the Zipf for selection of templatesgiven an extended chunk tag

3. Obtain the distribution for selection of tem-plates by multiplying probabilities takenfrom the above two distribution.

This results in a damped Zipf curve for the se-lection of templates which allows for much morevariance in the extended chunk tags of the gener-ated sentences.

7.2 Mutation

To perform mutation, we randomly select a tem-plate to mutate and using the dampened distribu-tion obtained, we sample a template from the sub-space corresponding to the search configuration ofthat template position.

7.3 Selection and Evolution Strategy

We use elitist selection (pick the top k organisms)with enforced variability. We enforce that at least8% and 15% of the population have unique blankconstruction layout and extended POS construc-tion layout respectively. Following are the stepsfor evolving the population.

1. Initialize population given a set of Con-straints and sample a population of size 1000

2. Mutate the first half of the population and as-sign it to the second half.

3. compute fitnesses for all organisms and sortpopulation based on fitness score

124

4. Retain organisms such that the variabilityconstraints are met.

5. Re-initialize 25% of the population so thatdifferent search configurations are searched.

6. Repeat 1 to 5 until 100 generations.This evolution run gives a population consistingof grammatical configurations which adhere to theconstraints given as input. There are still possiblyword clusters remaining which need to be filledsince sentences which are generated are comprisedof templates which contain clusters, not wordforms. We fill these cluster slots with the tokenswe gave in the initial bag-of-words constraints. Todo that, we run a random search on the best gener-ated sentence and fill the cluster slots which max-imizes the token and head token perplexity.

8 Experiment Setup

Following are the steps we took to conduct our ex-periments:

1. We tokenized, POS tagged, NER tagged,and chunked and abstracted the EnglishWikipedia corpus using Stanford CoreNLP(Manning et al., 2014), spaCy (Honnibal andMontani, 2017) and LM-LSTM-CRF (Liuet al., 2018) to extract the templates.

2. The number of clusters, k was chosen to be7500.

3. We used lexvec(Salle et al., 2016) pretrainedvectors for clustering.

4. The weights for the fitness function were em-pirically chosen to be 75, 10, 10, 5 and 2.

9 Results

Following are the top sentences generated with thefollowing constraints:

1. “cat” should be present:the heated water plant is likewise formed en-tirely of cat .

2. “what” as the first token and sentence con-tains a “?”:what does the right thing do ?

3. “the” in position=0 and “.” in sentence:the mid product can readily be used in poly-nomial practices .

4. “and”, “ate” and “ran” in sentence:the division ate a plant of cape , ran a princi-pal prying need and stockpiled superconduc-tivity .

5. “on” and “in” in sentence:PERSON bumped ORG in DATE on spirits

Figure 2: Evolution

of error .6. “influential”, “large” and “red” in sentence:

large influential player vocals defined red tothe convenience of the detector .

We show the generation improvements for the firstsentence. As we can see, the algorithm is ableto generate a question with minimal supervision.Also, in sentence 5, multiple PPs can be seen.

Some level of supervision is necessary to drivethe required syntax and content words to gener-ate predictable outputs. We observe that there isa bias in the model e.g. usage of “the” in thestarting noun phrase, and chaining verb phrasesand prepositional phrases. Hence, to generate aquestion, one has to specify the position of the whword, otherwise the sentences often start with anoun phrase.

Computationally, the search time increases withincreasing sentence lengths. On a reasonablymodern machine, our implementation generatedthe above sentences in about 150 seconds whileusing 2.2 GB of memory.2

10 Future Work

As a future work, detailed qualitative analysis andminimum constraints needed to generate specificlinguistic structures can be done. Also, automaticextraction of content words and other relevant con-straints can be explored for generation.

11 Acknowledgements

The authors would like to thank Prof. Owen Ram-bow for helping gain insight into the problem.

2The code used for this research will be made availableon https://github.com/tingc9/LinearTemplatesSRW

125

ReferencesNikhilesh Bhatnagar and Radhika Mamidi. 2016. Ex-

periments in linear template combination using ge-netic algorithms. 22nd Himalayan Languages Sym-posium.

Dimitrios Galanis and Ion Androutsopoulos. 2007.Generating multilingual descriptions from linguisti-cally annotated owl ontologies: the naturalowl sys-tem. In Proceedings of the Eleventh EuropeanWorkshop on Natural Language Generation, pages143–146. Association for Computational Linguis-tics.

Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

Ravi Kondadadi, Blake Howald, and Frank Schilder.2013. A statistical nlg framework for aggregatedplanning and realization. In ACL (1), pages 1406–1415.

L. Liu, J. Shang, F. Xu, X. Ren, H. Gui, J. Peng, andJ. Han. 2018. Empower Sequence Labeling withTask-Aware Neural Language Model. In AAAI.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Ehud Reiter and Robert Dale. 2000. Building naturallanguage generation systems. Cambridge universitypress.

Alexander I Rudnicky and Alice H Oh. 2002. Dialogannotation for stochastic generation.

Alexandre Salle, Marco Idiart, and Aline Villavicencio.2016. Matrix factorization using window samplingand negative sampling for improved word represen-tations. CoRR, abs/1606.00819.

Kees Van Deemter, Emiel Krahmer, and Mariet The-une. 2005. Real versus template-based natural lan-guage generation: A false opposition? Computa-tional Linguistics, 31(1):15–24.

126


Trick Me If You Can: Adversarial Writing of Trivia Challenge Questions

Eric WallaceUniversity of Maryland, College Park

[email protected]

Jordan Boyd-GraberUniversity of Maryland, College Park

[email protected]

Abstract

Modern question answering systems havebeen touted as approaching human perfor-mance. However, existing question an-swering datasets are imperfect tests. Ques-tions are written with humans in mind, notcomputers, and often do not properly ex-pose model limitations. To address this,we develop an adversarial writing setting,where humans interact with trained mod-els and try to break them. This annota-tion process yields a challenge set, whichdespite being easy for trivia players toanswer, systematically stumps automatedquestion answering systems. Diagnosingmodel errors on the evaluation data pro-vides actionable insights to explore in de-veloping robust and generalizable questionanswering systems.

1 Introduction

Proponents of modern machine learning systemshave claimed human parity on difficult tasks suchas question answering.1 Datasets such as SQuADand TriviaQA (Rajpurkar et al., 2016; Joshi et al.,2017) have certainly advanced the state of the art,but are they providing the right examples to mea-sure how well machines can answer questions?

Many of the existing question answeringdatasets are written and evaluated with humansin mind, not computers. Though the way com-puters solve NLP tasks is fundamentally differentthan humans. They train on hundreds of thousandsof questions, rather than looking at small groupsof them in isolation. This allows models to pickup on superficial patterns that may occur in datacrawled from the internet (Chen et al., 2016) or

1https://rajpurkar.github.io/SQuAD-explorer/

from biases in the crowd-sourced annotation pro-cess (Gururangan et al., 2018). Additionally, be-cause existing test sets do not provide specific di-agnostic information for improving models, it canbe difficult to get proper insight into a system’s ca-pabilities or its limitations. Unfortunately, whenrigorous evaluations are not performed, strikinglysimple model limitations can be overlooked (Be-linkov and Bisk, 2018; Ettinger et al., 2017; Jiaand Liang, 2017).

To address this lacuna, we ask triviaenthusiasts—who write new questions forscholastic and open circuit tournaments—to cre-ate examples that specifically challenge QuestionAnswering (QA) systems. We develop a user in-terface (Section 2) that allows question writers toadversarially craft these questions. This interfaceprovides a model’s predictions and its evidencefrom the training data to facilitate a model-drivenannotation process.

Humans find the resulting challenge questionseasier than regular questions (Section 3), butstrong QA models struggle (Section 4). Unlikemany existing QA test sets, our questions highlightspecific phenomena that humans can capture butmachines cannot (Section 5). We release our QA

challenge set to better evaluate models and sys-tematically improve them.2

2 A Model-Driven Annotation Process

This section introduces our framework for tai-loring questions to challenge computers, the sur-rounding community of trivia enthusiasts that cre-ate thousands of questions annually, and how weexpose QA algorithms to this community to helpthem craft questions that challenge computers.

2www.qanta.org

127

The protagonist of this opera describes the fu-ture day when her lover will arrive on a boatin the aria “Un Bel Di” or “One BeautifulDay.” The only baritone role in this opera isthe consul Sharpless who reads letters for theprotagonist, who has a maid named Suzuki.That protagonist blindfolds her child Sorrowbefore stabbing herself when her lover B.F.Pinkerton returns with a wife. For 10 points,name this Giacomo Puccini opera about anAmerican lieutenants affair with the Japanesewoman Cio-Cio San.ANSWER: Madama Butterfly

Figure 1: An example Quiz Bowl question. Thequestion becomes progressively easier to answerlater on; thus, more knowledgeable players can an-swer after hearing fewer clues.

2.1 The Quiz Bowl Community: Writers ofQuestions

The “gold standard” of academic competitions be-tween universities and high schools is Quiz Bowl.Unlike other question answering formats such asJeopardy! or TriviaQA (Joshi et al., 2017), QuizBowl questions are designed to be interrupted.This allows more knowledgeable players to “buzzin” before their opponent knows the answer. Thisstyle of play requires questions to be structured“pyramidally”: questions start with difficult cluesand get progressively easier (Figure 1).

However, like most existing QA datasets, QuizBowl questions are written with humans in mind.Unfortunately, the heuristics that question writersuse to select clues do not always apply to comput-ers. For example, humans are unlikely to mem-orize every song in every opera by a particularcomposer. This, however, is trivial for a com-puter. In particular, a simple baseline QA systemeasily solves the example in Figure 1 from see-ing the reference to “Un Bel Di”. Other questionscontain uniquely identifying “trigger words”. Forexample, “Martensite” only appears in questionson Steel. For these types of examples, a QA sys-tem needs to understand no additional informationother than an if–then rule. Surprisingly, this is truefor many different answers: in these cases, QA de-volves into trivial pattern matching. Consequently,information retrieval systems are strong baselinesfor this task, even capable of defeating top high

school and collegiate players. Well-tuned neuralbased QA systems (Yamada et al., 2018) can givesmall improvements over the baselines and haveeven defeated teams of expert humans in live QuizBowl events.

Although, other types of Quiz Bowl questionsare fiendishly difficult for computers. Many ques-tions have complicated coreference patterns (Guhaet al., 2015), require reasoning across multipletypes of knowledge, or involve wordplay. Giventhat these difficult types of question truly chal-lenge models, how can we generate and analyzemore of them?

2.2 Adversarial Question WritingOne approach to evaluate models beyond a typicaltest set is through adversarial examples (Szegedyet al., 2013) and other types of intentionally dif-ficult inputs. However, language data is hardto modify (e.g., replacing word tokens) withoutchanging the meaning of the input. Past workside-steps this difficulty by modifying examplesin a simple enough manner to preserve meaning(Jia and Liang, 2017; Belinkov and Bisk, 2018).Though it is hard to generate complex examplesthat expose richer phenomena through automaticmeans. Instead, we propose to use human adver-saries in a process we call adversarial writing.

In this setting, question writers are tasked withgenerating challenge questions that break exist-ing QA systems but are still answerable by hu-mans. To facilitate this breaking process, we ex-pose model predictions and interpretation methodsto question writers through a user interface. Thisallows writers to see what changes should be madeto confuse the system and visualize the resultingeffects. For example, our system highlights the re-vealing “Un Bel Di” clue in bright red.

This results in a small, model-driven challengeset that is explicitly designed to expose a model’slimitations. While the regular held-out test set forQuiz Bowl provides questions that are likely tobe asked in an actual tournament, these challengequestions highlight rare and difficult QA phenom-ena that models can’t handle.

2.3 User InterfaceThe interface (Figure 2) provides the top fivepredictions (Guesses) from a simple non-neuralmodel, the baseline system from a NIPS 2017competition that used Quiz Bowl as a sharedtask (Boyd-Graber et al., 2018). This model uses

128

Figure 2: The writer inputs a question (top right), the system provides guesses (left), and explains whyit’s making those guesses (bottom right). The writer can then adapt their question to “trick” the model.

an inverted index built using the shared task train-ing data (which consists of Quiz Bowl questionsand Wikipedia pages).

We select an information retrieval model as itenables us to extract meaningful reasoning behindthe model’s predictions. In particular, the Elastic-search Highlight API (Gormley and Tong, 2015)visually highlights words in the Evidence sectionof the interface. This helps users understand whichwords and phrases are present in the training dataand may be revealing the answer. Though theusers never see outputs from a neural system, thequestions they write are quite challenging for them(Section 4).

2.4 Farmed from the Maddeningly SmartCrowd

In contrast to crowd-sourced datasets, our datacomes from the fecund pool of thousands ofquestions written annually for Quiz Bowl tour-naments (Jennings, 2006). We connect with thequestion writers of these tournaments, who findthe adversarial writing process useful to help writehigh quality, original questions.

The current dataset (we intend to have twice-yearly competitions to continually collect data)consists of 651 questions made of 3219 sentences.A few of the writers have indicated that they planto use their question submissions in an upcomingtournament. We will release these questions afterthose respective tournaments have completed.

3 Validating Written Questions

We next verify the validity of the challenge ques-tions. We do not want to collect questions that area jumble of random characters or contain insuf-ficient information to discern the answer. Thus,we first automatically filter out invalid questionsbased on length, the presence of vulgar statements,or repeated submissions (including re-submissionsfrom the Quiz Bowl training or evaluation data).Next, we manually verify all of the resulting ques-tions appear legitimate and that no obviously in-valid questions made it into the challenge set.

We further wish to investigate not only the va-lidity, but also the difficulty of the challenge ques-tions according to human Quiz Bowl players. Todo so, we play a portion of the submitted questionsin a live Quiz Bowl event, using intermediate andexpert players (current and former collegiate QuizBowl players) as the human baseline. We sample60 challenge questions from categories that matchtypical tournament distributions. As a baseline,we additionally select 60 unreleased high schooltournament questions (to ensure no player has seenthem before).

When answering in Quiz Bowl, a player mustinterrupt the question with a buzz. The earlier thata player buzzes, the less of a chance their oppo-nent has to answer the question before them. Tocapture this, we consider two metrics to evaluateperformance, the average buzz position (as a per-centage of the question seen) and the correspond-ing answer accuracy. We randomly shuffle the

129

baseline and challenge questions, play them, andrecord these two metrics. On average for the chal-lenge set, humans buzz with 41.6% of the ques-tion remaining and an accuracy of 89.7%. On thebaseline questions, humans buzz with 28.3% ofthe question remaining and an accuracy of 84.2%.The difference in accuracy between the two typesof questions is not significantly different (p = 0.16using Fisher’s exact test), but the buzzing positionis significantly earlier for the challenge questions(a two-sided t-test yields p = 0.0047). Humansfind the challenge questions easier on average thanthe regular test examples (they buzz much earlier).We expect human performance to be comparableon the questions not played, as all questions wentthrough the same submission and post-processingstages.

4 Models and Experiments

In this section, we evaluate numerous QA systemson the challenge questions. We consider a diverseset of models: ones based on recurrent networks,feed-forward networks, and IR systems to prop-erly explore the difficulty of the examples.

We consider two neural models: a recurrentneural network (RNN) and Deep Averaging Net-work (Iyyer et al., 2015, DAN). The two modelstreat the problem as text classification and predictwhich of the answer entities the question is about.The RNN is a bidirectional GRU (Cho et al., 2014)and the DAN uses fully connected layers with aword vector average as input.

To train the systems, we collect the data usedat the 2017 NIPS Human-Computer Question An-swering competition (Boyd-Graber et al., 2018).The dataset consists of about 70,000 questionswith 13,500 answer options. We split the data intovalidation and test sets to provide baseline evalu-ations for the models. We also report results onthe baseline system (IR) shown to users during thewriting process. For evaluation, we report the ac-curacy as a function of the question position (tocapture the incremental nature of the game). Theaccuracy varies as the words are fed in (mostly im-proving, but occasionally degrading).

The buzz position of all models significantly de-grades on the challenge set. We compare the accu-racy on the original test set (Test Questions) to thechallenge questions in Figure 3.

For both the challenge and original test data,the questions begin with abstract clues that are

difficult to answer (accuracy at or below 10%).However, during the crucial middle portions ofthe questions (after revealing 25% to 75%), wherebuzzes in Quiz Bowl matches most frequently oc-cur, the accuracy on original test questions risessignificantly quicker than the challenge ones. Forboth questions, the accuracy rises towards the endas the “give-away” clues arrive. Despite usersnever observing the output of a neural system,the two neural models decreased more in absoluteaccuracy than the IR system. The DAN modelhad the largest absolute accuracy decrease (from54.1% to 32.4% on the full question), likely be-cause a vector average isn’t capable of capturingthe difficult wording of the challenge questions.

The human results are displayed on the left ofFigure 3 and show a different trend. For both ques-tion types, human accuracy rises very quickly afterabout 50% of the question has been seen. We sus-pect this occurs because the “give-aways”, whichoften contain common sense or simple knowledgeclues, are easy for humans but quite difficult forcomputers. The reverse is true for the early clues.They contain quotes and entities that models canretrieve but humans struggle to remember.

5 Challenge Set Reveals ModelLimitations

In this section, we conduct an analysis of the chal-lenge questions to better understand the sourceof their difficulty. We harvest recurring pat-terns using the user edit logs and correspondingmodel predictions, grouping the questions intolinguistically-motivated clusters (Table 1).

These groups provide an informative analysis ofmodel errors on a diverse set of phenomena. In ourdataset, we additionally provide the most similartraining questions for each challenge question.

A portion of the examples contain clues thatare unseen during training time. Many of theseclues are quite interesting, for example, the com-mon knowledge clue “this man is on the One Dol-lar Bill”. However, because we experiment withsystems that are not able to capture open-domaininformation, we do not investigate these examplesfurther as they trivially break systems.

5.1 Understanding Changes in Language

The first categories of challenge questions containpreviously seen clues that have been written in amisleading manner. Table 1 shows snippets of ex-

130

Figure 3: Both types of questions (challenge questions and original test set questions) begin with abstractclues the models are unable to capture, but the challenge questions are significantly harder during thecrucial middle portions (0.25 to 0.75) of the question. The human results (displayed on the left of thefigure) show their performance on a sample of the challenge questions and a separate test set.

emplar challenge questions for each category.

Paraphrases A common adversarial writingstrategy is to paraphrase clues to remove exact n-gram matches from the training data. This rendersan IR system useless but also hurts neural models.

Entity Type Distractors One key componentfor QA is determining the answer type that is de-sired from the question. Writers often take advan-tage of this by providing clues that lead the systeminto selecting a wrong answer type. For example,in the second question of Table 1, the “lead in” im-plies the answer may be an actor. This triggers themodel to answer Don Cheadle despite previouslyseeing the Bill Clinton “Saxophone” clue.

5.2 Composing Existing KnowledgeThe other categories of challenge questions re-quire composing knowledge from multiple exist-ing clues. Table 2 shows snippets of exemplarchallenge questions for each category.

Triangulation In these questions, entities thathave a first order relationship to the correct answerare given. The system must then triangulate thecorrect answer by “filling in the blank”. For ex-ample, in the first question of Table 2, the placeof death and the brother of the entity are given.The training data contains a clue about the placeof death (The Battle of Thames) reading “though

stiff fighting came from their Native American al-lies under Tecumseh, who died at this battle”. Thesystem must connect these two clues to answer.

Operator One extremely difficulty questiontype requires applying a mathematical or logicaloperator to the text. For example, the training datacontains a clue about the Battle of Thermopylaereading “King Leonidas and 300 Spartans died atthe hands of the Persians” and the second questionin Table 2 requires one to add 150 to the numberof Spartans.

Multi-Step Reasoning The final type of ques-tions requires a model to make multiple reason-ing steps between entities. For example, in thelast question of Table 2, a model needs to makea reasoning step first from the “I Have A Dream”speech to the Lincoln Memorial and an additionalstep to reach president Abraham Lincoln.

6 Related Work

Creating evaluation datasets to get fine-grainedanalysis of particular linguistics features or modelattributes has been explored in past work. TheLAMBADA dataset tests a model’s ability to under-stand the broad contexts present in book passages(Paperno et al., 2016). Linzen et al. (2016) createa dataset to evaluate if language models can learnsubject-verb number agreement. The most closely

131

Set Question Answer RationaleTraining Name this sociological phenomenon, the taking of

one’s own life.Suicide

Paraphrase

Challenge Name this self-inflicted method of death. Arthur MillerTraining Clinton played the saxophone on The Arsenio Hall

ShowBill Clinton Entity Type

DistractorChallenge He was edited to appear in the film “Contact”. . .

For ten points, name this American president whoplayed the saxophone on an appearance on the Ar-senio Hall Show.

Don Cheadle

Table 1: Snippets from challenge questions show the difficulty in retrieving previously seen evidence.Training questions indicate relevant snippets from the training data. Answer displays the RNN Reader’sanswer prediction (always correct on Training, always incorrect on Challenge).

Question Prediction Answer RationaleThis man, who died at the Battle of theThames, experienced a setback whenhis brother Tenskwatawa’s influenceover their tribe began to fade

Battle of Tippecanoe Tecumseh Triangulation

This number is one hundred and fiftymore than the number of Spartans atThermopylae.

Battle of Thermopylae 450 Operator

A building dedicated to this man wasthe site of the “I Have A Dream”speech

Martin Luther King Jr. Abraham Lincoln Multi-StepReasoning

Table 2: Snippets from challenge questions show examples of composing existing evidence. Answerdisplays the RNN Reader’s answer prediction. For these examples, connecting the training and challengeclues is quite simple for humans but very difficult for models.

related work to ours is Ettinger et al. (2017) whoalso consider using humans as adversaries. Ourwork differs in that we use model interpretationmethods to facilitate breaking a specific system.

Other methods have found very simple inputmodifications can break neural models. For ex-ample, adding character level noise drastically re-duces machine translation quality (Belinkov andBisk, 2018), while paraphrases can fool naturallanguage inference systems (Iyyer et al., 2018).Jia and Liang (2017) placed distracting sentencesat the end of paragraphs and caused QA systemsto incorrectly pick up on the misleading informa-tion. These types of input modifications can evalu-ate one specific type of phenomenon and are com-plementary to our approach.

7 Conclusion

It is difficult to automatically expose the limi-tations of a machine learning system, especiallywhen that system reaches very high performance

metrics on a held-out evaluation set. To addressthis, we have introduced a human driven evalua-tion setting, where users try to break a trained sys-tem. By facilitating this process with interpreta-tion methods, users can understand what a modelis doing and how to create challenging examplesfor it. An analysis of the resulting data can re-veal unknown model limitations and provide in-sight into improving a system.

Acknowledgments

We thank all of the Quiz Bowl players and writ-ers who helped make this work possible. Wealso thank the anonymous reviewers and membersof the UMD “Feet Thinking” group for helpfulcomments. JBG is supported by NSF Grant IIS-1652666. Any opinions, findings, conclusions, orrecommendations expressed here are those of theauthors and do not necessarily reflect the view ofthe sponsor.

132

ReferencesYonatan Belinkov and Yonatan Bisk. 2018. Synthetic

and natural noise both break neural machine transla-tion. In Proceedings of the International Conferenceon Learning Representations.

Jordan Boyd-Graber, Shi Feng, and Pedro Rodriguez.2018. Human-Computer Question Answering: TheCase for Quizbowl, Springer.

Danqi Chen, Jason Bolton, and Christopher D. Man-ning. 2016. A thorough examination of thecnn/daily mail reading comprehension task. Pro-ceedings of the Association for Computational Lin-guistics .

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation .

Allyson Ettinger, Sudha Rao, Hal Daume III, andEmily M. Bender. 2017. Towards linguistically gen-eralizable nlp systems: A workshop and shared task.

Clinton Gormley and Zachary Tong. 2015. Elastic-search: The Definitive Guide. ” O’Reilly Media,Inc.”.

Anupam Guha, Mohit Iyyer, Danny Bouman, and Jor-dan Boyd-Graber. 2015. Removing the trainingwheels: A coreference dataset that entertains hu-mans and challenges computers. In North AmericanAssociation for Computational Linguistics.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R. Bowman, and NoahA. Smith. 2018. Annotation artifacts in natural lan-guage inference data .

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daume III. 2015. Deep unordered compo-sition rivals syntactic methods for text classification.In Proceedings of the Association for ComputationalLinguistics.

Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In Conference of the North American Chapter of theAssociation for Computational Linguistics.

Ken Jennings. 2006. Brainiac: adventures in thecurious, competitive, compulsive world of triviabuffs. Villard. http://books.google.com/books?id=tXi_rfgUWCcC.

Robin Jia and Percy Liang. 2017. Adversarial ex-amples for evaluating reading comprehension sys-tems. In Proceedings of Empirical Methods in Nat-ural Language Processing.

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. In Proceedings of the Association for Compu-tational Linguistics.

Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of lstms to learn syntax-sensitive dependencies 4:521–535.

Denis Paperno, German Kruszewski, Angeliki Lazari-dou, Quan Ngoc Pham, Raffaella Bernardi, San-dro Pezzelle, Marco Baroni, Gemma Boleda, andRaquel Fernandez. 2016. The LAMBADA dataset:Word prediction requiring a broad discourse context.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text .

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, andRob Fergus. 2013. Intriguing properties of neuralnetworks .

Ikuya Yamada, Ryuji Tamaki, Hiroyuki Shindo, andYoshiyasu Takefuji. 2018. Studio ousia’s quizbowl question answering system. arXiv preprintarXiv:1803.08652 .

133


Alignment Analysis of Sequential Segmentation of Lexicons to ImproveAutomatic Cognate Detection

Pranav ABig Data Institute

Hong Kong University of Science and TechnologyHong Kong

[email protected]

Abstract

Ranking functions in information retrievalare often used in search engines to recom-mend the relevant answers to the query.This paper makes use of this notion ofinformation retrieval and applies onto theproblem domain of cognate detection. Themain contributions of this paper are: (1)positional segmentation, which incorpo-rates the sequential notion; (2) graphicalerror modelling, which deduces the trans-formations. The current research workfocuses on classification problem; whichis distinguishing whether a pair of wordsare cognates. This paper focuses on aharder problem, whether we could predicta possible cognate from the given input.Our study shows that when language mod-elling smoothing methods are applied asthe retrieval functions and used in con-junction with positional segmentation anderror modelling gives better results thancompeting baselines, in both classificationand prediction of cognates.

1 Introduction

Cognates are a collection of words in different lan-guages deriving from the same origin. The studyof cognates plays a crucial role in applying com-parative approaches for historical linguistics, inparticular, solving language relatedness and track-ing the interaction and evolvement of multiple lan-guages over time. A cognate instance in Indo-European languages is given as the word group:night (English), nuit (French), noche (Spanish)and nacht (German).

The existing studies on cognate detection in-volve experiments which distinguish between apair of words whether they are cognates or non-

cognates (Ciobanu and Dinu, 2014; List, 2012a).These studies do not approach the problem of pre-dicting the possible cognate of the target language,if the cognate of the source language is given. Forexample, given the word nuit, could the algorithmpredict the appropriate German cognate within thehuge German wordlist? This paper tackles thisproblem by incorporating heuristics of the prob-abilistic ranking functions from information re-trieval. Information retrieval addresses the prob-lem of scoring a document with a given query,which is used in every search engine. One canview the above problem as the construction of asuitable search engine, through which we want tofind the cognate counterpart of a word (query) in alexicon of another language (documents).

This paper deals with the intersection betweenthe areas of information retrieval and approximatestring similarity (like the cognate detection prob-lem), which is largely under-explored in the liter-ature. Retrieval methods also provide a variety ofalternative heuristics which can be chosen for thedesired application areas (Fang et al., 2011). Tak-ing such advantage of the flexibility of these mod-els, the combination of approximate string simi-larity operations with an information retrieval sys-tem could be beneficial in many cases. We demon-strate how the notion of information retrieval canbe incorporated into the approximate string sim-ilarity problem by breaking a word into smallerunits. Regarding this, Nguyen et al. (2016) hasargued that segmented words are a more practi-cal way to query large databases of sequences,in comparison with conventional query methods.This further encourages the heuristic attempt atimposing an information retrieval model on thecognate detection problem in this paper.

Our main contribution is to design an informa-tion retrieval based scoring function (see section4) which can capture the complex morphological

134

shifts between the cognates. We tackled this byproposing a shingling (chunking) scheme whichincorporates positional information (see section 2)and a graph-based error modelling scheme to un-derstand the transformations (see section 3). Ourtest harness focuses not only on distinguishing be-tween a pair of cognates, but also the ability to pre-dict the cognate for a target language (see section5).

2 Positional Character-based Shingling

This section examines on converting a string intoa shingle-set which includes the encodings of thepositional information. In this paper, we notify,S as the shingle-set of cognate from the sourcelanguage and T as the shingle-set of cognate forthe target language. The similarity between thesesplit-sets is denoted by S ∩ T . An example ofcognate from the source language, S (Romanian)could be shingle set of the word rosmarin and T(Italian) could be romarin.

K-gram shingling: Usually, set based stringsimilarity measures are based on comparing over-lap between the shingles of two strings. Shinglingis a way of viewing a string as a document by con-sidering k characters at a time. For example, theshingle of the word rosmarin is created with k = 2as: S = {〈s〉r, ro, os, sm, ma, ar, ri, in, n〈/s〉}.Here, 〈s〉 is the start sentinel token and 〈/s〉 isthe stop sentinel token. For the sake of simplic-ity, we have ignored sentinel tokens; which trans-forms into: S = {r, ro, os, sm, ma, ar, ri, in, n}.This method splits the strings into smaller k-gramswithout any positional information.

2.1 Positional Shingling from 1 EndWe argue that the unordered k-grams splittingcould lead to an inefficient matching of stringssince a shingle set is visualized as the bag-of-words method. Given this, we propose a po-sitional k-gram shingling technique, which in-troduces position number in the splits to incor-porate the notion of the sequence of the to-kens. For example, the word rosmarin couldbe position-wise split with k = 2 as: S ={1r, 2ro, 3os, 4sm, 5ma, 6ar, 7ri, 8in, 9n}.

Thus, the member 4sm means that it is thefourth member of the set. The motivation behindthis modification is that it retains the positional in-formation which is useful in probabilistic retrievalranking functions.

2.2 Positional Shingling from 2 EndsThe main disadvantage of the positional shin-gling from single end is that any mismatchcan completely disturb the order of the rest,leading to low similarity. For example,if the query is rosmarin with cognate ro-marin, the corresponding split sets would be{1r, 2ro, 3os, 4sm, 5ma, 6ar, 7ri, 8in, 9n} and{1r, 2ro, 3om, 4ma, 5ar, 6ri, 7in, 8n}. The orderof the members after 2ro is misplaced, thus thiswill lead to low similarity between two cognates.Only {1r, 2ro} is common between the cognates.Considering this, we propose positional shinglingfrom both ends, which is robust against suchdisplacements.

We attach a position number to the left if thenumbering begins from the start, and to the rightif the numbering begins from the end. Then thesmallest position number is selected between thetwo position numbers. If the position numbers areequal, then we select the left position number asa convention. Figure 1 gives an exemplificationof this algorithm illustrated with splits of romarinand rosmarin.

rosmarin

rosmarin

123456789

987654321

12345678

87654321

romarin

romarin rosmarinromarin

Figure 1: The process of positional tokeni-sation from both ends. On the left, algo-rithm segments the Romanian word romarin intothe split-set {1r, 2ro, 3om, 4ma, ar4, ri3, in2, n1}.On the right, the algorithm segments rosmarin into{1r, 2ro, 3os, 4sm, 5ma, ar4, ri3, in2, n1}.

In Figure 1, split sets of rosmarin and romarinare shown. After taking intersection of them, weget {1r, 2ro, ar4, ri3, in2, n1}, indicating a highersimilarity.

135

3 Graphical Error Modelling

Once shingle sets are created, common overlapset measures like set intersection, Jaccard (Jarvelinet al., 2007), XDice (Brew et al., 1996) or TF-IDF(Wu et al., 2008) could be used to measure simi-larities between two sets. However, these methodsonly focus on similarity of the two strings. Forcognate detection, it is crucial to understand howsubstrings are transformed from source languageto target language. This section discusses on howto view this ”dissimilarity” by creating a graphicalerror model.

Algorithm 1 explicates the process of graphicalerror modelling. For illustration purposes, we vi-sualize the procedure via a Romanian-Italian cog-nate pair (mesia, messia). If the source language isRomanian, then S = {1m, 2me, 3es, si3, ia2, a1},which is the split-set of mesia. Let the tar-get language by Italian. Then the split-set ofthe Italian word messia, denoted as T , will be{1m, 2me, 3es, 4ss, si3, ia2, a1}. Thus |S ∩ T |is the number of common terms. Thus the termmatches are, S∩T = {1m, 2me, 3es, si3, ia2, a1}.We are interested in examining the ”dissimilar-ity”, which are the leftover terms in the sets. Thatmeans, we need to infer a certain pattern from left-over sets, which are S−{S∩T} and T −{S∩T}.Thus we can draw mappings to gather informa-tion of the corrections. Let top and bottom bethe ordered sets referring to S − {S ∩ T} andT − {S ∩ T} respectively. Referring to the exam-ple, T − S ∩ T = {4ss}, a bottom set. Similarly,S − S ∩ T = {}, a top set. Then we follow in-structions given in algorithm 1.

Algorithm 1: Graphical Error ModelGraphical Error Model takes two split sets

generated by the shingling variants, namely topand bottom. The objective is to output a graphicalstructure showing connections between membersof the top and the bottom sets.

1. Initialization of the top and bottom: If thegiven sets top and bottom are empty, we ini-tialize them by inserting an empty token (φ)into those sets.Running example: This step transforms topset as {φ} and bottom as {4ss}.

2. Equalization of the set cardinalities: Thecardinalities of the sets top and bottom madeequal by inserting empty tokens (φ) into themiddle of the sets.

Running example: The set cardinalities oftop and bottom were already equal. Thus theoutput of this step are top set as {φ} and bot-tom as {4ss}.

3. Inserting the mappings of the set membersinto the graph: The empty graph is initial-ized as graph = {}. The directed edges aregenerated, originating from every set memberof top to every set member of bottom. Thisresults in a complete directed bipartite graphbetween top and bottom sets. Each edge is as-signed a probability P (e) which is discussedin a later section.Running example: The output of thisstep would be complete directed bipartitegraph between top and bottom sets which is{φ→ 4ss}One more example is provided in figure 2.

Intuition: The edges created as the result ofthis graph could be used for probabilistic calcula-tions which are detailed more in section 4.2. Intu-itively, φ → 4ss means that if the letter s is addedat position 4 of the word of the source mesia, thenone could get the target word messia.

{po3, ϕ, or2}

{pe4, eu3, ur2}Figure 2: The figure shows the bipartite graph out-put of the algorithm when the source cognate isstupor and the target cognate is stupeur.

4 Evaluation Function

The design of our evaluation function focuses ontwo main properties: set based similarity (seesection 4.1) and probabilistic calculation throughgraphical model (see section 4.2)

4.1 Similarity FunctionUsually, the computation of similarity betweentwo sets is done by metrics like Jaccard, Diceand XDice (Brew et al., 1996). Dynamic pro-gramming based methods like edit distance andLCSR (Longest Common Subsequence Ratio,

136

(Melamed, 1999)) are also often used to calculatesimilarity between two strings. Ranking functionsincorporate more complex but necessary featureswhich are needed to distinguish between the doc-uments.

In this paper, we use BM25 and a Dirichletsmoothing based ranking function to compute thesimilarity. BM25 considers term-frequency, in-verse document frequency and length normaliza-tion based penalization features for similarity cal-culations. Dirichlet smoothing function (Robert-son and Zaragoza, 2009) makes use of languagemodelling features and tunable parameter whichaids in Bayesian smoothing of unseen shingles inthe split sets (Blei et al., 2003).

4.2 Error Modelling FunctionThe information of the common morphologicaltransformations for cognates between two differ-ent languages helps in determining if a pair ofwords could be cognates. Based on the graphs ofcognate pairs between Italian and Romanian (sec-tion 3), which models the morphological shifts be-tween the cognates in the two languages, we definean error modelling function on any pair of wordsfrom the two languages. The split set from thesource language is denoted by S and target lan-guage by T , then probabilistic function would be:

π(S, T) =1

|G(S, T )|∑

e∈G(S,T )

P(e)q

where G(S, T ) is the constructed graph of S andT , the strength parameter is called q here with therange of (0,∞), and P (e) is the probability ofedge e to occur in between two cognates, whichis estimated by its frequency of being observed inthe graphs of cognate pairs in the training set.

Figure 3 illustrates the aggregation of edges inthe graph and figure 4 shows the final output of theerror modelling function after normalizing.π(S, T) is called the error modelling function

defined for the word pair, which is an intuitivecalculation of probabilty between a pair of cog-nates through estimating their transformations. qis a tunable parameter that controls the effect ofthe probabilistic frequencies P (e) observed in thetraining set, often useful in avoiding overfitting.

1|G(S,T )| is the normalization factor to allow us tocompare the quantity across different word pairs.

po3

po3

po3

ϕ

ϕ

ϕ

or2

or2

or2

pe4

eu3

ur2

pe4

eu3

ur2

pe4

eu3

ur2

aggregate theedge probabilities

Figure 3: From the graph created in figure 2, wecalculate the probabilities of each edge (by com-puting frequencies and smoothing) and then ag-gregate all the probabilities of edges in the graph.

po3

ϕ

or2

pe4

eu3

ur2

normalizingafter aggregation

Figure 4: After aggregating, we normalize the sumand the graph conversion score is outputted.

4.3 Combining Error Modelling andSimilarity Function metrics

In this subsection, we merge the notion of sim-ilarity and dissimilarity together. We combine aset-based similarity function (discussed in section4.1) and the error modelling function (discussedin section 4.2) into a score function by a weightedsum of them, which is,

score(S, T ) = λ× sim(S, T) + (1− λ)× π(S, T)

where λ ∈ [0, 1] is a weight based hyperparameter,sim(S, T ) is a set-based similarity between S and

137

T , and π(S, T ) is the graphical error modellingfunction defined above.

5 Test Harness

Table 1 summarizes the results of the experimentalsetup. The elements of test harness are mentionedas following:

5.1 Setup DescriptionDataset: The experiments in this paper are per-formed on the dataset used by Ciobanu et al(2014). The dataset consists 400 pairs of cognatesand non-cognates for Romanian-French (Ro - Fr),Romanian-Italian (Ro - It), Romanian-Spanish(Ro - Es) and Romanian-Portuguese (Ro - Pt). Thedataset is divided into a 3:1 ratio for training andtesting purposes. Using cross-validation, hyperpa-rameters and thresholds for all the algorithms andbaselines were tuned accordingly in a fair manner.Experiments: Two experiments are included intest harness.

1. We provide a pair of words and the algo-rithms would aim to detect whether they arecognates. Accuracy on the test set is used asa metric for evaluation.

2. We provide a source cognate as the input andthe algorithm would return a ranked list as theoutput. The efficiency of the algorithm woulddepend on the rank of the desired target cog-nate. This is measured by MRR (Mean Re-ciprocal Rank), which is defined as,MRR =∑|dataset|

i=11

ranki, where ranki is the rank of

the true cognate in the ranked list returned tothe ith query.

5.2 BaselinesString Similarity Baselines: It is intuitive tocompare our methods with the prevalent stringsimilarity baselines since the notions behind cog-nate detection and string similarity are almost sim-ilar. Edit Distance is often used as the baselinein the cognate detection papers (Melamed, 1999).This computes the number of operations requiredto transform from source to target cognate. Wehave also incorporated XDice (Brew et al., 1996),which is a set based similarity measure that oper-ates between shingle set between two strings. Hid-den alignment conditional random fields (CRF)are often used in transliteration which serves as thegenerative sequential model to compute the prob-abilities between the cognates, which is analogous

to learnable edit distance (McCallum et al., 2012).Among these baselines, CRF performs the best inaccuracy and MRR.Orthographic Cognate Detection: Papers re-lated to this notion usually take alignment of sub-strings which in classifier like support vector ma-chines (Ciobanu and Dinu, 2015, 2014) or hiddenmarkov models (Bhargava and Kondrak, 2009).We included Alina et al as the baseline (2014),which employs the dynamic programming basedmethods for sequence alignment following whichfeatures were extracted from the mismatches in theword alignments. These features are plugged intothe classifier like SVM which results in decent per-formance on accuracy with an average of 84%, butonly 16% on MRR. This result is due to the factthat a large number of features leads to overfittingand scoring function is not able to distinguish theappropriate cognate.Phonetic Cognate Detection: Research in au-tomatic cognate identification using phonetic as-pects involve computation of similarity by de-composing phonetically transcribed words (Kon-drak, 2000), acoustic models (Mielke, 2012), pho-netic encodings (Rama, 2015), aligned segmentsof transcribed phonemes (List, 2012b). We im-plemented Rama’s research (2015), which em-ploys a Siamese convolutional neural network tolearn the phonetic features jointly with languagerelatedness for cognate identification, which wasachieved through phoneme encodings. Althoughit performs well on accuracy, it shows poor resultswith MRR, possibly the reason as same as SVMperformance.

5.3 Ablation experimentsWe experiment with the variables like lengthof substrings, ranking functions, shingling tech-niques, and graphical error model, which are de-tailed in the Table 1. Amongst the shingling tech-niques, we found that character bigrams with 2-ended positioning give better results. Adding tri-grams to the database does not give major effecton the results. The results clearly indicate thatadding graphical error model features greatly im-prove the test results. Amongst the ranking func-tions, Dirichlet smoothing tends to give better re-sults, possibly due to the fact that it requires fewerparameters to tune and is able to capture the se-quential data (like substrings) better than otherranking functions (Fang et al., 2011). The hy-

138

Algorithms Ro - It Ro - Fr Ro - Es Ro - Pt

Acc MRR Acc MRR Acc MRR Acc MRR

Edit Distance (Melamed, 1999) 0.53 0.11 0.52 0.13 0.58 0.15 0.54 0.13XDice (Brew et al., 1996) 0.54 0.19 0.53 0.14 0.59 0.16 0.53 0.14

SVM with Orthographic Aligment (Ciobanu and Dinu, 2014) 0.81 0.18 0.87 0.15 0.86 0.16 0.73 0.14Phonetic Encodings in CNN (Rama, 2016) 0.69 0.21 0.78 0.17 0.77 0.19 0.66 0.15

Hidden alignment CRF (McCallum et al., 2012) 0.84 0.51 0.85 0.48 0.84 0.50 0.71 0.45

Shingling Technique Ranking Function

Bigram, 0-ended TF-IDF 0.54 0.18 0.52 0.14 0.59 0.15 0.55 0.11Bigram, 1-ended TF-IDF 0.59 0.19 0.54 0.18 0.60 0.18 0.57 0.12Bigram, 2-ended TF-IDF 0.64 0.25 0.63 0.21 0.68 0.22 0.65 0.17(Bi + Tri)-gram, 2-ended TF-IDF 0.64 0.25 0.64 0.21 0.57 0.22 0.65 0.18Bigram, 2-ended BM25 0.84 0.40 0.87 0.37 0.86 0.34 0.73 0.35Bigram, 2-ended Dirichlet 0.84 0.41 0.86 0.38 0.86 0.39 0.74 0.36Bigram, 2-ended BM25 + Graphical Error Model 0.87 0.64 0.89 0.51 0.86 0.54 0.78 0.55Bigram, 2-ended Dirichlet + Graphical Error Model 0.88 0.67 0.89 0.59 0.87 0.60 0.80 0.58

Table 1: Results on the test dataset. The upper half denotes the baselines used and the lower half de-scribes our ablation experiments. For the experiment 1, we evaluate using the accuracy (Acc) met-ric. We used MRR (Mean Reciprocal Rank) for describing the second experiment. Higher scoressignify the better performance. The maximum value possible is 1.0. It is worth noting that for theclassification problem (experiment 1), our algorithm has slight improvement. However for the rec-ommendation problem (experiment 2), our algorithm shows massive improvement. The thresholds,hyper-parameters, source code and sample Python notebooks are available at our github repository:https://github.com/pranav-ust/cognates

perparameter λ mentioned in the section 4.3, wastuned around 0.6, which shows the 60% contribu-tion by the similarity function and 40% contribu-tion by the dissimilarity. Overall, the combina-tion of bigrams with 2-ended positional shingling,graphical error modelling with Dirichlet rankingfunction gives the best performance with an aver-age of 86% on accuracy metric and 60% on MRR.

6 Conclusions

We approach towards the harder problem wherethe algorithm aims to find a target cognate when asource cognate is given. Positional shingling out-performed non-positional shingling based meth-ods, which demonstrates that inclusion of posi-tional information of substrings is rather impor-tant. Addition of graphical error model boostedthe test results which shows that it is crucial to adddissimilarity information in order to capture thetransformations of the substrings. Methods whosescoring functions rely only on complex machinelearning algorithms like CNN or SVM tend to giveworse results when searching for a cognate.

Acknowledgements

This work would not be possible without the sup-port from my parents. I would like to thank theNLP community for providing me open-sourcedresources to help an underprivileged and naive stu-dent like me. Finally, I would like to thank the re-viewers, mentors, and organizers for ACL-SRWfor supporting student research. Special thanksto my classmate Chun Sik Chan and SRW men-tor Sam Bowman for providing excellent critiquesfor this paper, and Alina Ciobanu for providing thedataset.

139

ReferencesAditya Bhargava and Grzegorz Kondrak. 2009. Mul-

tiple word alignment with profile hidden markovmodels. In Proceedings of Human Language Tech-nologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, Companion Volume: Student Re-search Workshop and Doctoral Consortium, pages43–48. Association for Computational Linguistics.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. J. Mach. Learn.Res., 3:993–1022.

Chris Brew, David Mckelvie, and Buccleuch Place.1996. Word-pair extraction for lexicography.

Alina Maria Ciobanu and Liviu P Dinu. 2014. Au-tomatic detection of cognates using orthographicalignment. In ACL (2), pages 99–105.

Alina Maria Ciobanu and Liviu P Dinu. 2015. Au-tomatic discrimination between cognates and bor-rowings. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 2: Short Papers),volume 2, pages 431–437.

Hui Fang, Tao Tao, and Chengxiang Zhai. 2011. Di-agnostic evaluation of information retrieval models.ACM Trans. Inf. Syst., 29(2):7:1–7:42.

Anni Jarvelin, Antti Jarvelin, and Kalervo Jarvelin.2007. s-grams: Defining generalized n-grams for in-formation retrieval. Information Processing & Man-agement, 43(4):1005–1019.

Grzegorz Kondrak. 2000. A new algorithm for thealignment of phonetic sequences. In Proceedingsof the 1st North American chapter of the Asso-ciation for Computational Linguistics conference,pages 288–295. Association for Computational Lin-guistics.

Johann-Mattis List. 2012a. Lexstat: Automatic de-tection of cognates in multilingual wordlists. InProceedings of the EACL 2012 Joint Workshop ofLINGVIS & UNCLH, EACL 2012, pages 117–125,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Johann-Mattis List. 2012b. Lexstat: Automatic de-tection of cognates in multilingual wordlists. InProceedings of the EACL 2012 Joint Workshop ofLINGVIS & UNCLH, pages 117–125. Associationfor Computational Linguistics.

Andrew McCallum, Kedar Bellare, and FernandoPereira. 2012. A conditional random field fordiscriminatively-trained finite-state string edit dis-tance. arXiv preprint arXiv:1207.1406.

I. Dan Melamed. 1999. Bitext maps and alignment viapattern recognition. Comput. Linguist., 25(1):107–130.

Jeff Mielke. 2012. A phonetically based metric ofsound similarity. Lingua, 122(2):145–163.

Ken Nguyen, Xuan Guo, and Y Pan. 2016. MultipleBiological Sequence Alignment: Scoring Functions,Algorithms and Applications.

Taraka Rama. 2015. Automatic cognate identificationwith gap-weighted string subsequences. In Proceed-ings of the 2015 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies., pages1227–1231.

Taraka Rama. 2016. Siamese convolutional networksbased on phonetic features for cognate identifica-tion. CoRR, abs/1605.05172.

Stephen Robertson and Hugo Zaragoza. 2009. Theprobabilistic relevance framework: Bm25 and be-yond. Found. Trends Inf. Retr., 3(4):333–389.

Ho Chung Wu, Robert Wing Pong Luk, Kam FaiWong, and Kui Lam Kwok. 2008. Interpreting tf-idfterm weights as making relevance decisions. ACMTrans. Inf. Syst., 26(3):13:1–13:37.

140


Mixed Feelings: Natural Text Generation with Variable, CoexistentAffective Categories

Lee KezarMathematics and Computer Science Department, Rhodes College

Memphis, Tennessee, [email protected]

Abstract

Conversational agents, having the goal ofnatural language generation, must rely onlanguage models which can integrate emo-tion into their responses. Recent projectsoutline models which can produce emo-tional sentences, but unlike human lan-guage, they tend to be restricted to one af-fective category out of a few (e.g. Zhaoet al. (2018)). To my knowledge, noneallow for the intentional coexistence ofmultiple emotions on the word or sen-tence level. Building on prior researchwhich allows for variation in the inten-sity of a singular emotion (Ghosh et al.,2017), this research proposal outlines anLSTM (Long Short-Term Memory) lan-guage model which allows for variation inmultiple emotions simultaneously.

1 Introduction

In closing her landmark paper on affective com-puting, Rosalind Picard charges researchers ofartificial intelligence with a task. She writes,“Computers that will interact naturally andintelligently with humans need the ability to atleast recognize and express affect” (Picard, 1995).While Picard herself has since spent her timeprimarily studying the physiology behind emotionand health, there is also a strong relationshipbetween linguistics and emotion. The task forcomputer scientists who study this relationship isto symbolize and express emotion through verballanguage alone.

Although this goal easy to articulate, accom-plishing it has proven to be quite challenging.However, in the same way AI researchers acquiredvaluable insight from reviewing models of cellularneuroscience to produce an artificial neural net-work, perhaps we can demystify affect generation

by reviewing psychological models which buildon neuro-biological findings in regards to humanemotion.

1.1 Need for Affective Mixing

I will now summarize two key ideas from thesefindings: emotion/language dynamic and classifi-cation of emotion.

First, emotion usually precedes language.In describing phenomenal consciousness as itpertains to emotion, Carroll Izard wrote that“an emotion feeling remains functional andmotivational without being symbolized andmade accessible in reflective consciousness vialanguage” (Izard, 2009). One might support thisclaim by noting how complex emotional qualiafeels and how rather limited language can be. Forexample, how accurate is it to say that “Alice ishappy”? To what extent is she happy? Is it thesame happiness that she feels when being in goodcompany, or in favorable weather? Do they onlydiffer in magnitude, or also along some otherdimension? Or as a second example, can yourecall a moment where you couldn’t describe howyou were feeling, but felt it nonetheless? Clearly,emotion is rather difficult to express in simplewords. Yet, recent affective generation techniquestend to presume that they fall neatly into oneof five or six discrete categories (Zhao et al.,2018). Recently, researchers added nuance toaffect generation via variation in intensity (Ghoshet al., 2017), but to my knowledge no model addsnuance along extra dimensions, such as otheremotions.

Second, emotion is usually the conflation oftwo distinct phenomena: “basic emotions” and“dynamic emotion-cognition interactions” (Izard,2009). The basic emotions are linked to old evolu-tionary stimuli and are more automatic (e.g. fear

141

as a response mechanism to avoid danger). Thatis to say, basic emotions have little involvementwith cognition. Emotion schemas, on the otherhand, result directly from interactions betweencognition and emotion. This type of emotion isunderscored by research which casts emotion andcognition as interdependent processes (Storbeckand Clore, 2007), closely mirroring Picard’s insis-tence that intelligence is comprised of emotion.

The categorizations usually found in lexicons(e.g. Linguistic Inquiry and Word Count orLIWC) treat words as stimuli which humans re-spond to with emotions. This model aligns closelywith basic emotions, and supposes language pre-cedes emotion. However, in actuality humans tendto incorporate cognitive processes such as memoryand perspective-taking, allowing us to have multi-ple emotions simultaneously. If we are to use thisas inspiration for generating affective text algorith-mically, we might then permit the AI to intention-ally express multiple emotions.

1.2 Project OverviewHaving established this need for emotionalintelligence in AI (including natural languageprocessing), and reviewed psychological researchin this area, we might conclude that emotionsshould not be modeled as singular, discretecategories but continuous and constantly mixing.

This project aims to create an algorithm whichis capable of taking some priming text and thedesired affective state which can vary alongdifferent emotional dimensions simultaneously,and produce a corresponding utterance. Thisalgorithm is primarily meant for conversationalagents, but could be adapted for other purposes.

1.3 Use CasesMixing emotions are not only more realistic tohuman emotion, since they allow for significantlymore flexibility in expression, but also morehelpful for specific applications. I will introducethree such applications here.

First, the ability to mix emotions in a contin-uous fashion allows for conversational agents togradually and imperceptibly shift the tone of aconversation toward a new tone. Human tendencyto mirror the emotions of others through empathy

would make this an effective strategy to improveattitude and emotional outlook.

Second, mixing emotions in this way also al-lows for an extra layer of nuance in conversationalpractice, as we are affected by emotional languageas often as we produce it. For example, in sim-ulating a realistic conversation, the deliveranceof emotionally-charged statements should causethe computer to respond appropriately. However,models which treat emotion as discrete categorieswould “overreact” and abruptly switch from oneemotion to another.

Third, realistic personality can be introduced asa tendency to hover near or avoid specific points.Naturally, the conversational agent will vary inemotion, but ultimately return to some defaultstate. For example, if an agent were to expressoptimistic personality, it might impose some min-imum on the joy vector, and a maximum on thesadness and anger vectors. This is not possiblewith models that suppose emotions are discretecategories.

1.4 Algorithm Overview

We can encapsulate this goal with a broad formula

g(p, e) = w (1)

where p is the priming text, e is a quadruple ofvalues such that

ex ∈ R | x ∈ {j, s, f, a}, 0 ≤ ex ≤ 1 (2)

which correspond to the intensity of joy, sadness,fear, anger, respectively; and w is an arrayof words and punctuation which represent thealgorithm’s response to p with emotional statee. The emotional categories are selected tocorrespond with the DepecheMood database,which I will describe in section 3.2. The primingtext can be a sentence fragment which the userseeks to complete or a natural sentence which thealgorithm is meant to respond to.

The purpose for g is to be embedded into aconversational setting with improvements onparsing and production. I will not detail whatthis embedding looks like for sake of brevity andcoherency.

142

An example set of sentences generated across agradient beginning at high happiness (ej ≈ 1), lowfear (ef ≈ 0) and ending at moderate happiness(ej ≈ 0.5), high fear (ef ≈ 1) might look likethis:

• I am happy to go to work today

• I am content to go . . .

• I am hesitant to go . . .

• I am worried to go . . .

• I am nervous to go . . .

• I am anxious to go . . .

since anxiety is, in a sense, the combination offear and partial excitement or happiness.

2 Related Work

In the last two years, different models have beenused to create conversational agents which canexpress affect. Namely, the Emotional ChattingMachine (ECM) (Zhao et al., 2018) and Affect-LM (Ghosh et al., 2017). These models aremotivated by the psychological finding that agentswith subtle expressivity can improve the affectivestate of the user (Prendinger et al., 2005). Theseprojects utilize an array of emotional categoriesincluding liking, happiness, sadness, disgust,anger, and anxiety.

To accomplish this, different language mod-els which utilize machine learning algorithmshave been crafted and tested for accuracy andgrammaticality. The most popular include feed-forward neural networks, recurrent neural net-works (RNNs), and long short-term memory(LSTM) neural networks (Sundermeyer et al.,2015). Among these, the LSTM model is notablysuperior for establishing long-term dependenciesin text, and reducing the vanishing gradient prob-lem found in RNNs. This is the method usedby (Ghosh et al., 2017), with an additional net-work for including emotionally-charged words ofthe desired strength. Importantly, they accommo-dated and tested for loss in grammaticality, whichwas predicted for expressions of intense emotionbut not low emotions, where standard LSTMs typ-ically suffice.

3 Model

3.1 LSTM Language ModelThe LSTM Language Model allows for the use ofall prior words as evidence in predicting the nextbest word, wt. We can write this prediction as aprobability like so:

p(wM1 ) =

M∏

i=1

p(wi|wi−11 ) (3)

for a sequence of M words. We can utilize theLSTM as a function l of the prior words wi−1

1 tocalculate p(wi). An additional bias term bi corre-sponding to unigram occurrence of wi may be in-cluded to favor more common words, as in Ghoshet al. (2017). The output layer of l summed withthis bias term then pass through a softmax activa-tion function to normalize the outputs, producing

p(wi|wi−11 ) = softmax(l(wi−1

1 ) + bi) (4)

The algorithm would simply repeat this calcu-lation, each step incrementing i and selecting themost probable word, until a period is produced in-dicating the end of the sentence and completion ofthe algorithm.

3.2 Incorporating Affective DataThe DepecheMood lexicon contains affective datafor over 13,500 words, each rated along a contin-uous interval [0, 1] for eight affective dimensions:{Fear, Amusement, Anger, Annoyance, Indiffer-ence, Happiness, Inspiration, Sadness} (Staianoand Guerini, 2014). As a comical American exam-ple, DepecheMood rates the word “president” as{0.2, 0.346, 0.626, 1.0, 0.528, 0.341, 0.0, 0.115}respectively – that is, moderately infuriating,never inspiring, and completely annoying.

For this project, I would use the more typicalcategories of joy, sadness, fear, and anger. Thislexicon contrasts other popular lexicons withaffective data, e.g. LIWC, in that these ratingsare continuous along [0, 1]. This property allowsfor more precise matching to the affective state,and movement along a gradient between the fourdimensions.

In addition to the bias term from the previoussection, we would account for affect by intro-ducing a third term d(wi, e) which favors wordsmost affectively similar to the desired output.

143

Graphically, this equation maps the vocabulary Vinto four-dimensional space corresponding to eachword’s emotional valences, and calculates the dis-tance between e, a point in this space, and eachword wi ∈ V . This inverse distance function canbe written as such:

d(wi, e) = softmax

1√∑|e|

j=1(wij − ej)2

(5)where j enumerates each of the four affective di-mensions, wij is the intensity along that dimensionej is the target intensity.

3.3 Optimizing d(wi, e)

In its current form, this function requires a calcu-lation for each wi ∈ V which is has a subidealtime complexity of O(n). We can reduce this toO(log n) by estimating e to its nearest neighbor inV (a O(log n) operation, as I will soon explain),whose distances to every other word can beprecomputed and accessed via a hash table O(1).This replacement can be done without loss ofaccuracy due to the density of the data set.

To find e nearest neighbor in V , we can uti-lize a modified QuadTree algorithm to reduce thesearch space to some arbitrary n much smallerthan 13,500. To briefly review, this algorithm or-ganizes a set of data points into a hierarchical treewith leaf nodes containing a list of less than npoints (Finkel and Bentley, 1974). Querying thistree to find nearby data points has been empiricallyshown to take on average O(log n) time, a modestimprovement over O(n).

3.4 Avoiding Over-Emphasis

As Zhao et al. (2018) noted in their ECM project,“emotional responses are relatively short lived andinvolve changes.” These changes, which the au-thors call “emotion dynamics”, involve modelingemotions as quantities which decays at each step.This avoids the problem of over-emphasizing theinput e by repeatedly expressing the same state,unintentionally compounding its strength. Theirimplementation of this decay involves updating ejin (5) by subtracting wij for each dimension j.Therefore, upon completion, e would be close to[0, 0, 0, 0].

Preliminary experimentation would certainlyneed to reveal the appropriate weights for g, d, and

bi such that precision of emotion does not sacrificegrammaticality.

4 Implementation, Training, and Review

For the LSTM, we would follow the suggestion ofSundermeyer et al. (2012) and implement a net-work using TensorFlow1 with two hidden layersof 200 nodes: the first being a projection layer ofstandard neural network units and the second be-ing hidden layer of LSTM units. The output layerwould also have 200 nodes.

The same authors later suggest training the net-work using the cross-entropy error criterion, usingthe function

F (A) = −M∑

i=1

log pA(wi|wi−1i−n+1) (6)

where M is the size of the training corpus (Sun-dermeyer et al., 2015). For a stochastic gradientdescent algorithm, we can obtain a gradient usingepochwise backpropogation through time (BPTT)on the first pass, and update the weights on the sec-ond pass as specified in Sundermeyer et al. (2014).

The training corpus for this algorithm wouldneed only be some collection of natural dialogue.This can be catered to the environment it will beused in, but for our purposes the Ubuntu DialogueCorpus will be used for its generality, accessibil-ity, dyadic nature, and size (Lowe et al., 2016).

Additional funds have been secured to utilizeAmazon’s Mechanical Turk to analyze loss ingrammaticality and verify the manipulation of eby simply assessing sentences along a Likert scaleand comparing this data to the intended affect. Ipredict that certain categories and combinationswill be harder to express in text than others, result-ing in higher variability and weaker correlations.For example, anger and sadness (an approxima-tion of remorse) may be rather easy to express, butjoy and sadness may be difficult. Visualizing thesestrengths and weaknesses will be an interesting re-flection of the English language.

5 Conclusion

In this research proposal, I have given a briefoverview of a natural language generation algo-rithm which is capable of producing utterances ex-pressive of multiple emotional dimensions simul-taneously. The DepecheMood lexicon enables the

1http://www.tensorflow.org

144

mapping of words into n-dimensional space, al-lowing us to prefer words which minimize the dis-tance between it and the target affective state alongthe four emotional dimensions.

After clarifying implementation details such asLSTM node construction, neural architecture, anduse of the training corpus, this algorithm has thepotential to add further nuance to our current mod-els of generating affect which are consistent withpsychological and neuro-biological findings.

ReferencesR. A. Finkel and J. L. Bentley. 1974. Quad trees: A

data structure for retrieval on composite keys. ActaInformatica, 4(1):1–9.

Sayan Ghosh, Mathieu Chollet, Eugene Laksana,Louis-Philippe Morency, and Stefan Scherer. 2017.Affect-lm: A neural language model for customiz-able affective text generation. ACL, pages 634–642.

Carroll E. Izard. 2009. Emotion theory and research:Highlights, unanswered questions, and emerging is-sues. Annual Review of Psychology, 60:1–25.

Ryan Lowe, Nissan Pow, Iulian V. Serban, and JoellePineau. 2016. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dia-logue systems.

Rosalind W. Picard. 1995. Affective computing. Tech-nical report, Massachusetts Institute of Technology,Perceptual Computing Section.

Helmut Prendinger, Junichiro Mori, and MitsuruIshizuka. 2005. Using human physiology to eval-uate subtle expressivity of a virtual quizmaster ina mathematical game. Human-Computer Studies,2:231–245.

Jacopo Staiano and Marco Guerini. 2014. De-pechemood: a lexicon for emotion analysis fromcrowd-annotated news.

Justin Storbeck and Gerald L. Clore. 2007. On the in-terdependence of cognition and emotion. CognitiveEmotion, 21(6):1212–1237.

Martin Sundermeyer, Hermann Ney, and Ralf Schlter.2012. Lstm neural networks for language modeling.In Interspeech.

Martin Sundermeyer, Hermann Ney, and Ralf Schlter.2014. Rwthlm-the rwth aachen university neuralnetwork language modeling toolkit. In Interspeech,pages 2093–2097.

Martin Sundermeyer, Hermann Ney, and Ralf Schlter.2015. From feedforward to recurrent lstm neuralnetworks for language modeling. IEEE/ACM Trans-actions on Audio, Speech, and Language Process-ing, 23(3).

Hao Zhao, Minlie Huang, Tianyang Zhang, XiaoyanZhu, and Bing Liu. 2018. Emotional chatting ma-chine: Emotional conversation generation with in-ternal and external memory.

145


Automatic Spelling Correction for Resource-Scarce Languagesusing Deep Learning

Pravallika EtooriLTRC, KCIS

IIIT Hyderabade.pravallika@

research.iiit.ac.in

Manoj ChinnakotlaMicrosoft

Bellevue, [email protected]

Radhika MamidiLTRC, KCIS

IIIT Hyderabadradhika.mamidi@

iiit.ac.in

Abstract

Spelling correction is a well-known taskin Natural Language Processing (NLP).Automatic spelling correction is impor-tant for many NLP applications like websearch engines, text summarization, sen-timent analysis etc. Most approachesuse parallel data of noisy and correctword mappings from different sources astraining data for automatic spelling cor-rection. Indic languages are resource-scarce and do not have such parallel datadue to low volume of queries and non-existence of such prior implementations.In this paper, we show how to build anautomatic spelling corrector for resource-scarce languages. We propose a sequence-to-sequence deep learning model whichtrains end-to-end. We perform experi-ments on synthetic datasets created for In-dic languages, Hindi and Telugu, by incor-porating the spelling mistakes committedat character level. A comparative evalua-tion shows that our model is competitivewith the existing spell checking and cor-rection techniques for Indic languages.

1 Introduction

Spelling correction is important for many of thepotential NLP applications such as text summa-rization, sentiment analysis, machine translation(Belinkov and Bisk, 2017). Automatic spellingcorrection is crucial in search engines as spellingmistakes are very common in user-generated text.Many websites have a feature of automaticallygiving correct suggestions to the misspelled userqueries in the form of Did you mean? suggestionsor automatic corrections. Providing suggestionsmakes it convenient for users to accept a proposedcorrection without retyping or correcting the query

manually. This task is approached by collectingsimilar intent queries from user logs (Hasan et al.,2015; Wilbur et al., 2006; Ahmad and Kondrak,2005). The training data is automatically extractedfrom event logs where users re-issue their searchqueries with potentially corrected spelling withinthe same session. Example query pairs are (houselone, house loan), (ello world, hello world), (mo-bilephone, mobile phone). Thus, large amounts ofdata is collected and models are trained using tech-niques like Machine Learning, Statistical MachineTranslation etc.

The task of spelling correction is challengingfor resource-scarce languages. In this paper, weconsider Indic languages, Hindi and Telugu, be-cause of their resource scarcity. Due to lesserquery share, we do not find the same level of par-allel alteration data from logs. We also do not havemany language resources such as Parts of Speech(POS) Taggers, Parsers etc. to linguistically an-alyze and understand these queries. Due to lackof relevant data, we create synthetic dataset usinghighly probable spelling errors and real world er-rors in Hindi and Telugu given by language ex-perts. Similarly, synthetic dataset can be createdfor any resource-scarce language incorporating thereal world errors. Deep Learning techniques haveshown enormous success in sequence to sequencemapping tasks (Sutskever et al., 2014). Most ofthe existing spell-checkers for Indic languages areimplemented using rule-based techniques (Kumaret al., 2018). In this paper, we approach thespelling correction problem for Indic languageswith Deep learning. This model can be em-ployed for any resource-scarce language. We pro-pose a character based Sequence-to-sequence textCorrection Model for Indic Languages (SCMIL)which trains end-to-end.

Our main contributions in this paper are sum-marized as follows:

• We propose a character based recurrent

146

sequence-to-sequence architecture with aLong Short Term Memory (LSTM) encoderand a LSTM decoder for spelling correctionof Indic languages.

• We create synthetic datasets1 of noisy andcorrect word mappings for Hindi and Teluguby collecting highly probable spelling errorsand inducing noise in clean corpus.

• We evaluate the performance of SCMIL bycomparing with various approaches such asStatistical Machine Translation (SMT), rule-based methods, and various deep learningmodels, for this task.

2 Related Work

Significant work has been done in the field of Spellchecking for Indian languages. There are spell-checkers available for Indian languages like Hindi,Marathi, Bengali, Telugu, Tamil, Oriya, Malay-alam, Punjabi.

Dixit et al. (2005) designed a rule-based spell-checker for Marathi, a major Indian Language.This is the first initiative for morphology-basedspell checking for Marathi. The spell-checker isbased on the rules of morphology and the rules oforthography.

A spell-checker is designed for Telugu (Rao,2011), an agglutinating Indian language which hasa very complex morphology. This spell-checkeris based on Morphological Analysis and Sandhisplitting rules. It consists of two parts: a setof routines for scanning the text (MorphologicalAnalyzer and sandhi splitting rules) and identify-ing valid words, and an algorithm for comparingthe unrecognized words and word parts against aknown list of variantly spelled words and wordparts.

Another Hindi Spell-checker (Sharma and Jain,2013) uses a dictionary with word, frequency pairsas language model. Error detection is done bydictionary look-up. Error correction is performedusing Damerau-Levenshtein edit distance and n-gram technique. These candidates are ranked bysorting in increasing order of edit distance. Wordsat same edit distance are sorted in order of theirfrequencies.

HINSPELL(Singh et al., 2015) is a spell-checker designed for Hindi which is implemented

1https://github.com/PravallikaRao/SpellChecker

using a hybrid approach. Error is detected by dic-tionary look-up. Error correction is done by usingMinimum Edit Distance technique where the clos-est words in the dictionary to the error word areobtained. These obtained words are given priorityusing a weightage algorithm and Statistical Ma-chine Translation (SMT).

Ambili et al. (2016) designed a Malayalamspell-checker that detects the error by a dictio-nary look-up approach and error correction is donethrough N-gram based technique. If a word is notpresent in the dictionary, it is identified as an er-ror and N-gram based technique corrects error byfinding similarity between words and computing asimilarity coefficient.

Recently, Ghosh and Kristensson (2017) pro-posed a Deep Learning model for text correctionand completion in keyboard decoding for English.This is a first attempt at text correction using DeepNeural Networks which gave promising results.

Sakaguchi et al. (2017) approached the prob-lem of Spell Correction using semi-character Re-current Neural Networks on English data.

Studies of spell checking techniques for In-dian Languages (Kumar et al., 2018; Guptaand Mathur, 2012) show that the existing spell-checkers have two major steps: Error detectionand error correction. Error detection is done bydictionary look-up. Error correction consists oftwo steps: the generation of candidate correc-tions and the ranking of candidate corrections.The most studied spelling correction algorithmsare: edit distance, similarity keys, rule-basedtechniques, n-gram-based techniques, probabilis-tic techniques, neural networks, and noisy channelmodel. All of these methods can be thought ofas calculating a distance between the misspelledword and each word in the dictionary or index.The shorter the distance the higher the dictionaryword is ranked.

While there have been a few attempts to de-sign spell-checkers for English and few other lan-guages using Machine Learning, to the best of ourknowledge, no such prior work has been attemptedfor Indian languages.

3 Model Description

We address the spelling correction problem forIndic languages by having a separate correctornetwork as an encoder and an implicit languagemodel as a decoder in a sequence-to-sequence at-

147

tention model that trains end-to-end.

3.1 Sequence-to-sequence Model

Sequence-to-sequence (seq2seq) models(Sutskever et al., 2014; Cho et al., 2014)have enjoyed great success in a variety of taskssuch as machine translation, speech recognition,image captioning, and text summarization. Abasic sequence-to-sequence model consists oftwo neural networks: an encoder that processesthe input and a decoder that generates the output.This model has shown great potential in input-output sequence mapping tasks like machinetranslation. An input side encoder captures therepresentations in the data, while the decoder getsthe representation from the encoder along withthe input and outputs a corresponding mapping tothe target language. Intuitively, this architecturalset-up seems to naturally fit the regime of map-ping noisy input to de-noised output, where thecorrected prediction can be treated as a differentlanguage and the task can be treated as MachineTranslation.

3.2 System Architecture

The Recurrent Neural Network (RNN) (Rumelhartet al., 1986; Werbos, 1990) is a natural generaliza-tion of feed-forward neural networks to sequences.Given a sequence of inputs (x1, . . . , xT ), a stan-dard RNN computes a sequence of outputs (y1,. . . , yT ) by iterating the following equation:

ht = sigm(W hxxt +W hhht−1) (1)

yt = W yhht (2)

The RNN can easily map sequences to se-quences whenever the alignment between the in-puts the outputs is known ahead of time. Infact, recurrent neural networks, long short-termmemory networks (Hochreiter and Schmidhu-ber, 1997), and gated recurrent neural networks(Chung et al., 2014) have become standard ap-proaches in sequence modelling and transductionproblems such as language modelling and ma-chine translation.

RNNs struggle to cope with long-term depen-dency in the data due to vanishing gradient prob-lem (Hochreiter, 1998). This problem is solvedusing Long Short Term Memory (LSTM) recur-rent neural networks.

SCMIL has the similar underlying architectureof sequence-to-sequence models. The encoder anddecoder in SCMIL operate at character level.

Encoder: In SCMIL, the encoder is a characterbased LSTM. With LSTM as encoder, the input se-quence is modeled as a list of vectors, where eachvector represents the meaning of all characters inthe sequence read so far.

Decoder: The decoder in SCMIL is a characterlevel LSTM recurrent network with attention. Theoutput from the encoder is the final hidden state ofthe character based LSTM encoder. This becomesthe input to the LSTM decoder.

By letting the decoder have an attention mecha-nism (Bahdanau et al., 2014), the encoder is re-lieved from the burden of having to encode allinformation in the source sequence into a fixed-length vector. With attention, the information canbe spread throughout the sequence of annotations,which can be selectively retrieved by the decoderaccordingly. The attention mechanism computes afixed-size vector that encodes the whole input se-quence based on the sequence of all the outputsgenerated by the encoder as opposed to the plainencoder-decoder model which looks only at thelast state generated by the encoder for all the slicesof the decoder.

Thus, SCMIL is a sequence-to-sequence atten-tion model with Bidirectional RNN encoder andattention decoder which is trained end-to-end hav-ing a character based representation on both en-coder and decoder sides.

3.3 Training details

All the code is written in Python 2.7 us-ing tf-seq2seq (Britz et al., 2017), a general-purpose encoder-decoder framework for Tensor-flow (Abadi et al., 2016) deep learning library ver-sion 1.0.1. Both the encoder and the decoder arejointly trained end-to-end on the synthetic datasetswe created. SCMIL has a learning rate of 0.001,batch size of 100, sequence length of 50 (charac-ters) and number of training steps 10,000. The sizeof the encoder LSTM cell is 256 with one layer.The size of the decoder LSTM cell is 256 withtwo layers. We use Adam optimization (Kingmaand Ba, 2014) for training SCMIL. The charac-ter embedding dimension are fixed to 256 and thedropout rate to 0.8.

148


We performed experiments with SCMIL and othermodels using synthetic datasets which we createdfor the Indic languages: Hindi and Telugu. Hindiis the most prominent Indian language and thethird most spoken language in the world. Teluguis the most widely spoken Dravidian language inthe world and third most spoken native languagein India.

4.1 Dataset details

Due to lack of data with error patterns in Indiclanguages, we have built a synthetic dataset thatSCMIL is trained on. Initially, we create data listsfor Hindi and Telugu. For this, we have extracteda corpus of most frequent Hindi words2 and mostfrequent Telugu words3. We have also extractedHindi movie names and Telugu movie names ofthe movies released between the years 1930 and2018 from Wikipedia which constitute phrases inthe data lists. Thus, the Hindi and Telugu data listsconsist of words and phrases consisting maximumof five words.

For each data instance in the data list, multiplenoisy words are generated by introducing error.The type of errors include insertion, deletion, sub-stitution of one character, and word fusing. Spacesbetween words are randomly dropped in phrases tosimulate the word fusing problem. The list of er-rors for Hindi and Telugu is created by collectingthe highly committed spelling errors users makein each of these languages. We created this errorlist from linguistic resources and with help fromlanguage experts. The language experts analyzedHindi and Telugu usage and listed the most prob-able errors. These errors are based on observa-tions on real data and lexicon of Hindi and Telugu.Thus, the synthetic datasets are made as close aspossible to real world user-generated data.

Table 2 shows the example of generation ofnoisy words corresponding to a correct word con-sidering a Hindi word. Thus, the pairs of noisyword and original word constitute the parallel datafor training. Table 1 gives the details about size ofthe synthetic datasets for Hindi and Telugu.

4.2 Baseline Methods

We perform experiments on various models. Thedatasets are divided into train, dev, test partitions

2https://ltrc.iiit.ac.in/download.php3https://en.wiktionary.org/Frequency lists/Telugu

randomly in the ratio 80:10:10 respectively. In allour results, the models learn over the train parti-tion, get tuned over the dev partition, and are eval-uated over the test partition.

We train a SMT model using Moses (Koehnet al., 2007) on Hindi and Telugu syntheticdatasets. This is our main baseline model. Thestandard set of features is used, including a phrasemodel, length penalty, jump penalty and a lan-guage model. This SMT model is trained at thecharacter-level. Hence, the system learns map-pings between character-level phrases. Mosesframework allows us to easily conduct experi-ments with several settings and compare withSCMIL.

Other baselines are character based sequence-to-sequence attention models: CNN-GRU andGRU-GRU. All the models compared in this setof experiments look at batch sizes of 100 inputsand a maximum sequence size of 50 characterswith a learning rate of 0.001 and run for 10000steps. Throughout, the size of GRU cell is 256with one layer. The CNN consists of 5 filters withsizes varying in the range of [2,3,4]. These multi-ple filters of particular widths produce the featuremap, which is then concatenated and flattened forfurther processing.

4.3 Results and Analysis

Table 3 shows the accuracies reported by SCMILand SMT methods. To check if Moses performsbetter on larger training data, we increased thesize of Hindi synthetic dataset to 20567 and per-formed SMT. The accuracy value was 62% whichis almost equivalent to the accuracy on originalsynthetic dataset. SCMIL outperforms Moses, anSMT technique by a huge margin. In Table 3,we find that SCMIL performs better than othersequence-to-sequence models with different con-volutional and recurrent encoder-decoder combi-nations. These accuracies are on test set overthe entire sequence. Thus, the results show thatSCMIL performs better than all other baselinemodels. Our results support the conclusion byBritz et al. (2017) that LSTMs outperform GRUs.

The rule-based spell-checker for Hindi, HIN-SPELL (Singh et al., 2015) reported an accuracyof 77.9% on a data of 870 misspelled words ran-domly collected from books, newspapers and peo-ple etc. Te data used in Singh et al. (2015) and theHINSPELL system are not available. Hence, we

149

Language High Frequency Words Movie names Size of parallel corpusHindi 15000 3021 108587Telugu 10000 3689 92716

Table 1: Details of the synthetic datasets for Hindi and Telugu.

Correct word Noisy wordsþ�mA khAnF (prema kahaanii)

þ�m khAnF þ�m khnF (prem kahanii)(prem kahaanii) þ�mkhAnF (premkahaanii)

þ�m khAEn (prem kahaani)þ{m khAnF (praim kahaanii)

Table 2: Example of noisy words generation for aHindi word with corresponding transliterations.

implemented HINSPELL using Shabdanjali dic-tionary4 consisting of 32952 Hindi word. Thissystem when tested on our Hindi synthetic dataset,gave an accuracy is 72.3%. This accuracy beinglower than the original HINSPELL accuracy canbe accounted to larger size of testing data and in-clusion of out-of-vocabulary words. Thus, SCMILoutperforms HINSPELL by reporting an accuracyof 85.4%.

In Table 4, we have shown predictions given bySCMIL on few Hindi inputs. The results show thaterrors are contextually learned by the model. Itcan also be seen that the model has learned theword fusing problem. Also, the error detectionstep is not required separately as SCMIL trainsend-to-end and learns the presence of noise in theinput based on context. The advantage of SCMILover rule-based techniques is that it can handleout of vocabulary words as it is trained at char-acter level. For example, the last entry in the Ta-ble 4, is the English word bomb spelled in Hindi.The model has learned it and corrected when mis-spelled in Hindi as bombh. Dictionary based tech-niques fail in such cases. The model fails in fewcases when it corrects one character of the mis-spelled word instead of other like the example infourth row of 4.

The slightly higher accuracy of SCMIL on Tel-ugu data than on Hindi data might be due to thefact that the highly probable spelling errors used indata creation are slightly less in number for Teluguwhen compared to Hindi. This can be handled forany language with more errors by increasing size

4https://ltrc.iiit.ac.in/download.php

Model Hindi(%) Telugu(%)Moses (SMT) 62.8 64.7CNN-GRU 74.4 79.7GRU-GRU 77.6 84.3SCMIL 85.4 89.3

Table 3: Accuracy of spelling correction on Hindiand Telugu synthetic datasets given by Moses,character-based deep learning models(CNN-GRUand GRU-GRU), and SCMIL.

of dataset which includes enough data instancescapturing each kind of error.

5 Future Work

This paper is the initial approach to automaticspelling correction for Indic languages using DeepLearning and we have obtained results that arecompetitive with the existing techniques. SCMILcan be improved and extended in many ways.

SCMIL presently deals with spelling correc-tions at word level. It can be further extendedto automatically make not only spelling correc-tions but also grammar corrections at phraselevel/sentence level.

The synthetic dataset can be improved by col-lecting noisy words from different platforms likesocial media, blogs etc. and introducing these realworld errors into clean corpus. This will improvethe performance of the model on user-generateddata. Further, for proper evaluation of the model,The model should be tested on real world user-generated parallel data.

One more potential improvement would be tochange the training data from words and phrasesto sentences. This will help in achieving contextbased spelling correction.

The spelling correction model can be extendedto a text correction and completion model. Chang-ing the decoder from character-level to word-levelwill add the functionality of auto completion. Thiswill improve the scope of the model in various ap-plications.

150

Input Prediction Correct OutputrAjEs rAjsF rAjsF(raajasi) (raajasii) (raajasii)doaA K�\ do aA K�\ do aA K�\(dhoaankhei) (dho aankhei) (dho aankhei)a\s a\f a\f(ans) (ansh) (ansh)dfAvtr dfvtr dfAvtAr(dhashaavatar) (dhashavatar) (dhashaavataar)bA�MB bA�Mb bA�Mb(baambh) (baamb) (baamb)bA�MB bA�Mb bA�Mb(baambh) (baamb) (baamb)

Table 4: Qualitative evaluation of predictions bySCMIL on few Hindi inputs along with expectedoutputs and corresponding transliterations.

6 Summary and Conclusion

In this paper, we proposed SCMIL for automaticspelling correction in Indic languages which em-ploys a recurrent sequence-to-sequence attentionmodel to learn the spelling corrections from noisyand correct word pairs. We created a parallelcorpus of noisy and correct spellings for trainingby introducing spelling errors in correct words.We validated SCMIL on these synthetic datasetscreated for Hindi and Telugu. We implementedspelling correction using Moses, a SMT systemas a baseline model. We evaluated our systemagainst existing techniques for Indic languagesand showed favorable results. Finally, we dis-cussed possible extensions to improve the scopeof SCMIL and perform better evaluation.

SCMIL can be used in applications like searchengines as we have shown that it automaticallycorrects the input text in Indic languages. Mostof the deep learning models train on billions ofdata instances. On the contrary, SCMIL trains ona dataset of less than a million parallel instancesand gives competitive results. This shows that ourapproach can be used for automatic spelling cor-rection of any resource-scarce language.

References

Martın Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scale

machine learning. In OSDI, volume 16, pages 265–283.

Farooq Ahmad and Grzegorz Kondrak. 2005. Learninga spelling error model from search query logs. InProceedings of the conference on Human LanguageTechnology and Empirical Methods in Natural Lan-guage Processing, pages 955–962. Association forComputational Linguistics.

Ambili, Panchami KS, and Neethu Subash. 2016. Au-tomatic error detection and correction in malayalam.International Journal of Science Technology andEngineering(IJSTE), 3(2).

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Yonatan Belinkov and Yonatan Bisk. 2017. Syntheticand natural noise both break neural machine transla-tion. arXiv preprint arXiv:1711.02173.

D. Britz, A. Goldie, T. Luong, and Q. Le. 2017. Mas-sive Exploration of Neural Machine Translation Ar-chitectures. ArXiv e-prints.

Denny Britz, Anna Goldie, Thang Luong, and QuocLe. 2017. Massive exploration of neural ma-chine translation architectures. arXiv preprintarXiv:1703.03906.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555.

Veena Dixit, Satish Dethe, and Rushikesh K Joshi.2005. Design and implementation of a morphology-based spellchecker for marathi, an indian language.ARCHIVES OF CONTROL SCIENCE, 15(3):301.

Shaona Ghosh and Per Ola Kristensson. 2017. Neuralnetworks for text correction and completion in key-board decoding. arXiv preprint arXiv:1709.06429.

Neha Gupta and Pratistha Mathur. 2012. Spell check-ing techniques in nlp: a survey. International Jour-nal of Advanced Research in Computer Science andSoftware Engineering, 2(12).

Sasa Hasan, Carmen Heger, and Saab Mansour. 2015.Spelling correction of user search queries throughstatistical machine translation. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 451–460.

151

Sepp Hochreiter. 1998. The vanishing gradient prob-lem during learning recurrent neural nets and prob-lem solutions. International Journal of Uncer-tainty, Fuzziness and Knowledge-Based Systems,6(02):107–116.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, et al. 2007. Moses: Open sourcetoolkit for statistical machine translation. In Pro-ceedings of the 45th annual meeting of the ACL oninteractive poster and demonstration sessions, pages177–180. Association for Computational Linguis-tics.

Rakesh Kumar, Minu Bala, and Kumar Sourabh. 2018.A study of spell checking techniques for indian lan-guages. JK Research Journal in Mathematics andComputer Sciences, 1(1).

Uma Maheshwar Rao. 2011. Telugu spell-checker.International Telugu Internet Conference Proceed-ings.

David E Rumelhart, Geoffrey E Hinton, and Ronald JWilliams. 1986. Learning representations by back-propagating errors. nature, 323(6088):533.

Keisuke Sakaguchi, Kevin Duh, Matt Post, and Ben-jamin Van Durme. 2017. Robsut wrod reocgini-ton via semi-character recurrent neural network. InAAAI, pages 3281–3287.

Amit Sharma and Pulkit Jain. 2013. Hindi spellchecker. Indian Institute of Technology Kanpur.

Harsharndeep Singh et al. 2015. Design and imple-mentation of hinspell-hindi spell checker using hy-brid approach. International Journal of ScientificResearch and Management, 3(2).

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

Paul J Werbos. 1990. Backpropagation through time:what it does and how to do it. Proceedings of theIEEE, 78(10):1550–1560.

W John Wilbur, Won Kim, and Natalie Xie. 2006.Spelling correction in the pubmed search engine. In-formation retrieval, 9(5):543–564.

152


Automatic Question Generation using Relative Pronouns and Adverbs

Payal Khullar Konigari Rachna Mukul Hase Manish ShrivastavaLanguage Technologies Research Centre

International Institute of Information Technology, HyderabadGachibowli, Hyderabad, Telangana-500032

{payal.khullar@research., [email protected]@student., m.shrivastava@}iiit.ac.in

Abstract

This paper presents a system that auto-matically generates multiple, natural lan-guage questions using relative pronounsand relative adverbs from complex Englishsentences. Our system is syntax-based,runs on dependency parse information ofa single-sentence input, and achieves highaccuracy in terms of syntactic correctness,semantic adequacy, fluency and unique-ness. One of the key advantages of oursystem, in comparison with other rule-based approaches, is that we nearly elim-inate the chances of getting a wrong wh-word in the generated question, by fetch-ing the requisite wh-word from the inputsentence itself. Depending upon the input,we generate both factoid and descriptivetype questions. To the best of our infor-mation, the exploitation of wh-pronounsand wh-adverbs to generate questions isnovel in the Automatic Question Genera-tion task.

1 Introduction

Asking questions from learners is said to facili-tate interest and learning (Chi, 1994), to recognizeproblem learning areas (Tenenberg and Murphy,2005) to assess vocabulary (Brown et al., 2005)and reading comprehension (Mitkov, 2006); (Ku-nichika et al., 2004), to provide writing support(Liu et al., 2012), to support inquiry needs (Aliet al., 2010), etc. Manual generation of questionsfrom a text for creating practice exercises, tests,quizzes, etc. has consumed labor and time of aca-demicians and instructors since forever, and withthe invent of a large body of educational mate-rial available online, there is a growing need tomake this task scalable. Along with that, in therecent times, there is an increased demand to cre-

ate Intelligent tutoring systems that use computer-assisted instructional material or self-help prac-tice exercises to aid learning as well as objec-tively check learner’s aptitude and accomplish-ments. Inevitably, the task of Automatic QuestionGeneration (QG) caught the attention of NLP re-searchers from across the globe. Automatic QGhas been defined as ”the task of automatically gen-erating questions from various inputs like raw text,database or semantic representation” (Rus et al.,2008). Apart from its direct application in the ed-ucational domain, in general, the core NLP areaslike Question Answering, Dialogue Generation,Information Retrieval, Summarization, etc. alsobenefit from large scale automatic Question Gen-eration.

2 Related Work

Previous work on Automatic QG has focusedon generating questions using question tem-plates (Liu et al., 2012); (Mostow and Chen,2009); (Sneiders, 2002), transformation rulesbased largely on case grammars (Finn, 1975),general-purpose, transformation rules (Heilmanand Smith, 2009), tree manipulation rules (Heil-man, 2011); (Ali et al., 2010); (Gates, 2008), dis-course cues (Agarwal et al., 2011), queries (Lin,2008), various scopes (Mannem et al., 2010), de-pendency parse information (Mazidi and Nielsen,2015), topic information (Chali and Hasan, 2015),ontologies (Alsubait et al., 2015), etc. More recentapproaches also apply neural methods (Subrama-nian et al., 2017); (Zhou et al., 2017); (Yuan et al.,2017); (Du et al.), etc. to generate questions.

In the current paper, we fetch relative pronounsand relative adverbs from complex English sen-tences and use dependency-based rules, groundedin linguistic theory of relative clause syntacticstructure, to generate multiple relevant questions.The work follows in the tradition of questionwriting algorithm (Finn, 1975) and transformation

153

rules based approach (Heilman and Smith, 2009).However, while Finn’s work was based largelyaround case grammars (Fillmore, 1968), our sys-tem exploits dependency parse information usingthe Spacy parser (Honnibal and Johnson, 2015),which provides us with a better internal structureof complex sentences to work with. The general-purpose transformation rules in Heilman’s systemdo not work well on sentences with a highly com-plex structure, as we show in the section on com-parison and evaluation. Although no other state-of-the art system focuses specifically on QG fromrelative pronouns and relative adverbs, a more re-cent Minimal Recursion semantics-based QG sys-tem (Yao et al., 2012) has a sub part that deals withsentences with a relative clause, but less compre-hensively. We differ from their system in that, forone, we do not decompose the complex sentenceinto simpler parts to generate questions. The rulesare defined for the dependencies between relativepronouns and relative adverbs and the rest of thesentence as a whole. Secondly, our system gener-ates a different set and more number of questionsper sentence than their system.

3 Why Relative Clauses?

In complex sentences, relative pronouns or rela-tive adverbs perform the function of connectingor introducing the relative clause that is embed-ded inside the matrix clause. Examples of these inEnglish include who, whom, which, where, when,how, why, etc. An interesting thing about bothrelative pronouns and relative adverbs is that theycarry unique information on the syntactic relation-ship between specific parts of the sentence. Forexample, consider the following sentence in En-glish:I am giving fur balls to John who likes cats.

In this sentence, the relative pronoun who mod-ifies the object of the root verb give of the matrixsentence. At the same time, it acts as the subjectof the relative clause likes cats, which it links thematrix clause with. In this paper, we aim to ex-actly exploit this structural relationship to generatequestions, thereby adding to the pool of questionsthat can be generated from a given sentence. Oneof the key benefits of using the information fromrelative pronouns and relative adverbs is that weare not likely to go wrong with the wh-word, aswe fetch it from the relative clause itself to gen-erate the question. This gives our system an edge

over other QG systems.

4 System Description

We split the complete QG task into the followingsub parts - the input natural language sentence isfirst fed into the Spacy parser. Using the parse in-formation, the system checks for the presence ofone or more relative pronouns or adverbs in thesentence. Post that, it further checks for well-defined linguistic features in the sentence, such astense and aspect type of the root and relative clauseverb, head-modifier relationship between differentparts of the sentence, etc. to accordingly sendthe information to the rule sets. Depending uponwhich rule in the rule sets the information is sentto, questions are generated. We define our rule setsin the next section.

5 Rule Sets

For each of the nine relative pronouns and rela-tive adverbs in English (who, whom, whose, which,that, where, when, why, how) that we took intoconsideration, we defined three sets of rules. Eachof the three rule sets further contains a total of tenrules, backed by linguistic principles. Each rela-tive pronoun or relative adverb in the sentence isfirst checked for a set of requirements before get-ting fed into the rules. Depending upon the relativepronoun and relative adverb and the type of rela-tive clause (restrictive or unrestrictive), questionsare generated. We present an example of one outof the ten rules from each of the three rule sets inthe next section.

5.1 Rule Set 1.

We know that the relative pronoun (RP) modi-fies the object of the root of the matrix sentence.Before feeding the sentence to the rule, we firstcheck the sentence for the tense and aspect of theroot verb and also for the presence of modals andauxiliary verbs (aux). Based on this information,we then accordingly perform do-insertion beforeNoun Phrase (NP) or aux/modal inversion. For asentence of the following form, with an optionalPreposition Phrase (PP) and RP who that precedesa relative clause (RC),NP (aux) Root NP (PP)+ {RC}The rule looks like this:RP aux NP root NP Preposition?

Hence, for the example sentence introduced inthe previous section, we get the following question

154

using the first rule:Who/Whom am I giving fur balls to?

In representation of the rules, we follow thegeneral linguistic convention which is to put roundbrackets on optional elements and ’+’ symbol formultiple possible occurrences of a word or phrase.

5.2 Rule Set 2.The next understanding about the relative pronounor adverb comes from the relative clause it intro-duces or links the matrix sentence with. The rel-ative pronoun can sometimes act as the subject ofthe verb of relative clause. This forms the basis forrules in the second set.

After checking for dependency of the RP, whichshould be noun subject (n-subj) to the verb in therelative clause, we then check the tense and as-pect of the relative clause verb and the presence ofmodals and auxiliary verbs. Based on this infor-mation, we then accordingly perform do-insertionor modal/auxiliary inversion. For a relative clauseof the following form, with the relative pronounwho,{matrix } RP (aux/modals)+ RC verb (NP)(PP)+The rule looks like this:RP do-insertion/aux/modal RC verb (NP)(Preposition)?Taking the same example sentence, we get the fol-lowing question using the second rule:Who likes cats?

5.3 Rule Set 3.The relative pronoun modifies or gives more in-formation on the head of the noun phrase of thepreceding sentence. This forms the basis to rulesin the third set. Before feeding the sentence to thisrule, we first check the tense of the relative clauseverb along with its number agreement. We do thisbecause English auxiliaries and copula carry tenseand number features and we need this informationto insert their correct form. For a sentence of thefollowing form:NP (aux/modals) Root NP RP (aux)+ RCverb (NP) (PP)+The rule looks like this:RP aux Head of NP?Taking the first example sentence, from the previ-ous sections, we get the following question usingthe fourth rule:Who is John?In a similar fashion, we define rules for all other

relative pronouns and adverbs that we listed in theprevious section.

6 Evaluation Criteria

There is no standard way to evaluate the outputof a QG system. In the current paper, we gowith manual evaluation, where 4 independent hu-man evaluators, all non-native English speakersbut proficient in English, give scores to questionsgenerated from the system. The scoring schema issimilar to one used by (Agarwal et al., 2011) albeitwith some modifications. To judge syntactic cor-rectness, the evaluators give a score of 3 when thequestions are syntactically well-formed and natu-ral, 2 when they have a few syntactic errors and1 when they are syntactically unacceptable. Sim-ilarly, for semantic correctness, the raters give ascore of 3 when the questions are semanticallycorrect, 2 when they have a weird meaning and1 when they are semantically unacceptable. Un-like (Agarwal et al., 2011), we test the fluency andsemantic relatedness separately. The former tellsus how natural the question reads. A question withmany embedded clauses and adjuncts is syntacti-cally acceptable, but disturbs the intended purposeof the question and, hence, should be avoided.For example, a question like Who is that girl whoworks at Google which has its main office in Amer-ica which is a big country? is syntactically andsemantically fine, but isn’t as fluent as the ques-tion Who is that girl who works at Google? whichis basically the same question but is more fluent.The evaluators give a score of 1 for questions thataren’t fluent and 2 to the ones that are. Lastly, eval-uators rate the questions for how unique they are.Adding this criteria is important because questionsgenerated for academic purposes need to coverdifferent aspects of the sentence. This is why ifthe generated questions are more or less alike, theevaluators give them a low score on distribution orvariety. For a well distributed output, the score is2 and for a less distributed one, it is 1. The evalua-tors give a score of 0 when there is no output for agiven sentence. The scores obtained separately forsyntactic correctness, semantic adequacy, fluencyand distribution are used to compare the perfor-mance of the two systems.

7 Evaluation

We take sentences from the Wikipedia corpus. Outof a total of 25773127 sentences, 3889289 sen-

155

tences have one or more relative pronoun or rel-ative adverb in them. This means that sentenceswith relative clauses form roughly 20% of the cor-pus. To conduct manual evaluation, we take 300sentences from the set of sentences with relativeclauses, and run ours and Heilman’s system onthem. We give the questions generated per sen-tence for both the systems to 4 independent humanevaluators who rate the questions on syntactic cor-rectness, semantic adequacy, fluency and distribu-tion.

8 Results

The results of our system and comparison withHeilman’s is given in Figure 1. The ratings pre-sented are average of ratings of all the evaluators.Our system gets 2.89/3.0 on syntactic correctness,2.9/3.0 on semantic adequacy, 1.85/2.0 in fluencyand 1.8/2.0 in distribution. On the same met-rics, Heilman’s system gets 2.56, 2.58, 1.3 and1.1. The Cohen’s kappa coefficient or the interevaluator agreement is 0.6, 0.7, 0.7 and 0.7 onsyntactic correctness, semantic adequacy, fluencyand distribution respectively, which indicatereliability. The overall rating of our system is 9.44out of 10 in comparison of Heilman’s which is7.54.

Figure 1: Evaluation scores: Our system performs betterthan Heilman’s system on all of the given criteria; syntacticcorrectness, semantic adequacy, fluency and uniqueness.

9 Discussion

On all the four evaluation criteria that we usedfor comparison, our system performs better thanHeilman’s state-of-the-art rule based system,while generating questions from complex Englishsentences. Let us take a look at some specificinput example cases to analyze the results. Firstof all, by fetching and modifying the wh-word

from the sentence itself, we nearly eliminatethe possibility of generating a sentence with awrong wh-word. From the example comparisonin Figure.2, we can see that the output of boththe systems is the same. However, our systemgenerates the correct wh-word for the generatedquestion.

Figure 2: Wh-Word: Our system performs better than Heil-man’s system at fetching the correct Wh-word for the giveninput.

By exploiting the unique structural relationshipsbetween relative pronouns and relative adverbswith the rest of the sentence, we are able to coverdifferent aspects of the same sentence. Also, byeliminating unwanted dependencies, we ensurethat the system generates fluent questions. SeeFigure 3 for a reference example.

Figure 3: Fluency: As compared to Heilman’s system, oursystem generates more fluent questions for the given input.

Since Heilman’s system does not look deeperinto the internal structural dependencies betweendifferent parts of the sentence, it fails to generatereasonable questions for most cases of complexsentences. Our system, on the other hand, exploitssuch dependencies and is, therefore, able tohandle complex sentences better. See Figure 4 for

156

a reference example of this case. Lastly, there is arestriction put on the length of the input sentencein Heilman’s system. Due to this, there is zeroor no output at all for complex sentences that arevery long. Our system, however, works well onsuch sentences also and gives reasonable output.

Figure 4: Complex Sentences: Our system is able to han-dle the given highly complex sentence better than Heilman’ssystem.

10 Conclusion

This paper presented a syntax, rule-based systemthat runs on dependency parse information fromthe Spacy parser and exploits dependency relation-ship between relative pronouns and relative ad-verbs and the rest of the sentence in a novel wayto automatically generate multiple questions fromcomplex English sentences. The system is sim-ple in design and can handle highly complex sen-tences. The evaluation was done by 4 independenthuman evaluators who rated questions generatedfrom our system and Heilman’s system on the ba-sis of how syntactically correct, semantically ad-equate, fluent and well distributed or unique thequestions were. Our system performed better thanHeilman’s system on all the aforesaid criterion. Apredictable limitation of our system is that it isonly meant to generate questions for sentences thatcontain at least one relative clause. Such sentencesform about 20% of the tested corpus.

11 Acknowledgments

We acknowledge the support of Google LLC forthis research, in the form of an International TravelGrant.

ReferencesManish Agarwal, Rakshit Shah, and Prashanth Man-nem. 2011. Automatic question generation using dis-

course cues. Proceedings of the Sixth Workshop on In-novative Use of NLP for Building Educational Appli-cations, page 19.

Husam Ali, Yllias Chali, and Sadid A. Hasan. 2010.Automation of question generation from sentences.Proceedings of QG2010: The Third Workshop onQuestion Generation, pages 58–67.

Tahani Alsubait, Bijan Parsia, and Ulrike Sattler. 2015.Ontology-based multiple choice question generation.KI - Knstliche Intelligenz, 30(2):183188.

Jonathan C. Brown, Gwen A. Frishkoff, and MaxineEskenazi. 2005. Automatic question generation for vo-cabulary assessment. Proceedings of the conference onHuman Language Technology and Empirical Methodsin Natural Language Processing - HLT 05.

Yllias Chali and Sadid A. Hasan. 2015. Towards topic-to-question generation. Computational Linguistics, 41.

M Chi. 1994. Eliciting self-explanations improves un-derstanding,. Cognitive Science, 18(3):439477.

Xinya Du, Junru Shao, and Claire Cardie. Learning toask: Neural question generation for reading compre-hension. Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics, 1.

Charles J. Fillmore. 1968. The case for case. EmmonW. Bach and Robert T. Harms, editors, Universals inLinguistic Theory. Holt, Rinehart & Winston, page 188.

Patrick J Finn. 1975. A question writing algorithm.Journal of Reading Behavior, 7(4).

Melinda D. Gates. 2008. Generating reading compre-hension look back strategy questions from expositorytexts. Masters thesis, Carnegie Mellon University.

Michael Heilman. 2011. Automatic factual questiongeneration from text.

Michael Heilman and Noah A. Smith. 2009. Ques-tion generation via overgenerating transformations andranking.

Matthew Honnibal and Mark Johnson. 2015. An im-proved non-monotonic transition system for depen-dency parsing. In Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Process-ing, pages 1373–1378, Lisbon, Portugal. Associationfor Computational Linguistics.

Hidenobu Kunichika, Tomoki Katayama, Tsukasa Hi-rashima, and Akira Takeuchi. 2004. Automated ques-tion generation methods for intelligent english learningsystems and its evaluation. Proc. of ICCE2001, page11171124.

Chin-Yew Lin. 2008. Automatic question generationfrom queries.

Ming Liu, Rafael Calvo, and Vasile Rus. 2012. G-asks: An intelligent automatic question generation sys-tem for academic writing support. Dialogue & Dis-course, 3(2):101124.

157

Prashanth Mannem, Rashmi Prasad, and Aravind Joshi.2010. Question generation from paragraphs at upenn:Qgstec system discription. Proceedings of QG2010:The Third Workshop on Question Generation.

Karen Mazidi and Rodney D. Nielsen. 2015. Leverag-ing multiple views of text for automatic question gen-eration. Lecture Notes in Computer Science ArtificialIntelligence in Education, page 257266.

R. Mitkov. 2006. Comtuter-aided generation ofmultiple-choice tests. International Conference onNatural Language Processing and Knowledge Engi-neering, 2003. Proceedings. 2003.

J. Mostow and W Chen. 2009. Generating instructionautomatically for the reading strategy of self question-ing. Proceeding of the Conference on Artificial Intelli-gence in Education, pages 465–472.

Vasile Rus, Zhiqiang Cai, and Arthur Graesser. 2008.Question generation : Example of a multi-year evalu-ation campaign. Online Proceedings of 1st QuestionGeneration Workshop.

Eriks Sneiders. 2002. Automated question answer-ing using question templates that cover the conceptualmodel of the database. Natural Language Processingand Information Systems Lecture Notes in ComputerScience, page 235239.

Sandeep Subramanian, Tong Wang, Xingdi Yuan,Saizheng Zhang, Adam Trischler, and Yoshua Ben-gio. 2017. Neural models for key phrase detection andquestion generation.

Josh Tenenberg and Laurie Murphy. 2005. Knowingwhat i know: An investigation of undergraduate knowl-edge and self-knowledge of data structures. ComputerScience Education, 15(4):297315.

Xuchen Yao, Gosse Bouma, and Yi Zhang. 2012.Semantics-based question generation and implementa-tion. Dialogue & Discourse, 3(2).

Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessan-dro Sordoni, Philip Bachman, Saizheng Zhang,Sandeep Subramanian, and Adam Trischler. 2017. Ma-chine comprehension by text-to-text neural questiongeneration. Proceedings of the 2nd Workshop on Rep-resentation Learning for NLP.

Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,Hangbo Bao, and Ming Zhou. 2017. Neural questiongeneration from text: A preliminary study. LectureNotes in Computer Science, pages 662–671.

158

Author Index

A, Pranav, 134Aggarwal, Swati, 91Ali Shah, Syed Afaq, 14Anvesh Rao, Vijjini, 99Arase, Yuki, 21Arumae, Kristjan, 105Atapattu, Thushari, 45

Bennamoun, Mohammed, 14Bhatnagar, Nikhilesh, 120Bladier, Tatiana, 59Boyd-Graber, Jordan, 127

Chinnakotla, Manoj, 146Chu, Chenhui, 21

Dai, Xiang, 37Dohare, Shibhansh, 74Downey, Doug, 9

Etoori, Pravallika, 146

Falkner, Katrina, 45Fernandez, Jared, 9

Gupta, Vivek, 74

Hase, Mukul, 153

Kallmeyer, Laura, 59Karnick, Harish, 74Katsumata, Satoru, 112Kawara, Yuki, 21Kezar, Lee, 141Khullar, Payal, 153Komachi, Mamoru, 112Kumaraguru, Ponnurangam, 52

Liu, Fei, 105

Mamidi, Radhika, 99, 120, 146Manchanda, Prachi, 91Matsumura, Yukio, 112

Parupalli, Sreekavitha, 99Pecar, Samuel, 1

Rachna, Konigari, 153

Samih, Younes, 59Sankhavara, Jainisha, 84Sawhney, Ramit, 91Sen, Indira, 52Sharif, Naeha, 14Shrivastava, Manish, 120, 153Singh, Kushagra, 52Singh, Raj, 91Singh, Sonit, 28

Thilakaratne, Menasha, 45

van Cranenburgh, Andreas, 59Vanmassenhove, Eva, 67

Wallace, Eric, 127Way, Andy, 67White, Lyndon, 14

Yamagishi, Hayahide, 112

159