[lecture notes in computer science] advances in artificial intelligence volume 6657 ||

Lecture Notes in Artificial Intelligence 6657Edited by R. Goebel, J. Siekmann, and W. Wahlster

Subseries of Lecture Notes in Computer Science

Cory Butz Pawan Lingras (Eds.)

Advances inArtificial Intelligence

24th Canadian Conferenceon Artificial Intelligence, Canadian AI 2011St. John’s, Canada, May 25-27, 2011Proceedings

13

Series Editors

Randy Goebel, University of Alberta, Edmonton, CanadaJörg Siekmann, University of Saarland, Saarbrücken, GermanyWolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany

Volume Editors

Cory ButzUniversity of Regina, Department of Computer Science3737 Wascana Parkway, Regina, Saskatchewan, Canada S4S 0A2E-mail: [email protected]

Pawan LingrasSaint Mary’s University, Department of Mathematics and Computing ScienceHalifax, Nova Scotia, Canada B3H 3C3E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349ISBN 978-3-642-21042-6 e-ISBN 978-3-642-21043-3DOI 10.1007/978-3-642-21043-3Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: 2011926783

CR Subject Classification (1998): I.3, H.3, I.2.7, H.4, F.1, H.5

LNCS Sublibrary: SL 7 – Artificial Intelligence

© Springer-Verlag Berlin Heidelberg 2011This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer. Violations are liableto prosecution under the German Copyright Law.The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,even in the absence of a specific statement, that such names are exempt from the relevant protective lawsand regulations and therefore free for general use.

Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Preface

This volume contains the papers presented at the 24th Canadian Conference onArtificial Intelligence (AI 2011). The conference was held in St. John’s, New-foundland and Labrador, during May 25–27, 2011, and was collocated with the37th Graphics Interface Conference (GI 2011), and the 8th Canadian Conferenceon Computer and Robot Vision (CRV 2011).

The Program Committee received 81 submissions for the main conference, AI2011, from across Canada and around the world. Each submission was reviewedby a minimum of four and up to five reviewers. For the final conference pro-gram and for inclusion in these proceedings, 23 regular papers, with allocationof 12 pages each, were selected. Additionally, 22 short papers, with allocationof 6 pages each, were accepted. Finally, 5 papers from the Graduate StudentSymposium appear in the proceedings, each of which was allocated 4 pages.

The conference program featured three keynote presentations by CorinnaCortes, Head of Google Research, New York, David Poole, University of BritishColumbia, and Regina Barzilay, Massachusetts Institute of Technology.

One pre-conference workshop on text summarization, with its own proceed-ings, was held on May 24, 2011. This workshop was organized by Anna Kazant-seva, Alistair Kennedy, Guy Lapalme, and Stan Szpakowicz.

We would like to thank all Program Committee members and external re-viewers for their effort in providing high-quality reviews in a timely manner. Wethank all the authors of submitted papers and the authors of selected papersfor their collaboration in preparation of the final copy. The conference bene-fited from the practical perspective brought by the participants in the IndustryTrack session. Many thanks to Svetlana Kiritchenko, Maria Fernanda Caropreso,and Cristina Manfredotti for organizing the Graduate Student Symposium, andchairing the Program Committee of the symposium. The coordinating effortsof General Workshop Chair Sheela Ramanna are much appreciated. We expressour gratitude to Jiye Li for her efforts in compiling these proceedings as the Pro-ceedings Chair. We thank Wen Yan (Website Chair), Atefeh Farzindar (IndustryChair) and Dan Wu (Publicity Chair), for their time and effort.

We are in debt to Andrei Voronkov for developing the EasyChair conferencemanagement system and making it freely available to the academic world. Easy-Chair is a remarkable system with functionality that saved us a significant amountof time.

The conference was sponsored by the Canadian Artificial Intelligence Associ-ation (CAIAC), and we thank the CAIAC Executive Committee for the constantsupport. We would like to express our gratitude to John Barron, the AI/GI/CRVGeneral Chair, Andrew Vardy, the AI/GI/CRV Local Arrangements Chair, andOrland Hoeber, the AI Local Organizing Chair, for making AI/GI/CRV 2011an enjoyable experience.

March 2011 Cory ButzPawan Lingras

Organization

AI/GI/CRV 2011 General Chair

John Barron University of Western Ontario

AI Program Committee Chairs

Cory Butz University of ReginaPawan Lingras Saint Mary’s University

AI/GI/CRV Local Arrangements Chair

Andrew Vardy Memorial University of Newfoundland

AI Local Organizing Chair

Orland Hoeber Memorial University of Newfoundland

Graduate Student Symposium Chairs

Svetlana Kiritchenko National Research CouncilMaria Fernanda Caropreso Defence R&DCristina Manfredotti University of Regina

AI 2011 Program Committee

Esma Aimeur Chris DrummondMassih Amini Marek DruzdzelAijun An Zied ElouediXiangdong An Larbi EsmahiDirk Arnold Atefeh FarzindarSalem Benferhat Paola FlocchiniPetra Berenbrink Michel GagnonSabine Bergler Qigang GaoVirendra Bhavsar Yong GaoCory Butz Dragan GasevicMaria Fernanda Caropreso Ali GhorbaniColin Cherry Cyril GoutteDavid Chiu Kevin Grant

VIII Organization

Lyne Da Sylva Howard HamiltonJoerg Denzinger Robert HildermanOrland Hoeber Sheela RamannaJimmy Huang Robert ReynoldsFrank Hutter Denis RiordanDiana Inkpen Samira SadaouiChristian Jacob Eugene SantosNathalie Japkowicz Anoop SarkarRichard Jensen Jonathan SchaefferManeesh Joshi Oliver SchulteIgor Jurisica Mahdi ShafieiVlado Keselj Mohak ShahSvetlana Kiritchenko Weiming ShenZiad Kobti Mike ShepherdGrzegorz Kondrak Daniel L. SilverLeila Kosseim Shyamala SivakumarAdam Krzyzak Dominik SlezakPhilippe Langlais Marina SokolovaGuy Lapalme Luis Enrique SucarOscar Lin Marcin SzczukaPawan Lingras Stan SzpakowiczHongyu Liu Ahmed TawfikJiming Liu Choh Man TengAlejandro Lopez-Ortiz Eugenia TernovskaSimone Ludwig Thomas TranAlan Mackworth Thomas TrappenbergAnders L. Madsen Andre TrudelCristina Manfredotti Peter van BeekYannick Marchand Paolo ViappianiRobert Mercer Hai WangEvangelos Milios Harris WangDavid Mitchell Xin WangSushmita Mitra Dunwei WenMalek Mouhoub Rene WitteDavid Nadeau Dan WuEric Neufeld Yang XiangRoger Nkambou Jingtao YaoSageev Oore Yiyu YaoJian Pei Jia-Huai YouGerald Penn Haiyi ZhangLaurent Perrussel Harry ZhangFred Popowich Xiaokun ZhangBhanu Prasad Sandra ZillesDoina Precup Nur Zincir-Heywood

Organization IX

External Reviewers

Connie Adsett Yannick MarchandAmeeta Agrawal Marie-Jean MeursAditya Bhargava Felicitas MokomSolimul Chowdhury Majid RazmaraElnaz Delpisheh Fatemeh RiahiAlban Grastien Maxim RoyFranklin Hanshar Shahab TasharrofiHua He Milan TofiloskiMichael Horsch Baijie WangYeming Hu Jacek WolkowiczIlya Ioshikhes Xiongnan WuMichael Janzen Safa YahiHassan Khosravi Qian YangMarek Lipczak Martin ZinkevichStephen Makonin

Graduate Student Symposium Program Committee

Ebrahim Bagheri Cristina ManfredottiJulien Bourdaillet Stan MatwinScott Buffet Fred PopowichMaria Fernanda Caropreso Mohak ShahKevin Cohen Marina SokolovaDiana Inkpen Bruce SpencerNathalie Japkowicz Stan SzpakowiczSvetlana Kiritchenko Jo-Anne TingGuy Lapalme Paolo ViappianiBradley Malin

Sponsoring Institutions and Companies

Canadian Artificial Intelligence Association/Association pour l’intelligenceartificielle au Canada (CAIAC)http://www.caiac.ca

Memorial Universityhttp://www.mun.ca/

Compusult (Gold sponsor)http://www.compusult.net/

X Organization

Palomino System Innovations Inc.http://www.palominosys.com

University of Reginahttp://www.uregina.ca/

Saint Mary’s Universityhttp://www.smu.ca/

NLP Technologies Inc.http://nlptechnologies.ca

Springerhttp://www.springer.com/

Table of Contents

Dynamic Obstacle Representations for Robot and Virtual AgentNavigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Eric Aaron and Juan Pablo Mendoza

Grounding Formulas with Complex Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 13Amir Aavani, Xiongnan (Newman) Wu, Eugenia Ternovska, andDavid Mitchell

Moving Object Modelling Approach for Lowering Uncertainty inLocation Tracking Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Wegdan Abdelsalam, David Chiu, Siu-Cheung Chau,Yasser Ebrahim, and Maher Ahmed

Unsupervised Relation Extraction Using Dependency Trees forAutomatic Generation of Multiple-Choice Questions . . . . . . . . . . . . . . . . . . 32

Naveed Afzal, Ruslan Mitkov, and Atefeh Farzindar

An Improved Satisfiable SAT Generator Based on Random SubgraphIsomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Calin Anton

Utility Estimation in Large Preference Graphs Using A* Search . . . . . . . . 50Henry Bediako-Asare, Scott Buffett, and Michael W. Fleming

A Learning Method for Developing PROAFTN Classifiers and aComparative Study with Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Nabil Belacel and Feras Al-Obeidat

Using a Heterogeneous Dataset for Emotion Analysis in Text . . . . . . . . . . 62Soumaya Chaffar and Diana Inkpen

Using Semantic Information to Answer Complex Questions . . . . . . . . . . . . 68Yllias Chali, Sadid A. Hasan, and Kaisar Imam

Automatic Semantic Web Annotation of Named Entities . . . . . . . . . . . . . . 74Eric Charton, Michel Gagnon, and Benoit Ozell

Learning Dialogue POMDP Models from Data . . . . . . . . . . . . . . . . . . . . . . . 86Hamid R. Chinaei and Brahim Chaib-draa

Characterizing a Brain-Based Value-Function Approximator . . . . . . . . . . . 92Patrick Connor and Thomas Trappenberg

XII Table of Contents

Answer Set Programming for Stream Reasoning . . . . . . . . . . . . . . . . . . . . . 104Thang M. Do, Seng W. Loke, and Fei Liu

A Markov Decision Process Model for Strategic Decision Making inSailboat Racing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Daniel S. Ferguson and Pantelis Elinas

Exploiting Conversational Features to Detect High-Quality BlogComments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Nicholas FitzGerald, Giuseppe Carenini, Gabriel Murray, andShafiq Joty

Consolidation Using Context-Sensitive Multiple Task Learning . . . . . . . . . 128Ben Fowler and Daniel L. Silver

Extracting Relations between Diseases, Treatments, and Tests fromClinical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Oana Frunza and Diana Inkpen

Compact Features for Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 146Lisa Gaudette and Nathalie Japkowicz

Instance Selection in Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . 158Yuanyuan Guo, Harry Zhang, and Xiaobo Liu

Determining an Optimal Seismic Network Configuration UsingSelf-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

Machel Higgins, Christopher Ward, and Silvio De Angelis

Comparison of Learned versus Engineered Features for Classification ofMine Like Objects from Raw Sonar Images . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Paul Hollesen, Warren A. Connors, and Thomas Trappenberg

Learning Probability Distributions over Permutations by Means ofFourier Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

Ekhine Irurozki, Borja Calvo, and Jose A. Lozano

Correcting Different Types of Errors in Texts . . . . . . . . . . . . . . . . . . . . . . . . 192Aminul Islam and Diana Inkpen

Simulating the Effect of Emotional Stress on Task Performance UsingOCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Dreama Jain and Ziad Kobti

Base Station Controlled Intelligent Clustering Routing in WirelessSensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

Yifei Jiang and Haiyi Zhang

Comparison of Semantic Similarity for Different Languages Using theGoogle n-gram Corpus and Second-Order Co-occurrence Measures . . . . . 216

Colette Joubarne and Diana Inkpen

Table of Contents XIII

A Supervised Method of Feature Weighting for Measuring SemanticRelatedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

Alistair Kennedy and Stan Szpakowicz

Anomaly-Based Network Intrusion Detection Using Outlier SubspaceAnalysis: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

David Kershaw, Qigang Gao, and Hai Wang

Evaluation and Application of Scenario Based Design onThunderbird . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Bushra Khawaja and Lisa Fan

Improving Phenotype Name Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246Maryam Khordad, Robert E. Mercer, and Peter Rogan

Classifying Severely Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258William Klement, Szymon Wilk, Wojtek Michalowski, andStan Matwin

Simulating Cognitive Phenomena with a Symbolic DynamicalSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

Othalia Larue

Finding Small Backdoors in SAT Instances . . . . . . . . . . . . . . . . . . . . . . . . . . 269Zijie Li and Peter van Beek

Normal Distribution Re-Weighting for Personalized Web Search . . . . . . . . 281Hanze Liu and Orland Hoeber

Granular State Space Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285Jigang Luo and Yiyu Yao

Comparing Humans and Automatic Speech Recognition Systems inRecognizing Dysarthric Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

Kinfe Tadesse Mengistu and Frank Rudzicz

A Context-Aware Reputation-Based Model of Trust for OpenMulti-agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

Ehsan Mokhtari, Zeinab Noorian, Behrouz Tork Ladani, andMohammad Ali Nematbakhsh

Pazesh: A Graph-Based Approach to Increase Readability of AutomaticText Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

Nasrin Mostafazadeh, Seyed Abolghassem Mirroshandel,Gholamreza Ghassem-Sani, and Omid Bakhshandeh Babarsad

Textual and Graphical Presentation of Environmental Information . . . . . 319Mohamed Mouine

XIV Table of Contents

Comparing Distributional and Mirror Translation Similarities forExtracting Synonyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

Philippe Muller and Philippe Langlais

Generic Solution Construction in Valuation-Based Systems . . . . . . . . . . . . 335Marc Pouly

Cross-Lingual Word Sense Disambiguation for Languages with ScarceResources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

Bahareh Sarrafzadeh, Nikolay Yakovets, Nick Cercone, and Aijun An

COSINE: A Vertical Group Difference Approach to Contrast SetMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

Mondelle Simeon and Robert Hilderman

Hybrid Reasoning for Ontology Classification . . . . . . . . . . . . . . . . . . . . . . . . 372Weihong Song, Bruce Spencer, and Weichang Du

Subspace Mapping of Noisy Text Documents . . . . . . . . . . . . . . . . . . . . . . . . 377Axel J. Soto, Marc Strickert, Gustavo E. Vazquez, andEvangelos Milios

Extending AdaBoost to Iteratively Vary Its Base Classifiers . . . . . . . . . . . 384Erico N. de Souza and Stan Matwin

Parallelizing a Convergent Approximate Inference Method . . . . . . . . . . . . 390Ming Su and Elizabeth Thompson

Reducing Position-Sensitive Subset Ranking to Classification . . . . . . . . . . 396Zhengya Sun, Wei Jin, and Jue Wang

Intelligent Software Development Environments: Integrating NaturalLanguage Processing with the Eclipse Platform . . . . . . . . . . . . . . . . . . . . . . 408

Rene Witte, Bahar Sateli, Ninus Khamis, and Juergen Rilling

Partial Evaluation for Planning in Multiagent Expedition . . . . . . . . . . . . . 420Y. Xiang and F. Hanshar

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

Dynamic Obstacle Representations for Robot andVirtual Agent Navigation

Eric Aaron and Juan Pablo Mendoza

Department of Mathematics and Computer ScienceWesleyan University

Middletown, CT 06459

Abstract. This paper describes a reactive navigation method for autonomousagents such as robots or actors in virtual worlds, based on novel dynamic tangentobstacle representations, resulting in exceptionally successful, geometrically sen-sitive navigation. The method employs three levels of abstraction, treating eachobstacle entity as an obstacle-valued function; this treatment enables extraordi-nary flexibility without pre-computation or deliberation, applying to all obstaclesregardless of shape, including non-convex, polygonal, or arc-shaped obstacles indynamic environments. The unconventional levels of abstraction and the geomet-ric details of dynamic tangent representations are the primary contributions of thiswork, supporting smooth navigation even in scenarios with curved shapes, suchas circular and figure-eight shaped tracks, or in environments requiring complex,winding paths.

1 Introduction

For autonomous agents such as robots or actors in virtual worlds, navigation based onpotential fields or other reactive methods (e.g., [3,4,6,9,10]) can be conceptually ele-gant, robust, and adaptive in dynamic or incompletely known environments. In somemethods, however, straightforward geometric representations can result in ineffectiveobstacle avoidance or other navigation difficulties. In this paper, we introduce reac-tive navigation intelligence based on dynamic tangent obstacle representations and re-pellers, which are locally sensitive to relevant obstacle geometry, enabling effectivenavigation in a wide range of environments.

In general, reactive navigation is fast and responsive in dynamic environments, but itcan be undesirably insensitive to some geometric information in complicated navigationspaces. In some potential-based or force-based approaches, for instance, a circular ob-stacle would be straightforwardly treated as exerting a repulsive force on agents aroundit, deterring collisions; as an example, Figure 1 illustrates an angular repeller form em-ployed in [5,7,8], in which a circle-shaped obstacle obsi repels circle-shaped agent A bysteering A’s heading angle away from all colliding paths. (See Section 2 for additionalinformation on this kind of angular repeller.) Straightforwardly, the repeller representa-tion of obstacle obsi is based on the entire shape of obsi. Such a straightforward con-nection between the entire shape of an obstacle entity and the repeller representationof that entity, however, is not always so successful. Some common obstacle entities,for example, have shapes inconsistent with otherwise-effective navigation methods; for

C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 1–12, 2011.c© Springer-Verlag Berlin Heidelberg 2011

2 E. Aaron and J.P. Mendoza

pm

x

A

cA

r A φ − ψi

ri

ci

obsi

φ

ψi

vm2Δψi

Fig. 1. Obstacle avoidance, with agent A, obstacle obsi, and other elements as labeled. Whenheading angle φ is inside the angular range delimited by the dotted lines—i.e., when some pointofA is not headed outside of obsi—A is steered outside of that angular range, avoiding collision.

example, navigation methods requiring circle-based obstacle representations [2,3,5,7]can be ineffective with obstacles that have long, thin shapes, such as walls.

Indeed, our work is motivated by difficulties in applications requiring navigation nearor along walls in dynamic environments, such as boundary inspection [1] or navigationin hallways. For this paper, we distinguish between boundary-proximate and boundary-distant navigation: Boundary-proximate behaviors require navigation along obstacleboundaries, whereas boundary-distant behaviors require only collision avoidance, whichtends to deter proximity to obstacle boundaries. Boundary-distant reactive behaviorscan often be straightforwardly achieved by, e.g., potential-based navigation that em-ploys forceful repellers and ignores details such as concavities in obstacle shapes.Boundary-proximate reactive behaviors, however, are more challenging. This paper isthus focused on boundary-proximate behaviors (although as noted in Section 5, ourmethod supports both kinds of behavior), presenting efficient, geometrically sensitive,dynamic obstacle representations that support boundary-proximate navigation.

In particular, this paper introduces dynamic tangent (DT, for short) obstacle rep-resentations as intermediaries between obstacle entities and repeller representations.Dynamic tangent-based DT navigation treats each obstacle entity as an obstacle-valuedfunction, which returns an obstacle representation: For each agent A, at each timestepin computing navigation, each perceived obstacle entity (e.g., a wall, a polygon, anothernavigating agent) is represented by a dynamic tangent obstacle representation; each ob-stacle representation is mathematically modeled as an angular range of repulsion—or,alternatively, as the part of the obstacle entity within that angular range from which Ais repelled. Hence, unlike other approaches in which only two levels of information arereflected in obstacle modeling, DT navigation employs a three-tiered structure:

1. the obstacle entity—the full geometric shape of the entity in the environment;2. the obstacle representation—the DT representation abstracted from the obstacle

entity, i.e., the locally usable geometry upon which a repeller form is based;3. and the repeller representation—the mathematical function encoding the angular

repulsion ascribed to the obstacle entity.

Dynamic Obstacle Representations for Robot and Virtual Agent Navigation 3

The additional level of abstraction in this three-level structure and the geometricdetails of our DT representations are the primary contributions of this paper. The math-ematical functions for the repeller representations in our DT navigation are similar tothose of a standard mathematical form described in [2,5,8], and based on only threearguments: the minimum distance from agent A to a nearest point pm on an obstacleentity; the difference φ−ψ between heading angle φ of A and angle ψ from A to pm; andan angular range of repulsion Δψ, which controls the spectrum of heading angles fromwhich the obstacle will repel the agent. The resulting DT representations are effectiveand efficient, and they satisfy desirable properties of obstacle representations—such asproximity, symmetry, and necessary and sufficient repulsion—that underlie successfulnavigation in other methods (see Section 4).

Indeed, as detailed in Section 3, DT representations enable exceptionally effective,geometrically sensitive navigation without pre-computation or deliberation. Even insimulations of non-holonomic agents (e.g., robots) moving at constant speed, DT repre-sentations result in smooth, successful navigation in scenarios with complicated paths,moving walls, or curved shapes such as circular or figure-eight shaped tracks.

2 Repellers

The repellers underlying our DT representations are based on the same mathematics asin [5,7], although our repellers are capable of expressing a wider range of configura-tions. In this section, we describe these repellers, summarizing previous presentationsof the underlying mathematics and emphasizing particulars of our design.

In broad terms, DT navigation is an application of dynamical systems-based behaviormodeling [8]. For this paper, and consistent with related papers [5,7], agents are circularand agent velocity is constant throughout navigation—obstacle avoidance and targetseeking arise only from the dynamical systems governing heading angle φ. As noted in[2] and elsewhere, velocity could be autonomously controlled, but our present methodshows the effectiveness of DT representations even without velocity control.

The repellers themselves in DT navigation are angular repeller functions, dynam-ically associated with obstacles based on local perception, without pre-computation.Most of the mathematical ideas underlying these functions remain unchanged from[5], thus retaining strengths such as competitive behavioral dynamics and some resis-tance to problems such as local minima. (See [5,7,8] for details about strengths of thisparticular repeller design.) Conventionally, with these repellers, every obstacle entity isrepresented using only circles, and circles’ radii determine the associated repeller repre-sentations; in the underlying mathematics [5], however, repellers do not actually dependon circles’ radii, but only on angular ranges of repulsion. Previous presentations do notemphasize that these angular ranges need not be derived from radii, nor that requiringa circle (thus a radius r ≥ 0) for the repellers can restrict the expressiveness of obsta-cle representations: A point obstacle at the intersection of the dotted lines in Figure 1,for instance, would have the same angular range of repulsion as obstacle obsi; becauseno entity could be smaller than a point, no smaller range could occur from an obstacle


at that location. Smaller ranges, however, could be productively employed by variants ofthese previous techniques, for more flexible and sensitive navigation dynamics. In thissection, we describe our repellers, which are based on angular ranges, and in Section 3,we describe the particular angular ranges calculated for DT navigation.

To briefly summarize work from [7] and other papers, the evolution of agent headingangle φ during navigation is determined by angular repellers and angular attractors in adynamical system of the form

φ = |wtar|ftar + |wobs|fobs + noise, (1)

where φ is the time derivative of φ (by convention, dotted variables are time derivatives),ftar and fobs are functions representing targets and obstacles, respectively—the con-tributions of attractors and repellers to agent steering—and wtar and wobs are weightfunctions for each term. (The noise term prevents undesired fixed points in the dynam-ics.) Thus, at each timestep, φ is calculated for steering at that moment, naturally andreactively accommodating new percepts or changes in the environment. A target is rep-resented by a simple sine function:

ftar = −a sin(φ − ψtar). (2)

This function induces a clockwise change in heading direction when φ is counter-clockwise with respect to the target (and symmetrically, counter-clockwise attractionwhen φ is clockwise), thus attracting an agent.

Obstacle functions are more complicated, encoding windowed repulsion scaled bydistance, so that more distant repellers have weaker effects than closer ones, and re-pellers do not affect agents already on collision-free paths. (Full details are in [7],which we only concisely summarize here.) For an obstacle obsi, the angular repellerin DT navigation is the product of three functions:

Ri =φ − ψi

Δψie1−|φ−ψiΔψi

| (3)

Wi = tanh(h1(cos(φ−ψi)−cos(Δψi+σ)))+12 (4)

Di = e−dmd0 . (5)

Function Ri is an angular repeller with angular width Δψi, centered around heading-angle value ψi (see Figure 1); windowing function Wi limits repulsion to have signifi-cant effects only within Δψi (plus a safety margin σ) from ψi; and scaling function Di

limits the strength of the overall repulsion based on dm, the minimum distance betweenthe agent and the obstacle. (Designer-chosen constant d0 is a scaling parameter for Di.)Each repeller, then, is a product fobsi = Ri · Wi · Di, and for navigation, contributionsof individual repellers are summed to fobs =

∑i fobsi and then combined with ftar in

the weighted sum of Equation 1 to control steering. The weights themselves are deter-mined by a system of competitive dynamics for reactive behavior selection; full detailsare available in [5].


β1

A

rA

cA cA ΔψE

EE

eA eAeA

β1s

D

s

D

(a)

s

DΔψvm

φ

vm

Δψ

φ

vm

(b) (c)

β2β2β2

β1φ

pm pm pmcA

rA

A

rA

A

Fig. 2. Default, non-boundary case DT representations for various obstacle entity shapes. In allcases, 2Δψ is chosen to be the angular range subtended by a segment of length 2D constructedaround point pm, perpendicular to vector vm .

3 Dynamic Tangent Obstacle Representations

Because of the additional level of abstraction for DT representations, the dynamicsensitivity of the repellers in Section 2 is enhanced by increased flexibility and localsensitivity to geometry, without deliberation or pre-computation. DT representationsare constructed from locally relevant portions of obstacle entities’ shapes, which forthis paper are presumed to be always either straight lines or arcs of circles. Based onthis, we consider three possible cases for any relevant component shape of an obstacle:straight line, convex non-line, or concave (i.e., a boundary of a linear, convex, or con-cave portion of the obstacle). Processes for DT construction are similar in each case, sowe here present details of geometry applicable to all cases, with case-specific details inthe following subsections.

In any of the cases, given agent A and obstacle entity E, the DT representationof E can be seen as the portion of E within a reactively calculated angular range ofrepulsion for E; in Figures (e.g., Figure 2), we conventionally indicate such a portionby thicker, lighter colored boundary lines on E. For this paper, the angular range (fromA) defining that portion of E is the subtended range of a line segment, as shown inFigure 2; in general, as Figure 2 also suggests, this DT segment is locally tangent toE at a projection point pm of minimal distance between A and E. More specifically,the segment is oriented perpendicular to the vector vm that joins the center of A topm, and it is centered at pm, extending a distance D in each direction, where D is anagent- or application-specific parameter. Parameter D thus determines angular rangeΔψ—the DT segment represents an angular repeller of range 2Δψ, with range Δψ ineach direction around vm (see Figure 2). In examples in this paper, D is constant overa navigation (rather than, e.g., Δψ being constant over a navigation), so Δψ relates to|vm| in desirable ways: For example, as A gets closer to E, the subtended range widens,resulting in greater repulsion.

In the default (i.e., non-boundary) conditions in each of the three cases, the angularrange 2Δψ subtended by the segment of length 2D around pm can be found fromelements labeled in Figure 2:

Δψ = β1 + β2 = sin−1(rA

s

)+ sin−1

(D

s

)(6)


(a)

Δψvm

β1

β2vm

E

E

eAeA

φAcArA

A

rA

cA

φ

Δψ

β1

vAβ2

(b)

vA

pm

pmD

D

Fig. 3. Boundary case DT construction for a straight line obstacle entity. In the direction agentA is heading with respect to vector vm , the entity’s subtended angular range is smaller than astandard DT representation’s subtended angular range, so the range of the DT representation ismodified to prevent repulsion from non-colliding paths.

As non-default, boundary cases, we consider instances where the angular range of theresulting repeller is wider than the angular range subtended by the original obstacleentity E, as shown in Figures 3–5. In these cases, to prevent needless repulsion fromcollision-free headings, DT representations have angular range exactly equal to that ofE in the direction the agent is headed with respect to vm; the computations underlyingthese representations depend on the shape of E, as described below.

3.1 Straight Lines

When the obstacle entity E is a wall or some other straight line shape, finding pm fora DT representation is straightforward: If there is a perpendicular from the center ofagent A to E, then pm is the intersection of E and the perpendicular; otherwise, pm

is the closest endpoint of E to A. In default cases, construction of the DT segment—aportion of E—and the resulting repeller is also straightforward.

In boundary cases, it is necessary to find the subtended range of E with respect to Ain the direction that A is heading around E. For this, the process is the same whethervm ⊥ E (Figure 3a) or not (Figure 3b). First, find endpoint eA of E in the directionthat A is headed. The appropriate angular range for repulsion is then given by:

Δψ = sin−1

(rA

|vA|

)+ cos−1

(|vm|2 + |vA|2 − |eA − pm|2

2|vm||vA|

)(7)

3.2 Convex Shapes

For a convex chain of straight lines, finding pm is again straightforward for agent A,and desired range Δψ can be computed similarly to the process in Subsection 3.1, usingvertices for boundary cases (see Figure 4a). For a convex circular arc (Figure 4b), how-ever, the subtended angular range is not necessarily defined by its endpoints. Considersuch an arc to be defined by the radius ro and center-point co of its defining circle Co,as perceived by A, and by the visible angular range of Co included in the arc, defined byangles θi and θf with respect to co and the positive x axis. Then, it is again straightfor-ward to find closest point pm to A, and in default cases, the DT segment and associatedangular range follow immediately.


D

E

(a)

β2

vm

(b)

θi

co

φ

pm

φ β1

vm

θf

ro

Δαpm

Δψ

ρ

A

rA

cA

A

rA

cAvA

eA eA

E

Fig. 4. Boundary case DT construction for convex shapes. For convex circular arcs (b), it mustbe determined if the subtended angle Δα between agent A and the entire circle Co (of which thearc is a portion) is the angle Δψ from which the DT representation is derived.

In boundary cases, it remains to find the angle subtended by the arc, to compareto the angle subtended by the DT segment. To do this, we find endpoint eA similarlyto the straight line case, and we observe that the desired subtended angular range islimited by either eA or by the point called ρ in Figure 4b, which defines (one side of)the subtended angular range 2Δα between A and the entire circle Co. The remainder ofthe DT construction then follows as before, with angular range of repulsion determinedby either parameter D or the appropriate boundary condition described here.

3.3 Concave Shapes

Unlike the convex case, concave shapes bring safety concerns: Given non-holonomicagents with constant forward velocity, some environments cannot be navigated safely,such as a corner or box requiring sharper turns than motion constraints allow agentsto make. In DT navigation, we prevent such difficulties by automatically approximat-ing each concave corner by an arc with a radius large enough to be safely navigable:first, using properties of agent velocity and geometry, we calculate the agent’s minimumradius for safe turns, rmin; then, when computing navigation, an unsafe corner is effec-tively modeled by a navigable arc, as in Figure 5a. (We also presume that all arcs in theenvironment have radius at least rmin, although tighter arcs could similarly be modeledby larger, navigable ones.) To approximate only the minimum amount necessary, DTrepresentations treat the corner as if an arc of radius rmin were placed tangent to thelines forming the corner, as in Figure 5a; after finding half of the angle formed by thecorner, θc, the distance dc from the corner at which the arc should begin is dc = rmin

tan (θc).

It is straightforward to find point pm and manage boundary cases with endpoints of thechain, thus completing the definition of the representation.

For concave arcs, procedures are similar but complementary to those for convex arcs.If A is not located between co (the center of the circle from which the arc is derived)and the arc itself, closest point pm to agent A is an endpoint of the arc; otherwise, pm

is the point on the arc in the same direction from co as cA. Endpoint eA is found byfollowing the arc clockwise from pm if agent heading direction is clockwise from pm,and counter-clockwise otherwise (see Figure 5b). For boundary cases, the subtendedangular range of a concave arc is always determined by endpoint eA, and can thus befound similarly to previous cases.


eA

pm

φ

A cAΔψβ2

E

β1

(b)(a)

θc

d c

rm

in

E

vm

vA

D

Fig. 5. Concave corner and arc shapes, with labeled features pertinent to DT computation: (a)approximating a corner by an arc; (b) finding a subtended angle

4 Properties of Obstacle Representations

As part of designing DT representations, we identified several desirable properties thatreactive obstacle representations could possess, properties that clearly motivated previ-ous work such as [5,7]. As part of evaluating DT representations, we present our list ofthese properties and very briefly describe how DT representations satisfy them.

Proximity and Symmetry. The focal point for computing repulsion is a point pm onobstacle entity E of minimal distance from agent A. Thus, the nearest point on Eto A—with which A might in principle collide soonest—is also the nearest point ofthe obstacle representation of E to A, enabling appropriate distance-based effectsof repulsion. Furthermore, repulsion is centered around pm and associated vectorvm (see Figures 1 and 2), so the reactive, local obstacle representation aptly doesnot determine in which general direction A heads around E—the representationsteers A around E in the direction A was already heading, symmetrically aroundvm, regardless of the heading of A.

Necessary and sufficient repulsion. The repulsive range of the obstacle representa-tion corresponds to exactly the heading angles along which the agent would collidewith the obstacle entity.

Reactivity. Obstacle avoidance dynamically applies to both stationary and moving ob-stacles, without pre-computation or non-local knowledge.

Mathematical parsimony. Each obstacle entity is represented by a single repeller, nei-ther overloading agent computations nor requiring needless mathematical machina-tions. This enables straightforward utility in a range of scenarios.

Previous obstacle representations that satisfy these properties were effective onlyin substantially restricted environments, i.e., consisting of only circular obstacles (see,e.g., Figure 1). Our DT representations are far more flexible, also applying to non-circular obstacles, and they directly satisfy properties of reactivity, mathematical parsi-mony, proximity, and symmetry, as well as a relaxed sense of necessary and sufficientrepulsion: Because DT-based repellers are bounded by the maximum angular range ofthe obstacle entity remaining in the direction A is heading (toward endpoint eA), nocollision-free paths in that direction are repelled. Furthermore, due to proximity and


symmetry properties, each time repulsion is calculated, any possible collision pointwould be at or very near pm. Thus, for large enough D with respect to agent size andnavigation calculations, any colliding path would be within the range of repulsion of aDT-based repeller, and no collision-free path would ever be in that range.

5 Demonstrations

To test DT navigation, we created a simple, OpenGL-based simulator and simulatednavigation in several scenarios. For each agent A, obstacle boundaries were locally per-ceived, and perception was straightforwardly implemented so that portions of obstaclesoccluded by other obstacles were not perceived, but other entities were perceived withunlimited range in all directions. Default parameter values for the repellers of Section 2and [5,7] were d0 = 2.0, σ = 0.4, and D = 4rA unless otherwise noted, and naviga-tion was in a 12 by 12 unit world, with velocity held constant at 0.3, isolating headingangle φ as the only behavioral variable governing navigation, as in [5,7]. Tests were oftwo general kinds: basic testing to calibrate the value of D and establish general DTeffectiveness in common scenarios; and testing in complex environments.

5.1 Basic Testing and Calibration

The default value D = 4rA in our demonstrations was chosen after experimentallydetermining the effect of D on navigation. In general, greater values of D lead to re-pellers with greater angular ranges of repulsion, but for an obstacle entity that subtendsa large angular range, it is not always desirable for a repeller to subtend that entire range.For example, such large repellers can preclude the boundary-proximate navigation(Figure 6) on which this paper is focused, as discussed in Section 1. Boundary-distantnavigation, in contrast, can be supported by such large repellers, but boundary-distantnavigation can also be readily supported by appropriate DT representations (Figure 7).The local sensitivity enabled by DT representations, however, is not fully exploited inboundary-distant applications.

For finer-tuned, boundary-proximatenavigation, we first calibrated D for appropriatesensitivity in DT representations. To do this, we ran experiments with a single agent Anavigating along a wall, which indicated that distance dm of agent A from the wallsystematically varied with the value of D. We also considered a thought experiment—i.e., among many differences between an elephant and an ant, they maintain differentsafety margins when walking along a wall—and thus selected an agent size-dependentvalue of D = 4rA, where rA is the radius of agent A; this results in a dm of betweenone and two radii for agents, which seems safe but not excessive. We then tested DTnavigation in the basic scenarios shown in Figure 6, each with a convex or concaveobstacle. In each scenario, agents started at 100 randomly selected positions spanningthe left sides of their worlds, and DT navigation achieved perfect performance: Everyagent reached its target without collision.

5.2 Complicated Environments

We also tested agents in more complicated environments, as shown in Figure 9. In theHallways scenario, approximating an indoor layout with 3 × 2-sized office-obstacles


(c) Concave Corner (d) Concave Arc

(a) Octagon (b) Convex Arc

Fig. 6. Basic scenarios for demonstrations ofpurely reactive boundary-proximate naviga-tion. Each image contains an obstacle entity,a target (green circle), and a sample trajectory.

(a) Convex Arc (b) Concave Corner

Fig. 7. Demonstrations of DT-based boundary-distant navigation. Agents started out facingthe target but turned quickly, taking a smooth,efficient path to the target.

(a) (b)

Fig. 8. Two different-sized agents, sizes rA = 0.1 and 0.3, reaching parallel paths along a wall,each from a setting of D = 4rA. Figures show the target locations (green circles), trajectories,and DT representations of the wall for each.

(the inner rectangles) and hallway width roughly 10-to-20 times rA, agents navigatedfrom 100 starting positions in the left of their world to a sequence of five target locations(Figure 9a), requiring extensive navigation and turning. Purely reactive DT navigationperformance was perfect in all tested variants of this hallway scenario, including ver-sions with additional circular obstacles, stationary or moving, in hallways.

The Polygons scenario (Figure 9b) incorporates navigation around a moving wall,which rotates in the center of the space, and a variety of convex polygons. In theseexperiments, agents navigated to five target locations (similar to those in the Hallwaysscenario), requiring a full traversal of the horizontal space; because of the additionaldifficulty posed by this scenario, the values of d0 and σ were raised to 2.25 and 0.6, forrepulsion at greater distances. Tests of DT navigation showed very good performance:Of 100 agents tested, starting from positions spanning the left side and top of this en-vironment, 99 reached all targets without colliding. (Avoiding the moving wall proveddifficult, perhaps due to the restriction to constant velocity.)


(e) Winding Path(a) Hallways (b) Polygons (c) Circle Track (d) Figure−Eight

Fig. 9. Different scenarios in which DT navigation was tested, showing target locations and anexample agent trajectory in each: (a) Hallways; (b) Polygons; (c) Circle Track; (d) Figure-Eight;(e) Winding Path

Fig. 10. A race-like run in the Circle Track scenario, including target locations and a trajectory ofa fast agent that steered around slower agents

The remaining tests in complex environments were performed in scenarios withcurved shapes (Figure 9c–e): a Circle Track; a Figure-Eight; and a Winding Path. TheWinding Path scenario illustrates how even purely reactive DT navigation can succeedalong a very complicated path, and the Figure-Eight scenario shows successful navi-gation in a closely bounded, curved environment. In the Figure-Eight and Circle Trackscenarios, targets were alternatingly placed on the top and bottom of the tracks (indi-cated in Figure 9), to keep agents looping around the tracks. Race-like demonstrationswere also run on the Circle Track (Figure 10), with up to four agents at different speeds,all successfully avoiding each other and the boundaries of the track while running.

6 Conclusion

This paper presents a new, dynamic tangent-based navigation method, which treats ob-stacle entities as obstacle-valued functions: Each agent represents each obstacle as anangular repeller, dynamically adjusted during navigation to support successful perfor-mance. The obstacle representation level of abstraction enables enhanced geometricsensitivity while retaining desired properties of obstacle representations. Simulationsdemonstrate that DT navigation is successful even in applications where agents mustnavigate closely around obstacle shapes and scenarios with a moving wall or compli-cated environments requiring circular or winding paths. DT representations might also


be effective in a wider range of environments if based on context-dependent variationsin the value of D or with learning-based adaptations; the fact that DT representationsrequire so few parameters may facilitate developmental or learning-based approaches.

Acknowledgments. The authors thank Clare Bates Congdon and anonymous refereesfor comments on previous versions of this paper.

References

1. Easton, K., Burdick, J.: A coverage algorithm for multi-robot boundary inspection. In: Int.Conf. Robotics and Automation, pp. 727–734 (2005)

2. Goldenstein, S., Karavelas, M., Metaxas, D., Guibas, L., Aaron, E., Goswami, A.: Scal-able nonlinear dynamical systems for agent steering and crowd simulation. Computers andGraphics 25(6), 983–998 (2001)

3. Huang, W., Fajen, B., Fink, J., Warren, W.: Visual navigation and obstacle avoidance using asteering potential function. Robotics and Autonomous Systems 54(4), 288–299 (2006)

4. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. Int. Journal ofRobotics Research 5(1), 90–98 (1986)

5. Large, E., Christensen, H., Bajcsy, R.: Scaling the dynamic approach to path planning andcontrol: Competition among behavioral constraints. International Journal of Robotics Re-search 18(1), 37–58 (1999)

6. Paris, S., Pettre, J., Donikian, S.: Pedestrian reactive navigation for crowd simulation: Apredictive approach. Computer Graphics Forum 26(3) (2007)

7. Schoner, G., Dose, M.: A dynamical systems approach to task-level system integration usedto plan and control autonomous vehicle motion. Robotics and Autonomous Systems 10(4),253–267 (1992)

8. Schoner, G., Dose, M., Engels, C.: Dynamics of behavior: Theory and applications for au-tonomous robot architectures. Robotics and Autonomous Systems 16(2-4), 213–245 (1995)

9. Shao, W., Terzopoulos, D.: Autonomous pedestrians. Graphical Models 69(5-6), 246–274(2007)

10. Treuille, A., Cooper, S., Popovic, Z.: Continuum crowds. ACM Trans. on Graphics 25(3),1160–1168 (2006)

Grounding Formulas with Complex Terms

Amir Aavani, Xiongnan (Newman) Wu,Eugenia Ternovska, and David Mitchell

Simon Fraser University{aaa78,xwa33,ter,mitchell}@sfu.ca

Abstract. Given a finite domain, grounding is the the process of creating avariable-free first-order formula equivalent to a first-order sentence. As the first-order sentences can be used to describe a combinatorial search problem, efficientgrounding algorithms would help in solving such problems effectively and makesadvanced solver technology (such as SAT) accessible to a wider variety of users.One promising method for grounding is based on the relational algebra from thefield of Database research. In this paper, we describe the extension of this methodto ground formulas of first-order logic extended with arithmetic, expansion func-tions and aggregate operators. Our method allows choice of particular CNF rep-resentations for complex constraints, easily.

1 Introduction

An important direction of work in constraint-based methods is the development ofdeclarative languages for specifying or modelling combinatorial search problems. Theselanguages provide users with a notation in which to give a high-level specification ofa problem (see e.g., ESSENCE [1]). By reducing the need for specialized constraintprogramming knowledge, these languages make the technology accessible to a widervariety of users. In our group, a logic-based framework for specification/modelling lan-guage was proposed [2]. We undertake a research program of both theoretical develop-ment and demonstrating practical feasibility through system development.

Our tools are based on grounding, which is the task of taking a problem specification,together with an instance, and producing a variable-free first-order formula represent-ing the solutions to the instance1. Here, we consider grounding to propositional logic,with the aim of using propositional satisfiability (SAT) solvers as the problem solvingengine. Note that SAT is just one possibility. A similar process can be used for ground-ing from a high-level language to e.g., CPLEX, various Satisfiability Modulo Theory(SMT) and ground constraint solvers, e.g., MINION [3], etc. An important advantagein solving through grounding is that the speed of ground solvers improves all the time,and we can always use the best and the latest solver available.

Grounding a first-order formula over a given finite domain A may be done simplyby replacing ∀x φ(x) with ∧a∈Aφ(x)[x/a], and ∃x φ(x) with ∨a∈Aφ(x)[x/a] wherea is a new constant symbol denoting domain element a and φ(x)[x/a] denotes substi-tuting a for every occurrence of x in φ. In practice, though, effective grounding is noteasy. Naive methods are too slow, and produce groundings that are too large and containmany redundant clauses.

Patterson et. al. defined a basic grounding method for function-free first-order logic(FO) in [4,5], and a prototype implementation is described in [5]. Expressing most of

1 By instance we always understand an instance of a search problem, e.g., a graph is an instanceof 3-colourability.


14 A. Aavani et al.

interesting real-world problems, e.g., Traveling Salesman problem or Knapsack prob-lem, with function-free FO formula without having access to arithmetical operators isnot an easy task. So, enriching the syntax with functions and arithmetical operators isa necessity. We describe how we have extended the existing grounding algorithm suchthat it can handle these constructs.

It is important to notice that the model expansion problem [5] is very different fromquery evaluation process. In model expansion context, there are formulas and sub-formulas which cannot be evaluated, while in query processing context, every formulacan be evaluated as either true or false. First-order model expansion, when we are talk-ing about finite domain, allows one to describe NP-complete problems while the queryprocessing problem for FO, in finite domain context, is polynomial time. In this paper,we are interested in solving model expansion problem.

An important element in the practice of SAT solving is the choice, when designingreductions, of “good” encodings into propositional logic of complex constraints. We de-scribe our method for grounding of formulas containing aggregate operations in termsof “gadgets” which determine the actual encoding. The choice of the particular gadgetcan be under user control, or even made automatically at run-time based on formula andinstance properties.

Even within one specification, different occurrences of the same aggregate may begrounded differently, and this may vary from instance to instance. With well designed(possibly by machine learning methods) heuristics for such choices, we may be able toproduce groundings that are more effective in practice than those a human could designby hand, except through an exceedingly labour-intensive process.

Our main contributions are:

1. We present an algorithm which can be used to ground specifications having differ-ent kinds of terms, e.g., aggregates, expansion/instance functions, arithmetic.

2. We enrich our language with aggregates, functions and arithmetical expression anddesign and develop an engine which can convert these constructs to pure SAT in-stances as well as to instances of SAT solvers which are able to handle more com-plex constraints such as cardinality constraints or Pseudo-Boolean constraints.

3. We define the notion of answer to terms and modify the previous grounding algo-rithm to be able to work with this new concept.

2 Background

We formalize combinatorial search problems in terms of the logical problem of modelexpansion (MX), defined here for an arbitrary logic L.

Definition 1 (MX). Given an L-sentence φ, over the union of disjoint vocabularies σand ε, and a finite structure A for vocabulary σ, find a structure B that is an expansionof A to σ ∪ ε such that B |= φ.

In this paper, φ is a problem specification formula.A always denotes a finite σ-structure,called the instance structure, σ is the instance vocabulary, and ε the expansion vocabu-lary, and L is FO logic extended with arithmetic and aggregate operators.

Example 1. Consider the following variation of the knapsack problem:We are given a set of items (loads), L = {l1, · · · , ln}, and weight of each item

is specified by an instance function W which maps items to integers (wi = W (li)).We want to check if there is a way to put these n items into m knapsacks, K ={k1, · · · , km} while satisfying the following constraints:

Grounding Formulas with Complex Terms 15

Certain items should be placed into certain knapsacks. These pairs are specified usingthe instance predicate “P”. h of these m knapsacks have high capacity, each of them cancarry a total load of HCap, while the capacity of the rest of the knapsacks is LCap. Wealso do not want to put two items whose weights are very different in the same bag, i.e.,the difference between the weights of the items in the same bag should be less than Wl.Each of HCap, LCap and Wl is an instance function with arity zero, i.e. a given constant.

The following formula φ in the first order logic is a specification for this problem:

{A1 : ∀l∃k : Q(l, k)}∧{A2 : ∀l∀k1∀k2 : (Q(l, k1) ∧ Q(l, k2)) ⊃ k1 = k2}∧{A3 : ∀l, k : P (l, k) ⊃ Q(l, k)}∧{A4 : ∀k :

∑l:Q(l,k) W (l) ≤ HCap}∧

{A5 : COUNTk{∑

l:Q(l,k) W (l) ≥ LCap} ≤ h}∧{A6 : ∀k, l1, l2 : (Q(l1, k) ∧ Q(l2, k)) ⊃ (W (l1) − W (l2) ≤ Wl)}

An instance is a structure for vocabulary σ = {P, W, Wl, HCap, LCap}, i.e., a list ofpairs, a function which maps items to integers and three constant integers. The task isto find an expansion B of A that satisfies φ:

A︷︸︸︷(L ∪ K; PA, WA, WA

l , HACap, L

ACap, QB︸︷︷︸

B

) |= φ.

Interpretations of the expansion vocabulary ε = {Q}, for structures B that satisfy φ, isa mapping from items to knapsacks that satisfies the problem properties.

The grounding task is to produce a ground formula ψ = Gnd(φ,A), such that modelsof ψ correspond to solutions for instance A. Formally, to ground we bring domain ele-ments into the syntax by expanding the vocabulary with a new constant symbol for eachelement of the domain. For domain A, the domain of structure A, we denote the set ofsuch constants by A. In practice, the ground formula should contain no occurrences ofthe instance vocabulary, in which case we call it reduced.

Definition 2 (Reduced Grounding for MX). Formula ψ is a reduced grounding offormula φ over σ-structure A = (A; σA) if1 ψ is a ground formula over ε ∪ A, and2 for every expansion structure B = (A; σA, εB) over σ ∪ ε, B |= φ iff (B, AB) |= ψ,

where AB is the standard interpretation of the new constants A.

Proposition 1. Let ψ be a reduced grounding of φ over σ-structure A. Then A can beexpanded to a model of φ iff ψ is satisfiable.

A reduced grounding with respect to a given structure A can be obtained by an algo-rithm that, for each fixed FO formula, runs in time polynomial in the size of A. Sucha grounding algorithm implements a polytime reduction to SAT for each NP searchproblem. Simple grounding algorithms, however, do not reliably produce groundingsfor large instances of interesting problems fast enough in practice.

Grounding for MX is a generalization of query answering. Given a structure (database)A, a Boolean query is a formula φ over the vocabulary of A, and query answering isequivalent to evaluating whether φ is true, i.e.,A |= φ. For model expansion, φ has some

16 A. Aavani et al.

additional vocabulary beyond that of A, and producing a reduced grounding involvesevaluating out the instance vocabulary, and producing a ground formula representingthe possible expansions of A for which φ is true.

The grounding algorithms in this paper construct a grounding by a bottom-up pro-cess that parallels database query evaluation, based on an extension of the relationalalgebra. For each sub-formula φ(x) with free variables x, we call the set of reducedgroundings for φ under all possible ground instantiations of x an answer to φ(x). Werepresent answers with tables on which an extended algebra operates.

An X-relation is a k-ary relation associated with a k-tuple of variables X, represent-ing a set of instantiations of the variables of X. It is a central notion in databases. Inextended X-relations, introduced in [4], each tuple γ is associated with a formula ψ.For convenience, we use � and ⊥ as propositional formulas which are always true and,false, respectively.

Definition 3 (extended X-relation; function δR). Let A be a domain, and X a tupleof variables with |X | = k. An extended X-relation R over A is a set of pairs (γ, ψ) s.t.1 γ : X → A, and2 ψ is a formula, and3 if (γ, ψ) ∈ R and (γ, ψ′) ∈ R then ψ = ψ′.The function δR represented by R is a mapping from k-tuples γ of elements of thedomain A to formulas, defined by:

δR(γ) ={

ψ if (γ, ψ) ∈ R,

⊥ if there is no pair (γ, ψ) ∈ R.

For brevity, we sometimes write γ ∈ R to mean that there exists ψ such that (γ, ψ) ∈ R.We also sometimes call extended X-relations simply tables. To refer to X-relations forsome concrete set X of variables, rather than in general, we write X-relation.

Definition 4 (answer to φ wrt A). Let φ be a formula in σ ∪ ε with free variables X ,A a σ-structure with domain A, and R an extended X-relation over A. We say R is ananswer to φ wrt A if for any γ : X → A, δR(γ) is a reduced grounding of φ[γ] overA. Here, φ[γ] denotes the result of instantiating free variables in φ according to γ.

Since a sentence has no free variables, the answer to a sentence φ is a zero-ary extendedX-relation, containing a single pair (〈〉, ψ), associating the empty tuple with formula ψ,which is a reduced grounding of φ.

Example 2. Let σ = {P} and ε = {E}, and let A be a σ-structure with PA ={(1, 2, 3), (3, 4, 5)}. Answers to φ1 ≡ P (x, y, z) ∧ E(x, y) ∧ E(y, z), φ2 ≡ ∃zφ1

and φ3 ≡ ∃x∃yφ2 are demonstrated in Table 1.Observe that δR(1, 2, 3) = E(1, 2)∧E(2, 3) is a reduced grounding of φ1[(1, 2, 3)] =

P (1, 2, 3)∧E(1, 2)∧E(2, 3), and δR(1, 1, 1)=⊥ is a reduced grounding of φ1[(1, 1, 1)].E(1, 2)∧E(2, 3) is a reduced grounding of φ2[(1, 2)]. Notice that, as φ3 does not haveany free variables, its corresponding answer has just a single row.

Table 1. Answers to φ1, φ2 and φ3

x y z ψ1 2 3 E(1, 2) ∧E(2, 3)3 4 5 E(3, 4) ∧E(4, 5)

x y ψ1 2 E(1, 2) ∧E(2, 3)3 4 E(3, 4) ∧E(4, 5)

ψ[E(1, 2) ∧E(2, 3)] ∨ [E(3, 4) ∧E(4, 5)]


The relational algebra has operations corresponding to each connective and quantifierin FO, as follows: complement (negation); join (conjunction); union (disjunction), pro-jection (existential quantification); division or quotient (universal quantification). Fol-lowing [4,5], we generalize each to extended X-relations as follows.

Definition 5 (Extended Relational Algebra). Let R be an extended X-relation and San extended Y -relation, both over domain A.1. ¬R is the extended X-relation ¬R = {(γ, ψ) | γ : X → A, δR(γ) �= �, and ψ =

¬δR(γ)}2. R �� S is the extended X ∪Y -relation R �� S = {(γ, ψ) | γ : X ∪Y → A, γ|X ∈

R, γ|Y ∈ S, and ψ = δR(γ|X) ∧ δS(γ|Y )};3. R ∪ S is the extended X ∪ Y -relation R ∪ S = {(γ, ψ) | γ|X ∈ R or γ|Y ∈

S, and ψ = δR(γ|X) ∨ δS(γ|Y )}.4. For Z ⊆ X , the Z-projection of R, denoted by πZ(R), is the extended Z-relation

{(γ′, ψ) | γ′ = γ|Z for some γ ∈ R and ψ =∨

{γ∈R|γ′=γ|Z} δR(γ)}.5. For Z ⊆ X , the Z-quotient of R, denoted by dZ(R), is the extended Z-relation

{(γ′, ψ) | ∀γ(γ : X → A∧γ|Z = γ′ ⇒ γ ∈ R), and ψ=∧

{γ∈R|γ′=γ|Z} δR(γ)}.

To ground using this algebra, we apply the algebra inductively on the structure of theformula, just as the standard relational algebra may be applied for query evaluation. Wedefine the answer to atomic formula P (x) as follows. If P is an instance predicate, theanswer to P is the set of tuples (a,�), for a ∈ PA. If P is an expansion predicate, theanswer is the set of all pairs (a, P (a)), where a is a tuple of elements from the domainA. Correctness of the method then follows, by induction on the structure of the formula,from the following proposition.

Proposition 2. Suppose that R is an answer to φ1 and S is an answer to φ2, both withrespect to (wrt) structure A. Then1. ¬R is an answer to ¬φ1 wrt A;2. R �� S is an answer to φ1 ∧ φ2 wrt A;3. R∪ S is an answer to φ1 ∨ φ2 wrt A;4. If Y is the set of free variables of ∃zφ1, then πY (R) is an answer to ∃zφ1 wrt A.5. If Y is the set of free variables of ∀zφ1, then dY (R) is an answer to ∀zφ1 wrt A.

The proof for cases 1, 2 and 4 is given in [4]; the other cases follow easily.The answer to an atomic formula P (x), where P is from the expansion vocabulary,

is formally a universal table, in practice we may represent this table implicitly and avoidexplicitly enumerating the tuples. As operations are applied, some subset of columnsremain universal, while others do not. Again, those columns which are universal maybe represented implicitly. This could be treated as an implementation detail, but the useof such implicit representations dramatically affects the cost of operations, and so it isuseful to further generalize our extended X-relations. We call the variables which areimplicitly universal “hidden” variables, as they are not represented explicitly in the tu-ples, and the other variables “explicit” variables. We are not going to define this concepthere, but interested readers are encouraged to refer to [5].

This basic grounding approach can ground just the axioms A1, A2, A3 in example 1.

2.1 FO MX with Arithmetic

In this paper, we are concerned with specifications written in FO extended with func-tions, arithmetic and aggregate operators. Informally, we assume that the domain of anyinstance structure is a subset of N (set of natural numbers), and that arithmetic opera-tors have their standard meanings. Details of aggregate operators need to be specified,

18 A. Aavani et al.

but these also behave according to our normal intuitions. Quantified variables and therange of instance functions must be restricted to finite subsets of the integers, and pos-sible interpretations of expansion predicates and expansion functions must be restrictedto a finite domain of N, as well. This can be done by employing a multi-sorted logicin which all sorts are required to be finite subsets of N, or by requiring specificationformulas to be written in a certain “guarded” form.

In the rest of this paper, we assume that all variables are ranging over the finitedomain2 T ⊂ N and φ(t1(x), · · · , tk(x)) is a short-hand for ∃y1, · · · , yk : y1 =t1(x) ∧ · · · yk = tk(x) ∧ φ(y1, · · · , yk). Under these assumptions, we do not needto worry about the interpretation of predicates and functions outside T .

Syntax and Semantics of Aggregate Operators. We may use evaluation for formulaswith expansion predicates. By evaluating a formula, which has expansion predicates, astrue we mean that there is a solution for the whole specification which satisfies the givenformula, too. Also, for sake of representation, we may use φ[a, z2] as a short-hand forφ(z1, z2)[z1/a], which denotes substituting a for every occurrence of z1 in φ. Althoughour system supports grounding specification having Max, Min, Sum and Count aggre-gates, but for the sake of space, we just focus on Sum and Count aggregate in this paper:

– t(y) = Maxx{t(x, y) : φ(x, y); dM (y)}, for any instantiation b for y, denotes themaximum value obtained by t[a, b] over all instantiations a for x for which φ[a, b]is true, or dM if there is none. dM is the default value of Max aggregate which isreturned whenever all conditions are evaluated as false.

– t(y) = Minx{t(x, y) : φ(x, y); dm(y)} is defined dually to Max.– t(y) = Sumx{t(x, y) : φ(x, y)}, for any instantiation b of y, denotes 0 plus the

sum of all values t[a, b] for all instantiations a for x for which φ[a, b] is true.– t(y) = Countx{φ(x, y)}, for any instantiation b for y, denotes the number of tu-

ples a for which φ[a, b] is true. As we have Countx{φ(x, y)}=Sumx{1, φ(x, y)},in the rest of this paper, we assume that all terms having Count aggregate are re-placed with the appropriate terms in which Count is replaced with Sum aggregateand so we do not discuss the count aggregate, anymore.

3 Evaluating Out Arithmetic and Instance Functions

The relational algebra-based grounding algorithm, described in Section 2, is designedfor the relational (function-free) case. Below, we extend it to the case where argumentsto atomic formulas may be complex terms. In this section, we present a simple methodfor special cases where terms do not contain expansion predicates/functions, and sothey can be evaluated purely on the instance structure.

Recall that an answer to a sub-formula φ(X) of a specification is an extended X-relation R. If |X | = k, then the tuples of R have arity k. Now, consider an atomicformula whose arguments are terms containing instance functions and arithmetic oper-ations, e.g., φ = P (x + y). As it is discussed previously, φ ⇔ ∃z(z = x + y ∧ P (z)).Although we have not discussed handling of the sub-formula z = x + y, it is apparentthat the answer to φ, with free variables {x, y}, is an extended {x, y}-relation R.

The extended relation R can be defined as the set of all tuples (〈a, b〉, ψ) such thata + b is in the interpretation of P . To modify the grounding algorithm of previoussub-section, we revise the base cases of definition as follows:

2 A more general version, where each variable may have its own domain, is implemented, butis more complex to explain.


Definition 6 (Base Cases for Atoms with Evaluable Terms). For an atomic formulaφ = P (t1, · · · , tn) with terms t1 . . . tn and free variables X , use the following extendedX-relation (which is an answer to φ wrt A):

1. P is an instance predicate: {(γ,�) | A |= P (t1, . . . tn)[γ]}2. P is t1(x) � t2(x), where � ∈ {=, <}: {(γ,�) | A |= t1 � t2[γ]}3. P is an expansion predicate: {(γ, P (a1, . . . an)) | A |= (t1 =a1, . . . tn = an)[γ]}

Terms involving aggregate operators, provided the formula argument to that operatorcontains only instance predicates and functions with a given interpretation, can also beevaluated out in this way. In example 1, this extension enables us to ground A6.

4 Answers to Terms

Terms involving expansion functions or predicates, including aggregate terms involv-ing expansion predicates, can only be evaluated with respect to a particular interpre-tation of those expansion predicates/functions. Thus, they cannot be evaluated outduring grounding as in Section 3 and they somehow must be represented in the groundformula. We call a term which cannot be evaluated based on the instance structure acomplex term.

In this section, we further extend the base cases of our relational algebra basedgrounding method to handle atomic formulas with complex terms. The key idea is tointroduce the notion of answer to a term. The new base cases then construct an answerto the atom from the answers to the terms which are its arguments. The terms we allowhere include arithmetic expressions, instance functions, expansion function, and aggre-gate operators involving these as well. The axioms, A4 and A5, in example 1, have thesekinds of terms.

Let t be a term with free variables X , and A a σ-structure. Let R be a pair (αR, βR)such that αR is a finite subset of N, and βR is a function mapping each element a ∈ αRto an extended X-relation βR(a).

Intuitively, αR is the set of all possible values that a given term t(X) may take,βR(a) is a table representing all instantiations of X under which t might evaluate toa. We sometimes use Ra as a shorthand to βR(a). We define βR(a) = ∅ for a �∈ αR.Recall that we defined δR(γ) to be ψ iff (γ, ψ) ∈ R. We may also use δR(γ, n) andδβR(n)(γ) interchangeably.

Definition 7 (Answer to term t wrt A). We say that R = (αR, βR) is an answer toterm t wrt A if, for every a ∈ αR, the extended X-relation βR(a) is an answer to theformula (t = a) wrt A, and for every a �∈ αR, the formula (t = a) is not satisfiable wrtA. Note that with this definition, αR can be either the range of t or a superset of it.

Example 3. (Continue of Example 1) Let ψ(l1, l2) = W (l1) − W (l2) ≤ Wl where thedomains of both l1 and l2 are L = {0, 1, 2}. Let A be a σ-structure with WA = {(0 �→7), (1 �→ 3), (2 �→ 5)} and WA

l = 2. Let t = Wl, ti = li, t′i = W (li) (i ∈ {1, 2}),t′′ = t′1−t′2 be the terms in ψ, andR, Ri, R′

i (i ∈ {1, 2}),R′′ be answers to these terms,respectively. Then, αR = {2}, αRi = {0, 1, 2}, αR′

i= {3, 5, 7} and αR′′ = {0..4}.

We now give properties that are sufficient for particular extended X-relations to consti-tute answers to particular terms. For a tuple X of variables of arity k, define DX to bethe set of all k-tuples of domain elements, i.e., DX = Ak.

20 A. Aavani et al.

Proposition 3 (Answers to Terms). Let R be the pair (αR, βR), and t a term . Assumethat t1, . . . tm are terms, and R1, . . .Rm (respectively) are answers to those terms wrtA. Also, let S be an answer for φ. Then, R is an answer to t wrt A if:

(1) t is a simple term (i.e., involves only variables, instance functions, and arithmeticoperators): αR = {n ∈ N | ∃ a ∈ DX : (t[a] = n)} and for all n ∈ αR, βR(n) isan answer to t = n computed as described in Definition (6).

(2) t is a term in form of t1 + t2: αR = {x + y | x ∈ αR1 and y ∈ αR2}

βR(n) = ∪(j∈αR1 , k∈αR2 , n=j+k)βR1(j) �� βR2(k)

(3) t is a term in form of t1{−,×}t2: similar to case (2);(4) t is a term in form of f(t1, · · · , tm), where f is an instance function: αR =

{y| for some x1 ∈ αR1 , . . . , xm ∈ αRm , f(x1, . . . , xm) = y},βR(n) = ∪a1∈αR1 ,...am∈αRm , s.t.f(a1,...am)=nβR1(a1) �� · · · �� βRm(am)Intuitively, βR(n) is the combination of all possible ways that f can be evaluatedas n.

(5) t is a term in form of f(t1, · · · , tm), where f is an expansion function. We introducean expansion predicate Ef (x, y) for each expansion function f(x) where type of yis the same as range of f . Then αR is equal to range of f , andβR(n) = ∪a1∈αR1 ,...am∈αRm

βR1(a1) �� · · · �� βRm(am) �� Ta1,··· ,am,n

Where Ta1,··· ,am,n is an answer to ∃ ∧i xi = ai ∧ y = n ∧ Ef (x1, · · · , xm, y).βR(n) expresses that f(t1, · · · , tm) is equal to n under assignment γ iff ti[γ] = ai

and f(a1, · · · , am) = n.(6) t is Sumx{t1(x, y) : φ(x, y)} : αR = {

∑a∈Dx

f(a) : f : x �→ {0} ∪ αR1}. Let

δ′R1(γ, n) =

{δR1(γ, n) if n �= 0δR1(γ, 0) ∨ ¬δS(γ) if n = 0

Then for each assignment b : y �→ D(y):

δR(b, n) =∨

f :x�→αRs.t.

∑a∈Dx f(a)=n

∧a∈D(x)

δ′R1(a, b, f(a))

For a fixed instantiation of y (b), each instantiation of x (a), might or might notcontribute to the output value of the aggregate when y is equal to b. a contributesto the output of aggregate if B |= φ(a, b). δ′R1

(γ, α, f(a)) describes the conditionunder which t1(a) contributes f(a) to the output. So, for a given mapping from Dx

to αR1 , we need to conjunct the conditions obtained from δ′R1to find the necessary

and sufficient condition for one of the cases where the output sum is exactly n. Andthe outside disjunction, finds the complete condition.

Although what is described in (7) can be used directly to find an answer for the Sumaggregates, in practice many of the entries in R will be eliminated during ground-ing as they are joined with false formula or unioned with a true formula. So, to re-duce the grounding time, we use a place holder, SUM place holder, in the formSUM(R1,S, n, γ), as the formula corresponding to δR(γ, n). The sum gadget is storedand propagated during grounding as a formula. After the end of the grounding phase, the


Table 2. Tables for Example 4: a) Answer to Wl = 2, i.e. βR(2), b) Answer to l1 = 0, i.e.βR1(0), c) Answer to l1 = 1, i.e. βR1(1), e) Answer to W (l1) = 5, i.e. βR′

1(5), f) Answer to

W (l2) = 7, i.e. βR′2(7), g) Answer to W (l1) −W (l2) = 2, i.e. βR′′(2),

(a)

ψTrue

(b)

l1 ψ0 True

(c)

l1 ψ1 True

(d)

l1 ψ2 True

(e)

l2 ψ0 True

(f)

l1 l1 ψ0 1 False2 1 True

engine enters the CNF generation phase in which a SAT instance is created from the ob-tained reduced grounding. In the CNF generation phase, the ground formula is traversedand by using normal Tseitin transformation [6] a corresponding CNF is generated3.

While the engine traverses the formula tree, produced in the grounding phase it mightencounter a SUM place holder. If this happens, the engine passes the SUM place holderto the SUM gadget which in turn converts the SUM place holder to CNF. This designhas another benefit, too. If we decided to use a SAT solver which is capable of han-dling Pseudo-Boolean constraints natively, the SUM gadget can easily be changed togenerate the Pseudo-Boolean constraints form the SUM place holder. One can find animplementation for Sum gadget in appendix A.

Example 4. (Continue From Example 3) βRi corresponds to the answer to variablesli (i ∈ {1, 2}). So, R1(R2) should have one free variable, namely l1(l2). Having ananswer to ti, an answer to t′i, (αR′

i, βR′

i), can be computed. By proposition 3, we have

βR′′(2) = βR′1(7) �� βR′

2(5) ∪ βR′

1(5) �� βR′

2(3). In other word, the answer to t′′ is 2

if either t′1 = 7 ∧ t′2 = 5 or t′1 = 5 ∧ t′2 = 3.

4.1 Base Case for Complex Terms

To extend our grounding algorithm to handle terms which cannot be evaluated out, weadd the following base cases to the algorithm.

Definition 8 (Base Case for Atoms with Complex Terms). Let t1, · · · , tm be terms,and assume that R1, . . .Rm (respectively) are answers to those terms wrt structure A.Then R is an answer to P (t1, . . . tm) wrt A if1. P (...) is t1 = t2: R = ∪(i∈αR1∩αR2)βR1(i) �� βR2(i)2. P (...) is t1 ≤ t2: R = ∪(i∈αR1 , j∈αR2 , i≤j)βR1(i) �� βR2(j)3. P is an instance predicate: R = ∪(a1,··· ,am)∈PA, ai∈αRi

βR1(a1) �� · · · ��

βRm(am)4. P is an expansion predicate and R is an answer to

∃x1 . . . xm (x1 = t1 ∧ · · · ∧ xm = tm ∧ P (x1, . . . , xm))

Example 5. (Continue Example 3) Although, ψ does not have any complex term, todemonstrate how the base cases can be handled, the process of computing an answerfor ψ is described here. We have computed an answer to t′′, t. To compute an answer toψ(l1, l2) = t′′(l1, l2) ≤ Wl, one needs to find the union of βR′′(n) �� βR(m) for m ∈αt = {2} and n ≤ 2 ∈ {0..2}. In this example, {(0, 2,�), (2, 1,�)} is an answer to ψ.

3 It is not the purpose of this paper to discuss the techniques we have used in this phase.

22 A. Aavani et al.

5 Experimental Evaluation

In this section we report an empirical observation on the performance of an implemen-tation of the methods we have described.

Thus far, we presented our approach to grounding aggregates and arithmetic. As amotivating example, we show how haplotype inference problem[7] can be axiomatizedin our grounder. To argue that the CNF generated through our grounder is efficient, wewill use a well-known and optimized encoding for haplotype inference problem andshow that the same CNF will be obtained without much hardship.

In haplotype inference problem, we are given an integer r and a set, G, consisting ofn strings in {0, 1, 2}m, for a fixed m. We are asked if there exists a set of r strings, H ,in {0, 1}m such that for every g ∈ G there are two strings in H which explain g. Wesay two strings h1 and h2 explain an string g iff for every position 1 ≤ i ≤ m eitherg[i] = h1[i] = h2[i] or g[i] = 2 and h1[i] �= h2[i].

The following axiomatization is intentionally produced in a way to generate the sameCNF encoding as presented in [7] in the assumption that the gadget used for count is asimplified adder circuit [7].

1. ∀i∀j (g(i, j) = 0 ⊃ ∃k ((¬h(k, j) ∨ ¬Sa(k, i)) ∧ (¬h(k, j) ∨ ¬Sb(k, i))))2. ∀i∀j (g(i, j) = 1 ⊃ ∃k ((h(k, j) ∨ ¬Sa(k, i)) ∧ (h(k, j) ∨ ¬Sb(k, i))))3. ∀i∀j (ga(i, j) �= gb(i, j))4. ∀i∀j

(g(i, j) = 2 ⊃ ∃k

((h(k, j)∨¬ga(i, j)∨¬Sa(k, i))∧(¬h(k, j)∨ga(i, j)∨

¬Sa(k, i))∧(h(k, j)∨¬gb(i, j)∨¬Sb(k, i))∧(¬h(k, j)∨gb(i, j)∨¬Sb(k, i))))

5,6. ∀i (Countk(Sa(k, i)) = 1) ∧ ∀i (Countk(Sb(k, i)) = 1)

In the above axiomatization, g(i, j) is an instance function which gives the character atposition j of i-th string in G. The expansion predicate h(k, i) is true iff the i-th positionof the k-th string in H is one. The expansion predicate Sa(k, i) is true iff k-th string inH is one of the explanations for i-th string in G. Sb has a similar meaning. ga(i, j) andgb(i, j) are some peripheral variables which are used in axiom (4).

Table (3) shows the detailed information about running time of haplotype inferenceinstances produced by the ms program[7]. The axiomatization given above correspondsto the row labeled with “Optimized Encoding”. The other row labeled with “Basic En-coding” also comes from the same paper [7] but as noted there and shown here producesCNF’s that take more time to solve.

Table 3. Haplotyping Problem Statistics

Grounding SAT Solving CNF SizeBasic Encoding 2.2 s 12.3 s 18.9 MB

Optimized Encoding 1.9 s 0.95 s 13.3 MB

Thus, using our system, Enfragmo, as grounder, we have been able to describe theproblem in a high level language and yet reproduce the same CNF files that have beenobtained through direct reductions. Thus, Enfragmo enables us to try different reduc-tions faster. Of course, once a good reduction is found, one can always use direct reduc-tions to achieve higher grounding speed although, as table (3) shows, Enfragmo alsohas a moderate grounding time when compared to the solving time.


Another noteworthy point is that different gadgets show different performances un-der different combinations of problems and instances. So, using different gadgets alsoenables a knowledgeable user to choose the gadget that serves them best. The processof choosing a gadget can also be automatized through some heuristics in the grounder.

6 Related Work

The ultimate goal of all systems like ours is to have a high-level syntax which ease thetask for problem description for both naive and expert users. To achieve this goal, thesesystems should be extended to handle complex terms. As different systems use differ-ent grounding approaches, each of them should have its very specific way of handlingcomplex terms.

Essence, [8], is a declarative modelling language to specify search problems. InEssence, we do not have expansion predicates. Users need to describe their problemsby expansion functions (which are variables, array of variables, matrix of variables andso on), instance predicates and mathematical operators. Then, the problem descriptionis transformed to a Constraint Satisfying Problem, CSP, instance by an engine calledTailor. As there is no standard input format for CSP solvers, Tailor has to be developedseparately for each CSP solver. Unlike SAT solvers which are just capable of han-dling Boolean variables, CSP solvers can work with instances in which the variables’domains are arbitrary. In [8], a method called Flattening has been described which re-sembles Tseitin transformation. Flattening process describes a complex term by intro-ducing some auxiliary variables and decomposing the complex term into simpler terms.The flattening method is also used in Zinc system [9].

IDP, [10], is system for model expansion whose input is an extension of first-orderlogic with inductive definitions. Essentially, the syntax of IDP is very similar to that oursystem but their approach to ground a given specification is different. A ground formulais created using a top-down procedure. The formula is written in Term Normal Form,TNF, in which all arguments to predicates are variables and the complex terms can onlyappear in the atomic formula in the form x{≤, <, =, >,≥}t(y). And then, the atomicformulas which have complex terms are rewritten as disjunction or conjunction of atomicformulas in form x < t(y) and x > t(y) [11]. The ground solver used by IDP system isan extension of regular SAT solvers which is capable of handling aggregates internally.This enables them to translate specifications and instances into their ground solver input.

7 Conclusion

In model-based problem solving, users need to answer the questions in form of “Whatis the problem?” or “How can the problem be described?”. In this approach, systemswith a high-level language helps users a lot and reduces the amount of expertise a userneed to have, and thus provides a way of solving computationally hard AI problems toa wider variety of users.

In this paper, we described how we extended our engine to handle complex terms.Having access to aggregates and arithmetic operators eases the task of describing prob-lems for our system and, enables more users to work with our system for solving theirtheoretical and real-world problems. We also extended our grounder in such a way thatit is able to convert the new construct to CNF and further showed that our grounder canreproduce the same CNF files from the high level language as the one obtains throughdirect reductions.

24 A. Aavani et al.

Acknowledgements

The authors are grateful to the Natural Sciences and Engineering Research Council ofCanada (NSERC), MITACS and D-Wave Systems for their financial support. In addi-tion, the anonymous reviewers’ comments helped us in improving this paper and clari-fying the presentation.

References

1. Frisch, A.M., Grum, M., Jefferson, C., Hernandez, B.M., Miguel, I.: The design ofESSENCE: a constraint language for specifying combinatorial problems. In: Proc. IJCAI2007 (2007)

2. Mitchell, D., Ternovska, E.: A framework for representing and solving NP search problems.In: Proc. AAAI 2005 (2005)

3. Gent, I., Jefferson, C., Miguel, I.: Minion: A fast, scalable, constraint solver. In: ECAI 2006:17th European Conference on Artificial Intelligence, Proceedings Including Prestigious Ap-plications of Intelligent Systems (PAIS 2006), Riva del Garda, Italy, August 29-September 1,vol. 98, p. 98. Ios Pr. Inc., Amsterdam (2006)

4. Patterson, M., Liu, Y., Ternovska, E., Gupta, A.: Grounding for model expansion ink-guarded formulas with inductive definitions. In: Proc. IJCAI 2007, pp. 161–166 (2007)

5. Mohebali, R.: A method for solving NP search based on model expansion and grounding.Master’s thesis, Simon Fraser University (2006)

6. Tseitin, G.: On the complexity of derivation in propositional calculus. Studies in constructivemathematics and mathematical logic 2(115-125), 10–13 (1968)

7. Lynce, I., Marques-Silva, J.: Efficient haplotype inference with boolean satisfiability. In:AAAI. AAAI Press, Menlo Park (2006)

8. Rendl, A.: Effective compilation of constraint models (2010)9. Nethercote, N., Stuckey, P., Becket, R., Brand, S., Duck, G., Tack, G.: MiniZinc: Towards

a standard CP modelling language. In: Bessiere, C. (ed.) CP 2007. LNCS, vol. 4741, pp.529–543. Springer, Heidelberg (2007)

10. Wittocx, J., Marien, M., De Pooter, S.: The idp system (2008), Obtainable via www.cs.kuleuven.be/dtai/krr/software.html

11. Wittocx, J.: Finite domain and symbolic inference methods for extensions of first-orderlogic. AI Communications (2010) (accepted)

12. Asın, R., Nieuwenhuis, R., Oliveras, A., Rodrıguez-Carbonell, E.: Cardinality Networksand Their Applications. In: Kullmann, O. (ed.) SAT 2009. LNCS, vol. 5584, pp. 167–180.Springer, Heidelberg (2009)

13. Een, N.: SAT Based Model Checking. PhD thesis (2005)

A SUM Gadget

We could have an implementation which constructs answers to complex terms by takingliterally the conditions described in Proposition 3. However, we would expect this im-plementation to result in a system with poor performance. In the grounding algorithm,the function which generates ψ for tuple (γ, ψ) may produce any formula logicallyequivalent to φ. We may think of the function as “gadgets”, in the sense this term isused in reductions to prove NP-completeness. Choice of these gadgets is important forsome constraints. For example, choosing CNF representations of aggregates is an activearea of study in the SAT community (e.g., see [12]). Our method allows these choicesto be made at run time, either by the user or automatically.

As it described in previous sections, to compute an answer to an aggregate one needsto find a set αR ⊂ N and a function βR which maps every integer to a ground formula.


In section 4, we have showed what the set αR is for each term and also described theproperties of the output of βR function. Here, we present one method to construct aSUM gadget which can be used during the CNF Generation phase.

A gadget for Sum aggregate, denoted by S(R1,S, n), takes an answer to a term, R1,and an answer to a formula S, and an integer and returns a CNF formula f .

Let’s assume the original Sum aggregate is t(y) = Sumx{t1(x, y) : φ(x, y)}, whereR1 is an answer to t1 and S is an answer to φ. Let Ti = βR1(i) �� S for all i ∈ αR1 .So, ψ = δTi(γ) is the necessary and sufficient condition for t1(γ) to be equal to i.

Remember that the SUM gadget is called during the CNF generation and it returnsa Tseitin variable which is true iff t(...) = n. The gadget generates/retrieves a Tseitinvariable, vψ for ψ(γ) if ψ = δTi(γ) �= ⊥ and stores the pair (i, vψ). After fetching allthese pairs, (n1, v1) · · · (nk, vk), SUM gadget starts generating a CNF for t(γ) = n. Infact, our gadget can be any valid encoding for Pseudo-Boolean constraints such as Bi-nary Decision Diagrams (BDD) based encoding, Sorting network based encoding andetc, [13]. Here, we describe the BDD based encoding:

Let T = {(n1, f1), · · · , (nk, fk)}. Define the output of the gadget to be Fnk where

F sr ’s are inductively constructed based on the following definitions:

F sr+1 = (F s

r ∧ ¬fr+1) ∨ (F s−nr+1r ∧ fr+1)

F sr =

{� if r and s are both zero⊥ r = 0 and s �= 0

Moving Object Modelling Approach

for Lowering Uncertaintyin Location Tracking Systems

Wegdan Abdelsalam1, David Chiu1, Siu-Cheung Chau2,Yasser Ebrahim2, and Maher Ahmed2

1 School of Computer Science, University of Guelph2 Physics & Computer Science Department, Wilfrid Laurier University

Abstract. This paper introduces the concept of Moving Object (MO)modelling as a means of managing the uncertainty in the location track-ing of human moving objects travelling on a network. For previous move-ments of the MOs, the uncertainty stems from the discrete nature oflocation tracking systems, where gaps are created among the locationreports. Future locations of MOs are, by definition, uncertain. The ob-jective is to maximize the estimation accuracy while minimizing the op-erating costs.

Keywords: Moving object modelling, Managing uncertainty, Locationtracking Systems.

1 Introduction

Location Tracking Systems (LTSs) are built to answer queries about the where-abouts of moving objects in the past, present, or future. To answer such queries,each moving object, monitored by the system, must report its location peri-odically using a sensing device such as Global Positioning System(GPS) . Thelocation reports are then saved to a database where they are indexed to facilitateanswering user queries.

In spite of the continuous nature of the MO’s movement data, location datacan only be acquired in a discrete time. This leaves the location of the MOunknown for the periods of time between the location reports. It is economicallyinfeasible to capture and store a continuous stream of the location data foreach MO. Rcording location reports discretly introduces uncertainty about thelocation of MOs between reports.

Lowering uncertainty has been addressed by a number of researchers over thepast few years [1–4]. These approaches try to find a link between the amount ofuncertainty and the frequency of location reporting. By increasing the reportingfrequency, the uncertainty can be kept within acceptable bounds. We believethat there is a need for a new approach for lowering the uncertainty withoutincreasing the reporting frequency. This new approach must be integrated withthe system database in a way that facilitates the efficient execution of the userqueries.


MO Modelling Approach for Lowering Uncertainty in LTSs 27

We propose a MO modelling approach for lowering the uncertainty aboutMO locations in a LTS. MO modelling includes collecting information about theenvironment (i.e., the context) under which MOs operate. The MO model is usedto reach a more accurate estimate of the MO location. This is done by estimatingthe location calculation parameters (e.g., speed and route) in terms of the MO’shistorical data, collected for these parameters under similar circumstances.

2 Moving Object Modelling

The goal is to equip location-tracking systems with an MO model for each ofthe MOs being tracked. The model encompasses the MO’s characteristics, pref-erences, and habits. The location-tracking application utilizes the informationabout the MO to more accurately estimate his/her position.

Because the location of an MO is determined primarily by the chosen routeand the travelling speed, the MO is modelled in respect to these two variables.

2.1 MO Speed Model

Our proposed approach adopts a Bayesian Network (BN) to build the MO speedmodel. Figure 1 depicts a BN for the suggested MO speed model. As shown inthe figure the BN structure is a single contacted DAG, most often refereed aspolytree [8].

The child-node Speed is influenced by three parent nodes: the driving condi-tion (DC), level of service (LoS), and road speed limit (SL). In turn, the drivingcondition is affected by two parent nodes weather condition (WC) and road type(RT). The LoS is affected by three parent node: day-of-week (DW), time-of-day(TD), and area of city (A).

To build the model, the first step is to determine the possible states for eachvariable (i.e., node) in the Bayesian Network. It is possible to either intuitively

Weather condition

Road type

Driving condition

SpeedSpeed Limit

Level of Service

Day of Week

Time of Day

Area

Fig. 1. Example BN for the MO speed model

28 W. Abdelsalam et al.

choose the values or elicit them from the domain expert. For example, the speedvariable can have the finite discrete values 0, 1, 2, 3, ..., 199, 200 k/hr representingall the possible MO speeds (assuming no decimals values). For the road type, aGeographic Information System (GIS) is consulted for the possible road typesin the city.

The next step is to initialize the CPT with the probability of each state of thenode, given the possible states of its parents. In a polytree the size of the CPTof each variable is determined by the possible states of its parent(s). Each entry(i.e. probability value) in the CPT corresponds to a combination of the parentnode’s states, combined with one of the evidence.

The state of the Speed variable is inferred according to the evidence of theroot variables. The evidence is propagated (using Pear’s BP algorithm) down thenetwork. The resulting probablity table is then queried for the most probablespeed (i.e., the highest probability).

2.2 MO Trip Route Model

In principle, a trip route is determined by the trip source and destination. Dif-ferent MOs can take different routes, based on their preferences. For example,some MOs prefer to take the shortest route, while others may prefer the fastestroute.

For each trip (i.e., source/destination duo) for each MO, a directed graph isbuilt to represent the route such that a node represents a road segment, and anedge represents a connection between two road segments. Each edge is given aweight representing the probability the MO to achieve the transition from theparent node to the child node. The edge weight is based on the frequency thetransition is made in relation to the total transitions from the parent node. Eachedge is associated with a counter that is incremented each time the transition ismade.

The graph is built, based on the received location reports. If the reportedroad segment is on the graph, the transition frequency counter is incremented.If not, a new node is added to the graph, and its frequency number is initializedto 1. The graph nodes represent all the road segments ever visited, while on anyinstance of this trip. The most probable route is the shortest maximal weightpath from the source node to the destination node. From the most probableroute and the most probable speed on each road segment along this route, thesystem creates the estimated MO trajectory for the trip. Figure 2 signifies themodel for the trip from source 1 to destination 13.

2.3 Estimated Trajectory Updating

Sometimes the estimated trajectory of the MO needs to be updated, based onthe actual location reports received. A certain degree of deviation is detectedbetween the estimated trajectory (based on the MO model) and the actual lo-cation reports. This deviation can occur because the MO is either followingthe estimated route but at a different speed than expected (called a schedule


deviation), or because the MO takes a different route from the estimated one,(called a route deviation). With either type of deviation, with continual use,the estimated trajectory for answering future queries might produce incorrectresults.

When an MO is detected to be off-schedule, the remainder of the estimatedtrajectory can be adjusted in one of two ways: If the MO is behind schedule, theremaining trajectory is shifted forward one reporting interval to reflect that thetrip can take longer than expected. If the MO is ahead of schedule, the remainingtrajectory is shifted backward one or more reporting interval(s) to reflect thatthe trip can finish sooner than expected.

When a route deviation is detected, the trip route model is checked to seeif there are alternate routes that have been taken by the MO in the past. Bycomparing the road segments travelled so far (as suggested by the actual locationreports received) to the road segment sequences in the trip model, it may bepossible to find a match that suggests the route the MO is actually taking.

Fig. 2. Trip model

3 Experimental Results

To experimentally validate the efficiency of using the MO modelling approachin the location estimation, query results of three different techniques to esti-mate the MO’s speed are compared. The three speed estimation methods thatare examined are the last-reported-speed, the average-reported-speed, and theMO-model-based, most-probable-speed. The estimated speed is applied in thefollowing formula to estimate the location of the MO (assuming the MO is mov-ing in a straight line):

Location = last reported location + (estimated speed * time elapsed sincelast report).

Three route estimation methods are selected. The straight-line method as-sumes that the MO continues to move along the same line made by the last twolocation reports. The trip-route-model method estimates the trip route at thebeginning of the trip, and the trajectory is created, according to the estimated

30 W. Abdelsalam et al.

speeds along route. The route-model-with-shifting method employs off-scheduletrajectory updating, as described in Section 2.3. Each MO is randomly assigneda preference of either taking the shortest or the fastest route.

Reporting intervals, between 0.25 and 3 minutes with 0.25 minute intervals,are tested. Each experiment is performed five times and the average deviationper location report (in metres) is obtained.

0

200

400

600

800

1000

1200

1400

1600

1800

0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3

Average Deviation in meters

Reporting Interval in Minutes

Model

Straight Line

Model with Shift

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3

Average Deviation in meters

Reporting Interval in Minutes

Model

Straight Line

Model with Shift

(a) (b)

Fig. 3. (a)Average error at different reporting intervals using straight-line-with-last-reported-speed, model speed, and model speed with shifting with a 0% probabilityof a route deviation. (b)Average error at different reporting intervals using straight-line-with-last-reported-speed, model speed, and model speed with shifting with a 10%probability of a 10% route deviation.

From Figure 3, a number of observations can be made. The two model-basedlocation estimation methods is considerably better than the straight-line-with-last-reported-speed method for reporting intervals more than 30 seconds. Thestraight-line-with-last-reported-speed method deviation grows linearly, as the re-porting interval grows. On the other hand, both model-based methods deviationactually improves as the reporting interval grows. This is due to the fact thathigher reporting intervals allow for deviations between the reported and esti-mated speeds (i.e., estimated speed being above/below the reported speed) tocancel each other out. The model-with-shifting tends to perform better thanthe model alone for shorter reporting intervals. The two converge at a reportinginterval of about 1.5 to 1.75 minutes. This reveals that the proposed shiftingapproach does improve accuracy, as the estimated trajectories are adjusted toreflect received location reports. This effect diminishes as the reporting intervalgrows which means that fewer such shifts are performed.

4 Conclusion

The use of the user movement modelling in location-tracking applications is pre-sented and the idea of MO modelling is introduced for reducing the uncertainty


about the MO’s locations. The building of the MO speed models, by employingthe Bayesian Networks, is explained with a discussion of some variables affectingthe design of a typical MO model-based systems.

A trip route modelling approach is developed to capture the most commonlytaken route between two locations. The estimated trajectory is adjusted to reflectthe actual locations that are reported.

Experimental evidence is produced to confirm that both speed and trip routemodelling do help in increasing the accuracy of the location estimation comparedwith the traditional approach (last reported speed and straight line directionestimation). The same is also shown to be true when the speed is estimated byaveraging of the last k reported speeds (rather than considering the last reportedspeed).

References

1. Wolfson, O., Sistla, A.P., Xu, B., Zhou, J., Chamberlain, S., Yesha, Y., Rishe, N.:Tracking moving objects using database technology in domino. In: Tsur, S. (ed.)NGITS 1999. LNCS, vol. 1649, pp. 112–119. Springer, Heidelberg (1999)

2. Ding, H., Trajcevski, G., Scheuermann, P.: Efficient maintenance of continuousqueries for trajectories. Geoinformatica 12(3), 255–288 (2008)

3. Moreira, J., Ribeiro, C., Abdessalem, T.: Query operations for moving objectsdatabase systems. In: Proceedings of the 8th International Symposium on Ad-vances in Geographic Information Systems (ACMGIS 2000), Washington, D.C,USA, November 6-11, pp. 108–114. ACM Press, New York (2000)

4. Sistla, A.P., Wolfson, O., Chamberlain, S., Dao, S.: Querying the uncertain positionof moving objects. In: Etzion, O., Jajodia, S., Sripada, S.M. (eds.) Proceedings ofDagstuhl seminar on Temporal Databases: Research and Practise, Dagstuhl Castle,Germany, June 23-27, pp. 310–337. Springer, Heidelberg (1997)

5. Patterson, D.J., Fox, D., Kautz, H., Fishkin, K., Perkowitz, M., Philipose, M.:Contextual computer support for human activity. In: Spring Symposium on In-teraction Between Humans and Autonomous Systems over Extended Opera-tion (AAAI 2004), Stanford, CA, USA (2004), http://www.aaai.org/Library/

Symposia/Spring/2004/ss04-03-013.php

6. Darwiche, A.: Modeling and Reasoning with Bayesian Networks, 1st edn.Cambridge University Press, New York (2009)

7. Cooper, G.F.: Probabilistic inference using belief networks is np-hard. Technical re-port SMI-87-0195, Knowledge Systems Laboratory, Stanford University, Stanford,CA, USA (1987)

8. Pearl, J.: Fusion, propagation, and structuring in belief networks. Artificial Intel-ligence 29(3), 241–288 (1986)

9. Heckerman, D.: A tutorial on learning with bayesian networks. TechReport MSR-TR-95-06, Microsoft Research, Microsoft Corp., Seattle, Washington (1995)

10. Neapolitan, R.E.: Learning Bayesian Networks. Prentice Hall, Englewood Cliffs(2003)

http://www.aaai.org/Library/Symposia/Spring/2004/ss04-03-013.php

http://www.aaai.org/Library/Symposia/Spring/2004/ss04-03-013.php

C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 32–43, 2011. © Springer-Verlag Berlin Heidelberg 2011

Unsupervised Relation Extraction Using Dependency Trees for Automatic Generation of Multiple-Choice

Questions

Naveed Afzal1, Ruslan Mitkov1, and Atefeh Farzindar2

1 Research Institute for Information and Language Processing (RIILP) University of Wolverhampton, Wolverhampton, UK

{N.Afzal,R.Mitkov}@wlv.ac.uk 2 NLP Technologies Inc. 1255 University Street, Suite 1212

Montreal (QC), Canada, H3B 3W9 [email protected]

Abstract. In this paper, we investigate an unsupervised approach to Relation Extraction to be applied in the context of automatic generation of multiple-choice questions (MCQs). MCQs are a popular large-scale assessment tool making it much easier for test-takers to take tests and for examiners to interpret their results. Our approach to the problem aims to identify the most important semantic relations in a document without assigning explicit labels to them in order to ensure broad coverage, unrestricted to predefined types of relations. In this paper, we present an approach to learn semantic relations between named entities by employing a dependency tree model. Our findings indicate that the presented approach is capable of achieving high precision rates, which are much more important than recall in automatic generation of MCQs, and its enhancement with linguistic knowledge helps to produce significantly better patterns. The intended application for the method is an e-Learning system for automatic assessment of students’ comprehension of training texts; however it can also be applied to other NLP scenarios, where it is necessary to recognise the most important semantic relations without any prior knowledge as to their types.

Keywords: E-Learning, Information Extraction, Relation Extraction, Biomedical domain, Dependency Tree, MCQ generation.

1 Introduction

Multiple choice questions (MCQs) also known as multiple-choice tests are a form of objective assessment in which a user selects one answer from a set of alternative choices for a given question. MCQs are straightforward to conduct and instantaneously provide an effective measure of test-takers performance and feedback test results to the learner. In many disciplines instructors use MCQs as a preferred assessment tool and it is estimated that 45% - 67% student assessments utilise MCQs [2]. The fast developments of e-Learning technologies have in turn stimulated method for automatic generation of MCQs and today they have become an actively developing topic in

Unsupervised RE Using Dependency Trees for Automatic Generation of MCQs 33

application-oriented NLP research. The work done in the area of automatic generation of MCQs does not have a long history [e.g., 18, 19, 28, 3 and 10]. Most of the aforementioned approaches rely on the syntactic structure of a sentence.

We present a new approach to MCQs generation, where in order to automatically generate MCQs we first identify important concepts and the relationships between them in the input texts. In order to achieve this, we study unsupervised Information Extraction methods with the purpose of discovering the most significant concepts and relations in the domain texts, without any prior knowledge of their types or their exemplar instances (seeds). Information Extraction (IE) is an important problem in many information access applications. The goal is to identify instances of specific semantic relations between named entities of interest in the text. Named Entities (NE’s) are generally noun phrases in the unstructured text e.g. names of persons, posts, locations and organisations while relationships between two or more NE’s are described in a pre-defined way e.g. “interact with” is a relationship between two biological objects (proteins).

Dependency trees are regarded as a suitable basis for semantic patterns acquisition as they abstract away from the surface structure to represent relations between elements (entities) of a sentence. Semantic patterns represent semantic relations between elements of sentences. In a dependency tree a pattern is defined as a path in the dependency tree passing through zero or more intermediate nodes within a dependency tree [27]. An insight of usefulness of the dependency patterns was provided by [26] in their work as they revealed that dependency parsers have the advantage of generating analyses which abstract away from the surface realisation of text to a greater extent than phrase structure grammars tend to, resulting in semantic information being more accessible in the representation of the text which can be useful for IE.

The main advantage of our approach is that it can cover a potentially unrestricted range of semantic relations while most supervised and semi-supervised approaches can learn to extract only those relations that have been exemplified in annotated text, seed patterns. Our assumption for Relation Extraction (RE) is that it is between NE’s stated in the same sentence and that presence or absence of relation is independent of the text prior to or succeeding the sentence. Moreover, our approach is suitable in situations where a lot of unannotated text is available as it does not require manually annotated text or seeds. These properties of the method can be useful, specifically, in such applications as MCQs generation [18, 19] or a pre-emptive approach in which viable IE patterns are created in advance without human intervention [23, 24].

2 Related Work

There is a large body of research dedicated to the problem of extracting relations from texts of various domains. Most previous work focused on supervised methods and tried to both extract relations and assign labels describing their semantic types. As a rule, these approaches required a manually annotated corpus, which is very laborious and time-consuming to produce.

Semi-supervised and unsupervised approaches relied on seeds patterns and/or examples of specific types of relations [1, 25]. An unsupervised approach based on

34 N. Afzal, R. Mitkov, and A. Farzindar

clustering of candidate patterns for the discovery of the most important relation types among NE’s from a newspaper domain was presented by [9]. In the biomedical domain, most approaches were supervised and relied on regular expressions to learn patterns [5], while semi-supervised approaches exploited pre-defined seed patterns and cue words [11, 17].

Several approaches in IE have relied on dependency trees in order to extract patterns for the automatic acquisition of IE systems [27, 25 and 7]. Apart from IE, [15] used dependency trees in order to infer rules for question answering while [29] had made use of dependency trees for paraphrase identification. Moreover, dependency parsers are used most recently in the systems which identify protein interactions in biomedical texts [13, 6].

In dependency parsing main objective is to describe syntactic analysis of a sentence using dependency links which shows the head-modifier relations between words. All the IE approaches that relied on dependency trees have used different pattern models based on the particular part of the dependency analysis. The motive behind all of these models is to extract the necessary information from text without being overly complex. All of the pattern models have made use of the semantic patterns based on the dependency trees for the identification of items of interest in text. These models vary in terms of their complexity, expressivity and performance in an extraction scenario.

3 Our Approach

Our approach is based on the Linked Chain Pattern Model presented by [7]. Linked Chain Pattern Model combines the pair of chains in a dependency tree which share common verb root but no direct descendants.

In our approach, we have treated every NE as a chain in a dependency tree if it is less than 5 dependencies away from the verb root and the word linking the NE’s to the verb root are from the category of content words (Verb, Noun, Adverb and Adjective) along with prepositions. We consider only those chains in the dependency tree of a sentence which contain NE’s, which is much more efficient than the subtree model of [27], where all subtrees containing verbs are taken into account. This allows us to extract more meaningful patterns from the dependency tree of a sentence. We extract all NE chains which follow aforementioned rule from a sentence and combine them together. Figure 1 shows the whole system architecture.

According to the system architecture, in Section 3, we elaborate the NER process, Section 4 explains the process of candidates patterns extraction, we use GENIA corpus for candidate patterns extraction. Section 5 describes various information theoretic measures and statistical tests for patterns ranking depending upon their associations with domain corpus. Section 6 discusses the evaluation procedures (rank-thresholding and score-thresholding); GENIA EVENT Annotation corpus is used for evaluation while Section 7 explains the experimental results obtained via various patterns ranking methods.


Fig. 1. System Architecture

4 Named Entity Recognition (NER)

NER is an integral part of any IE system as it identifies NE’s present in a text. Presently many NER tools are developed for various domains as there is a lot of research being done in the area of NER spreading across various languages, domains and textual genres. In our work, we used biomedical data as biomedical NER is generally considered to be more difficult as compared to other domains like newswire text. There are huge numbers of NE’s in the biomedical domain and the new ones are consistently added [32] which means that neither dictionaries nor training data approach will be sufficiently comprehensive for NER task. The volume of published biomedical research is expanding at a rapid rate in the recent past. Due to the syntactic and semantic complexity of biomedical domain many IE systems have utilised tools (e.g., part-of-speech tagger, NER, parsers) specifically designed and developed for the biomedical domain [21]. Moreover, [8] presented a report, investigating the suitability of current NLP resources for syntactic and semantic analysis for biomedical domain.

The GENIA NER1 [31, 32] is a specific tool designed for biomedical texts; the NE tagger is designed to recognise mainly the following NE’s: protein, DNA, RNA, cell_type and cell_line. Table 1 shows the performance of GENIA NER3.

Table 1. GENIA NER Performance

Entity Type Precision Recall F-score Protein 65.82 81.41 72.79 DNA 65.64 66.76 66.20 RNA 60.45 68.64 64.29 Cell Type 56.12 59.60 57.81 Cell Line 78.51 70.54 74.31 Overall 67.45 75.78 71.37

1 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

Unannotated corpus

Semantic Relations

Extraction of Candidate Patterns Named Entity

Recognition Patterns Ranking

Evaluation


5 Extraction of Candidate Patterns

Our general approach to learn dependency tree-based patterns consists of two main stages: (i) the construction of potential patterns from an unannotated domain corpus and (ii) their relevance ranking.

After NER the next step is the construction of candidate patterns. We will explain the whole process of candidate patterns extraction from the dependency trees with the help of an example shown below:

Fibrinogen activates NF-kappaB transcription factors in mononuclear phagocytes. After the NER the aforementioned sentence is transformed into following: <protein> Fibrinogen </protein> activates <protein> NF-kappaB </protein>

<protein> transcription factors </protein> in <cell_type> mononuclear phagocytes </cell_type>.

Once the NE’s are recognised in the domain corpus by the GENIA tagger, we replace all the NE’s with their semantic class respectively, so the aforementioned sentence is transformed into following sentence.

PROTEIN activates PROTEIN PROTEIN in CELL. The transformed sentences are then parsed by using the Machinese Syntax2 parser

[30]. Machinese Syntax parser uses a functional dependency grammar for parsing. The analyses produced by the Machinese Syntax parser are encoded to make the most of information they contain and ensure consistent structures from which patterns could be extracted. Figure 2 shows the dependency tree for the aforementioned adapted sentence:

Fig. 2. Example of a dependency tree

After the encoding process, the patterns are extracted from dependency trees using the methodology describe in Section 3. From Figure 2, the following patterns are extracted:

<NE ID="0" func="SUBJ" Dep="1"> "PROTEIN" </NE> <W ID="1" func="+FMAINV" Dep="none">"activate"</W> <NE ID="2" func="A" Dep="3"> "PROTEIN" </NE> <NE ID="3" func="OBJ" Dep="1"> "PROTEIN" </NE> <W ID="0" func="+FMAINV" Dep="none">"activate"</W> <NE ID="1" func="A" Dep="2"> "PROTEIN" </NE> <NE ID="2" func="OBJ" Dep="0"> "PROTEIN" </NE>

2 http://www.connexor.com/software/syntax/


<W ID="0" func="+FMAINV" Dep="none">"activate"</W> <NE ID="1" func="OBJ" Dep="0"> "PROTEIN" </NE> <W ID="2" func="PREP" Dep="0">"in"</W> <NE ID="3" func="P" Dep="2"> "CELL_TYPE" </NE> Here <NE> tag represents the Named Entity (semantic class) while <W> tag

represent the lexical words while ID represent the word id, func represent function of the word and Dep represents the id of the word on which this word depends in a dependency tree. The extracted patterns along with their frequencies are then stored in a database. We filtered out the patterns containing only stop-words in dependency-based patterns using stop-words corpus. Table 2 shows the examples of dependency-based patterns along with their frequencies.

Table 2. Example of dependency-based patterns along with frequencies

Patterns Frequency

<NE ID="0" func="SUBJ" Dep="1"> "DNA" </NE> <W ID="1" func="+FMAINV" Dep="none">"contain"</W> <NE ID="2" func="OBJ" Dep="1"> "DNA" </NE>

34

<NE ID="0" func="SUBJ" Dep="1"> "PROTEIN" </NE> <W ID="1" func="+FMAINV" Dep="none">"activate"</W> <NE ID="2" func="OBJ" Dep="1"> "PROTEIN" </NE>

32

<NE ID="0" func="SUBJ" Dep="1"> "PROTEIN" </NE> <W ID="1" func="+FMAINV" Dep="none">"contain"</W> <NE ID="2" func="OBJ" Dep="1"> "PROTEIN" </NE>

19

<NE ID="0" func="SUBJ" Dep="2"> "PROTEIN" </NE> <NE ID="1" func="APP" Dep="0">"PROTEIN" </NE> <W ID="2" func="+FMAINV" Dep="none">"induce"</W>

19

6 Pattern Ranking

After candidate patterns have been constructed, the next step is to rank the patterns based on their significance in the domain corpus. The ranking methods we use require a general corpus that serves as a source of examples of pattern use in domain-independent texts. To extract candidates from the general corpus, we treated every noun as a potential NE holder and the candidate construction procedure described above was applied to find potential patterns in the general corpus. In order to score candidate patterns for domain-relevance, we measure the strength of association of a pattern with the domain corpus as opposed to the general corpus. The patterns are scored using the following methods for measuring the association between a pattern and the domain corpus: Information Gain (IG), Information Gain Ratio (IGR), Mutual Information (MI), Normalised Mutual Information (NMI)3, Log-likelihood (LL) and

3 Mutual Information has a well-known problem of being biased towards infrequent events. To

tackle this problem, we normalised the MI score by a discounting factor, following the formula proposed in Lin and Pantel (2001).


Chi-Square (CHI). These association measures were included in the study as they have different theoretical principles behind them: IG, IGR, MI and NMI are information-theoretic concepts while LL and CHI are statistical tests of association.

Information Gain measures the amount of information obtained about domain specialisation of corpus c, given that pattern p is found in it.

}{ }{ )()(),(

log),(),(',', dPgP

dgPdgPcpIG

ppgccd∑∑

∈∈

=

where p is a candidate pattern, c – the domain corpus, p' – a pattern other than p, c' – the general corpus, P(c) – the probability of c in “overall” corpus {c,c'}, and P(p) – the probability of p in the overall corpus.

Information Gain Ratio aims to overcome one disadvantage of IG consisting of the fact that IG grows not only with the increase of dependence between p and c, but also with the increase of the entropy of p. IGR removes this factor by normalizing IG by the entropy of the patterns in the corpora:

}{∑

∈

−=

',

)(log)(

),(),(

ppg

gPgP

cgIGcpIGR

Pointwise Mutual Information between corpus c and pattern p measures how much information the presence of p contains about c, and vice versa:

)()(

),(log),(

cPpP

cpPcpMI =

Chi-Square and Log-likelihood are statistical tests which work with frequencies and rank-order scales, both calculated from a contingency table with observed and expected frequency of occurrence of a pattern in the domain corpus. Chi-Square is calculated as follows:

( ) ( )}{

2

',

2 , ∑∈

−=ccd d

dd

E

EOcpx

where O is the observed frequency of p in domain and general corpus respectively and E is the expected frequency of p in two corpora.

Log-likelihood is calculated according to the following formula:

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛=

2

22

1

11 loglog2),(

E

OO

E

OOcpLL

where O1 and O2 are observed frequencies of p in the domain and general corpus respectively, while E1 and E2 are its expected frequency values in the two corpora.

In addition to these six measures, we introduce a meta-ranking method that combines the scores produced by several individual association measures, in order to


leverage agreement between different association measures and downplay idiosyncrasies of individual ones. Because the association functions range over different values (for example, IGR ranges between 0 and 1, and MI between +∞ and -∞), we first normalise the scores assigned by each method4:

))((max

)()(

qs

psps

Pqnorm

∈

=

where s(p) is the non-normalised score for pattern p, from the candidate pattern set P. The normalised scores are then averaged across different methods and used to produce a meta-ranking of the candidate patterns.

Given the ranking of candidate patterns produced by a scoring method, a certain number of highest-ranking patterns can be selected for evaluation. We studied two different ways of selecting these patterns: (i) one based on setting a threshold on the association score below which the candidate patterns are discarded (henceforth, score-thresholding method) and (ii) one that selects a fixed number of top-ranking patterns (henceforth, rank-thresholding method). During the evaluation, we experimented with different rank- and score-thresholding values.

7 Evaluation

Biomedical NE’s are expressed in various linguistic forms such as abbreviations, plurals, compound, coordination, cascades, acronyms and apposition. Sentences in such texts are syntactically complex as the subsequent Relation Extraction phase depends upon the correct identification of the named entities and correct analysis of linguistic constructions expressing relations between them [34].

We used the GENIA Corpus as the domain corpus while British National Corpus (BNC) was used as a general corpus. GENIA corpus consists of 2,000 abstracts extracted from the MEDLINE containing 18,477 sentences. In the evaluation phase, GENIA EVENT Annotation corpus5 is used [14]. It consists of 9,372 sentences. The numbers of dependency patterns extracted from each corpus are: GENIA 5066, BNC 419274 and GENIA EVENT 3031 respectively.

In order to evaluate the quality of the extracted patterns, we examined their ability to capture pairs of related NE’s in the manually annotated evaluation corpus, without recognising the type of semantic relation. Selecting a certain number of best-ranking patterns, we measure precision, recall and F-score. To test the statistical significance of differences in the results of different methods and configurations, we used a paired t-test, having randomly divided the evaluation corpus into 20 subsets of equal size; each subset containing 461 sentences on average.

8 Results

Table 3 shows the results of precision scores for ranked-thresholding method. 4 Patterns with negative MI scores are discarded. 5 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi?page= Event+Annotation


Table 3. Precision scores of rank-thresholding method

Ranking Methods

Dependency Tree Patterns

Top 100 Ranked Patterns



IG 0.770 0.800 0.780 IGR 0.770 0.800 0.787 MI 0.560 0.560 0.540

NMI 0.940 0.815 0.707 LL 0.770 0.800 0.790

CHI 0.960 0.815 0.710

Meta 0.900 0.830 0.740

Table 4 shows the results of score-thresholding method, the left side of the Table 4

shows the precision (P), recall (R) and F-score values for score-threshold values where we are able to achieve high F-scores while right side of the Table 4 shows the high precision scores.

Table 4. Results of score-thresholding method

Ranking Methods

Dependency Tree Patterns

P R F-score P R F-score Threshold score > 0.01 Threshold score > 0.09

IG 0.748 0.107 0.187 0.733 0.007 0.014 IGR 0.748 0.107 0.187 0.733 0.007 0.014 MI 0.567 0.816 0.669 0.563 0.593 0.578

NMI 0.566 0.767 0.651 0.572 0.507 0.538 LL 0.748 0.107 0.187 0.733 0.007 0.014

CHI 0.577 0.529 0.552 0.900 0.036 0.069 Meta 0.571 0.643 0.605 0.860 0.048 0.092

Threshold score > 0.02 Threshold score > 0.1 IG 0.796 0.051 0.097 0.704 0.006 0.012

IGR 0.796 0.051 0.097 0.704 0.006 0.012 MI 0.566 0.744 0.643 0.564 0.588 0.576

NMI 0.570 0.706 0.631 0.569 0.483 0.523 LL 0.796 0.051 0.097 0.704 0.006 0.012

CHI 0.591 0.243 0.344 0.898 0.035 0.067 Meta 0.569 0.547 0.558 0.856 0.047 0.089

Threshold score > 0.03 Threshold score > 0.2 IG 0.785 0.035 0.067 0.571 0.003 0.005

IGR 0.785 0.035 0.067 0.571 0.003 0.005 MI 0.566 0.711 0.631 0.566 0.473 0.515

NMI 0.568 0.663 0.612 0.600 0.133 0.218 LL 0.785 0.035 0.067 0.571 0.003 0.005

CHI 0.613 0.146 0.236 1.000 0.015 0.029 Meta 0.577 0.355 0.439 1.000 0.013 0.025


In both tables (3 and 4), the results of the best performing ranking method in terms of precision are shown in bold font. Although our main focus is on achieving higher precision scores it is quite obvious from Table 4 that our method achieved low recall, one reason of having a low recall is due to the small size of GENIA corpus which can be encountered by using a large corpus as large corpus will produce much greater number of patterns and increase the recall.

The CHI and NMI are the best performing ranking methods in terms of precision in both rank-thresholding and score-thresholding method while IG, IGR and LL achieve quite similar results. Moreover in Table 4 we are able to achieve 100% precision. Figure 3 shows the precision scores for the best performing ranking methods (CHI and NMI) in score-thresholding method.

0

0 .2

0 .4

0 .6

0 .8

1

>0 .0 8 >0 .0 9 >0 . 1 >0 . 2 >0 . 3 >0 . 4 >0 . 5

CH I

NM I

Fig. 3. Example of a dependency tree

The literature on the topic suggests that IGR performs better than the IG [22, 16]; we found that in general there is no statistically significant difference between IG and IGR, IGR and LL. In both sets of experiments, obviously due to the aforementioned problem, MI performs quite poorly; the normalised version of MI helps to alleviate this problem. Moreover, there exists a statistically significant difference (p < 0.01) between NMI and the other ranking methods. The meta-ranking method did not improve on the best individual ranking method as expected.

We also find out that score-thresholding method produces better results than rank-thresholding as we are able to achieve up to 100% precision with the former technique. High precision is quite important in applications such as MCQ generation. In score-thresholding, it is possible to optimise for high precision (up to 100%), though recall and F-score is generally quite low. MCQ applications rely on the production of good questions rather than the production of all possible questions, so high precision plays a vital role in such applications.

9 Future Work

In the future, we plan to employ the RE method for automatic MCQ generation, where it will be used to find relations and NE’s in educational texts that are important for testing students’ familiarity with key facts contained in the texts. In order to achieve this, we needed an IE method that has a high precision and at the same time works with unrestricted semantic types of relations (i.e. without reliance on seeds), while recall is of secondary importance to precision. The distractors will be produced using distributional similarity measures.


10 Conclusion

In this paper, we have presented an unsupervised approach for RE from dependency trees intended to be deployed in an e-Learning system for automatic generation of MCQs by employing semantic patterns. We explored different ranking methods and found that the CHI and NMI ranking methods obtained higher precision than the other ranking methods. We employed two techniques: the rank-thresholding and score-thresholding and found that score-thresholding perform better.

References

1. Agichtein, E., Gravano, L.: Snowball: Extracting Relations from Large Plaintext Collections. In: Proc. of the 5th ACM International Conference on Digital Libraries (2000)

2. Becker, W.E., Watts, M.: Teaching methods in U.S. and undergraduate economics courses. Journal of Economics Education 32(3), 269–279 (2001)

3. Brown, J., Frishkoff, G., Eskenazi, M.: Automatic question generation for vocabulary assessment. In: Proc. of HLT/EMNLP, Vancouver, B.C. (2005)

4. Cohen, A.M., Hersh, W.R.: A Survey of Current Work in Biomedical Text Mining. Briefings in Bioinformatics, 57–71 (2005)

5. Corney, D.P., Jones, D., Buxton, B., Langdon, W.: BioRAT: Extracting Biological Information from Full-length Papers. Bioinformatics, 3206–3213 (2004)

6. Erkan, G., Ozgur, A., Radev, D.R.: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proc. of CoNLL-EMNLP (2007)

7. Greenwood, M., Stevenson, M., Guo, Y., Harkema, H., Roberts, A.: Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System. In: Proc. of the 4th Learning Language in Logic Workshop, Bonn, Germany (2005)

8. Grover, C., Lascarides, A., Lapata, M.: A Comparison of Parsing Technologies for the Biomedical Domain. Natural Language Engineering 11(1), 27–65 (2005)

9. Hasegawa, T., Sekine, S., Grishman, R.: Discovering relations among named entities from large corpora. In: Proc. of ACL 2004 (2004)

10. Hoshino, A., Nakagawa, H.: A Real-time Multiple-choice Question Generation for Language Testing – A Preliminary Study. In: Proc. of the 43rd ACL 2005 2nd Workshop on Building Educational Applications Using Natural Language Processing, Ann Arbor, U.S., pp. 17–20 (2005)

11. Huang, M., Zhu, X., Payan, G.D., Qu, K., Li, M.: Discovering patterns to extract protein-protein interactions from full biomedical texts. Bioinformatics, 3604–3612 (2004)

12. Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn. Prentice Hall, Englewood Cliffs (2008)

13. Katrenko, S., Adriaans, P.: Learning relations from biomedical corpora using dependency trees. In: Tuyls, K., Westra, R.L., Saeys, Y., Nowé, A. (eds.) KDECB 2006. LNCS (LNBI), vol. 4366, pp. 61–80. Springer, Heidelberg (2007)

14. Kim, J.-D., Ohta, T., Tsujii, J.: Corpus Annotation for Mining Biomedical Events from Literature, BMC Bioinformatics (2008)

15. Lin, D., Pantel, P.: Concept Discovery from Text. In: Proc. of Conference on CL 2002, Taipei, Taiwan, pp. 577–583 (2002)

16. Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)


17. Martin, E.P., Bremer, E., Guerin, G., DeSesa, M.-C., Jouve, O.: Analysis of Protein/Protein Interactions through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Text Articles, pp. 96–108. Springer, Berlin (2004)

18. Mitkov, R., An, L.A.: Computer-aided generation of multiple-choice tests. In: Proc. of the HLT/NAACL 2003 Workshop on Building educational applications using Natural Language Processing, Edmonton, Canada, pp. 17–22 (2003)

19. Mitkov, R., Ha, L.A., Karamanis, N.: A computer-aided environment for generating multiple-choice test items. Natural Language Engineering 12(2), 177–194 (2006)

20. Ono, T., Hishigaki, H., Tanigami, A., Takagi, T.: Automated Extraction of Information on Protein–Protein Interactions from the Biological Literature. Bioinformatics, 155–161 (2001)

21. Pustejovsky, J., Casta, J., Cochran, B., Kotecki, M.: Robust relational parsing over biomedical literature: Extracting inhibit relations. In: Proc. of the 7th Annual Pacific Symposium on Bio-computing (2002)

22. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986) 23. Sekine, S.: On-Demand Information Extraction. In: Proc. of the COLING/ACL (2006) 24. Shinyama, Y., Sekine, S.: Preemptive Information Extraction using Unrestricted Relation

Discovery. In: Proc. of the HLT Conference of the North American Chapter of the ACL, New York, pp. 304–311 (2006)

25. Stevenson, M., Greenwood, M.: A Semantic Approach to IE Pattern Induction. In: Proc. of ACL 2005, pp. 379–386 (2005)

26. Stevenson, M., Greenwood, M.: Dependency Pattern Models for Information Extraction. Research on Language and Computation (2009)

27. Sudo, K., Sekine, S., Grishman, R.: An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition. In: Proc. of the 41st Annual Meeting of ACL 2003, Sapporo, Japan, pp. 224–231 (2003)

28. Sumita, E., Sugaya, F., Yamamoto, S.: Measuring non-native speakers’ proficiency of English using a test with automatically-generated fill-in-the-blank questions. In: Proc. of the 2nd Workshop on Building Educational Applications using NLP, pp. 61–68 (2005)

29. Szpektor, I., Tanev, H., Dagan, I., Coppola, B.: Scaling Web-based acquisition of Entailment Relations. In: Proc. of EMNLP 2004, Barcelona, Spain (2004)

30. Tapanainen, P., Järvinen, T.: A Non-Projective Dependency Parser. In: Proc. of the 5th Conference on Applied Natural Language Processing, Washington, pp. 64–74 (1997)

31. Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)

32. Tsuruoka, Y., Tsujii, J.: Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. In: Proc. of HLT/EMNLP, pp. 467–474 (2005)

33. Wilbur, J., Smith, L., Tanabe, T.: BioCreative 2. Gene Mention Task. In: Proc. of the 2nd Bio-Creative Challenge Workshop, pp. 7–16 (2007)

34. Zhou, G., Su, J., Shen, D., Tan, C.: Recognizing Name in Biomedical Texts: A Machine Learning Approach. Bioinformatics, 1178–1190 (2004)

An Improved Satisfiable SAT Generator Based

on Random Subgraph Isomorphism

Calin Anton

Grant MacEwan University, Edmonton, Alberta, Canada

Abstract. We introduce Satisfiable Random High Degree Subgraph Iso-morphism Generator(SRHD-SGI), a variation of the Satisfiable RandomSubgraph Isomorphism Generator (SR-SGI). We use the direct encod-ing to translate the SRHD-SGI instances into Satisfiable SAT instances.We present empirical evidence that the new model preserves the maincharacteristics of SAT encoded SR-SGI: easy-hard-easy pattern of evo-lution and exponential growth of empirical hardness. Our experimentsindicate that SAT encoded SRHD-SGI instances are empirically harderthan their SR-SGI counterparts. Therefore we conclude that SRHD-SGIis an improved generator of satisfiable SAT instances.

1 Introduction

Satisfiability (SAT) - checking if a Boolean formula is satisfiable - is one of themost important problems in Artificial Intelligence. It has important practicalapplications such as formal verification of software and hardware. IncompleteSAT solvers, can not prove unsatisfiability, but they may find a solution if oneexists. New, challenging testbeds are needed for improving the performances ofthese solvers, and to differentiate between their performances. Several modelshave been proposed for generating hard SAT instances including some whichare based on generating graphs [1,2], but there are only a few such models forgenerating Satisfiable SAT instances [3,4].

Given a pair of graphs, the Subgraph Isomorphism Problem (SGI) asks if onegraph is isomorphic to a subgraph of the other graph. It is an NP-complete prob-lem with many applications in areas like pattern recognition, computer aided de-sign and bioinformatics. SAT encoded random Subgraph Isomorphism instancesare known to be empirically hard, in terms of running time, for state of the artSAT solvers [5,6].

SR-SGI[5] is a model for generating Satisfiable SAT instances, by convertingrandomly generated satisfiable instances of SGI. The model has the followingfeatures: a) it generates relatively hard satisfiable SAT instances; b) the empir-ical hardness of the instances, exhibits an easy-hard-easy pattern when plottedagainst one of the model’s parameters; c) the hardness of the instances at thehardness peak increase exponentially with the instance size.

In this paper we introduce SRHD-SGI, a variation of SR-SGI which aimsto produce harder instances by: reducing the number of possible solutions and


An Improved Satisfiable SAT Generator 45

eliminating the tell-tales - which are indicators of an easy to find solution. Theabove goals are acquired by generating a “flatter” subgraph. The main differencebetween SRHD-SGI and SR-SGI resides in the way the subgraph is generated:SR-SGI randomly selects the subgraph while SRHD-SGI selects the subgraphinduced by the highest degree vertices.

2 A Random Model for Generating Satisfiable SGIInstances

For integers n, m, and q such that 0 ≤ n ≤ m, 0 ≤ q ≤(m2

), and p ∈ [0, 1],

a (m, q, n, p) satisfiable random high degree SGI (SRHD-SGI) consists of twographs G and H and asks if G is isomorphic to a subgraph of H . H is a randomgraph with m vertices and q edges. G is obtained by the following steps:

1. Select the n highest degree vertices of H - breaking ties at random .2. Make G′ the induced subgraph of H , with the vertices selected in step 1.3. Remove edges from G′ in decreasing order of the sum of the degrees of its

adjacent vertices - breaking ties at random.4. Make G a random isomorphic copy of G′, by randomly shuffling G.

The SGI instance is simplified using PP a preprocessing procedure introduced in[7]. SRHD-SGI is a variation of the SR-SGI model, which removes edges from theinduced graph at random. The main difference between SRHD-SGI and SR-SGIresides in the way G is generated. Reducing the number of possible solutions1

is the reason for making G′ the subgraph induced by the highest degree verticesof H . If G′ is a dense graph, then it is conceivable that there will not be manysubgraphs of H which are isomorphic to G and thus there will not be manysolutions.

The high degree vertices of G may become tell-tales - “ signposts that mayhelp, at least statistically, to determine the hidden solution” [8] which negativelyaffects the instance hardness. Hiding these tell-tales is the main reason for ourchoice of edge removal method. It also reduces the variance in the degree se-quence of G, and such, it reduces the local variation. Reducing variation in localstructure has been applied to SAT [9], resulting in hard random instances.

If p = 0, no edge is removed from G′ and thus G is the subgraph of H inducedby the highest degree vertices. In this case it is very likely that the SGI instancehas a single solution. Furthermore, the presence of the tell-tales should quicklyguide the search toward the unique solution. As such, the instances generated forp = 0 are not expected to be very difficult. If p = 1, G has no edges and thereforeit is isomorphic to any subgraph of H with n vertices and no edges. Based onthese assumptions it is expected that the hardest instances of SRHD-SGI areobtained for values of p strictly between 0 and 1.

1 It has been implied [4] that the number of solutions is negatively correlated with thedifficulty of randomly generated satisfiable instances.

46 C. Anton

0

200000

400000

600000

800000

1e+06

1.2e+06

0 10 20 30 40 50

Med

ian

Tim

e(m

s)

p

Time(ms) VS p

SATzilla2009_C SATzilla2009_I SApperloTbase

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 10 20 30 40 50

p

Time(ms) VS p

clasp novelty picosat

precosat

Fig. 1. Evolution of empirical hardness of SAT encoded SRHD-SGI with p.(m=19,q=145, n=17). Notice the different time scales.

3 Empirical Investigation of SAT Encoded SRHD-SGI

In this section we provide experimental evidence indicating that SRHD-SGIpreserves the main characteristics of SR-SGI and produces empirically harderSAT encoded instances than SR-SGI.

We used the following experimental framework: m was set to 16, 17, 18, 19,and 20; n varied from m−5 to m; q varied between 0.60

(m2

)and 0.90

(m2

)in incre-

ments of 0.05(m2

); p varied between 0% and 50% in increments of 5%. We used a

cutoff time limit of 900 seconds. For each combination of values for m, q, n andp, 100 test samples were generated. The running time (in milliseconds) was usedto estimate the empirical hardness of the instances. For performing the experi-ments we chose solvers2 which won the gold and silver medals at the last SATcompetition[10] in the SAT and SAT+UNSAT categories of the Application andCrafted tracks: Clasp, precosat, SAperloT , SATZilla2009 I and SATZilla2009 C.To simplify the comparison with SR-SGI we added to the solvers pool picosatand gnovelty+ , which were used in the SR-SGI experiments[5].

3.1 SRHD-SGI Preserves the Main Characteristics of SR-SGI

In this subsection we present empirical evidence that SRHD-SGI preserves themost important characteristics of SR-SGI: easy-hard-easy pattern of hardnessevolution and exponential growth of the empirical hardness.

Easy-hard-easy pattern. The empirical hardness of the SAT encoded SR-SGIexhibits an easy-hard-easy pattern when plotted against p. This is an importantfeature of SR-SGI, as it offers a large selection of “predictable-hard” instances.Given the similarities between SR-SGI and SRHD-SGI we expected that SRHD-SGI also exhibits an easy-hard-easy pattern, and the experiments confirmed ourintuition. In this experiment we fixed m, q and n and let p vary. For all solvers,2 Brief descriptions of the solvers are available on the SAT competition website [10].


100

1000

10000

100000

1e+06

240 260 280 300 320 340 360

Med

ian

Tim

e (m

s)

Median Number of variables

Time VS Number of variables (logscale y)

SATzilla2009_C SATzilla2009_I

SAperlo_T clasp

gnovely+ picosat

precosat

Fig. 2. Exponential growth of the hardest SAT encoded SRHD-SGI instances (n=19)

0

50000

100000

150000

200000

250000

300000

350000

400000

0 10 20 30 40 50

Med

ian

Tim

e(m

s)

A - clasp (m=19,q=145,n=18)

SR-SGISRHD-SGI

0

2000

4000

6000

8000

10000

12000

14000

16000

0 10 20 30 40 50

Med

ian

Tim

e(m

s)

C - picosat and precosat (m=19,q=136,n=16)

picosat on SR-SGIpicosat on SRHD-SGI

precosat on SR-SGIprecosat on SRHD-SGI

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 10 20 30 40 50

B - gnovelty+ (m=19,q=136,n=19)

SR-SGISRHD-SGI

0

50000

100000

150000

200000

250000

0 10 20 30 40 50

D - SAperloT and SATzilla_I (m=19,q=145,n=17)

SAperloT on SR-SGISAperloT on SRHD-SGI

SATzilla_I on SR-SGISATzilla_I on SRHD-SGI

Fig. 3. Comparison of SAT encoded SRHD-SGI and SRSGI (x-axis - p in %.)

we noticed an easy-hard-easy pattern in the variation of the empirical hardness- see figure 1. The same pattern occurred when the number of visited nodes - forcomplete solvers, or steps - for incomplete solvers were plotted against p. Forfixed m, and n, and for all solvers, we noticed that the value of p at which the

48 C. Anton

hardness peaks, decreases as q increases. A possible reason for this correlationis that as H becomes denser, fewer edges need to be removed from G′ for H tocontain many copies of G, which makes the instances easy.

When m and q are fixed, the value of p which produces the hardest instances,increases as n increases. The presence of the tell-tales is a possible explanationfor this behavior. For fixed H , which is the case when m and q are fixed, thenumber of tell-tales increases as the number of vertices of G increases - morehigh degree vertices of H are used. More edges need to be removed from G, forhiding a larger set of tell-tales, and this explains why the value of p at which thehardness peaks increases with n.

Exponential growth rate. The empirical hardness of the SAT encoded SR-SGI instances - generated at the hardness peak- increases exponentially withthe number of variables. This is a desirable characteristic of any generator, asit implies that the generated instances will be hard even asymptotically. Weexpected that it may also be the case for SAT encoded SRHD-SGI. To checkthis hypothesis we fixed m and plotted the hardness of the SRHD-SGI instancesfrom the hardness peak against their number of variables - see figure 2. The shapeof the curves is essentially exponential. Similar curves were obtained when thehardness of the instances was plotted against their size - number of literals.

3.2 SRHD-SGI Generates Harder SAT Instances than SR-SGI

In this subsection we compare SRHD-SGI with SR-SGI. The main purpose ofthis comparison is to asses the hardness of SRHD-SGI instances. We expect thatthe vertex selection procedure of SRHD-SGI combined with its edge removalmethod produces harder SAT instances. To check this hypothesis we comparedinstances of the two models, generated for the same values of the parameters.

For all solvers, for the same values of the parameters, the hardest SRHD-SGIinstances are at least as difficult as the hardest SR-SGI ones; in most of the casesthe former are two to three times more difficult than the latter, and in some casesthe difference is more than an order of magnitude. The difference between thehardness of the peak instances of the two models is larger for smaller values of n.We believe that this is a consequence of the vertex selection method of SRHD-SGI. For small values of n, the vertex selection method of SRHD-SGI and therandom vertex selection may produce significantly different sets of vertices, whilefor large values of n, the sets of vertices produced by the two selection methodsare only slightly different.

We noticed an interesting pattern: for small values of p, SR-SGI instancesare harder than their SRHD-SGI counterparts. For large values of p, SRHD-SGI instances are harder. When running times on the instances generated bythe two models are plotted against p, the two plots cross over at values of psmaller than the ones that produce the hardest instances - see figure 3, and thisis consistent among all solvers. We think that this behavior is a consequenceof the vertex and edge selection methods of SRHD-SGI, which highlights theimportance of hiding the tell-tales. When p is small, only few edges are removedfrom G′ and therefore the high-degree vertices are preserved in G′ making the


(unique) hidden solution easy to find - easier than for the random selection ofvertices. As p increases, more edges connecting high degree vertices are removed,and thus highest degree vertices are suppressed, which conceals the tell-tales.Furthermore the edge removal procedure, makes G more uniform and therefore,it increases the likelihood that the variable and value heuristics are mislead bythe numerous almost solutions. We believe that this is the region where thehardest instances are generated and this is the reason for the superior hardnessof SRHD-SGI. When p becomes large, it is expected that H will contain manycopies of G. However, the selection methods of SRHD-SGI make G denser thanthe corresponding counterpart of SR-SGI and therefore H contains fewer copiesof it, which makes the instances harder than the SR-SGI ones.

4 Conclusion

We introduced and empirically analyzed a generator of satisfiable SAT instances,based on subgraph isomorphism. We showed that it preserves the main charac-teristics of SR-SGI: the easy-hard-easy pattern of the evolution of the empiricalhardness of the instances and the exponential growth of the empirical hardness.This is consistent for both complete and incomplete solvers. We presented em-pirical evidence that this model produces harder satisfiable SAT instances, thanSR-SGI. All these features indicate that this is a better model for generatingsatisfiable SAT instances and Pseudo Boolean instances.

References

1. Audemard, G., Jabbour, S., Sais, L.: SAT graph-based representation: A new per-spective. J. Algorithms 63(1-3), 17–33 (2008)

2. Ansotegui, C., Bejar, R., Fernandez, C., Mateu, C.: Generating hard SAT/CSPinstances using expander graphs. In: Proc. AAAI 2008, pp. 1442–1443 (2008)

3. Xu, K., Boussemart, F., Hemery, F., Lecoutre, C.: A simple model to generate hardsatisfiable instances. In: Proceedings of IJCAI 2005, pp. 337–342 (2005)

4. Achlioptas, D., Gomes, C., Kautz, H., Selman, B.: Generating satisfiable probleminstances. In: Proceedings of AAAI 2000, pp. 256–261 (2000)

5. Anton, C., Olson, L.: Generating satisfiable SAT instances using random subgraphisomorphism. In: Gao, Y., Japkowicz, N. (eds.) AI 2009. LNCS, vol. 5549, pp.16–26. Springer, Heidelberg (2009)

6. Culberson, J., Gao, Y., Anton, C.: Phase transitions of dominating clique problemand their implications to heuristics in satisfiability search. In: Proc. IJCAI 2005,pp. 78–83 (2005)

7. Anton, C., Neal, C.: Notes on generating satisfiable SAT instances using randomsubgraph isomorphism. In: Farzindar, A., Keselj, V. (eds.) Canadian AI 2010.LNCS, vol. 6085, pp. 315–318. Springer, Heidelberg (2010)

8. Culberson, J.: Hidden solutions, tell-tales, heuristics and anti-heuristics. In: IJCAI2001 Workshop on Empirical Methods in AI, pp. 9–14 (2001)

9. Bayardo, R., Schrag, R.: Using CSP look-back techniques to solve exceptionallyhard SAT instances. In: Freuder, E.C. (ed.) CP 1996. LNCS, vol. 1118, pp. 46–60.Springer, Heidelberg (1996)

10. The international sat competitions web page, http://www.satcompetition.org

http://www.satcompetition.org

Utility Estimation in Large Preference Graphs

Using A* Search

Henry Bediako-Asare1, Scott Buffett2, and Michael W. Fleming1

1 University of New Brunswick, Fredericton, NB, E3B 5A3{o3igd,mwf}@unb.ca

2 National Research Council Canada, Fredericton, NB, E3B [email protected]

Abstract. Existing preference prediction techniques can require that anentire preference structure be constructed for a user. These structures,such as Conditional Outcome Preference Networks (COP-nets), can growexponentially in the number of attributes describing the outcomes. In thispaper, a new approach for constructing COP-nets, using A* search, isintroduced. Using this approach, partial COP-nets can be constructed ondemand instead of generating the entire structure. Experimental resultsshow that the new method yields enormous savings in time and memoryrequirements, with only a modest reduction in prediction accuracy.

1 Introduction

In recent years, the idea of autonomous agents representing users in some formof automated negotiation has gained a significant amount of interest [6–8]. Thishas inspired research in finding effective techniques for modeling user preferencesand also eliciting preferences from a user [2, 3, 10].

Preference networks have been developed to graphically represent models ofuser preferences over a set of outcomes. Two such networks are Boutilier et al.’sConditional Preference Network (CP-net) [1] and Chen et al.’s Conditional Out-come Preference Network (COP-net) [5]. Using preference elicitation techniques,such as standard gamble questions or binary comparisons [9], an agent can ob-tain information on a user’s preferences in order to create a network. As thenumber of attributes grows, the size of such a network becomes unmanageableand it is typically infeasible to learn all preferences over a large number of out-comes. Therefore, given only a small number of preferences, the agent will haveto predict as many others as possible over the set of outcomes.

Often, preferences over a small number of outcomes are all that is needed,such as when determining whether one particular outcome is preferred over an-other. Here, it is valuable to be able to build a partial COP-net, containingonly outcomes that are relevant in determining the relationship between the twooutcomes of interest, without compromising preference prediction accuracy.

In this paper, such an approach for constructing COP-nets is introduced, usingA* search. Using this new methodology, smaller COP-nets can be constructed ondemand, eliminating the need to generate a network for an entire set of outcomes.

C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 50–55, 2011.c© Her Majesty the Queen in Right of Canada 2011

Utility Estimation in Large Preference Graphs Using A* Search 51

2 Preference Networks

A Conditional Outcome Preference Network (COP-net) [4] is a directed graphthat represents preferences over the set of outcomes. Every outcome is repre-sented by a vertex, and for vertices v and v′ representing outcomes o and o′,respectively, if v is a proper ancestor of v′ then o is preferred over o′. The graphis transitively reduced by the removal of redundant edges.

In addition to modeling the user’s preferences during the elicitation stage,the COP-net can also be used to estimate a utility function over the set ofoutcomes. Given an initial partial utility assignment, including at least the mostpreferred outcome (utility 1) and the least preferred (utility 0), a utility functionu over the entire set of outcomes is produced. This is done in such a way as topreserve the preference ordering specified by the COP-net. Specifically, if v andv′ represent outcomes o and o′ and v is a proper ancestor of v′, then u(o) > u(o′).Estimating a utility for every outcome allows one to compare two outcomes thatmight otherwise have no direct relationship in the graph.

3 Generating Partial COP-Nets

3.1 Motivation for Partial COP-Net Construction

The current method for constructing COP-nets provides a reasonably accuratemodel for representing a user’s preferences. An agent will, with high frequency,be able to correctly predict a user’s preference given any two outcomes, provideda sufficient amount of preference information has been elicited from the user [5].However, with the current solution, in order to estimate the utility of a smallnumber of outcomes, or even a single outcome, the entire structure must beconstructed. Since the number of outcomes grows exponentially in the number ofattributes, using such graphs to represent a preference profile becomes infeasiblefor problems with large numbers of attributes/values.

It would therefore be valuable to be able to construct only a partial COP-net when predicting preferences. For example, consider the partial COP-netin Figure 1 (left). This graph contains some valuable information regarding thelikely preference over oi and oj . In particular, two chains (p1 and p2) through thespace of arcs in the graph are highlighted that indicate that oi is likely “higher”in the actual COP-net than oj . This indicates that there is evidence to supportthe prediction that oi has higher utility than oj . The goal then of this paperis to generate partial COP-nets by attempting to construct such connectionsbetween the outcomes in question in the COP-net. Once the partial COP-netis constructed, the preference information represented in the connections in thepartial COP-net is then exploited to determine the likely preference.

3.2 Partial COP-Net Composition

The partial COP-net is composed by finding chains of arcs through the implicitCOP-net that connect the outcomes in question. These chains would represent

52 H. Bediako-Asare, S. Buffett, and M.W. Fleming

oi

oj

p1

p2

o1

o4

o3 o2

Fig. 1. (Left) A partial COP-net for deciding the likely preference over oi and oj and(right) a chain of nodes found using preferences o1 � o2, o1 � o3 and o3 � o4

paths through the COP-net if direction were removed from the arcs, but donot necessarily (and are in fact very unlikely to) represent directed paths. Forexample, p1 and p2 in Figure 1 (left) represent two such valid chains we seek tofind. We choose to generate the partial COP-net by constructing four chains asfollows. Given any pair of outcomes, oi and oj , initially oi is designated as thestart node and oj as the goal node. Two chains are then generated from oi to oj ,one that passes through a parent node of oi and one that passes through a childnode of oi. A second pair of chains is then generated similarly, but with oj asthe start node and oi as the goal node, with one chain passing through a parentnode of oj and one passing through a child node of oj . The idea here is to finda diverse sample of chains that reach both above and below each outcome inquestion. The two pairs of chains are then merged to obtain a partial COP-netfrom which utilities are estimated and preferences predicted.

3.3 Search Space

The set of known preferences can be seen as offering an implicit representation ofthe true COP-net. For example, if the preference o1 � o2 is specified, then thisimplies that the node representing o1 is an ancestor of the node representingo2 in the true COP-net. A chain through the implicit COP-net can thus beconstructed by jumping from one outcome to the next by finding preferencesthat dictate that one outcome is more preferred than another (and is thus anancestor) or less preferred (and is thus a descendant). For example, if preferencesare known that dictate that o1 � o2, o1 � o3 and o3 � o4, then a chain can beconstructed as depicted in Figure 1 (right). The goal is then to find reasonablysmall chains through the COP-net space, which will in turn result in reasonablysmall partial COP-nets.

Since we employ the ceteris paribus assumption (i.e. “all else equal”), prefer-ences can be quite general and therefore a single preference may dictate a largenumber of relationships. For example if there are two attributes A and B withvalues {a1, a2} and {b1, b2} respectively, then the preference a1 � a2 under ce-teris paribus implies both that a1b1 is an ancestor of a2b1 and that a1b2 is an


ancestor of a2b2. We also allow conditional preferences of the form c : x � y,meaning that given the condition that attribute values specified by c hold, thevalues specified by x are preferred over the values specified by y, all else equal.

To illustrate how the search is performed, assume the current node in thesearch is a1b1c1 and the elicited preferences from the user include a1 � a2 andb1 : c2 � c1. Applying a1 � a2 will allow the search to go “down” from a1b1c1 toa2b1c1, (i.e. since a2b1c1 must be a descendant of a1b1c1 in the COP-net), andapplying b1 : c2 � c1 will allow us to go “up” from a1b1c1 to a1b1c2.

3.4 The Heuristic Function

A consequence of the generality realized by employing ceteris paribus in thepreference representation is that there may be several possible preferences thatcan be applied to an outcome, and thus several possible steps from which oneneeds to choose during the search. To increase the likelihood of identifying shortchains between outcomes, we employ a heuristic to help choose which of theelicited preferences to apply and thus which outcome to select as a successor toa current outcome during the search.

The heuristic used to guide the search is defined as follows. Let oi be theoutcome representing the initial state in the search, let oj represent the goalstate and let on be a possible choice for the next state in the search for oj . Theheuristic h(on) for on is then set to be equal to the number of attributes whosevalues differ in on and oj . Such a heuristic should then guide the search towardthe goal node. If g(on) represents the number of steps in the chain found fromoi to on, then choices are made that minimize f(on) = g(on) + h(on).

The heuristic is both admissible and consistent under the assumption that onlysingle-attribute preferences are specified (which can be conditional on values forany number of attributes), which is typically the case in practice. Admissibilityholds because of the fact that, since only one attribute value can be changedat each step, h(on) cannot be an overestimate of the number of steps from on

to the goal. Consistency holds due to the fact that, if h(on) = k is the numberof attribute values on which on and the goal outcome oj differ, then for anyoutcome om that can be reached from on in one step, h(om) is at least k − 1.Therefore, h(on) is no larger than h(om) plus the actual cost of moving from on

to om (which is 1), which satisfies the definition of consistency.

3.5 Analyzing the Partial COP-Net to Predict Preferences

Once the partial COP-net is constructed by merging the four chains into a singlegraph, the next step is to exploit the inherent structure to estimate a utility foreach node in the graph. Utilities of outcomes in a partial COP-net are estimatedusing a version of a method referred to as the Longest Path technique [5]. Onceutilities are estimated, it is a simple matter of comparing estimates for the nodesin question and selecting the highest as the most preferred.

54 H. Bediako-Asare, S. Buffett, and M.W. Fleming

4 Results

Figure 2 (left) shows the accuracy of the partial COP-net method for preferenceprediction compared to using full COP-nets. In order to give a full comparison,a third baseline approach was also evaluated. While 50% might be consideredthe worst prediction accuracy one could achieve (i.e. by guessing), one couldeasily achieve better success than this by using preference information availableand making more educated guesses. The baseline is thus an estimate of the bestone could do using simplistic methods. We aim to ensure that, while we do notexpect our partial COP-net method to achieve the same accuracy rate as thefull COP-net approach, it should do reasonably well compared to the baseline.

With the baseline approach, predictions were made simply by choosing theoutcome with the higher number of individual attributes with preferred values.For example, consider two outcomes a1b1c1d1 and a2b2c2d2. If the elicited pref-erences were a1 � a2, b2 � b1 and c1 � c2, then the baseline method wouldchoose a1b1c1d1 as the more preferred, since it contains the preferred value fortwo of the attributes, while a2b2c2d2 only contains one, with one being unknown.Figure 2 demonstrates that our partial COP-net method performs reasonablywell when compared with the full COP-net method and the baseline approach.A paired t-test shows that the difference in means between results from thepartial COP-net method and baseline approach is statistically significant at thep < 0.05 level for all problem sizes.

50%

60%

70%

80%

90%

100%

3 6 9 16 24 32 48 64 81 108

144

192

243

288

384

486

576

729

864

1024

1296

1536

1944

2304

% a

ccur

acy

Possible number of outcomes (problem size)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

3 6 9 16 24 32 48 64 81 108

144

192

243

288

384

486

576

729

864

1024

1296

1536

1944

2304

% d

ecre

ase

in o

utco

mes

Number of outcomes in problem defintion (problem size)

Fig. 2. (Left) Accuracy of the full COP-net approach (best), partial COP-net approach(2nd-best) and baseline approach (worst) and (right) reduction in the number of out-comes considered by using the partial COP-net approach

The main objective of this paper is to show that we can still obtain a reason-ably high prediction accuracy, while exploring only a tiny fraction of the spaceof outcomes. Figure 2 (right) demonstrates this, showing that we can ignore upto 98% of all outcomes for problems of only about 2000 outcomes. The trendindicates that this number will continue to increase with the size of the problems.

We also examined what this reduction means in terms of computation speed.We found that only the tiniest fraction of computation time is now required.


For example, problems with 2304 outcomes required an average of over threehours to solve with the full COP-net approach, while the partial COP-net ap-proach took just two seconds. This means that a vast space of situations thatpreviously had too many outcomes to allow for any reasonable preference pre-diction technique is now manageable using our new technique.

5 Conclusions

The test results clearly demonstrate the benefits of the proposed methodology forconstructing partial COP-nets. Although it sacrifices some prediction accuracy,it provides enormous savings in time and memory requirements. For example, incases where it would have taken over three hours for the current methodology tobuild a COP-net and estimate utilities of outcomes, the proposed methodologytakes just a few seconds, with only a modest reduction in prediction accuracy(80-90% instead of 90-95% for problems with more than 500 outcomes). Perhapsmost importantly, the reduction in time and space requirements allows for fastpredictions in cases where it would have been completely infeasible before.

References

1. Boutilier, C., Brafman, R.I., Domshlak, C., Hoos, H.H., Poole, D.: CP-nets: A toolfor representing and reasoning with conditional ceteris paribus preference state-ments. Journal of Artificial Intelligence Research 21, 135–191 (2004)

2. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Regret-based utilityelicitation in constraint-based decision problems. In: Proceedings of IJCAI 2005,Edinburgh, Scotland, pp. 929–934 (2005)

3. Chajewska, U., Koller, D., Parr, R.: Making rational decisions using adaptive utilityelicitation. In: AAAI 2000, Austin, Texas, USA, pp. 363–369 (2000)

4. Chen, S.: Reasoning with conditional preferences across attributes. Master’s thesis,University of New Brunswick (2006)

5. Chen, S., Buffett, S., Fleming, M.W.: Reasoning with conditional preferences acrossattributes. In: Proc. of AI 2007, Montreal, Canada, pp. 369–380 (2007)

6. Faratin, P., Sierra, C., Jennings, N.R.: Using similarity criteria to make issue trade-offs in automated negotiations. Artificial Intelligence 142, 205–237 (2002)

7. Fatima, S.S., Wooldridge, M., Jennings, N.R.: Optimal negotiation of multipleissues in incomplete information settings. In: Proc. 3rd Int. Conf. on AutonomousAgents and Multi-Agent Systems, New York, NY, pp. 1080–1087 (2004)

8. Jennings, N.R., Faratin, P., Lomuscio, A., Parsons, S., Sierra, C., Wooldridge,M.: Automated negotiation: prospects, methods and challenges. Int. J. of GroupDecision and Negotiation 10(2), 199–215 (2001)

9. Keeney, R.L., Raiffa, H.: Decisions with Multiple Objectives: Preferences and ValueTradeoffs. John Wiley and Sons, Inc., Chichester (1976)

10. Sandholm, T., Boutilier, C.: Preference elicitation in combinatorial auctions. In:Cramton, P., Shoham, Y., Steinberg, R. (eds.) Combinatorial Auctions (2006)

A Learning Method for Developing PROAFTNClassifiers and a Comparative Study with Decision Trees

Nabil Belacel and Feras Al-Obeidat

Institute for Information TechnologyNational Research Council of Canada

Abstract. PROAFTN belongs to Multiple-Criteria Decision Aid (MCDA)paradigm and requires a several set of parameters for the purpose of classifica-tion. This study proposes a new inductive approach for obtaining these parametersfrom data. To evaluate the performance of developed learning approach, a com-parative study between PROAFTN and a decision tree in terms of their learningmethodology, classification accuracy, and interpretability is investigated in thispaper. The major distinguished property of Decision tree is that its ability to gen-erate classification models that can be easily explained. The PROAFTN methodhas also this capability, therefore avoiding a black box situation. Furthermore, ac-cording to the proposed learning approach in this study, the experimental resultsshow that PROAFTN strongly competes with ID3 and C4.5 in terms of classifi-cation accuracy.

Keywords: Classification, PROAFTN, Decision Tree, MCDA, Knowledge Dis-covery.

1 Introduction

Decision tree learning is a widely used method in data mining and machine learning.The strength of decision trees (DT) can be summarized as: (1) Simple to understandand interpret. People are able to understand decision tree models after a brief expla-nation. (2) Not a black box model. The classification model can be easily explainedby boolean logic. (3) The methodology used to construct a classification model is nothard to understand. (4) The classification results are usually reasonable. These advan-tages of DT make it a common and highly used classification method in research andapplications [4].

This paper introduces a new learning technique for the classification methodPROAFTN which requires several parameters (e.g intervals, discrimination thresholdsand weights) that need to be determined to perform the classification. This study inves-tigates a new automatic approach for the elicitation of PROAFTN parameters from dataand prototypes during training process. The major characteristics of PROAFTN can besummarized as follows:

– PROAFTN is not a black box and the results are automatically explained, that isit provides the possibility of access to more detailed information concerning theclassification decision.

C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 56–61, 2011.c© Her Majesty the Queen in Right of Canada 2011

Learning PROAFTN with a Comparative Study with DT 57

– PROAFTN can perform two learning paradigms: deductive and inductive learn-ing. In the deductive approach, the decision maker has the role of establishing therequired parameters for the studied problem, whereas in an inductive approach,the parameters and the classification models are obtained automatically from thedatasets.

Based on what have been presented above, one can see that DT and PROAFTN cangenerate classification models which can be easily explained and interpreted. However,when evaluating any classification method there is another important factor to be con-sidered: classification accuracy. Based on the experimental study presented in Section 4,PROAFTN can generate a higher classification accuracy than decision tree learning al-gorithms: ID3 and C4.5 [9].

The paper is organized as follows. Section 2 presents the PROAFTN method. Sec-tion 3 proposes automatic learning methods based on machine learning techniques toinfer PROAFTN parameters and prototypes. In Section 4 a comparative study based oncomputational results generated by PROAFTN and DT (ID3 and C4.5) on some well-known datasets is presented and analyzed. Finally, conclusions and future works arepresented in Section 5.

2 PROAFTN Method

PROAFTN procedure belongs to the class of supervised learning to solve classificationproblems. PROAFTN has been applied to the resolution of many real-world practicalproblems [6] [7] [10]. The following subsections describe the required parameters, theclassification methodology, and the procedure used by PROAFTN.

2.1 Initialization

From a set of n objects known as a training set, consider a is an object which requiresto be classified; assume this object a is described by a set of m attributes {g1,g2, ...,gm}and z classes {C1,C2, ...,Cz}. Given an object a described by the score of m attributes,for each class Ch, we determine a set of Lh prototypes. For each prototype bh

i and eachattribute g j, an interval [S1

j(bhi ), S2

j (bhi )] is defined where S2

j(bhi ) ≥ S1

j(bhi ).

To apply PROAFTN, the intervals: the pessimistic [S1j(b

hi ),S

2j (b

hi )] and the optimistic

[S1j(b

hi )−d1

j (bhi ),S

2j(b

hi )+ d2

j (bhi )] should be determined prior to classification for each

attribute. As mentioned above, the indirect technique approach will be adapted to inferthese intervals. The following subsections explain the stages required to classify theobject a to the class Ch using PROAFTN.

2.2 Computing the Fuzzy Indifference Relation

To use the classification method PROAFTN, we need first to calculate the fuzzy indif-ference relation I(a,bh

i ). The calculation of the fuzzy indifference relation is based onthe concordance and non-discordance principle which is identified by:

I(a,bhi ) =

m

∑j=1

whjCj(a,bh

i ) (1)

58 N. Belacel and F. Al-Obeidat

Cj(a, bhi )

S1j

Indifference

Indifference

1

0S2

jS2

j +d2jS1

j -d1j

gj(a)

d1j

d2j

Strong

Weak

IndifferenceNo

Indifference

No

Fig. 1. Graphical representation of the partial indifference concordance index between the objecta and the prototype bh

i represented by intervals

where whj is the weight that measures the importance of a relevant attribute g j of a

specific class Ch:

wj ∈ [0,1] , andm

∑j=1

whj = 1

j = 1, ...,m;h = 1, ...,z

Cj(a,bhi ) is the degree that measures the closeness of the object a to the prototype

bhi according to the attribute g j. To calculate Cj(a,bh

i ), two positive thresholds d1j (b

hi )

and d2j (b

hi ) need to be obtained. The computation of Cj(a,bh

i ) is graphically presentedin Fig. 1.

2.3 Evaluation of the Membership Degree

The membership degree between the object a and the class Ch is calculated based onthe indifference degree between a and its nearest neighbor in Bh. The following formulaidentifies the nearest neighbor:

d(a,Ch) = max{I(a,bh1), I(a,bh

2), ..., I(a,bhLh

)} (2)

2.4 Assignment of an Object to the Class

The last step is to assign the object a to the right class Ch; the calculation required tofind the right class is straightforward:

a ∈Ch ⇔ d(a,Ch) = max{d(a,Ci)/i ∈ {1, ...,z}} (3)

3 Proposed Techniques to Learn PROAFTN

As discussed earlier, PROAFTN requires the elicitation of its parameters for the purposeof classification. Several approaches have been used to learn PROAFTN in [1] [2] [3].


Algorithm 1. Building the classification model for PROAFTN1: Determine of a threshold β as reference for interval selection2: z ← Number of classes, i ← Prototype’s index3: m ← Number of attributes, k ← Number of intervals for each attribute4: Ir

jh ← Apply discretization to get {S1rjh,S

2rjh} for each attribute g j in each class Ch

5: ℜ ← Percentage of values within the interval Irjh per class

6: Generate PROAFTN intervals using discretization7: for h =← 1, z do8: i ← 09: for g ← 1, m do

10: for r ← 1, k do11: if ℜ of Ir

jh ≥ β then

12: Choose this interval to be part of the prototype bhi

13: Go to next attribute gm+114: else15: Discard this interval and find another one (i.e., Ir+1

jh )16: end if17: end for18: end for19: if (bh

i �= /0 ∀g jh ) then i ← i+120: end if21: (Prototypes’ composition):22: The selected branches from attribute g1 to attribute gm represent the induced prototypes

for the class Ch

23: end for

In this study however, a different technique is proposed to get these parameters fromdata. During the learning process, the necessary preferential information (a.k.a. pro-totypes) required to construct the classification model are extracted first; then this in-formation are used for assigning the new cases (testing data) to the closest class. ThePROAFTN parameters that are required to be elicited automatically from training datasetare: {S1

j(bhi ),S

2j (b

hi ),d

1j (b

hi ),d

2j (b

hi )}. This study proposes the discretization techniques

to infer these parameters. Once these parameters are determined, the next stage is tobuild the classification model, which consists of a set of prototypes that represents eachcategory. The obtained prototypes can then be used to classify the new instances.

Discretization techniques are utilized to obtain the intervals [S1j(b

hi ),S

2j(b

hi )] auto-

matically for each attribute in the training dataset. The obtained intervals will then beadjusted to get the other fuzzy intervals [S1

j(bhi )−d1

j (bhi ),S

2j(b

hi )+ d2

j (bhi )], which will

be used subsequently for building the classification model.Following to the discretization phase is model development stage. The proposed

model uses an induction approach given in Algorithm 1. The tree is constructed in atop-down recursive manner, where each branch represents the generated intervals foreach attribute. The prototypes can then be extracted for the decision tree to composedecision rules to be used for classifying testing data.

60 N. Belacel and F. Al-Obeidat

Table 1. Dataset Description

Dataset Instances Attributes ClassesBreast Cancer 699 11 2Heart Disease 303 14 2Haberman’s Survival 306 3 2Iris 150 4 3Mammographic Mass 961 4 2Pima Diabetes 768 8 2Vehicle 846 18 4Vowel Context 990 11 10Wine 178 13 3Yeast 1484 8 10

4 Application of the Developed Algorithms

The proposed method was implemented in java applied to 10 popular datasets describedin Table 1. These datasets are available on the public domain of the University of Cal-ifornia at Irvine (UCI) [5]. To compare our proposed approaches with ID3 and C4.5algorithms, we have used the open source platform Weka [11] for this purpose. Thecomparisons are made on all datasets using stratified 10-fold cross validation.

The generated results applied on the datasets for PROAFTN, ID3 and C4.5 (prunedand unpruned) is shown in Table 2. The Friedman test [8] is used to recognize theperformance of PROAFTN against other DT classifiers.

Table 2. ID3 and C4.5 versus PROAFTN in terms of classification accuracy

Algorithm / ID3 C4.5 C4.5 PROAFTNDataset (unpruned) (pruned)

1 Breast Cancer 89.80 94.56 94.56 97.182 Heart Disease 74.10 74.81 76.70 79.043 Haberman’s Survival 59.80 70.92 71.90 70.844 Iris 90.00 96.00 96.00 96.575 Mammographic Mass 75.35 81.27 82.10 84.306 Pima Diabetes 58.33 71.22 71.48 72.197 Vehicle 60.77 72.93 72.58 76.368 Vowel context 72.42 82.63 82.53 81.869 Wine 80.5 91.55 91.55 97.3310 Yeast 41.71 54.78 56.00 57.00Avg 70.28 79.07 79.54 81.27Rank 4 3 2 1

5 Discussion and Conclusions

The common advantages of the PROAFTN method and the DT could be summarizedas: (i) Reasoning about the results, therefore avoiding black box situations, and (ii)


Simple to understand and to interpret. Furthermore, in this study PROAFTN was ableto outperform ID3 and C4.5 in terms of classification accuracy.

To apply PROAFTN, some parameters should be determined before performing clas-sification procedures. This study proposed the indirect technique by using discretizationto establish these parameters from data.

It has been shown in this study that PROAFTN is a promising classification methodto be applied in a decision-making paradigm and knowledge discovery process. Hence,we have a classification method that relatively outperforms DT and is also interpretable.More improvements could be made to enhance PROAFTN; this includes (i) involve theweights factor in the learning process. The weights in this paper are assumed to beequal; (ii) extend the comparative study to include various classification methods fromdifferent paradigms.

References

1. Al-Obeidat, F., Belacel, N., Carretero, J.A., Mahanti, P.: A Hybrid Metaheuristic Frameworkfor Evolving the PROAFTN Classifier. Special Journal Issues of World Academy of Science,Engineering and Technology 64, 217–225 (2010)

2. Al-Obeidat, F., Belacel, N., Carretero, J.A., Mahanti, P.: Automatic Parameter Settings forthe PROAFTN Classifier Using Hybrid Particle Swarm Optimization. In: Li, J. (ed.) AI 2010.LNCS, vol. 6464, pp. 184–195. Springer, Heidelberg (2010)

3. Al-Obeidat, F., Belacel, N., Carretero, J.A., Mahanti, P.: Differential Evolution for learningthe classification method PROAFTN. Knowledge-Based Systems 23(5), 418–426 (2010)

4. Apteand, C., Weiss, S.: Data mining with decision trees and decision rules. Future GenerationComputer Systems (13) (1997)

5. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)6. Belacel, N., Boulassel, M.: Multicriteria fuzzy assignment method: A useful tool to assist

medical diagnosis. Artificial Intelligence in Medicine 21(1-3), 201–207 (2001)7. Belacel, N., Vincke, P., Scheiff, M., Boulassel, M.: Acute leukemia diagnosis aid us-

ing multicriteria fuzzy assignment methodology. Computer Methods and Programs inBiomedicine 64(2), 145–151 (2001)

8. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn.Res. 7, 1–30 (2006)

9. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, SanMateo (1993)

10. Sobrado, F.J., Pikatza, J.M., Larburu, I.U., Garcia, J.J., de Ipina, D.: Towards a clinicalpractice guideline implementation for asthma treatment. In: Conejo, R., Urretavizcaya, M.,Perez-de-la-Cruz, J.-L. (eds.) CAEPIA/TTIA 2003. LNCS (LNAI), vol. 3040, pp. 587–596.Springer, Heidelberg (2004)

11. Witten, H.: Data Mining: Practical Machine Learning Tools and Techniques. Kaufmann Se-ries in Data Management Systems (2005)


Using a Heterogeneous Dataset for Emotion Analysis in Text

Soumaya Chaffar and Diana Inkpen

School of Information Technology and Engineering, University of Ottawa Ottawa, ON, Canada

{schaffar,diana}@site.uottawa.ca

Abstract. In this paper, we adopt a supervised machine learning approach to recognize six basic emotions (anger, disgust, fear, happiness, sadness and surprise) using a heterogeneous emotion-annotated dataset which combines news headlines, fairy tales and blogs. For this purpose, different features sets, such as bags of words, and N-grams, were used. The Support Vector Machines classifier (SVM) performed significantly better than other classifiers, and it generalized well on unseen examples.

Keywords: Affective Computing, Emotion Analysis in Text, Natural Language Processing, Text Mining.

1 Introduction

Nowadays the emotional aspects attract the attention of many research areas, not only in computer science, but also in psychology, healthcare, communication, etc. For instance, in healthcare some researchers are interested in how acquired diseases of the brain (e.g., Parkinson) affect the ability to communicate emotions [10]. Otherwise, with the emergence of Affective Computing in the late nineties [11], several researchers in different computer science areas, e.g., Natural Language Processing (NLP), Human Computer Interaction (HCI), etc. are interested more and more in emotions. Their aim is to develop machines that can detect users' emotions and express different kinds of emotion. The most natural way for a computer to automatic emotion recognition of the user is to detect his emotional state from the text that he entered in a blog, an online chat site, or in another form of text.

Generally, two approaches (knowledge-based approaches and machine learning approaches) were adopted for automatic analysis of emotions in text, aiming to detect the writer’s emotional state. The first approach consists of using linguistic models or prior knowledge to classify emotional text. The second one uses supervised learning algorithms to build models from annotated corpora. For sentiment analysis, machine learning techniques tend to obtain better results than lexical-based techniques, because they can adapt well to different domains [7]. In this paper, we adopted a machine learning approach for automatic emotion recognition from text. For this purpose, we used a heterogeneous dataset collected from blogs, fairly tales and news headlines.

The rest of the paper is organized as follows: Section 2 identifies the several datasets that we used for our emotion detection in text. In Section 3, we describe the

Using a Heterogeneous Dataset for Emotion Analysis in Text 63

methodology that we adopted for this purpose. Section 4 presents and discusses the results by comparing different machine learning techniques for detecting emotion in texts. Finally, Section 5 concludes the paper and outlines the future direction of our research.

2 Datasets

Five datasets have been used in the experiments reported in this paper. We describe each one in details below.

2.1 Text Affect

This data consists of news headlines drawn from the most important newspapers, as well as from the Google News search engine [12] and it has two parts. The first one is developed for the training and it is composed of 250 annotated sentences. The second one is designed for testing and it consists of 1,000 annotated sentences. Six emotions (anger, disgust, fear, joy, sadness and surprise) were used to annotate sentences according to the degree of emotional load. For our experiments, we further use the most dominant emotion as the sentence label, instead of a vector of scores representing each emotion.

2.2 Neviarouskaya et al.’s Dataset

Two datasets produced by these authors were used in our experiments [8, 9]. In these datasets, ten labels were employed to annotate sentences by three annotators. These labels consist of the nine emotional categories defined by Izard [8] (anger, disgust, fear, guilt, interest, joy, sadness, shame, and surprise) and a neutral category. In our experiments, we considered only sentences on which two annotators or more completely agreed on the emotion category. We briefly describe in the following the two datasets.

• Dataset 1 This dataset includes 1000 sentences extracted from various stories in 13 diverse categories such as education, health, and wellness [8].

• Dataset 2 This dataset includes 700 sentences from collection of diary-like blog posts [9].

2.3 Alm’s Dataset

This data include annotated sentences from fairy tales [1]. For our experiments, we used only sentences with high annotation agreement, in other words sentences with four identical emotion labels. Five emotions (happy, fearful, sad, surprised and angry-disgusted) from the Ekman’s list of basic emotions were used for sentences annotations. Because of data sparsity and related semantics between anger and disgust, these two emotions were merged together by the author of the dataset, to represent one class.

64 S. Chaffar and D. Inkpen

2.4 Aman’s Dataset

This dataset consists of emotion-rich sentences collected from blogs [3]. These sentences were labelled with emotions by four annotators. We considered only sentences for which the annotators agreed on the emotion category. Ekman’s basic emotions (happiness, sadness, anger, disgust, surprise, and fear), and also a neutral category were used for sentences annotation.

3 Emotion Detection in Text

To find the best classification algorithm for emotion analysis in text, we compared the three classification algorithms from the Weka software [14] with the BOW representation: J48 for Decision Trees, Naïve Bayes for the Bayesian classifier and the SMO implementation of SVM.

To ensure proper emotional classification of text, it is essential to choose the relevant feature sets to be considered. We describe in the following the ones that we employed in our experiments:

• Bag-Of-Words (BOW) Each sentence in the dataset was represented by a feature vector composed of Boolean attributes for each word that occurs in the sentence. If a word occurs in a given sentence, its corresponding attribute is set to 1; otherwise it is set to 0. BOW considers words as independent entities and it does not take into consideration any semantic information from the text. However, it performs generally very well in text classification. • N-grams They are defined as sequences of words of length n. N-grams can be used for catching syntactic patterns in text and may include important text features such as negations, e.g., “not happy”. Negation is an important feature for the analysis of emotion in text because it can totally change the expressed emotion of a sentence. For instance, the sentence “I’m not happy” should be classified into the sadness category and not into hapiness. For these reasons, some research studies in sentiment analysis claimed that N-grams features improve performance beyond the BOW approach [4]. • Lexical emotion features This kind of features represents the set of emotional words extracted from affective lexical repositories such as, WordNetAffect [13]. We used in our experiments all the emotional words, from the WordNetAffect (WNA), associated with the six basic emotions.

4 Results and Discussion

For an exploratory purpose, we conducted several experiments using the labelled datasets for classifying emotional sentences.

4.1 Cross-Validation

First of all, it is important to prepare the data for proper emotional sentence classification. For classifying text into emotion categories, some words such as “I”


and “the” are clearly useless and should be removed. Moreover, in order to reduce the number of words in the BOW representation we used the LovinsStemmer stemming technique from the Weka tool [14], which replaces a word by its stem.

Another important way for reducing the number of words in the BOW representation is to replace negative short forms by negative long forms, e.g., “don’t” is replaced by “do not”, “shouldn’t” is replaced by “should not”, and so on. Applying this method of standardizing negative forms gave us better results for BOW representation and can consider effectively negative expressions in N-grams. In this later, the features include words, bigrams and trigrams.

In the spirit of exploration, we used five datasets to train supervised machine learning algorithms: Text Affect, Alm’s dataset, Aman’s dataset and the Global dataset (see Table 1). We also used the ZeroR classifier from Weka as a baseline; it classifies data into the most frequent class in the training set.

Table 1. Results for the training datasets using the accuracy rate (%)

Baseline Naive Bayes J48 SMO Text Affect 31.6 39.6 32.8 39.6 Alm’s Dataset 36.86 54.92 47.47 61.88 Aman’s Dataset 68.47 73.02 71.43 81.16 Global Dataset 50.47 59.72 64.70 71.69

The results presented in Table 1 show that in general the SMO algorithm has the

highest accuracy rate for each dataset. The use of the global dataset for the training is much better, because, on one hand it contains heterogeneous data collected from blogs, fairly tales and new headlines, and on the other hand the difference between accuracy rates for the SMO algorithm and the baseline is higher compared to Aman’s dataset. With the global dataset, SMO is statistically better than the next-best classifier (J48) with a confidence level of 95% based on the accuracy rate (according to a paired t-test).

Specifically, for Aman’s dataset, we achieved an accuracy rate of 81.16%, which is better than the highest accuracy rate (73.89%) reported in [2]. Compared to their work, we used not only emotional words, but also non-emotional ones, as we believe that some sentences can express emotions through underlying meaning and depending on the context, i.e., “Thank you so much for everyone who came”. From the context, we can understand that this sentence expresses happiness, but it does not include any emotional word.

4.2 Supplied Test Set

Given the performance on the training datasets, one important issue that we need to consider in emotion analysis in text is the ability to generalize on unseen examples, since it depends on sentences’ context and the vocabulary used. Thus, we tested our model (trained on the global dataset) on the three testing datasets using three kinds of feature sets (BOW, N-grams, emotion words from WordNetAffect). The results are presented in Table 2 below.

66 S. Chaffar and D. Inkpen

Table 2. SMO results using different feature sets

Accuracy rate (%) Test sets Feature sets

baseline SMO WNA 36.55 BOW 38.90 BOW +WNA 36.55

Text Affect

N-grams

36.20

40.30 WNA 44.76 BOW 57.81 BOW +WNA 56.28

Neviarouskaya et al.’s dataset 1

N-grams

24.73

49.47 WNA 48.91 BOW 53.45 BOW +WNA 52.56

Neviarouskaya et al.’s dataset 2

N-grams

35.89

50.69

As shown in Table 2, using the N-grams representation for Text Affect gives better

results than the BOW representation, but the difference is not statistically significant. However, the use of N-grams representation for Neviarouskaya et al.’ datasets decreased the accuracy rate compared to the BOW representation. As we notice from the table, using features sets from WordNetAffect did not help in improving the accuracy rates of the SMO classifier.

5 Conclusion

In this paper, we presented a machine learning approach for automatic emotion recognition from text. For this purpose, we used a heterogeneous dataset collected from blogs, fairly tales and new headlines, and we compared it to using each homogenous dataset separately as training data. Moreover, we showed that the SMO algorithm made a statistically significant improvement over other classification algorithms, and that it generalized well on unseen examples.

Acknowledgments

We address our thanks to the Natural Sciences and Engineering Research Council (NSERC) of Canada for supporting this research work.

References

1. Alm, C.O.: Affect in Text and Speech. PhD Dissertation. University of Illinois at Urbana- Champaign (2008)

2. Aman, S., Szpakowicz, S.: Identifying expressions of emotion in text. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 196–205. Springer, Heidelberg (2007)


3. Aman, S.: Identifying Expressions of Emotion in Text, Master’s thesis, University of Ottawa, Ottawa, Canada (2007)

4. Arora, S., Mayfield, E., Penstein-Ros, C., Nyberg, E.: Sentiment Classification using Automatically Extracted Subgraph Features. In: Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text (2010)

5. Ekman, P., Friesen, W.V.: Facial action coding system: Investigator’s guide. Consulting Psychologists Press, Palo Alto (1978)

6. Izard, C.E.: The Face of Emotion. Appleton-Century-Crofts, New York (1971) 7. Melville, P., Gryc, W., Lawrence, R.: Sentiment Analysis of Blogs by Combining Lexical

Knowledge with Text Classification. In: Proc. of KDD, pp. 1275–1284 (2009) 8. Neviarouskaya, A., Prendinger, H., Ishizuka, M.: @AM: Textual Attitude Analysis Model.

In: Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, Los Angeles, USA (2010)

9. Neviarouskaya, A., Prendinger, H., Ishizuka, M.: Compositionality Principle in Recognition of Fine-Grained Emotions from Text. In: Proceedings of the International Conference on Weblogs and Social Media. AAAI, San Jose (2009)

10. Paulmann, S., Pell, M.D.: Dynamic emotion processing in Parkinson’s disease as a function of channel availability. Journal of Clinical and Experimental Neuropsychology 32(8), 822–835 (2010)

11. Picard, R.W.: Affective Computing. MIT Press, Cambridge (1997) 12. Strapparava, C., Mihalcea, R.: Semeval- 2007 task 14: Affective text. In: Proceedings of

the 4th International Workshop on the SemEval 2007, Prague (2007) 13. Strapparava, C., Valitutti, A., Stock, O.: The affective weight of lexicon. In: Proceedings

of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy (2006)

14. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

Using Semantic Information to Answer Complex

Questions

Yllias Chali, Sadid A. Hasan, and Kaisar Imam

University of LethbridgeLethbridge, AB, Canada

{chali,hasan,imam}@cs.uleth.ca

Abstract. In this paper, we propose the use of semantic information forthe task of answering complex questions. We use the Extended StringSubsequence Kernel (ESSK) to perform similarity measures between sen-tences in a graph-based random walk framework where semantic informa-tion is incorporated by exploiting the word senses. Experimental resultson the DUC benchmark datasets prove the effectiveness of our approach.

Keywords: Complex Question Answering, Graph-based Random WalkMethod, Extended String Subsequence Kernel.

1 Introduction

Resolving complex information needs is not possible by simply extracting namedentities (persons, organizations, locations, dates, etc.) from documents. Complexquestions often seek multiple different types of information simultaneously anddo not presuppose that one single answer can meet all of its information needs.For example, with a factoid question like: “What is the magnitude of the earth-quake in Haiti?”, it can be safely assumed that the submitter of the questionis looking for a number. However, the wider focus of a complex question like:“How is Haiti affected by the earthquake?” suggests that the user may not havea single or well-defined information need and therefore may be amenable to re-ceiving additional supporting information relevant to some (as yet) undefinedinformational goal [6]. This type of questions require inferencing and synthesiz-ing information from multiple documents. This information synthesis in NaturalLanguage Processing (NLP) can be seen as a kind of topic-oriented, informativemulti-document summarization, where the goal is to produce a single text as acompressed version of a set of documents with a minimum loss of relevant infor-mation [1]. So, in this paper, given a complex question and a set of related data,we generate a summary in order to use it as an answer to the complex question.The graph-based methods (such as LexRank [4], TextRank [10]) are applied suc-cessfully to generic, multi-document summarization. In topic-sensitive LexRank[11], a sentence is mapped to a vector in which each element represents the oc-currence frequency (TF–IDF1) of a word. However, for the task like answering1 The TF–IDF (term frequency-inverse document frequency) is a statistical measure

used to evaluate how important a word is to a document in a collection or corpus.


Using Semantic Information to Answer Complex Questions 69

complex questions that requires the use of more complex semantic analysis, theapproaches with only TF–IDF are often inadequate to perform fine-level textualanalysis. In this paper, we extensively study the impact of using semantic in-formation in the random walk framework for answering complex questions. Weapply the Extended String Subsequence Kernel (ESSK) [8] to include semanticinformation by incorporating disambiguated word senses. We run all experi-ments on the DUC2 2007 data. Evaluation results show the effectiveness of ourapproach.

2 Background and Proposed Framework

2.1 Graph-Based Random Walk

In [4], the concept of graph-based centrality is used to rank a set of sentences,in producing generic multi-document summaries. A similarity graph is producedfor the sentences in the document collection. In the graph, each node representsa sentence. The edges between nodes measure the cosine similarity between therespective pair of sentences where each sentence is represented as a vector of termspecific weights. The term specific weights in the sentence vectors are productsof local and global parameters. The model is known as term frequency-inversedocument frequency (TF–IDF) model. To apply the LexRank in a query-focusedcontext, a topic-sensitive version of LexRank is proposed in [11]. The score of asentence is determined by a mixture model of the relevance of the sentence tothe query and the similarity of the sentence to other high-scoring sentences. Therelevance of a sentence s to the question q is computed by:

rel(s|q) =∑w∈q

log (tfw,s + 1) × log (tfw,q + 1) × idfw

where, tfw,s and tfw,q are the number of times w appears in s and q, respectively.A sentence that is similar to the high scoring sentences in the cluster should alsohave a high score. For instance, if a sentence that gets high score based on thequestion relevance model is likely to contain an answer to the question, then arelated sentence, which may not be similar to the question itself, is also likely tocontain an answer. This idea is captured by the following mixture model [11]:

p(s|q) = d × rel(s|q)∑z∈C rel(z|q) + (1 − d) ×

∑v∈C

sim(s, v)∑z∈C sim(z, v)

× p(v|q) (1)

2.2 Our Approach

We claim that for a complex task like answering complex questions where therelatedness between the query sentences and the document sentences is an impor-tant factor, the graph-based method of ranking sentences would perform better2 Document Understanding Conference– http://duc.nist.gov/

http://duc.nist.gov/

70 Y. Chali, S.A. Hasan, and K. Imam

if we could encode the semantic information instead of just the TF–IDF infor-mation in calculating the similarity between sentences. Thus, our mixture modelfor answering complex questions is:

p(s|q) = d × SEMSIM(s, q) + (1 − d) ×∑v∈C

SEMSIM(s, v) × p(v|q) (2)

where, SEMSIM(s,q) is the normalized semantic similarity between the query (q)and the document sentence (s) and C is the set of all sentences in the collection.In this paper, we encode semantic information using ESSK [7] and calculate thesimilarity between sentences. We reimplemented ESSK considering each word ina sentence as an “alphabet”, and the alternative as its disambiguated sense [3]that we find using our Word Sense Disambiguation (WSD) System [2]. We use adictionary based disambiguation approach assuming one sense per discourse. Weuse WordNet [5] to find the semantic relations among the words in a text. Weassign weights to the semantic relations. Our WSD technique can be decomposedinto two steps: (1) building a representation of all possible senses of the words and(2) disambiguating the words based on the highest score. We use an intermediaterepresentation (disambiguation graph) to perform the WSD. We sum the weightof all edges leaving the nodes under their different senses. The one sense with thehighest score is considered the most probable sense. In case of tie between twoor more senses, we select the sense that comes first in WordNet, since WordNetorders the senses of a word by decreasing order of their frequency.

3 Evaluation and Analysis

3.1 Task Definition

In this research, we consider the main task of DUC 2007 to run our experiments.The task was: “Given a complex question (topic description) and a collection ofrelevant documents, the task is to synthesize a fluent, well-organized 250-wordsummary of the documents that answers the question(s) in the topic”. We choose35 topics randomly from the given dataset and generate summaries for each ofthem according to the task guidelines.

3.2 Automatic Evaluation

We carried out the automatic evaluation of our summaries using ROUGE [9]toolkit (i.e. ROUGE-1.5.5 in this study). The comparison between the TF–IDFsystem and the ESSK system is presented in Table 1. To compare our systems’performance with the state-of-the-art systems, we also list the ROUGE scoresof the NIST baseline system (defined in DUC-2007) and the best system inDUC-2007. The NIST baseline system generated the summaries by returningall the leading sentences (up to 250 words) in the 〈TEXT 〉 field of the mostrecent document(s). Analysis of the results show that the ESSK system improvesthe ROUGE-1 and ROUGE-SU scores over the TF–IDF system by 0.26%, and1.48%, respectively whereas the ESSK system performs closely to the best systembesides beating the baseline system by a considerable margin.


Table 1. ROUGE F-scores for all systems

Systems ROUGE-1 ROUGE-SU

TF–IDF 0.379 0.135

ESSK 0.380 0.137

NIST Baseline 0.334 0.112

Best System 0.438 0.174

3.3 Manual Evaluation

Even if the ROUGE scores had significant improvement, it is possible to makebad summaries that get state-of-the-art ROUGE scores [12]. So, we conduct anextensive manual evaluation in order to analyze the effectiveness of our approach.Each summary is manually evaluated for a Pyramid-based evaluation of contentsand also a user evaluation is conducted to get the assessment of readability(i.e. fluency) and overall responsiveness according to the TAC 2010 summaryevaluation guidelines3.

Content Evaluation. In the DUC 2007 main task, 23 topics were selected forthe optional community-based pyramid evaluation. Volunteers from 16 differentsites created pyramids and annotated the peer summaries for the DUC maintask using the given guidelines4. 8 sites among them created the pyramids. Weused these pyramids to annotate a randomly chosen 5 peer summaries for each ofour system to compute the modified pyramid scores. Table 2 shows the modifiedpyramid scores of all the systems including the NIST baseline system and thebest system of DUC-2007. From these results we see that all the systems performbetter than the baseline system and ESSK performs the best.

Table 2. Modified pyramid scores for all systems

Systems Modified Pyramid Scores

NIST Baseline 0.139

Best System 0.349

TF–IDF 0.512

ESSK 0.547

User Evaluation. Some university graduate students judged all the systemgenerated summaries (70 summaries in total) for readability (fluency) and over-all responsiveness. The readability score reflects the fluency and readability ofthe summary (independently of whether it contains any relevant information)and is based on factors such as the summary’s grammaticality, non-redundancy,referential clarity, focus, and structure and coherence. The overall responsiveness3 http://www.nist.gov/tac/2010/Summarization/Guided-Summ.2010.guidelines.

html4 http://www1.cs.columbia.edu/~becky/DUC2006/2006-pyramid-guidelines.html

http://www.nist.gov/tac/2010/Summarization/Guided-Summ.2010.guidelines.

html

http://www1.cs.columbia.edu/~becky/DUC2006/2006-pyramid-guidelines.html

72 Y. Chali, S.A. Hasan, and K. Imam

score is based on both content (coverage of all required aspects) and readabil-ity. The readability and overall responsiveness is each judged on a 5-point scalebetween 1 (very poor) and 5 (very good). Table 3 presents the average readabil-ity and overall responsive scores of all the systems. Again, the NIST–generatedbaseline system’s scores and the best DUC-2007 system’s scores are given formeaningful comparison. The results show that the ESSK system improves thereadability and overall responsiveness scores over the TF–IDF system by 30.61%,and 42.17%, respectively while it performs closely to the best system’s scores be-sides beating the baseline system’s overall responsiveness score by a significantmargin.

Table 3. Readability and overall responsiveness scores for all systems

Systems Readability Overall Responsiveness

NIST Baseline 4.24 1.80

Best System 4.11 3.40

TF–IDF 2.45 2.30

ESSK 3.20 3.27

4 Conclusion

In this paper, we used semantic information and showed its impact in measuringthe similarity between the sentences in the random walk framework for answer-ing complex questions. We used Extended String Subsequence Kernel (ESSK) toinclude semantic information by applying disambiguated word senses. We evalu-ated the systems automatically using ROUGE and reported an extensive manualevaluation to further analyze the performance of the systems. Comparisons withthe state-of-the-art systems showed effectiveness of our proposed approach.

Acknowledgments

The research reported in this paper was supported by the Natural Sciences andEngineering Research Council (NSERC) of Canada – discovery grant and theUniversity of Lethbridge.

References

1. Amigo, E., Gonzalo, J., Peinado, V., Peinado, A., Verdejo, F.: An Empirical Studyof Information Synthesis Tasks. In: Proceedings of the 42nd Annual Meeting of theAssociation for Computational Linguistics, Barcelona, Spain, pp. 207–214 (2004)

2. Chali, Y., Joty, S.R.: Word Sense Disambiguation Using Lexical Cohesion. In:Proceedings of the 4th International Conference on Semantic Evaluations, pp. 476–479. ACL, Prague (2007)


3. Chali, Y., Hasan, S.A., Joty, S.R.: Improving Graph-based Random Walks forComplex Question Answering Using Syntactic, Shallow Semantic and ExtendedString Subsequence Kernels. Information Processing and Management (2010)(in Press, Corrected Proof), http://www.sciencedirect.com/science/article/B6VC8-51H5SB4-1/2/4f5355410ba21d61d3ad9f0ec881e740

4. Erkan, G., Radev, D.R.: LexRank: Graph-based Lexical Centrality as Salience inText Summarization. Journal of Artificial Intelligence Research 22, 457–479 (2004)

5. Fellbaum, C.: WordNet - An Electronic Lexical Database. MIT Press, Cambridge(1998)

6. Harabagiu, S., Lacatusu, F., Hickl, A.: Answering complex questions with randomwalk models. In: Proceedings of the 29th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval, pp. 220–227. ACM,New York (2006)

7. Hirao, T., Suzuki, J., Isozaki, H., Maeda, E.: Dependency-based Sentence Align-ment for Multiple Document Summarization. In: Proceedings of COLING 2004,pp. 446–452. COLING, Geneva (2004)

8. Hirao, T., Suzuki, J., Isozaki, H., Maeda, E.: NTT’s Multiple Document Summa-rization System for DUC2003. In: Proceedings of the Document UnderstandingConference (2003)

9. Lin, C.Y.: ROUGE: A Package for Automatic Evaluation of Summaries. In: Pro-ceedings of Workshop on Text Summarization Branches Out, Post-ConferenceWorkshop of Association for Computational Linguistics, Barcelona, Spain, pp. 74–81 (2004)

10. Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings ofthe Conference of Empirical Methods in Natural Language Processing, Barcelona,Spain (2004)

11. Otterbacher, J., Erkan, G., Radev, D.R.: Using Random Walks for Question-focused Sentence Retrieval. In: Proceedings of Human Language Technology Con-ference and Conference on Empirical Methods in Natural Language Processing,Vancouver, Canada, pp. 915–922 (2005)

12. Sjobergh, J.: Older Versions of the ROUGEeval Summarization Evaluation SystemWere Easier to Fool. Information Processing and Management 43, 1500–1505 (2007)

http://www.sciencedirect.com/science/article/B6VC8-51H5SB4-1/2/4f5355410ba21d61d3ad9f0ec881e740

http://www.sciencedirect.com/science/article/B6VC8-51H5SB4-1/2/4f5355410ba21d61d3ad9f0ec881e740

Automatic Semantic Web Annotation of Named

Entities

Eric Charton, Michel Gagnon, and Benoit Ozell

Ecole Polytechnique de Montreal, Montreal,H3T 1J4, Quebec, Canada

{eric.charton,michel.gagnon,benoit.ozell}@polymtl.ca

Abstract. This paper describes a method to perform automated seman-tic annotation of named entities contained in large corpora. The semanticannotation is made in the context of the Semantic Web. The method isbased on an algorithm that compares the set of words that appear be-fore and after the name entity with the content of Wikipedia articles,and identifies the more relevant one by means of a similarity measure.It then uses the link that exists between the selected Wikipedia entryand the corresponding RDF description in the Linked Data project toestablish a connection between the named entity and some URI in theSemantic Web. We present our system, discuss its architecture, and de-scribe an algorithm dedicated to ontological disambiguation of namedentities contained in large-scale corpora. We evaluate the algorithm, andpresent our results.

1 Introduction

Semantic Web is a web of data. This web of data is constructed with docu-ments that are, unlike HTML files, RDF1 assertions establishing links betweenfacts and things. RDF documents, like HTML documents, are accessible throughURI2. A set of best practices for publishing and connecting RDF semantic dataon the Web is referred by the term Linked Data. An increasing number of dataproviders have delivered Linked Data documents over the last three years, lead-ing to the creation of a global data space containing billions of RDF assertions.For the usability of the Semantic Web, a new breed of smarter applications mustbecome available. To encourage the emergence of such innovative softwares, weneed NLP solutions that can effectively establish a link between documents andSemantic Web data. In this paper, we propose a general schema of automatic an-notation, using disambiguation resources and algorithms, to establish relationsbetween named entities in a text and the ontological standardized semantic con-tent of the Linked Data network.

1 Resource Description Framework, is an official W3C Semantic Web specification formetadata models.

2 Uniform Resource Identifier (URI) is the name of the string of characters used toidentify a resource on the Internet.


Automatic Semantic Web Annotation of Named Entities 75

This article is structured as follows: section 2 investigates the annotation taskproblem from a broad perspective and describes the features of semantic annota-tion task in the context of Semantic Web; section 3 describes the proposed systemarchitecture and its implementation. In section 4 we present the experiment andcorpora on which the evaluation has been done. Finally, section 5 commentsthe results obtained by our system. We conclude and give some perspectives insection 6.

2 Problem Description

The basic principle of annotations is to add information to a source text. In acomputer perspective, annotations can take various forms, but their function isalways the same: to introduce complementary information and knowledge into adocument. Two main kinds of information can be attributed to a word or a groupof words by an annotation process : a fixed class label defined by a taxonomystandard or a link to some external knowledge.

A class description can be assigned to a word or a group of words called aNamed Entity (NE). By class, we mean a label describing the nature of theobject expressed by the words. This object can be, for example, a person, anorganization, a product, or a location. Attribution of such class is the NamedEntity Recognition (NER) task, widely investigated ([2,1,11]).The granularity ofclasses contained in a NE taxonomy can be highly variable ([13]) but strictly,NER task is a classification task, whose purpose is to assign to a sequence ofwords a unique class label. Label will be for example PERS to describe a person,or ORG for an organization, and so on. This means that NER task is unable tointroduce any more complementary information into the text. It is possible tointroduce an upper level of granularity in the NE taxonomy model (for example,we can distinguish two kinds of places, LOC.ADMI for a city and LOC.GEO fora National Park) but with strong limitations. Thus, there is no way to introducedata like birth date of a person or ground surface of a city.

To achieve this task of associating properties to NE, an upper level of annota-tion is needed, expressed by a relation between NE and an external knowledge.It consists in assigning to an identified NE a link to a structured external knowl-edge base, like the one delivered on the Semantic Web. This is the SemanticAnnotation (SA) task, previously investigated by ([10,7]).

2.1 Entity Labeling versus Semantic Labeling

The example in Figures 2 and Table 2 illustrates the difference between SA andNER and its implication on knowledge management. Let’s consider a sampletext to annotate, as presented in Table 1.

The first level of ambiguity encountered by the NER task is related to thewords polysemy. To illustrate this we show in Figure 1 the numerous possibleconcept-class values available for the Paris word. The main objective of the NERtask is to manage this first level of disambiguation, generally through statistical

76 E. Charton, M. Gagnon, and B. Ozell

Table 1. A sample document to label with various named entities contained in

Paris is a town in Oneida County, New York, USA. The town is in the southeast partof the county and is south of Utica. The population was 4,609 at the 2000 census.The town was named after an early benefactor, Colonel Isaac Paris.

Fig. 1. Ambiguity of a class label for a named entity like Paris. It can be a city, andasteroid, a movie, a music album or a boat.

methods ([2], [8]). The NER task results in a text where NE are labeled by classes,as presented in Table 2. But despite the NE labeling process, we can show thata level of ambiguity is still present. Paris is correctly annotated with the LOC(locality) class label, but this class is not sufficient to determine precisely whichlocality it is, according to the numerous existing cities that are also named Paris(Figure 2).

Table 2. Sample of word with standard NE labels in the document

Paris{LOC} is a town in Oneida County{LOC}, New York{LOC}, USA{LOC}.The town is in the southeast part of the county and is south of Utica{LOC}. The pop-ulation was 4,609{AMOUNT} at the 2000 census{DATE}. The town was namedafter an early benefactor, Colonel Isaac Paris{PERS}.

France Kentucky Idaho Ontario Maine Tenessee

Paris as a LOC

Fig. 2. Ambiguity of entity for a same NE class label: the Paris word, even with itsLocation class, is still ambiguous

2.2 Previous Semantic Labeling Propositions

The task of SA has received an increasing attention in the last few years. Ageneral survey of all the semantic annotation techniques have been proposed by


([15]). None of the described systems have been integrated in the general schemaof Semantic Web. They are all related to specific and proprietary or non-standardontological representations. The KIM platform ([10]) provides a two-step labelingprocess including a NER step to attribute NE labels to words before establishingthe semantic link. The semantic descriptions of entities and relations betweenthem are kept in a knowledge base encoded in the KIM ontology and resides inthe same “semantic repository”. SemTag ([5]) is another example of a tool thatfocuses only on automatic mark-up. It is based on IBM’s text analysis platformSeeker and uses similarity functions to recognize entities that occur in contextssimilar to marked up examples. The key problem with large-scale automaticmark-up is ambiguity. A Taxonomy Based Disambiguation (TBD) algorithmis proposed to tackle this problem. SemTag can be viewed as a bootstrappingsolution to get a semantically tagged collection off the ground. Recently, ([9])presented Moat, a proposition to bridge the gap between tagging and LinkedData. Its goal is to provide a simple and collaborative way to annotate contentthanks to existing URI with as little effort as possible and by keeping free-tagging habits. However, Moat does not provide an automatic generic solutionto establish a link between text and an entry point in the Linked Data Network.

2.3 The Word Sense Disambiguation Problem

The problem with those previous propositions is related to the Word Sense Dis-ambiguation (WSD). WSD consists in determining which sense of a word is usedwhen it appears in a particular context. KIM and Semtag, when they establisha link between a labeled NE and an ontology instance, need a complementaryknowledge resource to deal with the homonymic NEs of a same class. For theNER task, this resource can be generic and generative: a labeled corpus used totrain a statistical labeling tool (CRF, SVM, HMM). This statistical NER toolwill be able to infer a class proposition through its training from a limited setof contexts. But this generative approach is not applicable to the SA task, aseach NE to link to a semantic description has a specific word context, markerof its exact identity. Many propositions have been done to solve this problem.Recently, ([16]) suggest to use the LSA3 techniques mixed with cosine similar-ity measure to disambiguate terms in the perspective of establishing a semanticlink. The Kim system ([10]) re-uses the Gate platform and its NLP componentsand apply rules to establish a disambiguated link. Semtag uses two kinds ofsimilarity functions: bayesian, and cosinus. But the remaining problem for allthose propositions is the lack of access to an exhaustive and wide knowledgeof contextual information related to the identity of the NE. For our previousParis example, those systems could establish a disambiguated link between anyParis NE and its exact Linked Data representation only if they have access to

3 Latent Semantic Analysis is a technique of analyzing relationships between a setof documents and terms using term-document matrix built from Singular ValueDecomposition.


Surface forms (E.r) Words:TF.IDF (E.c) LinkedData (E.rdf)

Paris, Paris New York York:69,Cassvile:58,Oneida:52 ... http://dbpedia.org/data/Paris,_New_York.rdf Paris, Paname, Lutece France:342;Seine:210;Eiffel:53 ... http://dbpedia.org/data/Paris.rdfParis Kentucky:140,Varden:53,Bourbon:37 http://dbpedia.org/data/Paris,_Kentucky.rdf

Metadata containers E)

Best Cosine Score

Semantic LinkLinked Data

Semantic Disambiguation Algorithm (SDA)Cosine Similarity mesure (Words.TF.IDF,{town.Oneida;County;New York ...})

Linked Data Interface (LDI)

Fig. 3. Architecture of the system with metadata used as Linked Data Interface (LDI)and Semantic Disambiguation Algorithm (SDI)

an individual usual word contextual modelized resource. Unfortunately, such aknowledge is not present in RDF triples of the LinkedData network, neither instandard exhaustive ontologies like DBPedia.

3 Our Proposition: A Linked Data Interface

To solve this problem, we propose a SA system that uses an intermediate struc-ture to determine the exact semantic relation between a NE and its ontologicalrepresentation on the Linked Data network. In this structure, called Linked DataInterface (LDI), there is an abstract representation for every Wikipedia article.Each one of these abstract representations contains a pointer to the Linked Datadocument that provides an RDF description of the entity. The disambiguationtask is achieved by identifying the item in the LDI that is most similiar to thecontext of the named entity (the context is represented by the set of wordsthat appear before and after the NE). This algorithm is called Semantic Disam-biguation Algorithm (SDA). The architecture of this semantic labeling system ispresented in Figure 3.

3.1 The Linked Data Interface (LDI)

To each entity that is described by an entry in Wikipedia, we associate somemetadata, composed of three elements: (i) a set of surface forms, (ii) the set ofwords that are contained in the entity description, where each word is accom-panied by its tf.idf weight ([12]), and (iii) an URI that points to some entity in


MirageMirage JetMirage aircraftDassault Mirage F1CDassault Mirage

[E.r

Graph structure extracted from Wikipedia metadata surface forms

Fig. 4. All possible surface forms are collected from multiple linguistic editions ofWikipedia and transferred into a set E.r. Here two complementary surface forms for aplane name are collected from the German edition.

the Linked Data Network. The tf.idf value associated to a word is its frequencyin the Wikipedia document, multiplied by a factor that is inversely proportionalto the number of Wikipedia documents in which the word occurs (the exactformula is given below).

The set of surface forms for an entity is obtained by taking every Wikipediaentry that points to it by a redirection link, every entry that corresponds to itsdescription in another language and, finally, in every disambiguation page thatpoints to this entity, the term in the page that is associated to this pointer. Asan example, the surface form set for the NE Paris (France) contains 39 elements,(eg. Ville Lumiere, Ville de Paris, Paname, Capitale de la France, Departementde Paris).

In our application, the surface forms are collected from five linguistic edi-tions of Wikipedia (English, German, Italian, Spanish and French). We use suchcross-linguistic resource because in some cases, a surface form may appear onlyin a language edition of Wikipedia that is not the one of the source text. Agood example of this is given by the Figure 4. In this example, we see that thesurface form Dassaut Mirage is not available in the English Wikipedia but canbe collected from the German edition of Wikipedia.

The structure of Wikipedia and the sequential process to build metadata likeours, has been described previously ([3,4]).

We will now define more formally the LDI.

Let C be the Wikipedia corpus. C is partitioned into subsets Cl representinglinguistic editions of Wikipedia (i.e fr.wikipedia.org or en.wikipedia.org, which areindependent language sub-corpus of the whole Wikipedia).Let D be a Wikipedia article. Each D ∈ Cl is represented by a triple(D.t, D.c, D.l), where D.t is the title of the article, made of a unique wordsequence, D.c is a collection of terms w contained in the article, D.l is a setof links between D and other Wikipedia pages of C. Any link in D.l can be aninternal redirection inside Cl (a link from a redirection page or a disambiguationpage) or in another document in C (in this case, a link to the same article inanother language).


The LDI may now be described the following way. Let E ∈ LDI be ametadata container that corresponds to some D ∈ C. E is a tuple(E.t, E.c, E.r, E.rdf). We consider that E and D are in relation if and onlyif E.t = D.t. We say that E represents D, which will be noted E → D. E.ccontains pairs built with all words w of D.c associated with their tf.idf valuecalculated from Cl.

The tf.idf weight for a term wi that appears in document dj is the productof the two values tf and idf which are calculated as shown in equations 1 and 2.In the definition of idf , the denominator |{d : d ∈ Cl, wi ∈ d}| is the number ofdocuments where the term wi appears. tf is expressed by equation 2, where wi,j

is the number of occurrences of the term wi in document dj , and the denominatoris the sum of number of occurrences of all terms in document dj .

idfi = log|Cl|

|{d : d ∈ Cl, wi ∈ d}| (1)

tfi,j =wi,j∑k wk,j

(2)

The E.c part of a metadata container must be trained for each language.In our LDI the three following langages have been considered: English, Frenchand Spanish. The amount of representations collected can potentially elaboratesemantic links for 745 k different persons or 305 k organizations in English, 232k persons, and 183 k products in French.

The set of all surface forms related to a document D is built by taking all thetitles of special documents (i.e redirection or disambiguation pages) targeted bythe links contained in D.l, and stored in E.r.

The E.rdf part of the metadata container must contain a link to one or moreentry points of the Linked Data network. An entry point is an URI, pointing toan RDF document that describes the entity represented by E. As an example,http://dbpedia.org/data/Spain.rdf is the entry point of the DBpedia instancerelated to Spain inside the Linked Data network. The special interest of DBpediafor our application is that the ontology is a mirror of Wikipedia. Any Englisharticle of Wikipedia (and most French and Spanish ones) is supposed to havean entry in DBpedia. DBpedia delivers also correspondence files between othersentry point in the Linked Data Network and Wikipedia records4: for example,another entry point for Spain in the Linked Data Network is on the CIA FactbookRDF collection5. We use those table files to create E.rdf . For our experiments,we included in E.rdf only the link to the DBPedia entry point in the LinkedData Network.

3.2 Semantic Disambiguation Algorithm (SDA)

To identify a named entity, we compare it with every metadata container Ei ∈LDI. Each Ei that contains at least one surface form that corresponds to the4 See on http://wiki.DBpedia.org/Downloads34 files named Links to Wikipedia ar-

ticles.5 http://www4.wiwiss.fu-berlin.de/factbook/resource/Spain

http://wiki.DBpedia.org/Downloads34

http://www4.wiwiss.fu-berlin.de/factbook/resource/Spain


named entity surface form in the text is added into the candidate set. Now, foreach candidate, its set of words Ei.c is used to calculate a similarity measurewith the set of words that forms the context of the named entity in the text.In our application, the context consists of the n words that come immediatelybefore and after the NE. The tf.idf is used to calculate this similarity measure.The Ei that gets the higher similarity score is selected and its URI pointer Ei.rdfis used to identify the entity in Linked Data that corresponds to the NE in thetext.

Regarding the candidate set CS that has been found for the NE to be disam-biguated, three situations can occur:

1. CS = ∅: there is no metadata container for NE.2. |CS| = 1: there is only one metadata container available to establish a

semantic link between EN and an entity in the Linked Data Network.3. |CS| > 1: there are more than one possible relevant metadata container,

among which at most one must be selected.

Case 1 is trivial (no semantic link available). For cases 2 and 3, a cosinesimilarity measure (see equation 3) is applied to NE context S.w and E.ctf.idf

for every metadata container E ∈ CS. As usual, the vectors are formed byconsidering each word as a dimension. If a word appears in the NE context,we put the value 1 in its position in the vector space, 0 otherwise. For E.c, weput in the vector the tf.idf values. The similarity values are used to rank everyE ∈ CS.

cosinus(S, E) =S.w · E.ctf.idf

‖S.w‖ ‖E.ctf.idf‖(3)

Finally the best candidate EΩ according to the similarity ranking is chosen ifits similarity value is higher than the threshold value α, as described in 4. Thealgorithm derived from this method is presented in Table 3.

∀Ei ∈ CS {Eω = argmax(cosinus(S, Ei))}

EΩ ={∅ if score(Eω) ≤ αEω otherwise

(4)

4 Experiments

There is no standard evaluation schema for applications like the one describedin this paper. There are many metrics (precision, recall, word error rates) andannotated corpus for NER task, but none of them includes a Gold Standard forSemantic Web annotation. We evaluated our system with an improved standardNER test corpus. We associate to each NE of such corpus a standard Linked DataURI coming from DBpedia. This proposal has the following advantage. DBpedia


Table 3. Pseudo code of Semantic Disambiguation Algorithm (SDA)

SDA Function: rdf = SDA( sf , S[]) SDA

Input: Local variables:sf = surface form of detected NE to link E[]=metadataS[] = contextual words of EN CS[]=Candidate Set of metadataOutput: α=threshold valuerdf = uri link between EN and LinkedData entry point

Algorithm:

(1) CS[]=search all E[] where E[].c matchsf(2) if (CS[] == null) return null(3) for x = all CS[](3.1) CS[x].score=cosinus(CS[x].w :TF.idf [], S[])(4) order CS[] by descending CS[].score(5) if (CS[0].score > α ) return CS[0].rdf(5.1) else return null

is now one of the most known and accurate RDF resource. Because of this, DB-pedia evolved as a reference interlinking resource6 to the Linked Data semanticnetwork7. The NER corpora used to build semantically annotated corpora aredescribed below.

Test Corpora

The base corpus for French semantic annotation evaluation is derived from theFrench ESTER 2 Corpus ([6]). The named entity (NE) detection task on Frenchin ESTER 2 was proposed as a standard one. The original NE tag set consists of7 main categories (persons, locations, organizations, human products, amounts,time and functions) and 38 sub-categories. We only use PERS, ORG, LOC, andPROD tags for our experiments8. The English evaluation corpus is the WallStreet Journal (WSJ) version from the CoNLL Shared Task 2008 ([14]). NEcategories of WSJ corpus include: Person, Organization, Location, Geo-PoliticalEntities, Facility, Money, Percent, Time and Date, based on the definitions ofthese categories in MUC and ACE7 tasks.

4.1 Gold Standard Annotation Method

To build test corpora, we used a semi-automatic method. We first applied oursemantic annotator and then removed or corrected manually the wrong semantic

6 See http://wiki.dbpedia.org/Interlinking7 DBpedia is now an rdf interlinking resource for CIA World Fact Book, US Census,

Wikicompany, RDF Wordnet and more.8 This selection is made because Cardinal (amounts) and temporal values (time) are

specific entities involving different semantic content than named entities

http://wiki.dbpedia.org/Interlinking


Table 4. All NE contained in a text document does not have necessarily a correspond-ing representation in LDI. This Table shows the coverage of built metadata containedin LDI, regarding NE contained in the French ESTER 2 test corpus and in the EnglishWSJ CoNLL 2008 test corpus.

ESTER 2 2009 (French) WSJ CoNLL 2008 (En-glish)

Labels Entities intest corpus

Equivalententitiesin LDI

Coverage(%)

Entities intest corpus

Equivalententitiesin LDI

Coverage(%)

PERS 1096 483 44% 612 380 62%ORG 1204 764 63% 1698 1129 66%LOC 1218 1017 83% 739 709 96 %PROD/GPE 59 23 39% 61 60 98 %

Total 3577 2287 64% 3110 2278 73%

links. For some NE, the Linked Data Interface does not provide semantic links.This is the problem of coverage, managed by the use of the α threshold value.Level of coverage for the two test corpus in French and English is given in Table 4.

5 Results

To evaluate the performances of SA we applied it to the evaluation corpora withonly Word, POS and NE. Two experiments have been done. First, we verify theannotation process under the scope of quality of disambiguation: we apply SAonly to NEs which have their corresponding entries in LDI. This means we donot consider uncovered NE (as presented in Table 4) in the labeling experiment.We only try to label the 2287 French and 2278 English covered NEs. Thoseresults are given in the section [no α] of Table 5. Then, we verify the capacityof SA to annotate a text, with potentially no entry in LDI for a given NE. Thismeans we try to label the full set of NEs (3577 French and 3110 in English)and to assign the NORDF label when no entry is available in LDI. We use thethreshold value9 as a confidence weight score to assign as annotation an URIlink or a NORDF label. Those results are given in Table 5 in the section [α].We used recall measure (as in 5) to evaluate the amount of correctly annotatedNEs according to the Gold Standard.

Recall =Total of correct annotations → NE

NE total(5)

Our results indicate a good level of performance for our system, in both lan-guage with over .90 of recall in French and .86 in English. The lower performancesin English task can be explained by the structural difference of metadata in thetwo languages: near 0.7 million metadata containers are available in French andmore than 3 millions in English (according to each local Wikipedia size). A

9 α value is a cosine threshold selected empirically and is positioned for this experimenton 0.10 in French and 0.25 in English.


Table 5. Results of the semantic labeler applied on the ESTER 2 and WSJ CoNLL2008 test corpus

French tests English tests

NE [no α] Recall [α] Recall [no α] Recall [α] Recall

PERS 483 0.96 1096 0.91 380 0.93 612 0.94ORG 764 0.91 1204 0.90 1129 0.85 1608 0.86LOC 1017 0.94 1218 0.92 709 0.84 739 0.82PROD 23 0.60 59 0.50 60 0.85 61 0.85

Total 2287 0.93 3577 0.90 2278 0.86 3020 0.86

biggest amount of metadata containers means also more propositions of syn-onymic words for a specific NE and a higher risk of bad disambiguation bythe cosine algorithm. A way to solve this specific problem could be to weightthe tf.idf according to the amount of available metadata containers. The slightimprovement of recall on English [α] experiment is attributed to the better detec-tion of NORDF NEs, due to the difference of NE classes representation betweenthe French and the English Corpora.

6 Conclusions and Perspectives

In this paper, we presented a system to semantically annotate any named en-tity contained in a text, using a URI link. The URI resource used is a standardone, compatible with the Semantic Web network Linked Data. We have intro-duced the concept of Linked Data Interface, an exhaustive statistical resourcecontaining contextual and nature description of potential semantic objects tolabel. The Linked Data Interface gives a possible answer to solve the problem ofambiguity resolution for an exhaustive semantic annotation process. This systemis a functional proposition, available now, to establish automatically a relationbetween the vast amount of entry points available on the Linked Data networkand named entities contained in an open text. We have shown that a large andexpandable Link Data Interface of high quality containing millions of contextualdescriptions for potential semantic entities, available in various languages, canbe derived from Wikipedia and DBpedia. We proposed an evaluation schema ofsemantic annotators, using standard corpora, improved with DBpedia URI an-notations. As our evaluation shows, our system can establish semantic relationsautomatically, and can be introduced in a complete annotation pipeline behinda NER tools.

References

1. Bikel, D., Schwartz, R., Weischedel, R.: An algorithm that learns whats in a name.Machine learning 7 (1999)

2. Borthwick, A., Sterling, J., Agichtein, E.: R: Exploiting diverse knowledge sourcesvia maximum entropy in named entity. In: Proc. of the Sixth Workshop on VeryLarge Corpora, pp. 152–160 (1998)


3. Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disam-biguation. In: Proceedings of EACL, vol. 6 (2006)

4. Charton, E., Torres-Moreno, J.: NLGbAse: a free linguistic resource for NaturalLanguage Processing systems. In: LREC 2010: Proceedings of LREC 2010, Malta,vol. (1) (2010)

5. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T.,Rajagopalan, S., Tomkins, A., Tomlin, J., et al.: SemTag and Seeker: Bootstrappingthe semantic web via automated semantic annotation. In: Proceedings of the 12thInternational Conference on World Wide Web, p. 186. ACM, New York (2003)

6. Galliano, S., Gravier, G., Chaubard, L.: The ESTER 2 Evaluation Campaign forthe Rich Transcription of French Radio Broadcasts. In: International Speech Com-munication Association Conference 2009, pp. 2583–2586 (2009); Interspeech 2010

7. Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annota-tion, indexing, and retrieval. Web Semantics: Science, Services and Agents on theWorld Wide Web 2(1), 49–79 (2004)

8. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilisticmodels for segmenting and labeling sequence data. In: Proceedings of the Eigh-teenth International Conference on Machine Learning, pp. 282–289, Citeseer (2001)

9. Passant, A., Laublet, P.: Meaning of a tag: A collaborative approach to bridge thegap between tagging and linked data. In: WWW 2008 Workshop Linked Data onthe Web (2008)

10. Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: KIM– semantic annotation platform. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.)ISWC 2003. LNCS, vol. 2870, pp. 834–849. Springer, Heidelberg (2003)

11. Ratinov, L., Roth, D.: Design Challenges and Misconceptions in Named EntityRecognition. In: Proceedings of the Thirteenth Conference on Computational Nat-ural Language Learning. International Conference On Computational Linguistics(2009)

12. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval* 1.Information Processing & Management (1988)

13. Sekine, S., Sudo, K., Nobata, C.: Extended named entity hierarchy. In: Proceedingsof the LREC-2002 Conference, pp. 1818–1824, Citeseer (2002)

14. Surdeanu, M., Johansson, R., Meyers, A.L.: The CoNLL-2008 shared task on jointparsing of syntactic and semantic dependencies. In: Proceedings of the CoNLL, p.159 (2008)

15. Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargasvera, M., Motta, E.,Ciravegna, F.: Semantic annotation for knowledge management: Requirements anda survey of the state of the art. Web Semantics: Science, Services and Agents onthe World Wide Web 4(1), 14–28 (2006)

16. Zelaia, A., Arregi, O., Sierra, B.: A multiclassifier based approach for word sensedisambiguation using Singular Value Decomposition. In: Proceedings of the EighthInternational Conference on Computational Semantics - IWCS-8 2009, p. 248(January 2009)

Learning Dialogue POMDP Models from Data

Hamid R. Chinaei and Brahim Chaib-draa

Computer Science and Software Engineering Department,Laval University, Quebec, Canada

[email protected],[email protected]

Abstract. In this paper, we learn the components of dialogue POMDP modelsfrom data. In particular, we learn the states, observations, as well as transitionand observation functions based on a Bayesian latent topic model using unanno-tated human-human dialogues. As a matter of fact, we use the Bayesian latenttopic model in order to learn the intentions behind user’s utterances. Similar torecent dialogue POMDPs, we use the discovered user’s intentions as the statesof dialogue POMDPs. However, as opposed to previous works, instead of usingsome keywords as POMDP observations, we use some meta observations basedon the learned user’s intentions. As the number of meta observations is muchless than the actual observations, i.e. the number of words in the dialogue set, thePOMDP learning and planning becomes tractable. The experimental results onreal dialogues show that the quality of the learned models increases by increasingthe number of dialogues as training data. Moreover, the experiments based onsimulation show that the introduced method is robust to the ASR noise level.

1 Introduction

Consider the following example taken from the dialogue set SACTI-2 [6], where SACTIstands for Simulated ASR-Channel Tourist Information:

U1 Is there a good restaurant we can go to tonightU’1 [Is there a good restaurant week an hour tonight]M1 Would you like an expensive restaurantU2 No I think we’d like a medium priced restaurantU’2 [ No I think late like uh museum price restaurant]M2 Cheapest restaurant is eight pounds per person

The first line shows the first user’s utterance, U1. Because of Automatic Speech Recog-nition (ASR) this utterance is corrupted and is received by the system as U ′1 in thefollowing line in braces. M1 in the next line shows the system’s response to the user.For each dialogue utterance, the system’s goal is first to capture the user’s intention andthen to perform the best action which satisfies the user’s intention. For instance, in thesecond received user’s utterance, U’2 [No I think late like uh museum price restaurant],the system has difficulty in finding the user’s intention. In fact, in U ′2, the system isrequired to understand that the user is looking for a restaurant; though this utterance is


Learning Dialogue POMDP Models from Data 87

Fig. 1. Intentions learned by HTMM for SACTI-1, with their 20-top words and their probabilities

highly corrupted. Specifically, it contains misleading words such as museum that can bestrong observations for another user’s intention, i.e. user’s intention for museums.

Recently, there has been a great interest for modelling the dialogue manager (DM) ofspoken dialogue systems (SDS) using Partially Observable Markov Decision Processes(POMDPs) [8]. However, in POMDPs, similar to many other machine learning frame-works, estimating the environment dynamics is a significant issue; as it has been arguedpreviously, for instance in [4]. In other words, the POMDP models highly impact theplanned strategies. Nevertheless, a good learned model can result in desired strategies.Moreover, it can be used as a prior model in all Bayesian approaches so that the modelbe further updated and enhanced. As such, in this work we are interested in learningproper POMDP models for dialogue POMDPs based on human-human dialogues.

In this paper, we present a method for learning the components of dialogue POMDPmodels using unannotated data available in SDSs. In fact, using an unsupervised methodbased on Dirichlet distribution, one can learn states and observations as well as tran-sition and observation POMDP functions. In addition, we develop a simple idea forreducing the number of observations while learning the model, and define a small prac-tical set of observations for the designed dialogue POMDP.

2 Capturing Dialogue POMDP Model for SACTI-1

This section describes the method for learning POMDP transition and observation func-tions. For background about POMDPs, the reader is referred to [5]. We used Hid-den Topic Markov Model (HTMM) [3] to design a dialogue POMDP, for SACTI-1dialogues [7], publicly available at: http://mi.eng.cam.ac.uk/projects/sacti/corpora/. There are about 144 dialogues between 36 users and 12 experts who playthe role of a DM for 24 total tasks on this data set. Similar to SACTI-2, the utteranceshere are also first confused using a speech recognition error simulator, and then aresent to the human experts. For an application of HTMM on dialogues in particular forlearning states of the domain, the reader is referred to [1].

Figure 1 shows 3 captured user’s intentions and their top 20 words with their proba-bilities learned by HTMM. For each intention, we have highlighted the keywords whichbest distinguish the intention. These intentions are for the user’s intentions for requestinformation about some visiting places, the transportation, and food places, respectively.

http://mi.eng.cam.ac.uk/projects/sacti/corpora/

http://mi.eng.cam.ac.uk/projects/sacti/corpora/

88 H.R. Chinaei and B. Chaib-draa

Without loss of generality, we can consider the user’s intention as the system’sstate [2]. Based on the above captured intentions, we defined 3 primary states for theSACTI-1 DM as follows: visits (v) , transports (t) , and foods (f). Moreover, we definedtwo absorb states, i.e., Success (S) and Failure (F) for dialogues which end successfullyand unsuccessfully, respectively. The notion of successful or unsuccessful dialogue isdefined by user. After finishing each dialogue, the user assigns the level of precision andrecall. These are the only explicit feedback which we require from the user, to be ableto define absorb states of dialogue POMDP. A dialogue is successful if its precision andrecall is above a predefined threshold.

The set of actions are coming directly from SACTI-1 dialogue set, and they include:GreetingFarewell, Inform, StateInterp, IncompleteUnknown, Request, ReqRepeat, Re-spondAffirm, RespondNegate, ExplAck, ReqAck, etc. For instance GreetingFarewell isused for initiating or ending a dialogue, Inform is for giving information for a user’sintention, ReqAck is for the DM’s request for user’s acknowledgement, StateInterp forinterpreting the intentions of user, and it can be considered as implicit confirmation, etc.

The transition function is calculated using maximum likelihood with add-onesmoothing to make a more robust transition model:

T (s1,a1,s2) =Count(s1,a1,s2)+ 1Count(s1,a1)+ K

where K = |S|2|A|, S is the state set, and S equals to number of intentions N which is5 in our example. For each utterance U , its corresponding state is the intention withhighest probability.

For the choice of observation function, we assumed 5 observations, each one is spe-cific for one state, i.e. user’s hidden intention. we use the notation O= { VO, TO, FO,SuccessO, FailureO } for the meta observations for visits, transports, foods, Success,and Failure, respectively. For each user’s intention, one can capture POMDP observa-tions given each utterance W = {w1, . . . ,w|W |} using vector β. Notice that βwiz is thelearned vector for the probability of each word wi given each user’s intention z notedas βwiz [3]. Then, in dialogue POMDP interaction, given any arbitrary user’s utterancePOMDP observation o is captured as:

o = argmaxz ∏i

βwiz

Then, the observation function is estimated by taking average over belief of statesgiven each action and state.

For the choice of reward model, similar to previous works we penalized each actionin primary states by −1, i.e. -1 reward for each dialogue turn [8]. Moreover, actions inSuccess state get +50 as reward, and those which lead to Failure state get −50 reward.

3 Experiments

We generated dialogue POMDP models as described in the previous section for SACTI-1. The automatic generated dialogue POMDP models consist of 5 states, 14 actions and


05

1015202530354045

24 48 72 96

Exp

ecte

d R

ewar

ds

Number of Training Data

POMDPExpert

(a)

0

5

10

15

20

25

30

35

none low med high

Exp

ecte

d R

ewar

ds

Noise Level

POMDPExpert

(b)

Fig. 2. (a): Comparison of performance in dialogue POMDPs v.s. experts with respect to thenumber of expert dialogues. (b): Comparison of performance in dialogue POMDPs v.s. expertswith respect to the noise level.

5 meta observations (each of which is for one state) which are drawn by HTMM using817 primitive observations (words).

We solved our POMDP models, using ZMDP software available online at: http://www.cs.cmu.edu/˜trey/zmdp/. We set a uniform distribution on 3 primary states(visits, transports, and foods), and set discount factor to 90%. Based on simulation,we evaluated the performance of dialogue POMDP by increasing the number of expertdialogues based on the gathered rewards.

Figure 2 (a) shows that by increasing expert dialogues the dialogue POMDP modelsperform better. In other words, by increasing data the introduced method learns betterdialogue POMDP models. The only exception is when we use 48 dialogues where thedialogue POMDP performance decreases compared to when 24 dialogues were used,and it has average performance worse than performance of experts in corresponding 48dialogues. The reason could be use of EM for learning the model which is dependedon priors α and η [3]. Moreover, EM is prone to local optima. In this work, we setthe priors based on heuristic given in [3], and our trial and error experiments, which isindeed a drawback for use of parametric models in real applications.

Furthermore, based on our simulations, we evaluated the robustness of generatedPOMDP models to ASR noise. There are four levels of ASR noise: no noise, low noise,medium noise, and high noise. For each noise level, we randomly took 24 expert dia-logues and made a dialogue POMDP model. Then, for each POMDP we performed 24simulations and gathered their expected rewards, and compared to corresponding ex-pert dialogues. Figure 2 (b) shows the results of these experiments. As the figure showsthe dialogue POMDP models are more robust to ASR noise levels compared to expertdialogues. The only exception is with the presence of no noise, where the experts per-form better. This also might be because of use of EM for learning model, where themodel can converge in local minima. Nevertheless, our preliminarily results based onsimulation shows that dialogue POMDP models are much more robust to higher levelsof noise compared to expert performance.

http://www.cs.cmu.edu/~trey/zmdp/

http://www.cs.cmu.edu/~trey/zmdp/

90 H.R. Chinaei and B. Chaib-draa

Table 1. Left: Sample results of applying HTMM on SACTI-1 Right: Sample results of simula-tion for SACTI-1 dialogue POMDP

. . .U1 yeah hello this is johan schmulka uh and

i’m uh searching for a bar in this towncan you may be tell me where the cafe blu is

U’1 [hello this is now seven four busand do you tell me where to cafe blu is]

o1 FOB1 t:0.000000 v:0.000000 f:1.000000a1: Inform(foods)M1 cafe blu is on alexander streetU2 oh um yeah how can i get to alexander street and

where exactly is it i know there a shopping areaon alexander street um

U’2 [i am yeah i am at the alexander street andwhere is it was on a the center of alexander street]

o2 TOB2 t:0.999992 v:0.000008 f:0.000000a2: Inform(transports)

. . .

. . .a1: GreetingFarewellM1: How can I help you?o2: VOB1: t:0.048145 v:0.912760 f:0.039093a2: Inform(visits)M2: Here is information about visiting areaso2: TOB2: t:0.967322 v:0.008186 f:0.024490a3: Inform(transports)M3: Here is information about transportationo3: TOB3: t:0.993852 v:0.000314 f:0.005833a4: ReqAck(transports)M4: Are you looking for transportationo4: TOB4: t:0.945658 v:0.048333 f:0.006008a5: Inform(transports)M5: Here is information about transportation

Moreover, the left side of Table 1 shows a sample dialogue from SACTI-1 dialogueset after applying HTMM on dialogues. In fact, this is a sample of data used for learningdialogue POMDP model. The first line of the table shows the first user’s utterance (U1).Because of ASR this utterance is corrupted which is the following line in braces, U ′1.The next line o1 is the observation behind U ′1 which is used in the time of dialoguePOMDP interaction. Note that it is assumed that each user utterance corresponds to oneuser’s intention. So, for each system’s observation the values in the following line showthe system’s belief over possible hidden intentions (B1). The next line, a1 shows theDM’s action in the form of dialogue acts. For instance, Inform(foods) is the dialogueact for the actual DM’s utterance in the following line, i.e. M1: cafe blu is on alexanderstreet.

Furthermore, the right side of Table 1 shows samples of our simulation of dialoguePOMDP. In the simulation time, for instance action a1, GreetingFarewell is generatedby dialogue POMDP manager, the description of this action is shown in M1, How canI help you?. Then, the observation o2 is generated by environment, VO. For instance,the received user’s utterance could have been something like U’1=I would like a hourthere museum first, which easily the intention behind this can be calculated using βws

and equation 1. However, notice that these results are only based on dialogue POMDPsimulation; where there is no actual user’s utterance, but only simulated meta obser-vations oi. As the table shows, dialogue POMDP performance seems intuitive. For in-stance, in a4 the dialogue POMDP requests for acknowledgement that the user actuallylooks for transports, since dialogue POMDP already informed the user about transportsin a3.


4 Conclusion and Future Work

A common problem in dialogue POMDP frameworks is calculating the dialoguePOMDP policy. If we can estimate the POMDP model in particular the transition, ob-servation, and reward functions then we are able to use common dynamic program-ming approaches for calculating POMDP policies. In this context, [8] used POMDPsfor modelling a DM and defined the observation function based on confidence scoreswhich are in turn based on some recognition features. However, the work here is tackleddifferently. We consider all the words in an utterance and consider the highest intentionunder the utterance as the meta observation for the POMDP. This makes the work pre-sented here particularly different from [2] where the authors simply used some statekeywords together with a few other words for modelling SDS POMDP observationsand observation function.

However, the evaluation done here is in a rather small domain for real dialogue sys-tems. The number of states needs to be increased and the learned model should beevaluated accordingly. Moreover, the definition of states here is a simple intention statewhereas in real dialogue domains the information or dialogue states are more complex.Then, the challenge would be to compare in particular the learned observation functionpresented here with confidence score based ones such as in in [8], as well as keywordbased ones as presented in [2].

References

1. Chinaei, H.R., Chaib-draa, B., Lamontagne, L.: Learning user intentions in spoken dialoguesystems. In: Filipe, J., Fred, A., Sharp, B. (eds.) ICAART 2009. CCIS, vol. 67, pp. 107–114.Springer, Heidelberg (2010)

2. Doshi, F., Roy, N.: Spoken language interaction with model uncertainty: an adaptive human-robot interaction system. Connection Science 20(4), 299–318 (2008)

3. Gruber, A., Rosen-Zvi, M., Weiss, Y.: Hidden topic markov models. In: Artificial Intelligenceand Statistics (AISTATS), San Juan, Puerto Rico (2007)

4. Liu, Y., Ji, G., Yang, Z.: Using Learned PSR Model for Planning under Uncertainty. Advancesin Artificial Intelligence, 309–314 (2010)

5. Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: An anytime algorithm forpomdps. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1025–1032(August 2003)

6. Weilhammer, K., Williams, J.D., Young, S.: The SACTI-2 Corpus: Guide for Research Users,Cambridge University. Technical report (2004)

7. Williams, J.D., Young, S.: The SACTI-1 Corpus: Guide for Research Users. CambridgeUniversity Department of Engineering. Technical report (2005)

8. Williams, J.D., Young, S.: Partially observable markov decision processes for spoken dialogsystems. Computer Speech and Language 21, 393–422 (2007)

Characterizing a Brain-BasedValue-Function Approximator

Patrick Connor and Thomas Trappenberg

Department of Computer Science, Dalhousie [email protected], [email protected]

Abstract. The field of Reinforcement Learning (RL) in machine learn-ing relates significantly to the domains of classical and instrumentalconditioning in psychology, which give an understanding of biology’sapproach to RL. In recent years, there has been a thrust to correlatesome machine learning RL algorithms with brain structure and func-tion, a benefit to both fields. Our focus has been on one such structure,the striatum, from which we have built a general model. In machinelearning terms, this model is equivalent to a value-function approxima-tor (VFA) that learns according to Temporal Difference error. In keepingwith a biological approach to RL, the present work1 seeks to evaluatethe robustness of this striatum-based VFA using biological criteria. Weselected five classical conditioning tests to expose the learning accuracyand efficiency of the VFA for simple state-value associations. Manuallysetting the VFA’s many parameters to reasonable values, we characterizeit by varying each parameter independently and repeatedly running thetests. The results show that this VFA is both capable of performing theselected tests and is quite robust to changes in parameters. Test resultsalso reveal aspects of how this VFA encodes reward value.

Keywords: Reinforcement learning, value-function approximation, clas-sical conditioning, striatum.

1 Introduction

Over the last several decades, our understanding of RL has been advanced bypsychology and neuroscience through classical/instrumental conditioning exper-iments and brain signal recording studies (fMRI, electrophysiological recording,etc.). Over the same period, the machine learning field has been investigatingpotential RL algorithms. There has been some convergence of these fields, no-tably the discovery that the activity of a group of dopamine neurons in the brainresembles the Temporal Difference (TD) error in TD learning [1]. One researchfocus in machine learning RL is the mapping of expected future reward value tostates (state-value mapping) from as little experience (state-value sampling) aspossible. Living things clearly grapple with this problem, continually updating1 Funding for this work was supported in part by the Walter C. Sumner Foundation,

CIHR, and NSERC.


Characterizing a Brain-Based Value-Function Approximator 93

their beliefs about expected rewards from their limited experience. Indeed, thefield of classical conditioning, which relies heavily on animal behavioural exper-iments, has explored a variety of reward-learning scenarios. The obvious needto acquire value for a rewarding state and the need to generalize this to simi-lar circumstances is well recognized by both psychology and machine learning.What is interesting, however, is that there appear to be other useful reward-learning strategies expressed in classical conditioning phenomena that have notyet translated into machine learning RL. Just as generalization improves learningefficiency by spreading learned value to nearby states, the classical conditioningphenomena of "latent inhibition" and "unovershadowing" appear to improvelearning efficiency in their own right.

At the heart of classical conditioning experiments is the presentation of astimulus (eg. a light, tone, etc.) or combination of stimuli followed by a re-ward outcome (reward, punishment, or none). When a stimulus is repeatedlypresented and there is no change in the reward outcome, latent inhibition [2]sets in, reducing the associability of the stimulus when the change in rewardoutcome eventually occurs. This promotes association to novel stimuli, whichseems appropriate since novel stimuli are more likely to predict a new outcomethan familiar stimuli. Latent inhibition saves the additional experience otherwiseneeded to make this distinction clear. Recovery from overshadowing, or "unover-shadowing" [3] is one of a family of similar strategies. First, overshadowing is theprocess of presenting a compound stimulus followed by, say, reward (SAB → R).Although the compound will learn the full reward value, its constituent stimuli(SA and SB) tested separately will also increase in value, where the most salientstimulus (say SB) gains the most value. In unovershadowing, the most salientstimulus is presented but not rewarded (SB → 0) and will naturally lose someof its value. What is surprising, however, is that the absent stimulus (SA) con-currently increases in value. This allows the animal to not only learn that SB isless rewarding than it predicted but, by process of elimination, learns that SA

is more rewarding than it predicted. Unovershadowing saves the need to presentand reward SA explicitly to increase its value, taking advantage of implicit logic.Whether it is generalization, latent inhibition, or unovershadowing, learning thevalue-function from fewer experiences will assist the animal in making rewardingchoices sooner.

These and other RL strategies are found in classical conditioning experi-ments, where subjects maintain an internal value-function, indicating reward-value based on the rate of their response (eg. lever presses). Since these biologicalstrategies appear beneficial, a machine learning RL system based on RL struc-tures in the brain may prove effective. After a brief outline of our brain-basedmodel [4] that does value-function approximation, the present work character-izes this VFA to determine its robustness and effectiveness in several classicalconditioning tests that are especially relevant to VFAs, whether artificial or bi-ological.

94 P. Connor and T. Trappenberg

2 Striatal Model

The striatum, the input stage of the basal ganglia (BG) brain structure, is a keycandidate region on which to base a VFA. The striatum is a convergence point forinputs from all over the brain (specifically, the neocortex [5]), spanning signalsof sensation to abstract thought. The majority of striatal neurons project to oneanother (via axon collaterals) and to other BG nuclei. The synaptic strengths(i.e. weights) of these projection neurons are modulated by dopamine signals [6](or the lack thereof), where dopamine neuron activity has been linked to theteaching signal of TD learning [1] mentioned earlier. In addition, several neuralrecording studies suggest that reward-value is encoded in the striatum [7][8][9],although it is not the only area of the brain that has been implicated in therepresentation of reward-value [10][11][12].

Our striatal model [4] is shown in Figure 1. The excitatory external inputrepresents a real-world feature (eg. colour wavelength, tonal pitch, etc.) by pro-viding a Gaussian activation profile surrounding a specific feature value (eg.Green, 530 nm). This emulates the "tuning curve" input to the striatum fromthe neocortex. The model is composed as a one-layer, one-dimensional neuralnetwork of striatal projection neurons, each excited by a subset of the externalinputs and inhibited by a subset of the other projection neurons, as is the casein the striatum (see [5] and [13]). Each neuron is part of either the direct or in-direct pathway, the main information processing routes through the BG, whereD1 and D2 are their dominant dopamine receptor subtypes respectively. Thesepathways tend to behave in an opposite sense, where one increases BG outputactivity while the other decreases it. The output of the model, V (S), becomesthe expected value of an external input (state/stimuli), computed as the sumof the direct pathway neuron activity minus the sum of the indirect pathwayneuron activity. Finally, the teaching signal can be formulated in the same wayas TD error, but, for the simple one-step prediction tasks used in this work, itis only necessary to use the reward prediction error (RPE), the actual rewardminus the expected reward (RPE = R − V (S)). A more formal description ofthe model is provided in Appendix A.

An important novel element in our model is the inclusion of modifiable lat-eral inhibitory connections. Because of these, the neurons compete, partiallysuppressing one another. Given an arbitrary combination of external inputs, anassociated subset of neurons will become more active than the others becausetheir external input weights correlate most with the external input. Many neu-rons will also be inactive, suppressed below their base activation threshold.

3 Tests, Measures, and Variables

Conventionally, to evaluate a VFA, one might seek to prove that the VFA’sstate-values converge for arbitrary state-value maps or seek to test performanceon a particular RL task (eg. random walk). Instead, we seek to know how effec-tively this striatum-based VFA employs certain RL strategies found in classical


Fig. 1. Diagram of the striatal model. External input is shaped as a Gaussian activityprofile surrounding a feature value. Probabilistic inputs (finely dashed lines) from ex-ternal and lateral sources are excitatory (green) and inhibitory (red) respectively, whilethe modulatory RPE signal can be either (blue). The direct and indirect pathways areexpressed in the two populations of neurons, D1 and D2 respectively, whose activitiesare accumulated to compute the expected value of the input state/stimulus, V (S).

conditioning to update the value-function. This approach puts value-function up-date strategy first, after which agent actions can be included and convergenceproofs and specific RL task comparisons pursued. Also, using classical condi-tioning tests helps to ascertain whether or not the striatum is responsible forthis behaviour. There are a great variety of classical conditioning tests, but to bepractical, we limit this to five: two to evaluate state-value mapping accuracy andthree to evaluate state-value learning efficiency. The striatum-based VFA wasintegrated into simulations of these tests, providing results in terms of measuresthat are defined for each, as described below. During a test, many trials are run,where one trial consists of presenting a state/stimulus (external input) togetherwith its expected reward-value.

The entry level test for a VFA is the acquisition of a state-value. What is alsoimportant, however, is that other state-values outside of a reasonable general-ization window (eg. Yellow in Fig. 1) are relatively unaffected. The acquisitiontest, then, pairs a state with a reward value, and compares the state-value to asample of other state-values. We define the acquisition effectiveness measure as

EA(S) =V (S) − 1

M ΣMi=1V (Si)

V (S)(1)

and consider that acquisition is observed when the state-value, V (S), is twicethat of the other sampled state-values, V (Si), or EA(S) > 0.5. Twenty trials arerun for each acquisition test. Six V (Si) samples are used for the comparison.


Secondly, it is important that a VFA be able to represent a variety of state-value mappings. Negative patterning is the classical conditioning equivalent ofthe non-trivial "exclusive-OR" problem, where the subject learns to increase thevalue of two stimuli, SA and SB, while learning zero-value for the compoundstimulus SAB. Here, we will define the negative patterning effectiveness as thedifference between the average constituent value and the compound value, nor-malized by the average constituent value, which can be expressed as

ENP (SA, SB, SAB) =V (SA) + V (SB) − 2V (SAB)

V (SA) + V (SB). (2)

Negative patterning is observed while ENP (SA, SB, SAB) > 0, that is, whilethe constituents have a higher value than the compound. One-hundred trials ofinterleaved presentation of the stimuli and their associated rewards are run foreach test.

In practical situations, no two experiences are identical, making it critical togeneralize state-value learning. Generalization also contributes significantly tolearning efficiency, spreading learned value to nearby states under the assump-tion that similar states are likely to have similar expected reward value. Thisstrategy reduces the amount of state-value sampling necessary to achieve rea-sonable accuracy. For this test, acquisition is performed for a single feature-valueand the reward value computed for 500 equally spaced feature-values. General-ization effectiveness will describe the spread of the value as a weighted standarddeviation, where feature values are weighted by their associated reward values,

EG(S) =

√√√√ΣNi=1V (Si)(i − ΣN

k=1kV (Sk)

ΣNj=1V (Sj)

)

ΣNj=1V (Sj)

. (3)

Generalization will be considered observed when the spread of value is at least10% of the width of the tuning curve input.

To further enhance learning efficiency, we consider the phenomena of latentinhibition described earlier. Latent inhibition’s reduction of associability can beachieved by lowering the input salience of the familiar stimulus. This is donemanually, here, for our testing purposes but represents a process that lowerssalience as a stimulus is repeatedly presented without a change in reward out-come. Then, when the reduced salience stimulus is combined with a fully salient,novel stimulus and followed by reward, overshadowing will result. Thus, our testof latent inhibition becomes a test of overshadowing, where the novel stimulus(SA) overshadows the reduced (half) salience stimulus (SB). We define the latentinhibition effectiveness measure as

ELI(SA, SB) =V (SA) − V (SB)V (SA) + V (SB)

, (4)

where the effect is observed when ELI(SA, SB) > 0. Thirty trials are run foreach test.


Finally, unovershadowing appears to improve learning efficiency by processof elimination as described previously. There are other similar phenomena (eg.backward blocking) that raise or lower the value of the absent stimulus, depend-ing on the scenario. The unovershadowing effectiveness is defined as

EUO(SA, SB) = −ΔV (SA)ΔV (SB)

, (5)

where ΔV (SX) is the change of value of stimulus SX from one trial to thenext and observability occurs when EUO(SA, SB) > 0. Here, unovershadowingis simulated by first performing the process of overshadowing (see above) withequally salient stimuli, followed by 100 trials of SB presentation without reward.

Ultimately, we seek to determine the robustness of the simulation of thesefive tests to changes in the VFA’s parameters. Because the parameter space isvery large and a full search unnecessary for our purposes, we found initial valueswhere all tests were observed and varied the parameters independently throughtheir valid ranges. This process characterizes the VFA, showing the conditionsunder which the tests break down.

Besides parameters associated directly with the VFA there are others acknowl-edged here that are better associated with the particular RL task to be solved.To simulate input noise, Gaussian noise is added to the external input and recti-fied, where its standard deviation is the parameter varied in the tests. Since theintensity of stimuli and rewards may vary, the salience of inputs and rewards aremultiplied by parameters varied between 0.01 and 1.

4 Results

Figures 2 and 3 represent the results for all five tests over 17 parameters. Sincethe VFA connectivity and initial weights are randomly initialized, each test andparameter combination was run 20 times to provide uncertainty estimates. Theobservability curve (upper panel) for each parameter is a summary of the moredetailed effectiveness curves (lower panel). For each parameter, the observabil-ity curves from the five tests are multiplied together, giving an "intersection"of observability. So, wherever observability is zero, it means that at least onetest is not observed for that parameter setting and when observability is one,all tests are observed. For example, once the lateral learning rate, β, becomesnegative, unovershadowing effectiveness disappears (goes negative) and unover-shadowing is no longer observed. So, the summary observability curve is zerofor β < 0 because not all of the tests were observed in this range. In constrast,when β > 0, all tests are observed. The effectiveness curves, whose vertical barsdenote standard deviation, are colour coded: acquisition (blue), negative pattern-ing (green), generalization (red), latent inhibition (cyan), and unovershadowing(violet). Note that only the effectiveness of observed cases are given. Also, whenpart of an effectiveness curve is missing in the graph, this indicates that there


were no cases of the associated parameter values where the effect was observed.In the effectiveness graphs, a black dotted vertical line indicates that parameter’ssetting while the other parameters were independently varied.

Like any other parameter, the external input learning rate, α, was meant toremain static while other parameters were varied. However, the true effects ofthe tuning curve width, input noise, activation exponent (θ2 in Appendix A),and the number of inputs per feature were not readily observable through theirvalid feature range unless α was adjusted so that the system activity was not toosmall nor too great. For each of these parameters, a function, α = f(param), waschosen such that the acquisition test would acquire full reward value at aroundthe 20 trial mark.

5 Discussion

The results show that this VFA is generally robust to changes in feature val-ues. There are, however, regions where observability disappears within its validparameter range. From the causes of low observability and key trends in effective-ness curves given in the results, the structural and functional details necessaryto successfully reproduce these five effects are described.

Acquisition was prevented in only three cases2: high activation threshold (θ0

in Appendix A), low input salience, and low mean input weight. As the acti-vation threshold increases, fewer neurons are active because fewer have internalactivations that exceed it. Likewise, internal activations are weak when either theinput salience or the mean input weight is too low. Since learning only occurs inneurons that are active (see equations 8 and 9, Appendix A), neither acquisitionnor any other test will learn when neurons are silent. Acquisition is otherwiserobust to varying the VFA parameters. This is not surprising since the Rescorla-Wagner model [14] of classical conditioning acquires reward value in much thesame way, the key ingredient being that they both learn in proportion to RPEand input salience.

As different inputs are presented to the system it becomes clear that the subsetof active neurons is input specific, enabling inputs to be represented by separatepopulations of neurons. A lateral inhibitory network put forth by Rabinovichet al. [15] similarly showed that asymmetric lateral connectivity (implementedin the striatum-based VFA by low lateral connection probability) led to similarinput-specific patterns of activation as well. This form of activity also resemblesthat of sparse coarse coding [16], another value-function approximation tech-nique that uses state-specific subsets of elements to represent state-value. Thisvalue-encoding strategy is critical for negative patterning because it allows acompound stimulus (SAB) and its constituent stimuli (SA and SB) to be rep-resented in different (although overlapping) populations. Then SA and SB canhave a strong positive value while SAB holds zero value. In the results we see2 Note that for input noise, tuning curve width, and the activation exponent, testing

was terminated when parameter values led to instability in the model despite thecustom tuning of the learning rate, α, to avoid this.


Fig. 2. Intersection of observability curves (top) with effectiveness curves (bottom),where error bars represent the standard deviation of effectiveness. Effectiveness curvesare coloured according to test: acquisition (blue), negative patterning (green), general-ization (red), latent inhibition (cyan), and unovershadowing (violet). See the electronicversion (www.springerlink.com) for coloured plots.


Fig. 3. Intersection of observability curves (top) with effectiveness curves (bottom),where error bars represent the standard deviation of effectiveness. Effectiveness curvesare coloured according to test: acquisition (blue), negative patterning (green), general-ization (red), latent inhibition (cyan), and unovershadowing (violet). See the electronicversion (www.springerlink.com) for coloured plots.


negative patterning sometimes failing for high lateral connection probabilities.In this scenario, we find that it becomes difficult to separately represent theconstituent and compound stimuli because there is too much overlap betweentheir active subsets.

Generalization, like acquisition, is robust, not being eliminated except when allneurons are silent. In the effectiveness curves, the generalization is always greaterthan or equal to the tuning curve width. As the tuning curve width is increased,a proportional increase in generalization effectiveness can be seen as well. Whenthe generalization effectiveness is greater than the tuning curve width, closerexamination reveals it to be either noise or an average increase/decrease in thestate-values outside a reasonable generalization window. So, the actual general-ization present in the VFA is due to the activity profile of the input rather thananything in the VFA per se. The VFA does support this means generalization,however, in that the amount of subset overlap between two feature values is pro-portional to the overlap between their activity profiles. This, too, accords withthe approach taken by sparse coarse coding.

Again, the practical benefit of latent inhibition is its ability to reduce asso-ciation of familiar, ineffectual stimuli with reward outcome. We implementedthis as a test of overshadowing, where the familiar stimulus was given half thesalience of a novel stimulus. If reward associations were simply made in propor-tion to a stimulus’ input salience, as is the case for the Rescorla-Wagner model(not shown), our tests should return latent inhibition effectiveness values of ∼0.6.However, we see effectiveness values typically between 0.85 and 0.95, which seemsto suggest that the novel stimulus really dominates the association and the famil-iar stimulus receives disproportionately little association. As mentioned earlier,however, this lateral inhibitory model of the striatum has competitive properties.It appears that this makes up the difference in the effectiveness measure, wherethe familiar (less salient) stimulus is not very competitive and is overwhelmedby background activity when presented alone.

Unovershadowing is especially affected by the lateral learning rate. A sharpincrease in unovershadowing observability occurs as the lateral learning ratebecomes positive. In agreement with equations 8 and 9 (Appendix A), this sug-gests that for unovershadowing to be observed, a neuron’s lateral weights mustbe strengthened when its input weights are strengthened, and be weakened whenthey are weakened. This is unusual since, if gradient descent had been used toderive the lateral weight update equation as was done for the input weight up-date equation, the lateral weights would have learned in the opposite sense (i.e.would have been strengthened when input weights were weakened, etc.).

6 Conclusions

Systematically varying the VFA parameters led to both assessing the model’sdegree of robustness and helping to determine how the VFA is capable of suc-cessfully performing the tests. This striatum-based VFA has shown to effectivelyexpress the chosen classical conditioning tests over a breadth of parameter space,supporting the notion that the striatum may be the seat of general purpose


reward-value encoding in the brain. The VFA’s ability to effectively demonstrateunovershadowing and support latent inhibition is especially worthy of note, asemergent properties of the competitive nature and lateral learning in the VFA.

7 Future Work: Application to RL Tasks

We have characterized a brain-based VFA in terms of classical conditioning teststhat represent RL strategies for accurate and efficient value-function updates.This approach is not limited to brain-based VFAs, but may be applied to otherswith the assumption that these tests represent RL strategies worth emulating.

How might this striatal model be applied to RL tasks? We propose that thestriatal model be employed within the actor-critic [17] framework. Unchanged,the model would implement the critic, receiving all sensory inputs (i.e. featuressuch as X and Y position in a Grid World task). The actor, taught by the critic,would be composed of a number of striatal models, one per action (eg. North,South, East, West). Given that biological reinforcement systems are effectivebeyond simple (eg. Grid World) reinforcement tasks, our approach may supportthe completion of complex tasks, warranting further investigation.

References

1. Schultz, W.: Predictive Reward Signal of Dopamine Neurons. J. Neurophys-iol. 80(1), 1–27 (1998)

2. Lubow, R.E.: Latent inhibition. Psychological Bulletin 79, 398–407 (1973)3. Matzel, L.D., Schachtman, T.R., Miller, R.R.: Recovery of an overshadowed as-

sociation achieved by extinction of the overshadowing stimulus. Learning andMotivation 16(4), 398–412 (1985)

4. Connor, P.C., Trappenberg, T.: Classical conditioning through a lateral inhibitorymodel of the striatum (2011) (in preparation)

5. Wilson, C.J.: Basal Ganglia, 5th edn., pp. 361–413. Oxford University Press, Inc.,Oxford (2004)

6. Wickens, J.R., Begg, A.J., Arbuthnott, G.W.: Dopamine reverses the depressionof rat corticostriatal synapses which normally follows high-frequency stimulationof cortex in vitro. Neuroscience 70, 1–5 (1996)

7. Hori, Y., Minamimoto, T., Kimura, M.: Neuronal encoding of reward value anddirection of actions in the primate putamen. Journal of Neurophysiology 102(6),3530–3543 (2009)

8. Lau, B., Glimcher, P.W.: Value representations in the primate striatum duringmatching behavior. Neuron 58(3), 451–463 (2008)

9. Samejima, K.: Representation of Action-Specific reward values in the striatum.Science 310(5752), 1337–1340 (2005)

10. Bromberg-Martin, E.S., Hikosaka, O., Nakamura, K.: Coding of task reward valuein the dorsal raphe nucleus. Journal of Neuroscience 30(18), 6262–6272 (2010)

11. Gottfried, J.A.: Encoding predictive reward value in human amygdala and or-bitofrontal cortex. Science 301(5636), 1104–1107 (2003)

12. Roesch, M.R.: Neuronal activity related to reward value and motivation in primatefrontal cortex. Science 304(5668), 307–310 (2004)


13. Wickens, J.R., Arbuthnott, G.W., Shindou, T.: Simulation of GABA functionin the basal ganglia: computational models of GABAergic mechanisms in basalganglia function. In: Progress in Brain Research, vol. 160, pp. 313–329. Elsevier,Amsterdam (2007)

14. Rescorla, R.A., Wagner, A.R.: A theory of pavlovian conditioning: Variations inthe effectiveness of reinforcement and non-reinforcement. In: Black, A.H., Prokasy,W.F. (eds.) Classical Conditioning II. Appleton-Century-Crofts, New York (1972)

15. Rabinovich, M.I., Huerta, R., Volkovskii, A., Abarbanel, H.D.I., Stopfer, M.,Laurent, G.: Dynamical coding of sensory information with competitive networks.J. Physiol. (Paris) 94, 465–471 (2000)

16. Sutton, R.S.: Generalization in reinforcement learning: Successful examples us-ing sparse coarse coding. In: Advances in Neural Information Processing Systems,vol. 8, pp. 1038–1044. MIT Press, Cambridge (1996)

17. Houk, J., Adams, J., Barto, A.: A Model of How the Basal Ganglia Generateand Use Neural Signals that Predict Reinforcement, pp. 249–270. The MIT Press,Cambridge (1995)

Appendix A

Formally, the striatum-based neural network can be represented as:

τdu(x, t)

dt= −u(x, t) +

∫y

wI(x, y)I(y, t)dy −∫z

wL(x, z)r(u(z, t))dz (6)

r(u) =

{θ1 (u − θ0)

θ2 ,

0u > θ0

otherwise, (7)

where wI and wL are the synaptic weights connecting external input (I(y, t))and lateral inputs from other neurons respectively. The activation function, r(u),transforms the internal state (average membrane potential) to an instantaneouspopulation firing rate. Parameter θ0 is the x-intercept, θ1 is the slope multiplier,and θ2 is the exponent (r(u) is a threshold-linear activation function when θ2 =1). Neurons only activate if their internal state is greater than the threshold.

Learning in the model happens in two ways. Weights receiving external inputslearn according to gradient descent, minimizing the squared RPE (J = 1

2RPE2),resulting in

wI(x, y) = wI(x, y) + αD(x)RPE[θ2θ1(u(x, t) − θ0)θ2−1I(y, t)

], (8)

where α is the learning rate and D(x) = 1 for direct pathway neurons and −1for indirect pathway neurons. The weights receiving lateral inputs learn in an away that opposes the gradient,

wL(x, z) = wL(x, z) + αβD(x)RPE[θ2θ1(u(x, t) − θ0)θ2−1Q(u(z, t))

], (9)

where β is the relative learning rate for the lateral input connections, and Q(u) =1 for u > θ0 and 0 otherwise. Just as for r(u), there is no weight change for eitherof these learning equations when u(x, t) < θ0.

Answer Set Programming for Stream Reasoning

Thang M. Do, Seng W. Loke, and Fei Liu

Dept. of CSCE, La Trobe University, [email protected],{S.Loke,F.Liu}@latrobe.edu.au

Abstract. This paper explores Answer Set Programming (ASP) forstream reasoning with data retrieved continuously from sensors. We de-scribe a proof-of-concept with an example of using declarative models torecognize car on-road situations.

Keywords: ASP, stream reasoning, semantic sensor based applications.

1 Introduction

A new concept of “stream reasoning” has been proposed in [8]. Recently, dlvhex,an extension of ASP, has been introduced as one candidate for rule-based rea-soning for the Semantic Web [6]. dlvhex uses the semantic reasoning approachwhich makes it fully declarative and always terminating. dlvhex can deal withuncertain data (via using disjunctions rules to generate multiple answer sets),interoperate with arbitrary knowledge bases (to query data) and different rea-soning frameworks (e.g., higher-order logic, for more reasoning power). However,to our knowledge, using dlvhex and ASP for stream reasoning is new.

Our research has three aims i) to introduce a prototype of dlvhex stream rea-soning, ii) to formalize ASP for building stream reasoning systems, and iii) to fur-ther apply Semantic Web techniques (OWL) for sensor-based applications. Thecontribution is to propose a framework and theoretical formulation for buildingASP-based stream reasoning systems with a focus on sensor stream applications.

There has been research to apply ASP in wireless sensor networks applicationssuch as home-based health care services for aged people [7] or dealing with am-biguous situations [4], but the concept of stream reasoning was not introduced.To implement declarative stream processing, the logic framework of LarKC [5]basaed on aggregate functions (rather than logic programming semantic) of theC-SPARQL [3] language. The logic framework in [1] reasons with a stream ofpairs of RDF triples and a timestamp but the ability to deal with unstable datawas not mentioned. Therefore, we see the necessity to have a foundation forASP-based stream reasoning and to investigate its feasibility.

2 ASP-Based Stream Reasoning: A Conceptual Model

In this section, we describe a conceptual model that formalizes ASP-based streamreasoning that processes streams of data into answer sets. We describe: i) a


Answer Set Programming for Stream Reasoning 105

general abstract architecture of a stream reasoning system, ii) a formal modelof data streams, and iii) a formalization of an ASP based stream reasoner.

Abstract Architecture. A stream reasoning system has three main compo-nents, which are sensor system, data stream management system (DSMS) [2]and stream reasoner illustrated in Figure 1 (SSN for Semantic Sensor Network).

Fig. 1. Simple stream reasoning system

Notation. We introduce the notation which is used in the next two sections.

- dr: is the time period between the starting time and the finishing time of areasoning process which is always terminates.

- ds: is the time period between the starting time and the finishing time of asensor taking a data sample (usually very small).

- Δs: is the time period between the two start times of taking two consecutivedata samples of a sensor. The sample rate fs is: fs = 1/Δs.

- Δr: is the time period between the two start times of two consecutive rea-soning processes of the reasoner. The reasoning rate fr is: fr = 1/Δr.

There are two communication strategies between the DSMS and the streamreasoner: push and pull. In the pull method, when the reasoner needs sensor datasample(s), it sends a query to the DSMS which will perform the query and returnthe data sample(s) to the reasoner. In the push method, the reasoner registerswith the DSMS the sensor name from which it wants to have the data sample.The DSMS returns to the reasoner the data sample whenever it is available.

We use the pull method in our prototype to discover the maximum reasoningspeed of the reasoner when continuously running as fast as possible.

Data Stream Formalization. This section introduces the formalization of thedata stream provided to the stream reasoner. The time when a sample is taken isassumed to be very close to the time when that sample is available for reasoning,otherwise the reasoner will give its result with a consistent delay.

Definition 1 (Data Stream). Data stream DS is a sequence of sensor datasamples di ordered by timestamps. DS = {(t1, dt1), (t2, dt2), . . . , (ti, dti), . . .}where dti is the sensor data sample taken at time ti, and t1 < t2 < . . . < ti < . . ..

Definition 2 (Data Window). A data window available at time t, Wt, is afinite subsequence of a data stream DS and has the latest data sample taken attime t. The number of data samples, |Wt|, of this subset is the size of the window.For Wt ⊆ DS, and ts = t: Wt = {(t1, dt1), (t2, dt2), . . . , (ts, dts)} where Wt isdata window at time t, s = |Wt|: is the size of the window, t1 < t2 < . . . < ts, tsis the time when the latest sample of the data window is taken, and dti(1 ≤ i ≤ s)is the sensor data sample taken at time ti.

106 T.M. Do, S.W. Loke, and F. Liu

The data window can also be defined by a time period, for example, a datawindow that includes all data samples taken in the last 10 seconds.

Definition 3 (Window Slide Samples). Window slide samples l is the num-ber of samples counted from the latest sample (inclusive) of one data window tothe latest sample (exclusive) of the next data window.

Definition 4 (Window Slide Time). Given two continuous data windowsWt1 at time t1 and Wt2 at time t2 (t2 ≥ t1), the time period between t1 and t2is called window slide time Δw, i.e. we have Δw = t2 − t1.

From Definition 3 we can calculate the window slide time with the formula:Δw = l ∗ Δs. When we use the term “window slide”, it means window slidesamples or window slide time depending on context.

Definition 5 (Data Window Stream). Given a data stream DS, a datawindow stream WS is a sequence of data windows W in time order. WS ={(t1, Wt1), (t2, Wt2), . . . , (ti, Wti), . . .} where Wti is a data window at time ti,t1 < t2 < . . . < ti < . . ., and Wti ⊆ DS.

In dlvhex, we use &qW to represent a predicate which query a data windowfrom a DSMS. This predicate is extended from the external atoms of dlvhex-dlplugin1: &qW [|W |, URI, sn](X, V ) where &qW is an external predicate thatqueries a data window from the DSMS, |W | (input) is window size, URI (input)is a Unique Resource Identifier or the file path of the OWL ontology data source,sn (input) is the name of the sensor providing data sample, X (output) is thename of the returned instance of the ontology class that describes the sensor,and V (output) is data sample value returned.

ASP-Based Stream Reasoner. This section introduces a formalization of thestream reasoner of a system model that has one data stream and one reasoner.This is easily extendible to models that have: one data stream providing datafor multiple reasoners, one reasoner using data from multiple data streams, andmany reasoner using data from multiple data streams.

Definition 6 (Data Window Reasoner). An ASP-based data window rea-soner AWR is a function that maps every data window W ⊆ DS to a set SA ofanswer sets. AWR : WS → 2Σ where AWR denotes a ASP-based data windowreasoner, WS is the set of all data windows from data stream DS, Σ denotes theset of all possible answer sets S for any input, and 2Σ is the power set of Σ.

The reasoner AWR has input data window W and gives a set SA of answer sets:AWR(W ) = SA, where SA = {S1, S2, . . . , Sn}, and Si ∈ Σ, (1 ≤ i ≤ n).

When using pull communication, the reasoner AWR runs continuously withan interval of Δr, queries input data (not waitting for the data comming likein push method) from a data window stream WS, and gives a stream of sets of

1 http://www.kr.tuwien.ac.at/research/systems/dlvhex/download.html

http://www.kr.tuwien.ac.at/research/systems/dlvhex/download.html


answer sets SA: &aSR(AWR, WS, Δr) = {SAt1 , SAt2 , . . . , SAti , . . .} &aSR isthe meta operator (or external predicate in dlvhex) that triggers the reasonerAWR to run continuously, Δr is the interval at which the meta operator &aSRrepeatedly executes AWR, and SAti is the set of answer sets output at times ti.

From Definition 6, we have AWR(Wti′ ) = SAti , where ti is the time whenthe reasoner gives the output, and ti′ is the time when the input data becomesavailable. The input data was available before the reasoning process, so: ti′ < ti−dr. The reasoner use the latest data window, so: ti′ = max(tj), t0 ≤ tj < (ti−dr)where tl (l ≥ 0) is the time when a data window Wtl becomes available.

3 Prototype Implementation and Experimentation

Prototype Implementation. As a proof-of-concept of the model introducedin Section 2, we built a prototype to detect driving situations of a car travelling inpublic traffic conditions. The system has main components illustrated in Figure 2and uses models of car situations (e.g., turning left or right) as constraints onsensor data values, defined declaratively as dlvhex programs.

Fig. 2. Prototype design

The sensor system is built using the SunSPOT tool kit version 4.02. We builda simple ontology called Sensor Ontology in the OWL language. Sensor datais placed in a queue in an OWL file which is fed to RacerPro version 1.9.03.This setup simulates a DSMS mentioned in Section 2. The reasoner AWR isbuilt from a dlvhex program4 and we use a Unix shell script to realize the metapredicate &aSR. The prototype is installed in Ubuntu 9.10 which runs on a Sunvirtual box version 3.1.6 on our Windows Vista Fujitsu Lifebook T 4220.

Experimental Setup. We attach the sensor kit in the middle of a car with thesampling rate of the accelerometer was 0.3s/sample, as the maximum speed of thereasoner. AcceX and AcceZ are the horizontal and vertical direction respectively;AcceY is along the forward direction. The AcceX and AcceY values are mappedto the scale [0, 100]. A AcceX data sample of a data window is created as below:2 http://www.sunspotworld.com3 http://www.racer-systems.com4 http://www.kr.tuwien.ac.at/research/systems/dlvhex/download.html

http://www.sunspotworld.com

http://www.racer-systems.com

http://www.kr.tuwien.ac.at/research/systems/dlvhex/download.html

108 T.M. Do, S.W. Loke, and F. Liu

<AcceX rdf:ID="AcceX1"><acceValue rdf:datatype="&xsd;positiveInteger">55

</acceValue></AcceX>

To query for a data window, we use the dlvhex atom &dlDR[URI, a, b, c, d, Q](X, Y ):

url("../ontology/driving.owl").

acceX1(X,Y) :- &dlDR[U,a,b,c,d,"acceValue"](X,Y), X="sensorOnto:AcceX1", url(U).

We have to query every single data sample in the data window. With our pro-posed formula (&qW [|W |, URI, sn](X, V )) we only use one rule to obtain themost recent data window (of size five):

acceX(X, Y) :- &qW[5, U, ‘‘sensorOnto:AcceX’’](X, Y).

The code below, which is a declarative model of a right turn situation, rea-sons to recognize “right turn” situations. It implements an ASP-based windowreasoner defined conceptually earlier.

% right turn:

doingRightTurn :- acceX1(X1,Y1), acceX2(X2,Y2), acceX3(X3,Y3), acceX4(X4,Y4),

acceX5(X5,Y5), #int(S1), #int(S2), #int(S3), #int(S4), #int(Y1), #int(Y2),

#int(Y3), #int(Y4), #int(Y5), S1=Y1+Y2, S2=S1+Y3, S3=S2+Y4, S4=S3+Y5, S4>277.

The bounds (e.g., 277) were obtained via experiments done. Similar rules modelother car on-road situations. Because we used a UNIX shell script to triggerthe reasoner continuously, the Operating System has to repeatedly load dlvhex,run it, and then unload it. This is resource consuming and can reduce reasoningspeed, but provided a fast, though crude implementation of the meta-predicate,adequate for reasoning about car on-road behaviours.

In dlvhex, using rules with disjunctive heads can give several possible answersets representing several possible situations given the same sensor data readings.

doingRightTurn v doingLeftTurn :- acceX1(X1,Y1), acceX2(X2,Y2), ....

Results and Evaluation. The maximum reasoning speed of the system isnine (three) times/s without (with) querying ontology data. We tested the sys-tem in two running states (normal and delayed) when our car’s speed range is25-50km/h for straight driving and 25-40km/h for turning, turning angles ap-proximately greater than 30o, with three data window sizes (one, two and five).The system recognizes turning situations at higher accuracy with higher speed.

With window size five, the system detected 15 left turns and 15 right turnswith no error. With window size two, the system detected 10 left turns and 10right turns with no error. With window size one, the system is very sensitiveand often mis-recognizes because accelerometer sensor’s data fluctuates. Whenthe reasoner ran with a small deliberate delay, with window size five, the systemdetected 10 left turns and 10 right turns with six errors. With window size two,the system detected eight left turns and eight right turns with seven errors.

This result means that, using smaller data window sizes makes the system moresensitive and reduces accuracy, but it can more quickly recognize fine-grained sit-uations also different fine-grained car manoeuvres such as start turning, doing


turning, and finish turning. When using larger data window sizes, the system re-turns more accurate results and deals better with unstable sensor data.

Overall, our system detects turning, stopping, going straight and going overa ramp with high accuracy. The bound values in the rules can be adjusted tochange the sensitivity of the system. We could use machine learning to processsuch sensor data, but our aim is to illustrate a simply proof-of-concept of ASP-based stream reasoning where the stream comprises a sequence of time-stampedOWL objects. This prototype suggests the potential for applications that requireup to three reasoning processes per second such as driving assistant.


This paper has provided a conceptual model of ASP-based stream reasoning,and showed the feasibility of stream reasoning with dlvhex for semantic sensorbased applications. This project successfully used OWL objects to representsensor data (which is more general than even time-stamped RDF triples) andutilized dlvhex to reason with this data. Our future work will: (i) implementrepeated reasoning within dlvhex programs themselves to improve the system’sperformance, and ii) research hybrid ASP-machine learning for stream reasoning.

References

1. Barbieri, D., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: Incrementalreasoning on streams and rich background knowledge. In: Aroyo, L., Antoniou, G.,Hyvonen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.)ESWC 2010. LNCS, vol. 6088, pp. 1–15. Springer, Heidelberg (2010)

2. Barbieri, D., Braga, D., Ceri, S., Valle, E.D., Huang, Y., Tresp, V., Rettinger, A.,Wermser, H.: Deductive and inductive stream reasoning for semantic social mediaanalytics. IEEE Intelligent Systems 25(6), 32–41 (2010)

3. Barbieri, D.F., Braga, D., Ceri, S., Valle, E.D., Grossniklaus, M.: C-sparql: Sparqlfor continuous querying. In: Proceedings of the 18th International Conference onWorld Wide Web, pp. 1061–1062. ACM, New York (2009)

4. Buccafurri, F., Caminiti, G., Rosaci, D.: Perception-dependent reasoning and answersets (2005),http://www.ing.unife.it/eventi/rcra05/articoli/BuccafurriEtAl.pdf

5. Della Valle, E., Ceri, S., Barbieri, D.F., Braga, D., Campi, A.: A first step towardsstream reasoning. In: Domingue, J., Fensel, D., Traverso, P. (eds.) FIS 2008. LNCS,vol. 5468, pp. 72–81. Springer, Heidelberg (2009)

6. Eiter, T., Ianni, G., Schindlauer, R., Tompits, H.: dlvhex: A prover for semantic-webreasoning under the answer-set semantics. In: IEEE / WIC / ACM InternationalConference on Web Intelligence, pp. 1073–1074 (2006)

7. Mileo, A., Merico, D., Pinardi, S., Bisiani, R.: A logical approach to home healthcarewith intelligent sensor-network support. Comput. J. 53, 1257–1276 (2010)

8. Della Valle, E., Ceri, S., van Harmelen, F., Fensel, D.: It’s a streaming world! rea-soning upon rapidly changing information. IEEE Intelligent Systems 24(6), 83–89(2009)

http://www.ing.unife.it/eventi/rcra05/articoli/BuccafurriEtAl.pdf

A Markov Decision Process Model for Strategic

Decision Making in Sailboat Racing

Daniel S. Ferguson and Pantelis Elinas�

The University of SydneyAustralian Centre for Field Robotics

Sydney, [email protected]

Abstract. We consider the problem of strategic decision-making forinshore sailboat racing. This sequential decision-making problem is com-plicated by the yacht’s dynamics which prevent it from sailing directlyinto the wind but allow it to sail close to the wind following a zigzag tra-jectory towards an upwind race marker. A skipper is faced with the prob-lem of sailing the most direct route to this marker whilst minimizing thenumber of steering manoeuvres that slow down the boat. In this paper,we present a Decision Theoretic model for this decision-making processassuming a fully observable environment and uncertain boat dynamics.We develop a numerical Velocity Prediction Program (VPP) which al-lows us to predict the yacht’s speed and direction of sail given the wind’sstrength and direction as well as the yacht’s angle of attack with respectto the wind. We specify and solve a Markov Decision Process (MPD)using our VPP to estimate the rewards and transition probabilities. Wealso present a method for modelling the wind flow around landmassesallowing for the computation of strategies in realistic situations. Finally,we evaluate our approach in simulation showing that we can estimateoptimal routes for different kinds of yachts and crew performance.

1 Introduction

Sailing is both a recreational activity enjoyed by many as well as a competitive,team sport. In this paper, we focus on the latter considering the problem ofmaking strategic decisions for inshore yacht racing.

Figure 1 shows an example of the course sailed by a yacht (J24 one-designkeelboat) in Sydney harbor during a competitive race consisting of upwind anddownwind legs. The data was collected using an off-the-shelf GPS device. Wecan see that for the downwind leg, the yacht can sail on a straight line betweenthe markers. For the upwind leg, the yacht follows a zigzag course because it isconstrained in its ability to sail directly into the wind. The closest a yacht cansail into the wind varies depending on its design and it can be as close as 30degrees. In order to reach the windward mark, the yacht’s skipper must performa maneuver known as tacking which turns the bow of the boat through the wind� Corresponding author.


A MDP Model for Strategic Decision Making in Sailboat Racing 111

Fig. 1. Example trajectory sailed for a race in Sydney harbor. Data collected using aconsumer-level GPS sensor.

slowing it down as a result. The goal of a racing boat is to traverse the length ofthe course as quickly as possible (more accurately, faster than all the other boats)so a skipper must sail where the wind is strongest while minimizing the number oftacking manoeuvres which slow down the boat. On average, the wind’s strengthand direction are expected to remain constant during the race with temporaryfluctuations around a given mean. Wind gusts, i.e., local and temporary windfluctuations, are common and either help the boat advance forward faster orslow it down; similarly, landmasses effect the wind by changing its direction andstrength. In such an environment, skippers are faced with a difficult sequentialdecision-making problem for which we provide a decision theoretic solution asdescribed in this paper.

The rest of this paper is structured as follows. In Section 2, we review previouswork on weather routing. In Section 3, we introduce the basics of the physicsof sailing and our implementation of a Velocity Prediction Programme (VPP)essential for our MDP model described in Section 4. In Section 5, we introduce amethod for modelling the effects of landforms on the wind flow. We evaluate ourmethod by simulating different scenarios in Section 6. We conclude and discussfuture work in Section 7.

2 Previous Work

In the past, researchers have placed much emphasis in understanding the physicsof sailing and the creation of numerical Velocity Prediction Programs along withweather routing algorithms for offshore sailing.

The physics of sailing today are well understood using the theory of fluid me-chanics [1]. Based on these physics, over the years several researchers have de-veloped methods for real-time, numerical yacht performance prediction differingonly in the initial assumptions and modelling of different yacht parameters [7] [9].

112 D.S. Ferguson and P. Elinas

These numerical VPPs have been used to develop race modelling programs(RMPs) for predicting the results of races. For example, Fernandez et. al. [4]present the development of one such RMP that determines routes using graphsearch. Philpott et. al. [6] develop a model for predicting the outcome of a yachtmatch race between two yachts. They assume that a fixed strategy is givenand focus on the stochastic modelling of wind phenomena and the accuratesimulation of yacht dynamics. Another similar approach for match prediction ispresented by Roncin et. al. [8]. Stelzer et. al. [10] present a method for shortterm weather routing for a small robotic boat assuming fixed wind conditionsand focused on reaching a given target position with no guarantees for trajectoryoptimality with respect to minimizing the time to reach the goal as desired incompetitive sailing.

The work closest to our own is that of Philpott et. al. [5] who also model thestochastic, sequential decision-making problem as a Markov Decision Processbut they assume that the boat dynamics are deterministic and focus entirely onmodeling wind phenomena without, however, taking into account the effect oflandmasses in and around the course.

3 Numerical Velocity Prediction Program (VPP)

A sailboat moves under the influence of wind striking its sails and the watercurrent striking its keel. The physics of sailing are understood well enough toallow the prediction of a yacht’s velocity given the wind conditions, i.e., strengthand direction, and a considerably large set of yacht parameters including sailarea, boat length and shape, keel size and shape. The main principles of lift anddrag that explain how a plane flies apply to a sailboat. In part (a) of Figure 2, weshow a simplified diagram of the aerodynamic and hydrodynamic forces actingupon a yacht. In this document, we will not provide a detailed derivation of theseforces but the reader can find the details in [3],[11]. Briefly, the main aerodynamicforces are given by the following equations.

The aerodynamic lift is given by,

SFA =12ρ v2

AW Cl (1)

where ρ is the air density; vAW is the apparent wind speed; and Cl is the liftcoefficient which is a function of the angle of attack, heel angle, sail trim andarea. The aerodynamic drag is given by the same equation but with Cl replacedby Cd which is the drag coefficient.

The induced aerodynamic resistance due to lift is given by,

Ri =(SFA

cosφ )2

12ρv2

AW AπRe

(2)

where ρ is the air density; SFA is the aerodynamic side force; φ is the heel angle;vAW is the apparent wind speed; A is the area of the sail; and Re is the effectiveaspect ratio.


NO

YES

Hydrodynamic

Parameters

Forces

Forces

WindConditions

Yacht

Aerodynamic

reached?

leeway angleOutputs: Yacht velocity,

Adjust yachtparameters

Equilibrium

(a) (b)

Fig. 2. (a) The different forces acting on the yacht including SFH , SFA the hydrody-namic and aerodynamic sideways forces; β the leeway angle; βAW the apparent windangle; βTW the true wind angle; vAW the boat velocity in the direction of the apparentwind; vTW the boat velocity in the direction of the true wind, and (b) the flow chartfor numerical VPP calculations.

There are similar equations for the hydrodynamic forces acting on the yacht’skeel and these can be found in [3],[11]. We note that for the work presented inthis paper, we ignore a number of other forces such as wave resistance as well asthe effect of other boats sailing in close proximity.

Part (b) of Figure 2 shows the basic structure of our numerical VPP. The VPPtakes as input the yacht parameters and wind data necessary for evaluating theaerodynamic and hydrodynamic forces given by the equations listed above. Usingan iterative procedure, the VPP searches for those values for the yacht speed andleeway angle that perfectly balance all the forces. These estimated values are theVPP output. We emphasize that the output of this program is the maximumspeed of the boat for the given conditions; we do not take into account the factthat the boat needs to slowly accelerate to this speed.

4 Markov Decision Process Model

We tackle our sequential decision making problem using a Markov Decision Pro-cess (MDP) [2] with discrete states and actions. An MDP is a tuple {S, A, Pr, R},where S is a finite set of states and A is a finite set of actions. Actions inducestochastic state transitions, with Pr(s, a, s′) denoting the probability with whichstate s′ is reached when action a is executed at state s. R(s) is a real-valuedreward function, associating with each state s its immediate utility R(s).

Solving an MDP is finding a mapping from states to actions, π(s). Solutionsare evaluated based on an optimality criterion such as the expected total reward.


An optimal solution π′(s) is one that achieves the maximum over the optimalitymeasure, while an approximate solution comes to within some bound of themaximum. We use the value iteration algorithm to compute an optimal, infinite-horizon policy, with expected total discounted reward as the optimality criterion.

In the following sections, we give the details of our sailing MDP model.

4.1 States

For our problem, we have discretized the state space considering 3 features whichare the boat’s x, y position and tack (port or starboard). We can vary theresolution of the grid to compute policies at different levels of detail trading offbetween accuracy and computation time.

4.2 Actions

There is only one action available to the skipper and that is changing the boat’stack from port to starboard and vice versa. So, for our MDP model there are 2actions available at any state, A = {donothing, tack}. Executing a tack actionis associated with a penalty because it forces the boat to turn through the windand slow down.

4.3 Transition Function

During the race, the yacht is always making progress towards the goal (it wouldmake no sense to do otherwise) but its dynamics are uncertain. The MDP tran-sition function capturing the stochastic nature of the yacht’s motion in time isthe hardest component of our model that we must specify. There are 3 mainreasons why the location of the yacht in the future cannot be predicted exactly,

1. The yacht does not move in a straight line in the direction its bow is pointingbecause of the net sideways force induced by the wind and water current.An approximate value of the leeway angle is one of the two outputs of ourVPP described in Section 3.

2. The wind is only on average constant. Over the course wind gusts effect theboat’s ultimate location.

3. We assume that the boat is always sailing at an angle closest to the wind.Realistically, even the most experienced skippers and crews cannot maintainsuch an accurate heading.

So, we need a way to compute the probability of the boat landing in any oneof the squares in the next column in the grid satisfying the above requirements.One might be tempted to use a Gaussian distribution centered around the pre-dicted boat location taking the leeway angle into account. However, this will beincorrect considering that unless a tacking action is performed, the yacht is lesslikely (because of basic physics) to reach any of the cells on the windward sidecompared to those on the leeward side of the boat. As a result, we find that theGamma distribution for estimating transition probabilities is a better choice,


g(τ |α, δ) =1

δαΓ (α)τα−1e

−τδ (3)

where Γ (α) = (α − 1)! and the two parameters α (shape) and δ (scale) havevalues greater than 0. We can estimate the probability of landing within eachgrid cell via the cumulative distribution function.

−50 0 50 1000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Displacement (m)

Pro

babi

lity

1 degree5 degrees8 degrees12 degrees

(a) (b)

Fig. 3. (a) The Gamma distribution for different values of the leeway angle, and (b)example of the transition probabilities calculated for a leeway angle of 0o and the windcoming from the Southeast at a 45o angle to the boat.

However, in order for us to use this distribution for specifying the transitionprobabilities of our MDP, we must give functions for determining its two param-eters α and δ. The skew of the PDF is determined by the leeway angle of theyacht β. The shape parameter α relates to the skewness γ as follows:

γ =1√α⇔ α =

1γ2

(4)

For small leeway angles we wish to have small skew, and to achieve this a largeshape variable is required. We would also like skewness to roughly increase lin-early with the leeway angle. A formulation that achieves both of these require-ments is α = 3 + 1

β2 . The scale parameter δ is important as it determines thescaling of the PDF on the x and y axis. We require that the distribution has anincreasingly large tail for increasingly large leeway angles. The scale parameterδ is a function of the distribution variance σ and the shape parameter α givenby,

σ = αδ2 ⇔ δ =√

σ

α(5)

For our work, we decided on the value of the variance σ empirically given by

σ = σset + 10βσset (6)


where σset = 5 is the variance of the distribution for a zero leeway angle. Ex-amples of the gamma distribution used to compute the transition model for ourMDP for different values of the leeway angle are shown in part (a) of Figure 3.Part (b) of Figure 3 shows an example of the estimated transition probabilitiesfor a 0o leeway angle and the wind blowing from the Southeast at a 45o degreeangle to the boat.

4.4 Rewards

We set the reward for the goal state to a large positive value. In addition, wespecify the reward for each state as function of the boat’s velocity determinedusing the VPP developed in Section 3 for the given wind conditions. The imme-diate reward for the boat at location (x, y) and a given tack is given by,

R(s) =vb(s) − vmax

b

vmaxb

(7)

where, vb(s) gives the boat’s predicted speed for state s, and vmaxb is an upper

limit on the boat speed. This upper limit exists for all displacement boats due tothe physics of sailing and it is known as hull speed [1]. We note that the rewardfunction is scaled to give values in the range [−1, 0] such that maximum rewardis assigned for those states where the yacht sails the fastest, i.e., closest to vmax

b .

5 Modelling the Effects of Landmass on the Wind

The phenomena involved in wind patterns and in modelling these are extremelycomplex and well beyond the scope of this paper. In general, it is difficult to findwind models for the small scales we are dealing with in inshore sailing, as themajority of weather modelling research goes into large offshore wind patterns.Having said that, we will examine the results of a wind pattern survey arounda bay in California [12] and extrapolate some effects from them.

Carefully observing the wind data shown in [12], we see that a landmass can beadequately modelled as an ellipse. In addition,there are four important aspectsof note in terms of how wind flows around it. These are,

– The wind flows in a continuous manner around the landform, and somewhatmimics its shape.

– The effects of the landmass decrease the further away from it.– In the region behind the landmass the wind is greatly reduced.– In the region in front of the landmass wind velocity increases (this effect is

not modelled in this paper).

Based on the above observations, we represent landmasses using ellipses allowingus to model different size landmasses with ease keeping the calculations of windflow around them relatively simple. So, a landmass is defined as L = [lx, ly, sx, sy]where (lx, ly) denote the center of the ellipse and (sx, sy) the lengths along thex and y axis respectively.


Fig. 4. Elliptical model of landmass and wind flow around it

First the region over which the effect of the landmass is seen is set at twicethe size of the landmass, that is, seff

x = 2sx and seffy = 2sy. Outside this region

there is no effect on the wind. Now that the region that is affected by the land isknown, we must consider the features of the wind pattern that have been noted,i.e., continuous flow, that mimics the shape of the landmass and decreasing effectof landmass on the wind the further away from the land. In order to mimic thelandform (i.e. flow around it) we first set the wind in the x direction to arbitrarilybe xwind = 1. Now the wind component in the y direction is required, and thisis dependent on the shape of the land, as we wish the curvature of the windfield to follow that of the land. As can be seen in Figure 4, the larger the sizeof the land mass in the y direction (for the same size along x) the larger the ycomponent of wind velocity is needed. Mathematically, it is ywind ∝ sy

sx.

The next feature required is the dependence on distance from the land mass.The further away from the landform, the less it will be affected by it, so the windin the y direction related to the distance as ywind ∝ 1.0

d where d is the radialdifference between the current location and the edge of the landform. In orderto calculate the difference, we require an equation for the radius of the landformwhich is the standard equation for the radius of an ellipse:

rland(ω) =sxsy√

(sxsinω)2 + (sycosω)2(8)

where ω is the angle between the radius and the x axis. We can now defined(x, y, ω) =

√(x − lx)2 + (y − ly)2 − rland(ω). Combining all the above rela-

tionships, we can derive an equation for estimating the wind strength in the ydirection for the affected region, i.e., lx − seff

x < x < lx + seffx , as

ywind(x, y, sx, sy, ω) =sysx

sin(π x−seffx

seffx)

d(x, y, ω)(9)

and the wind direction is given by

αwind = tan−1 ywind

xwind(10)

Figure 6 shows an example of the wind flow around 4 landmasses estimated usingour elliptical model.



We present simulation results supporting our approach for strategic decisionmaking in yacht racing in Figures 5 and 6 as well as Table 1. Given our model,we know that the policies computed using the value iteration algorithm willbe optimal1 so in this section we qualitatively and quantitatively evaluate thecomputed policies in simulation and discuss the performance of different boats.

5 10 15 20 25 30

5

10

15

20

25

30

−1

−0.9

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0goaltack penalty: 0.0tack penalty: 0.2tack penalty: 0.5

Fig. 5. Example trajectories sailed in simulation for 3 different tack penalties for thesame boat. The color of each square indicates the reward for the the boat sailing in thatarea such that 0 is the highest reward and −1 the lowest as described in Section 4.4.For these experiments the wind is blowing from the East at 18 ± 4 meters per second.

Figure 5 shows the trajectories of a single simulation for the same boat but3 different penalties for executing the tacking manoeuvre. The results shownwere computed for a course size 3000× 3000 meters at a resolution of 100× 100meters per grid cell; for all experiments, policy computation requires less than10 seconds of compute time on a laptop computer. We generated a randomwind pattern with wind strength of 18 ± 4 meters per second from the East.We can see from the figure that when the tacking penalty is large, the optimalpolicy avoids executing the tacking manoeuvre unless absolutely necessary. Fora smaller tacking penalty, the policy tries to take the boat through as many highwind locations as possible in order to minimize the time travelled towards thegoal; we note that the boat will sail the fastest where the breeze is the strongestconsidering that we assume the boat is always pointing at the optimum angleto the wind. We see that when using a reasonable value for penalizing the tacking

1 We used the Matlab MDP toolbox for computing policies, downloaded fromhttp://www.inra.fr/internet/Departements/MIA/T/MDPtoolbox/

http://www.inra.fr/internet/Departements/MIA/T/MDPtoolbox/


manoeuvre, the trajectory sailed resembles that shown earlier in Figure 1 whichis based on real data. Table 5 shows the average and standard deviation forthe time and number of tacks it takes in each case to traverse the distance togoal computed for 1000 simulations over the same course but for a higher gridresolution of 25× 25 meters. We also consider 2 different boats that can sail atdifferent angles closest to the wind one being 40o and the other 30o. We caneasily compute the time to goal since using our VPP we have an estimate forthe boat’s top speed as it passes through each of the grid cells.

Table 1. Timing results over 1000 simulations for the same boat and wind data butusing 5 different tack penalty values and 2 different angles of attack

α = 40o α = 30o

Tack penalty Mean Time (s) Mean Tacks Mean Time (s) Mean Tacks

0 858 ± 8 51 ± 5 806 ± 12 50 ± 6

0.2 886 ± 10 21 ± 4 831 ± 13 14 ± 4

0.5 933 ± 11 8 ± 3 864 ± 14 5 ± 4

0.7 959 ± 14 6 ± 3 868 ± 15 3 ± 2

1.0 995 ± 18 2 ± 1 878 ± 57 2 ± 1

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

16

18

20

Fig. 6. Simulation result with landforms showing advantageous wind formation due tolandform presence allowing for a more direct route towards the goal at (20, 10). Thecourse is 2000 × 2000 meters large with grid resolution of 100 × 100.

As we would expect, the closer a boat can sail to the wind and the moreefficiently it can change tacks the faster it can traverse the length of the course.However, we can also see from the second and fourth rows of the Table 1 that


a boat that is not capable of sailing close to the wind but can perform steeringmanoeuvres efficiently can be as competitive as a boat that can sail much closerto the wind but is inefficient in its steering.

Finally, in Figure 6 we show a trajectory sailed by a boat for an experimentinvolving landmasses. The size of this course is 2000 × 2000 meters and theresolution of the grid is 100× 100 meters per grid cell. The yacht can sail at 40o

closest to the wind. We want to point out that the landmass warps the windsuch that the yacht can sail a more direct route to the goal as can be clearlyseen. Most experienced sailors know how land effects the direction and strengthof the prevailing wind and try to take advantage during the race; we see thatour automated system with proper modelling of the effect of land on the windcan do the same, at least, at a basic level.

7 Conclusions and Future Work

In this paper, we presented our method for strategic decision making for inshoreyacht racing. We started with an introduction of the physics of sailing and thedevelopment of a numerical Velocity Prediction Program (VPP). We also pre-sented a way for modelling wind flow around landforms. Finally, we modeledthe sequential decision-making problem using a Markov Decision Process whichwe solved using value iteration and evaluated in simulation for different yachtparameters.

Our model allowed us to compare boats that can sail at different angles closestto the wind as well as different crew performances in the execution of the mostimportant steering manoeuvre known as tacking. We have also shown that whenwe have knowledge of how the wind flows around a landmass, our model cancorrectly decide on routes that take advantage of the situation.

In future work we would like to develop a more accurate method for velocityprediction perhaps not based on numerical estimation but derived from real datacollected by sailing a yacht in different weather conditions. Our strategic decisionmaking approach would be useful to crews for making plans before the race, butsince the weather conditions can vary during the race, it would be desirableto augment our system with the ability to revise plans accordingly using datagathered during the competition. Lastly, we would like to test our approach ina real yacht race and also extend it to offshore sailing in which case we wouldhave to put much emphasis in more accurately modelling large wind patternsover time.

Acknowledgements

The authors would like to thank the reviewers for their many valuable commentsand suggestions. This work is supported by the ARC Centre of Excellence pro-gramme funded by the Australian Research Council (ARC) and the New SouthWales Government.


References

1. Anderson, B.D.: The physics of sailing. Physics Today 61(2), 38–43 (2008)2. Boutilier, C., Dean, T., Hanks, S.: Decision-theoretic planning: Structural assump-

tions and computational leverage. Journal of Artificial Intelligence Research 11,1–94 (1999)

3. Ferguson, D.S.: Strategic decision making in yacht racing. Master’s thesis, TheUniversity of Sydney, Sydney, Australia (2010)

4. Fernandez, A., Valls, A., Garcia-Espinosa, J.: Stochastic optimization of IACCyacht performance. In: International Symposium on Yacht Design and Production,pp. 69–78 (2004)

5. Philpott, A., Mason, A.: Optimising yacht routes under uncertainty. In: In Proc.of the 15th Chesapeake Sailing Yacht Symposium, Annapolis, MD (2001)

6. Philpott, A.B., Henderson, S.G., Teirney, D.: A simulation model for predictingyacht match race outcomes. Oper. Res. 52, 1–16 (2004)

7. Philpott, A.B., Sullivan, R.M., Jackson, P.S.: Yacht velocity prediction using math-ematical programming. European Journal of Operational Research 67(1), 13–24(1993)

8. Roncin, K., Kobus, J.: Dynamic simulation of two sailing boats in match racing.Sports Engineering 7, 139–152 (2004), 10.1007/BF02844052

9. Roux, Y., Huberson, S., Hauville, F., Boin, J., Guilbaud, M., Ba, M.: Yacht perfor-mance prediction: Towards a numerical VPP. In: High Performance Yacht DesignConference, Auckland, New Zealand (December 2002)

10. Stelzer, R., Proll, T.: Autonomous sailboat navigation for short course racing.Robot. Auton. Syst. 56, 604–614 (2008)

11. van Oossanen, P.: Predicting the speed of sailing yachts. SNAME Transactions 101,337–397 (1993)

12. Vesecky, J.F., Drake, J., Laws, K., Ludwig, F.L., Teague, C.C., Paduan, J.D.,Sinton, D.: Measurements of eddies in the ocean surface wind field by a mix ofsingle and multiple-frequency hf radars on monterey bay california. In: IEEE Int.Geoscience and Remote Sensing Symposium, pp. 3269–3272 (July 2007)

Exploiting Conversational Features to Detect

High-Quality Blog Comments

Nicholas FitzGerald, Giuseppe Carenini, Gabriel Murray, and Shafiq Joty

University of British Columbia{nfitz,carenini,gabrielm,rjoty}@cs.ubc.ca

Abstract. In this work, we present a method for classifying the qual-ity of blog comments using Linear-Chain Conditional Random Fields(CRFs). This approach is found to yield high accuracy on binary classifi-cation of high-quality comments, with conversational features contribut-ing strongly to the accuracy. We also present a new corpus of blog datain conversational form, complete with user-generated quality moderationlabels from the science and technology news blog Slashdot.

1 Introduction and Background

As the amount of content available on the Internet continues to increase expo-nentially, the need for tools which can analyze and summarize large amountsof text has become increasingly pronounced. Traditionally, most work on auto-matic summarization has focused on extractive methods, where representativesentences are chosen from the input corpus ([5]). In contrast, recent work (eg.[10], [2]) has taken an abstractive approach, where information is first extractedfrom the input corpus, and then expressed through novel sentences created withNatural Language Generation techniques. This approach, though more difficult,has been shown to produce superior summaries in terms of readability and co-herence.

Several recent works have focused on summarization of multi-participant con-versations ([9], [10]). [10] describes an abstractive summarization system forface-to-face meeting transcripts. The approach is to use a series of classifiers toidentify different types of messages in the transcripts; for example, utterancesexpressing a decision being made, or a positive opinion being expressed. Thesummarizer then selects a set of messages which maximize a function encom-passing information about the sentences in which messages appear, and passesthese messages to the NLG system.

In this paper, we present our work on detecting high-quality comments inblogs using CRFs. In future work, this will be combined with classification onother axes—for instance that of the message’s rhetorical role (ie. Question, Re-sponse, Criticism etc.)—to provide the messages for an abstractive summariza-tion system.

CRFs ([7]) are a discriminative probabilistic model which have gained muchpopularity in Natural Language Processing and Bio-informatics applications.


Exploiting Conversational Features to Detect High-Quality Blog Comments 123

One benefit of using linear chain CRFs over more traditional linear classificationalgorithms is that the sequence of labels is considered. Several works have shownthe effectiveness of CRFs on similar Natural Language Processing tasks whichinvolve sequential dependencies ([1], [4]). [11] uses Linear-Chain CRFs to classifysummary sentences to create extractive summaries of news articles, showing theireffectiveness on this task. [6] test CRFs against two other classifiers (SupportVector Machines and Naive-Bayes) on the task of classifying dialogue acts in live-chat conversations. They also show the usefulness of structural features, whichare similar to our conversational features (see Sect. 2.3).

2 Automatic Comment Rating System

2.1 The Slashdot Corpus

We compiled a new corpus comprised of articles and their subsequent user com-ments from the science and technology news aggregation website Slashdot1. Thissite was chosen for several reasons. Comments on Slashdot are moderated byusers of the site, meaning that each comment has a scores from -1 to +5 indicat-ing the total score of moderations assigned, with each moderator able to modifythe score of a given comment by +1 or -1. Furthermore, each moderation assignsa classification to the comment: for good comments, the classes are: Interesting,Insightful, Informative and Funny. For bad comments, the classes are: Flame-bait, Troll, Off-Topic and Redundant. Since the goal of this work was to identifyhigh-quality comments, most of our experiments were conducted with commentsgrouped into GOOD and BAD.

Slashdot comments are displayed in a threaded conversation-tree type layout.Users can directly reply to a given comment, and their reply will be placed under-neath that comment in a nested structure. This conversational structure allowsus to use Conversational Features in our classification approach (see Sect. 2.3).

Some comments were not successfully crawled, which meant that some com-ments in the corpus referred to parent comments which had not been collected.In order to prevent this, comments whose parents were missing were excludedfrom the corpus. After this cleanup, the collection totalled 425,853 comments on4320 articles.

2.2 Transformation into Sequences

As mentioned above, Slashdot commenters can reply directly to other comments,forming several tree-like conversation for each article. This creates a problem forour use of Linear-Chain CRFs, which require linear sequences.

In order to solve this problem, each conversation tree is transformed intomultiple Threads, one for each leaf-comment in the tree. The Thread is the se-quence of comments from the root comment to the leaf comment. Each Thread

1 http://slashdot.org

http://slashdot.org

124 N. FitzGerald et al.

is then treated as a separate sequence by the classifier. One consequence of thisis that any comment with more than one reply will occur multiple times in thetraining or testing set. This makes some intuitive sense for training, as commentshigher in the conversation tree are likely more important to the conversation asa whole, as the earlier a comment appears in the thread the greater effect ithas on the course of conversation down-thread. We describe the process of re-merging these comment threads, and investigate the effect this has on accuracy,in Sect. 3.3.

2.3 Features

Each comment in a given sequence was represented as a series of features. Inaddition to simple unigram (bag-of-words) features, we experimented with twoother classes of features: lexical similarity, and conversational features. Theseare described below:

Similarity Features. Three features were used which capture the lexical simi-larity between two comments: TF-IDF, LSA ([5]) and Lexical Cohesion([3]). Foreach comment, each of these three scores was calculated for both the precedingand following comment (0 if there was no comment before or after), giving atotal of six similarity features. These features were previously shown in [12] tobe useful in the task of topic-modelling in email conversations. However, in con-trast to [12], where similarity was calculated between sentences, these metricswere adapted to calculate similarity between entire comments.

Conversational Features. The conversational features capture informationabout the how the comment is situated in the conversation as a whole. The listis as follows:

ThreadIndex The index of the comment in the current thread (starting at 0).NumReplies The number of child comments replying to thisWordLength and SentenceLength The length of this comment in words and

sentences, respectively.AvgReplyWordLength and AvgReplySentLength The average length of replies

to this comment in words and sentence length.TotalReplyWordLenth and TotalReplySentLength The total length of all replies

to this comment in words and sentence length.

2.4 Training

The popular Natural Language Machine Learning toolkit MALLET2 was usedto train the CRF model. A 1000-article subset of the entire Slashdot corpuswas divided 90%-10% between the training and testing set. The training setconsisted of 93,841 Threads from 900 articles, while the testing set consisted of10,053 Threads from 100 articles.

2 http://mallet.cs.umass.edu/index.php

http://mallet.cs.umass.edu/index.php


Table 1. (a) Confusion matrix for binary classification of comment threads. (b) Resultsof feature analysis on the 3 feature classes. (c) Confusion matrix for re-merged commentthreads.

(a)

BAD GOOD

BAD 5991 1965GOOD 1426 8814

P: 0.818R: 0.861F: 0.839

(b)

P R F

all good 0.563 1.000 0.720

uni 0.708 0.699 0.703sim 0.802 0.900 0.848conv 0.8183 0.855 0.836uni sim 0.780 0.847 0.812uni conv 0.8183 0.855 0.836sim conv 0.8183 0.855 0.836uni sim conv 0.8183 0.855 0.836

(c)

BAD GOOD

BAD 4160 467GOOD 862 1090

P: 0.700R: 0.558F: 0.621


3.1 Classification

Experiment 1 was to train the CRF using data where the full set of moderationlabels had been grouped into GOOD comments and BAD. The ConditionalRandom Field classifier was trained on the full set of features presented in Sect.2.3. The Confusion-Matrix of this experiment is presented in Table 1a. We cansee that the CRF performs well on this formulation of the task, with a precisionof 0.818 and recall of 0.839. This compares very favourably to a baseline ofassigning GOOD to all comments, which yields a precision score of 0.563. TheCRF result also performs favourably against a non-sequential Support VectorMachine classifier (P = .799, R = .773) which confirms the existence of sequentialdependencies in this problem.

3.2 Feature Analysis

To investigate the relative importance of the 3-types of features (unigrams, sim-ilarity, and conversational) we experiment with training the classifier with dif-ferent groupings of features. The results of this feature analysis is presented inTable 1b. All three sets of features can provide relatively good results by them-selves, but the similarity and conversational features greatly out-perform theunigram features. Similarity features have a slight edge in terms of recall andf-score, while the Conversational features provide the edge in precision, seemingto dominate Similarity features when both are used. In fact, the results of thisanalysis seem to show that whenever the conversational features are used, theydominate the effect of the other features, since all sets of features which include3 These results were not identical, though close enough that precision, recall, and

f-score were identical to the third decimal point.

126 N. FitzGerald et al.

Conversational features have the same results as using the Conversational fea-tures alone. This would seem to indicate that most relevant factors in decidingthe quality of a given comment are conversational in nature, including the num-ber of replies it receives and the nature of those replies. This effect could bereinforced by the fact that comments which have previously been moderated asGOOD are more likely to be read by future readers, which will naturally increasethe number of comments they receive in reply. However, since the unigram- and,more notably, similarity-features can still perform quite well without use of theconversational features, our method is not overly-dependent on this effect.

3.3 Re-merging Conversation Trees

As described in Sect. 2.2, conversation trees were decomposed into multiplethreads in order to cast the problem in the form of sequence labelling. The re-sult of this is that after classification, each non-leaf thread has been classifiedmultiple times, equal to the number of sub-comments of that comment. Thesedifferent classifications need not be the same, ie. A given comment might wellhave been classified as GOOD in one sequence and BAD in another. We nextrecombined these sequences, such that there is only one classification per com-ment. Comments which appeared in multiple sequences, and thus received multi-ple classifications, were marked GOOD if they were classified as GOOD at leastonce (GOOD if |{ci ∈ C : ci = good}| ≥ 1}, where C is the set of classificationsof comment i4.

There are two ways to evaluate the merged classifications. The first way is toreassign the newly-merged classifications back onto the thread sequences. Thispreserves the proportions of observations in the original experiments, which al-lows us to determine whether merging has affected the accuracy of classification.Doing so showed that there was no significant effect on the performance of theclassifier; precision and recall remained .818 and .861, respectively.

The other method is to look at the comment-level accuracy. This removesduplicates from the data, and gives the overall accuracy for determining theclassification of a given comment. The results of this are given in Table 1c. Theprecision and recall in this measure are significantly lower than in the thread-based measure, which indicates that the classification of “leaf” comments tendedto be less accurate than that of non-leaf comments which subsequently appearedin more than one thread. The precision of .700 is still much greater than thebaseline of assigning GOOD to all comments, which would yield a precision of.297. This indicates that our approach can successfully identify good comments.


In this work, we have presented an approach to identifying high-quality com-ments in blog comment conversations. By casting the problem as one of binary4 This was compared to similar metrics such as a majority-vote metric (GOOD if|{ci ∈ C : ci = good}| ≥ |{ci ∈ C : ci = bad}|), and performed the best (though thedifference was negligible).


classification, and applying sequence tagging by way of a Linear-Chain Condi-tional Random Field, were were able to achieve high accuracy. Also presentedwas a new corpus of blog comments, which will be useful for future research.

Future work will focus on refining our ability to classify comments, and in-corporating this into an abstractive summarization system. In order to be usefulfor this task, it would be preferable to have finer-grained classification than justGOOD and BAD. Applying our current method to the full range of Slashdotmoderation classes yielded low accuracy5. Future work will attempt to addressthese issues.

References

1. Chung, G.: Sentence retrieval for abstracts of randomized controlled trials. In:BMC Medical Informatics and Decision Making, vol. 9, p. 10 (2009)

2. FitzGerald, N., Carenini, G., Ng, R.: ASSESS: Abstractive Summarization Sys-tem for Evaluative Statement Summarization (extended abstract), The PacificNorthwest Regional NLP Workshop (NW-NLP), Redmond (2010)

3. Galley, M., McKeown, K., Fosler-Lussier, E., Jing, H.: Discourse segmentation ofmulti-party conversation. In: 41st Annual Meeting on Association for Computa-tional Linguistics, Stroudsburg, vol. 1 (2003)

4. Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M.: Identifying Sections inScientific Abstracts using Conditional Random Fields. In: Third International JointConference on Natural Language Processing, Hyderabad, pp. 381–388 (2008)

5. Jurafsky, D., Martin, J.: Speech and Language Processing: an Introduction to Nat-ural Language Processing, Computational Linguistics, and Speech Recognition.Pearson Prentice Hall, Upper Saddle River (2009)

6. Kim, S., Cavedon, L., Baldwin, T.: Classifying dialogue acts in one-on-one livechats. In: 2010 Conference on Empirical Methods in Natural Language ProcessingCambridge (2010)

7. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilisticmodels for segmenting and labeling sequence data. In: Proc. 18th InternationalConf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)

8. McCallum, A.: MALLET: A Machine Learning for Language Toolkit,http://mallet.cs.umass.edu

9. Murray, G., Carenini, G.: Summarizing Spoken and Written Conversations. In:2008 Conference on Empirical Methods in Natural Language Processing, Waikiki(2008)

10. Murray, G., Carenini, G., Ng, R.: Generating Abstracts of Meeting Conversations:A User Study. In: International Conference on Natural Language Generation (2010)

11. Shen, D., Sun, J., Li, H., Yang, Q., Chen, Z.: Document Summarization usingConditional Random Fields. In: International Joint Conferences on Artificial In-telligence (2007)

12. Joty, S., Carenini, G., Murray, G., Ng, R.: Exploiting Conversation Structure inUnsupervised Topic Segmentation for Emails. In: The Conference on EmpiricalMethods in Natural Language Processing, Cambridge (2010)

5 A longer version of this paper with the report of this experiment is available fromthe first author’s website.

http://mallet.cs.umass.edu

Consolidation Using Context-Sensitive Multiple

Task Learning

Ben Fowler and Daniel L. Silver

Jodrey School of Computer ScienceAcadia University

Wolfville, NS, Canada B4P [email protected]

Abstract. Machine lifelong learning (ML3) is concerned with machinescapable of learning and retaining knowledge over time, and exploitingthis knowledge to assist new learning. An ML3 system must accuratelyretain knowledge of prior tasks while consolidating in knowledge of newtasks, overcoming the stability-plasticity problem. A system is presentedusing a context-sensitive multiple task learning (csMTL) neural network.csMTL uses a single output and additional context inputs for associatingexamples with tasks. A csMTL-based ML3 system is analyzed empiri-cally using synthetic and real domains. The experiments focus on theeffective retention and consolidation of task knowledge using both func-tional and representational transfer. The results indicate that combiningthe two methods of transfer serves best to retain prior knowledge, butat the cost of less effective new task consolidation.

1 Introduction

Machine lifelong learning, or ML3, a relatively new area of machine learningresearch, is concerned with the persistent and cumulative nature of learning[12]. Lifelong learning considers situations in which a learner faces a series ofdifferent tasks and develops methods of retaining and using prior knowledgeto improve the effectiveness (more accurate hypotheses) and efficiency (shortertraining times) of learning. We focus on the learning of concept tasks, where thetarget value for each example is either zero or one.

An ML3 system requires a method of using prior knowledge to learn models fornew tasks as efficiently and effectively as possible, and a method of consolidatingnew task knowledge after it has been learned. Consolidation is the act of savingknowledge, in one data store, in an integrated form such that it can be indexedefficiently and effectively. The challenge for an ML3 system is consolidating theknowledge of a new task while retaining and possibly improving knowledge ofprior tasks. This challenge is generally known in machine learning, cognitivescience and psychology as the stability-plasticity problem[5].

This paper addresses the problem of knowledge consolidation and, thereforethe stability-plasticity problem, within a machine lifelong-learning system us-ing a modified multiple-task learning (MTL) neural network. The system uses


[email protected]

Consolidation Using Context-Sensitive Multiple Task Learning 129

a context-sensitive multiple task learning (csMTL) network as a consolidateddomain knowledge (CDK) store. This research extends the work presented in[11], in which a csMTL network is demonstrated as an effective method of usingprior knowledge to learn models for new tasks.

The goal is to demonstrate that a csMTL network can retain previous taskknowledge when consolidating a sequence of tasks from a domain. This willestablish the usefulness of csMTL as part of an ML3. This requires showingthat an appropriate network structure and set of learning parameters can befound to allow the method to scale with increasing numbers of tasks.

The paper has four remaining sections. Section 2 provides the necessary back-ground information on MTL, csMTL and ML3. Section 3 presents a ML3 systembased on a csMTL network and discusses its benefits and limitations. Section 4provides the results of empirical studies that test the proposed system. Finally,Section 5 concludes and summarizes the paper.

2 Background

2.1 Inductive Transfer

Typically in machine learning, when a new task is learned, we ignore any previ-ously acquired and related knowledge, and instead start learning from a fresh setof examples. Using previously acquired knowledge is called inductive transfer [2].Humans generally make use of previously learned tasks to help them learn newones. Similarly, previously acquired knowledge should be used to bias learning,in order to make machine learning more efficient or more effective.

There are two basic forms of inductive transfer: functional transfer and repre-sentational transfer [8]. Functional transfer is when information from previouslylearned tasks is transferred through implicit pressures from training examples.Representational transfer is when information from previously learned tasks istransferred directly through explicit assignment of task representation (such asneural network weights).

2.2 Limitations of MTL for Machine Lifelong Learning

Multiple task learning (MTL) neural networks are one of the better documentedmethods of inductive transfer of task knowledge [2,8]. An MTL network is afeed-forward multi-layer network with an output for each task that is to belearned. The standard back-propagation of error learning algorithm is used totrain all tasks in parallel. Consequently, MTL training examples are composedof a set of input attributes and a target output for each task. The sharing ofinternal representation is the method by which inductive bias occurs within anMTL network [1]. The more that tasks are related, the more they will sharerepresentation and create positive inductive bias.

Let X be a set on �n (the reals), Y the set of {0, 1} and error a function thatmeasures the difference between the expected target output and the actual out-put of the network for an example. MTL can be defined as learning a set of target

130 B. Fowler and D.L. Silver

concepts f = {f1, f2, . . . fk} such that each fi : X → Y with a probability dis-tribution Pi over X × Y . We assume that the environment delivers each fi basedon a probability distribution Q over all Pi. Q is meant to capture regularity inthe environment that constrains the number of tasks that the learning algorithmwill encounter. Q therefore characterizes the domain of tasks to be learned. Anexample for MTL is of the form (x, f(x)), where x is the same as defined for STLand f(x) = {fi(x)}, a set of target outputs. A training set SMTL consists of allavailable examples, SMTL = {(x, f(x))}. The objective of the MTL algorithm isto find a set of hypotheses h = {h1, h2, . . . , hk} within its hypothesis space HMTL

that minimizes the objective function∑

x∈SMTL

∑ki=1 error [fi(x), hi(x)]. The as-

sumption is that HMTL contains sufficiently accurate hi for each fi being learned.Typically |HMTL| > |HSTL| in order to represent the multiple hypotheses.

Previously, we have investigated the use of MTL networks as a basis for anML3 system and have found them to have several limitations related to the mul-tiple outputs of the network [9,10]. First, MTL requires that training examplescontain corresponding target values for each task; this is impractical for lifelonglearning systems as examples of each tasks are acquired at different times andwith unique combinations of input values. Second, with MTL, shared represen-tation and therefore transfer is limited to the hidden node layers and not theoutput nodes. Third, there is the practical problem of how a MTL based ML3system would know to associate an example with a particular task. Clearly, thelearning environment should provide the contextual queues, however this sug-gests additional inputs, not outputs. Finally, a lifelong learning system shouldbe capable of practising two or more tasks and improving its models for eachas new examples become available. It is unclear how redundant task outputs forthe same task could be avoided using an ML3 system based on MTL.

In response to these problems, we developed context-sensitive MTL, or csMTL[11]. csMTL is based on MTL with two major differences; only one output isused for all tasks and additional inputs are used to indicate the example context,such as the task to which it is associated.

2.3 csMTL

Figure 1 presents the csMTL network. It is a feed-forward network architectureof input, hidden and output nodes that uses the back-propagation of error train-ing algorithm. The csMTL network requires only one output node for learningmultiple concept tasks. Similar to standard MTL neural networks, there is oneor more layers of hidden nodes that act as feature detectors. The input layercan be divided into two parts: a set of primary input variables for the tasks anda set of inputs that provide the network with the context of each training ex-ample. The context inputs can simply be a set of task identifiers that associateeach training example to a particular task. Related work on context-sensitivemachine learning can be found in [13].

Formally, let C be a set on �n representing the context of the primary inputsfrom X as described for MTL. Let c be a particular example of this set wherec is a vector containing the values c1, c2, . . . , ck; where ci = 1 indicates that


Fig. 1. Proposed system: csMTL

the example is associated with function fi. csMTL can be defined as learning atarget concept f ′ : C×X → Y ; with a probability distribution P ′ on C×X×Ywhere P ′ is constrained by the probability distributions P and Q discussed in theprevious section for MTL. An example for csMTL takes the form (c,x, f ′(c,x)),where f ′(c,x) = fi(x) when ci = 1 and fi(x) is the target output for task fi.A training set ScsMTL consists of all available examples for all tasks, ScsMTL ={(c,x, f ′(c,x))}. The objective of the csMTL algorithm is to find a hypothesish′ within its hypothesis space HcsMTL that minimizes the objective function,∑

x∈ScsMTLerror [f ′(c,x), h′(c,x)]. The assumption is that HcsMTL ⊂ {f |f :

C ×X → Y } contains a sufficiently accurate h′. Typically, |HcsMTL| = |HMTL|for the same set of tasks because the number of additional context inputs undercsMTL matches the number of additional task outputs under MTL.

With csMTL, the entire representation of the network is used to develop hy-potheses for all tasks of the domain. The focus shifts from learning a subset ofshared representation for multiple tasks to learning a completely shared repre-sentation for the same tasks. This presents a more continuous sense of domainknowledge and the objective becomes that of learning internal representationsthat are helpful to predicting the output of similar combinations of the primaryand context input values. Once f ′ is learned, if x is held constant, c indexesover the hypothesis base HcsMTL. If c is a vector of real-valued inputs and fromthe environment, it provides a grounded sense of task relatedness. If c is a setof task identifiers, it differentiates between otherwise conflicting examples andselects internal representation used by related tasks.

In the following section we propose how csMTL can be used to overcome thelimitations of MTL for construction of a ML3 system. The proposed ML3 isdescribed so as to provide motivation for and useful characteristics of csMTL.

3 Machine Lifelong Learning with csMTL Networks

Figure 2 shows the proposed csMTL ML3 system. It has two components; a tem-porary short-term learning network and a permanent long-term consolidation


csMTL network. The long-term csMTL network is the location in which domainknowledge is retained over the lifetime of the learning system. The weights ofthis network are updated only after a new task has been trained to an acceptablelevel of accuracy in the short-term learning network. The short-term network canbe considered a temporary extension of the long-term network that adds repre-sentation (several hidden nodes and a output node, fully feed-forward connected)that may be needed to learn the new task. At the start of short-term learningthe weights associated with these temporary nodes are initialized to small ran-dom weights while the weights of the long-term network are frozen. This allowsrepresentational knowledge to be rapidly transferred from related tasks existingin the long-term network without fear of losing prior task accuracies.

Once the new task has been learned, the temporary short-term network isused to consolidate knowledge of the task into the permanent long-term csMTLnetwork. This is accomplished by using a form of functional transfer called taskrehearsal [9]. The method uses the short-term network to generate virtual exam-ples for the new tasks so as to slowly integrate (via back-propagation) the task’sknowledge into the long-term network. Additionally, virtual examples for theprior tasks are used during consolidation to maintain the existing knowledge ofthe long-term network. Note that it is the functional knowledge of the prior tasksthat must be retained and not their representation; the internal representationof the long-term network will necessarily change to accommodate the consoli-dation of the new task. The focus of this paper and the experiments in Section4 is on the long-term network and the challenge of consolidation. The follow-ing discusses the benefits and limitations of the csMTL method as a long-termconsolidation network for an ML3 system.

3.1 Long-Term Retention of Learned Knowledge

Knowledge retention in a MTL network is the result of consolidation of newand prior task knowledge using task rehearsal [9]. Task rehearsal overcomes thestability-plasticity problem originally posed by [5] taken to the level of learningsets of tasks as opposed to learning sets of examples [6,4]. A plastic networkis one that can accommodate new knowledge. A stable network is one thatcan accurately retain old knowledge. The secret to maintaining a stable andyet plastic network is to use functional transfer from prior tasks to maintainstable function while allowing the underlying representation to slowly change toaccommodate the learning of the new task.

Prior work has shown that consolidation of new task knowledge, within anMTL network, without loss of existing task knowledge is possible given: sufficientnumber of training examples, sufficient internal representation for all tasks, slowtraining using a small learning rate and a method of early stopping to preventover-fitting and therefore the growth of high magnitude weights [10]. The sameis expected with a csMTL network.

In the long-term csMTL network there will be an effective and efficient sharingof internal representation between related tasks, without the MTL disadvantageof having redundant outputs for near identical tasks. Over time, practice sessions


Fig. 2. Proposed system: csMTL

for the same task will contribute to the development of a more accurate long-term hypothesis. In fact, the long-term csMTL network can represent a fluiddomain of tasks where subtle differences between tasks can be represented bysmall changes in the context inputs.

The csMTL ML3 approach does have its limitations. It suffers from the scalingproblems of similar neural network systems. The computational complexity ofthe standard back-propagation algorithm is O(W 3), where W is the number ofweights in the network. Long-term consolidation will be computationally moreexpensive than standard MTL because the additional contextual inputs willincrease the number of weights in the network at the same rate as MTL andit may be necessary to add an additional layer of hidden nodes for certain taskdomains. The rehearsal of each of the existing domain knowledge tasks requiresthe creation and training of m · k virtual examples, where m is the number ofvirtual training examples per task and k is the number of tasks. An importantbenefit from consolidation is an increase in the accuracy of related hypothesesexisting in the csMTL network as a new task is integrated.

4 Experiments

We empirically investigate the conditions to fullfil the long-term domain knowl-edge requirements of an ML3 system using a csMTL-based CDK network. Ouranalysis will focus on the effectiveness of prior task retention and new task con-solidation into a csMTL network. More specifically, the experiments examine(1) the retention of prior task knowledge as the number of virtual examples foreach task varies; (2) the benefit of of combining functional and representationaltransfer, and (3) the scalability of the method as up to 15 tasks are sequentiallyconsolidated within a csMTL-based CDK.


4.1 Task Domains and General Conventions

The Logic 1 task domain is synthetic, consisting of eight tasks. It has 11 real-valued inputs in the range [0, 1]. A positive example for a task is calculated bya logical conjunction of two disjunctions, involving two of the real inputs. Forexample, the first task, C1, is defined as (a > 0.5∨ b > 0.5)∧ (c > 0.5∨d > 0.5).For each new task, the set of inputs shifts one letter to the right in the alphabet.The Logic 2 domain is an extension of the Logic 1 domain, consisting of 15 taskswith 18 real value inputs in the range [0, 1]. The Covertype and Dermatologyreal-world domains were also examined with similar results, but for brevity willnot be discussed in this paper [3].

All experiments were performed using RASL3 ML3 system developed at Aca-dia University and available at ml3.acadaiu.ca. Preliminary experiments inves-tigated the network structure and learning parameters required to learn up to15 tasks of the Logic domain to a sufficient level of accuracy (greater than 0.75for all tasks) using real training examples. The following was determined: onelayer of hidden nodes is sufficient provided there are at least 17 hidden nodes,so 30 modes are used in all of the following experiments; the learning rate mustbe small - less than 0.001, and the momentum term can remain at 0.9 for allexperiments to speed up learning when possible.

Multiple runs of experiments are necessary to determine confidence in theresults. For each repetition of an experiment, different training and validationsets are used as well as random initial weights. In all experiments, the primaryperformance statistic is the accuracy on an independent test set. A hypothesistest of statistical significance between test set accuracy is done for each exper-iment, specifically the two-tailed student’s t-test. Each sequential learning run(training of up to 15 tasks one after the other) required a lot of time to complete.This limited the number of repetitions to three.

It is important to note that examples for the new task being consolidated areduplicated to match the number of virtual examples for each prior task for arun. This is to ensure that all tasks get an equal opportunity to effect the weightupdates in the csMTL network. For brevity, only results for certain tasks andthe mean of all tasks (including the new task being consolidated) are providedin this paper. A complete set of results from our research can be found in [3].

4.2 Experiment 1: Impact of Varying Number of Virtual Examplesfor Prior Tasks

This experiment examines the change in generalization accuracy of previouslylearned tasks as new tasks are consolidated within a csMTL network. The num-ber of virtual examples for prior tasks varies between runs providing increasinglymore task rehearsal. We are interested in how this variation in virtual exampleseffects the accuracy of the retained models.

Methodology: The learning rate is set to 0.001 for all configurations exceptthe 1000 virtual examples per task configuration, where it is lowered to 0.0005,to compensate for the much larger training set. 100 new task examples are used


for each training set (and duplicated as needed to match the number of virtualexamples for each prior task) and 100 new task examples are used for each val-idation set. A combination of functional and representational transfer is used.Three repetitions are made for for each configuration, which consists of learn-ing each of the seven tasks in sequence, using six different numbers of virtualexamples.

Results and Discussion: Figure 3 shows the accuracy for the second task, C2,over six consolidation steps under varying amounts of virtual examples. The la-bels follow a format of < task > − < numberofvirtualexamplesusedpertask >and then a letter code. The code indicates the type of transfer, either ’F’ forfunctional, or ’FR’ for functional plus representational. The figure shows thatthe accuracy of the consolidated network developed with 300 or more virtualexamples retains models better for the C2 task than when only 100 virtual ex-amples are used. A difference of means two-tailed T-test between the 1000 and100 virtual example models in the last consolidation step confirms this with ap-value of 99.9%. However this result is not consistent for all tasks.

More generally, the results indicate that increasing the number of virtualexamples for prior tasks slows the loss of knowledge, but not enough to stop it forall tasks, consistently. Some tasks showed signs of plateauing or even improvingas more tasks are consolidated when there are sufficient virtual examples, suchas C2. However, later tasks often begin at a smaller base accuracy, causingthe mean task accuracy to decline as more tasks are consolidated. This is theproblem of stability-plasticity. With the current network configuration, usingrepresentational transfer and more virtual examples results in a more stablenetwork for prior tasks, but a less plastic one for consolidating in new tasks.

4.3 Experiment 2: Impact of Transfer Type

Consolidation using functional and representational transfer both have theirmerits. Without functional transfer, it is difficult to maintain prior task gen-eralization accuracies while new task knowledge is being consolidated. Withoutrepresentational transfer, the models of prior tasks must essentially be rebuilt,and in rebuilding the models, prior knowledge not being transferred from func-tional examples may be lost. The transfer type is either functional or functionalplus representational. We observe the change in generalization accuracy of pre-viously learned tasks, as well as the change in generalization accuracy of newtasks as a function of the method of transfer and validation set examples.

Methodology: The learning rate is set to 0.001 for all configurations. 100 uniquenew task examples (duplicated to 300) and 300 virtual examples for each priortask are included in each training set. The validation set consists of 100 examplesfor each included task. Three repetitions are made for for each configuration,which consists of learning each of the seven tasks in sequence, using two differenttypes of transfer.


Results and Discussion: The results for second task, C2, and the averagefor all tasks are shown in Figure 4. Th graphs demonstrate the effects of usingdifferent types of transfer as sequential consolidation occurs. An hypotheses testfor task C2 confirms that the functional plus representational transfer approachis superior to the model using only functional transfer with 97.6% confidence.However, the mean task test set accuracy for the last consolidation step doesnot differ significantly between the methods.

The new task model accuracies for functional transfer are greater or equal tothe models developed using functional plus representational transfer for all tasks,except C4. Once again, the results indicate that combing representational andfunctional transfer provides more effective retention, however, new tasks have anincreasingly more difficult time developing accurate consolidated models. This iswhat causes the mean task accuracy to decline over the sequence. Conversely, thefunctional transfer method demonstrates better new task consolidation but poorprior task retention. The gains and losses balance out the mean task accuracyas the graph approaches the last consolidation step.

4.4 Experiment 3: Scalability

This experiment examines the performance of the csMTL-based CDK as a largenumber of Logic 2 domain tasks are learned in sequence. The objective is toto test, under the most optimum of conditions, the scalability of the system toretain accurate prior knowledge while consolidating up to 15 tasks, one after theother.

Methodology: The combined functional and representational transfer methodis used and compared to single task learning (STL) for each of the tasks. Anetwork of one hidden layer and 30 hidden nodes with a learning rate of 0.0001is used for all runs. The training sets consist of 300 real examples for each task- no virtual examples are used in this experiment so as to ensure the accuraterehearsal of prior tasks and accurate STL models. The validation set consistsof 100 examples for the newest task. Three repetitions are made for for eachconfiguration, which consists of learning each of the 15 tasks in sequence.

Results and Discussion: Results for the second task, C2, shown in Figure 5demonstrate the method performs quite well in terms of retention over the 15tasks, with no significant difference between the STL and the retained models.A two-tailed t-test shows no significant difference in mean accuracy (p-value =0.77) on the final task. However, the graph does shows a small but study declinein prior task knowledge as more tasks are consolidated; even when using realexamples for rehearsing prior tasks. Other tasks behave similarly.

As seen in the prior experiments, the accuracy of the new consolidated taskseventually starts to decline. The new task accuracy remains high and steady untiltask C7, where there is a significant drop below that of its STL counterpart. Thebase accuracy for new tasks then continues to fall as more tasks are consolidatedinto the csMTL network.


Fig. 3. Graph of the C2 test set accuracy over which tasks are consolidated, for varyingamounts of virtual examples

Fig. 4. Graph of the C2 and mean test set accuracy for different types of transfer

Fig. 5. Graph of the C2 and mean test set accuracy as many tasks are developed


5 Conclusion

This paper has presented a machine lifelong learning system using a context-sensitive multiple task learning-based consolidated domain knowledge network.The long-term motivation in the design of this system is to create a machinethat can effectively retain knowledge of previously learned tasks and use thisknowledge to more effectively and efficiently learn new tasks. Short-term realworld applications include medical classification problems.

A MTL-based consolidated domain knowledge network had been explored inprevious work, but was found to have limitations preventing its fulfilment of therequirements of a ML3 system. To address these limitations, we have proposeda new ML3 system based on a csMTL-based CDK network that is capablesequential knowledge retention and inductive transfer. A csMTL-based systemcan focus on learning shared representation, more than an MTL-based system,because all weight values in the csMTL network are shared between all tasks.The system is meant to satisfy a number of ML3 requirements including theeffective consolidation of task knowledge into a long-term network using taskrehearsal, the accumulation of task knowledge from practice sessions, effectiveand efficient inductive transfer during new learning.

The experiments have demonstrated that consolidation of new task knowledgewithin a csMTL without loss of prior task knowledge is possible but not consis-tent for all tasks of the test domain. The approach requires a sufficient number oftraining examples for the new task and an abundance of virtual training examplesfor rehearsal of the prior tasks. Also required is sufficient internal representationto support all tasks, slow training using a small learning rate and a method ofearly stopping to prevent over-fitting and therefore the growth of high magnitudeweights. Our empirical findings indicate that representational transfer of priorknowledge, in addition to functional transfer through task rehearsal, improvesretention of prior task knowledge, but at the cost of less accurate models fornewly consolidated tasks. We conclude that the stability-plasticity problem isnot resolved by our current csMTL-based ML3 system.

The ultimate goal of this avenue of research is the development of a true ma-chine lifelong learning system, a learning system capable of integrating new tasksand retaining old knowledge effectively and efficiently. Possible future researchdirections include:

1. Explore the decay of model accuracy for new task consolidation by examininggreater numbers of hidden nodes, or the injection of random noise to mitigatethe build-up of high magnitude network weights. Recent work on transferlearning using deep hierarchies of features suggests that multiple layers ofhidden nodes is worth exploring [7].

2. Exploit task domain properties to guide the generation of virtual exam-ples. Results on a real-world task domain indicated that successful learningrequired exploitation of meta-data to construct virtual examples, to moreclosely match the input distribution of the real data [3].


Acknowledgments. This research has been funded in part by the Governmentof Canada through NSERC.

References

1. Baxter, J.: Learning model bias. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E.(eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 169–175. TheMIT Press, Cambridge (1996)

2. Caruana, R.A.: Multitask learning. Machine Learning 28, 41–75 (1997)3. Fowler, B.: Context-Sensitive Multiple Task Learning with Consolidated Domain

Knowledge. Master’s Thesis Thesis, Jodrey School of Computer Science, AcadiaUniversity (2011)

4. French, R.M.: Pseudo-recurrent connectionist networks: An approach to thesensitivity-stability dilemma. Connection Science 9(4), 353–379 (1997)

5. Grossberg, S.: Competitive learning: From interactive activation to adaptive reso-nance. Cognitive Science 11, 23–64 (1987)

6. Robins, A.V.: Catastrophic forgetting, rehearsal, and pseudorehearsal. ConnectionScience 7, 123–146 (1995)

7. Salakhutdinov, R., Adams, R., Tenenbaum, J., Ghahramani, Z., Griffiths, T.:Workshop: Transfer Learning Via Rich Generative Models. Neural InformationProcessing Systems (NIPS) (2010),http://www.mit.edu/~rsalakhu/workshop_nips2010/index.html

8. Silver, D.L., Mercer, R.E.: The parallel transfer of task knowledge using dynamiclearning rates based on a measure of relatedness. Learning to Learn, 213–233 (1997)

9. Silver, D.L., Mercer, R.E.: The task rehearsal method of life-long learning: Over-coming impoverished data. In: Advances in Artificial Intelligence, 15th Conferenceof the Canadian Society for Computational Studies of Intelligence (AI 2002), pp.90–101 (2002)

10. Silver, D.L., Poirier, R.: Sequential consolidation of learned task knowledge. In:17th Conference of the Canadian Society for Computational Studies of Intelligence(AI 2004). LNAI, pp. 217–232 (2004)

11. Silver, D.L., Poirier, R., Currie, D.: Inductive tranfser with context-sensitive neuralnetworks. Machine Learning 73(3), 313–336 (2008)

12. Thrun, S.: Is learning the nth thing any easier than learning the first? In: Advancesin Neural Information Processing Systems 8, vol. 8, pp. 640–646 (1996)

13. Turney, P.D.: The identification of context-sensitive features: A formal definition ofcontext for concept learning. In: 13th International Conference on Machine Learn-ing (ICML 1996), Workshop on Learning in Context-Sensitive Domains, Bari, Italy,vol. NRC 39222, pp. 53–59 (1996)

http://www.mit.edu/~rsalakhu/workshop_nips2010/index.html

Extracting Relations between Diseases,

Treatments, and Tests from Clinical Data

Oana Frunza and Diana Inkpen

School of Information Technology and EngineeringUniversity of Ottawa, Ottawa, ON, Canada, K1N6N5

{ofrunza,diana}@site.uottawa.ca

Abstract. This paper describes research methodologies and experimen-tal settings for the task of relation identification and classification be-tween pairs of medical entities, using clinical data. The models that weuse represent a combination of lexical and syntactic features, medical se-mantic information, terms extracted from a vector-space model createdusing a random projection algorithm, and additional contextual informa-tion extracted at sentence-level. The best results are obtained using anSVM classification algorithm with a combination of the above mentionedfeatures, plus a set of additional features that capture the distributionalsemantic correlation between the concepts and each relation of interest.

Keywords: clinical data-mining, relation classification.

1 Introduction

Identifying semantic relations between medical entities can help in the develop-ment of medical ontologies, in question-answering systems on medical problems,in the creation of clinical trials — based on patient data new trials for alreadyknown treatments can be created to test their therapeutic potential on otherdiseases, and in identifying better treatments for a particular medical case bylooking at other cases that followed a similar clinical path. Moreover, identifyingrelations between medical entities in clinical data can help in stratifying patientsby disease susceptibility and response to therapy, reducing the size, duration, andcost of clinical trials, leading to the development of new treatments, diagnostics,and prevention therapies.

While some research has been done on technical data, text extracted frompublished medical articles, little work has been done on clinical data, mostlybecause of lack of resources. The data set that we used is the data releasedin the fourth i2b2-10 shared-task challenges in natural language processing forclinical data1, the relation identification track in which we participated.

2 Related Work

The relation classification task represents a major focus for the computationallinguistic research community. The domains on which this task was deployed1 https://www.i2b2.org/NLP/Relations/


https://www.i2b2.org/NLP/Relations/

Extracting Relations between Diseases, Treatments, and Tests 141

vary wildly, but the major approaches used to identify the semantic relationbetween two entities are the following: rule-based methods and templates tomatch linguistic patterns, co-occurrence analysis, and statistical or machine-learning based approaches.

Due to space limitation and the fact that our research is focused on thebioscience domain, we describe relevant previous work done in this domain onlyusing statistical methods.

Machine learning (ML) methods are the ones that are most used in the com-munity. They do not require human effort to build rules. The rules are automat-ically extracted by the learning algorithm when using statistical approaches tosolve various tasks [1], [2]. Other researchers combined the bag-of-words featuresextracted from sentences, with other sources of information like part-of-speech[3]. [4] used two sources of information: sentences in which the relation appearsand the local context of the entities, and showed that simple representationtechniques bring good results.

In our previous work presented in [5], we showed that domain-specific knowl-edge improves the results. Probabilistic models are stable and reliable for tasksperformed on short texts in the medical domain. The representation techniquesinfluence the results of the ML algorithms, but more informative representationsare the ones that consistently obtain the best results.

In the i2b2-shared task competition [6] the system that performed the best ob-tained a micro-averaged F-measure value of 73.65%. The mean of the F-measurescores of all the teams that participated in the competition was 59.58%

3 Data Set

The data set annotated with existing relations between two concepts in a sen-tence (if any) focused on 8 possible relations. These relations can exist onlybetween medical problems and treatments, medical problems and tests, andmedical problems and other medical problems.

These annotations are made at sentence level. Sentences that contain theseconcepts, but without any relation between them, were not annotated. The train-ing data set consisted in 349 records, divided by their type and provenance, whilethe test set consisted of 477 records. Table 1 presents the class distribution forthe relation annotations in the training and the test data. Besides the annotateddata, a number of 827 unannotated records were also released.

In order to create training data for the Negative class, a class in which a pairof concepts is not annotated with any relation, we considered sentences that hadonly one pair of concepts in no relation. This choice yielded in a data set of1,823 sentences. In the test data set a number of 50,336 pair of concepts was notannotated with a relation. These pairs represent the Negative-class test set. Inthe entire training data a number of 6,381 sentences contained more than twoconcepts. In the test data this number raised to 10,437.

142 O. Frunza and D. Inkpen

Table 1. The number of sentences of each relation in the training and test data sets

Relation Training Test

PIP (medical problem indicates medical problem) 1239 1,989

TeCP (test conducted to investigate medical problem) 303 588

TeRP (test reveals medical problem) 1734 3,033

TrAP (treatment is administered for medical problem) 1423 2,487

TrCP (treatment causes medical problem) 296 444

TrIP (treatment improves medical problem) 107 198

TrNAP (treatment is not administered because of medical problem) 106 191

TrWP (treatment worsens medical problem) 56 143

4 Method Description

Our method is using a supervised machine learning setting with various typesof feature representation techniques.

4.1 Data Representation

The features that we extracted for representing the pair of entities and thesentence context use lexical information, information about the type of conceptof each medical entity, and additional contextual information about the pair ofmedical concepts.

The bag-of-words (BOW). feature representation uses single token features witha frequency-based representation.

ConceptType. The second type of features represents semantic information aboutthe type of medical concept of each entity: problem, treatment, and test.

ConText. The third type of feature represents information extracted with theConText tool [7]. The system is capable to provide three types of contextualinformation for a medical condition: Negation, Temporality, and Experiencer.

Verb phrases. In order to identify verb phrases, we used the Genia tagger2

tool. The verb-phrases identified by the tagger are considered as features. Weremoved the following punctuation marks: [ . , ’ ( ) # $ % & + * / = < > [ ]- ], and considered valid features only the lemma-based forms of the identifiedverb-phrases.

Concepts. In order to make use of the fact that we know what token or sequenceof tokens represents the medical concept, we extracted from all the training dataa list of all the annotated concepts and considered this list as possible nominalvalues for the Concept feature.2 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/


Semantic vectors. Semantic vector models are models in which concepts are rep-resented by vectors in some high dimensional space. Similarity between conceptsis computed using the analogy of similarity or distance between points in thisvector space. The main idea behind semantic vector models is that words andconcepts are represented by points in a mathematical space, and this represen-tation is learned from text in such a way that concepts with similar or relatedmeanings are near to one another in that space.

In order to create these semantic vectors and use them in our experimentswe used the Semantic Vectors Package3 [8]. The package uses indexes createdby applying a Random Projection algorithm to term-document matrices createdusing Apache Lucene4.

We used the semantic vectors to extract the top 300 terms correlated witheach relation and to determine the semantic distribution of a pair of concepts inthe training corpus of all 9 relations.

5 Classification Technique

As classification algorithms, we used the SVM implementation with polynomialkernel from the Weka5 tool.

To solve the task, we are using a 9-class classification model, 8 relations ofinterest and the Negative class, and also a model that uses a voting ensemble of8 binary classifiers. The ensemble consists of 8 binary classifier focused on oneof the relations and the Negative class. We identify the negative test instanceswhen we use the voting ensemble as being the data points that are classified asNegative by all 8 binary classifiers. Once these negative instances are eliminated,we deploy an 8-class classifier to identify the relations that exist between theremaining instances.

6 Results

In this section, we present the results obtained in the competition and post-competition experimental results. The evaluation metric is micro-averaged F-measure.

Table 2 presents our results on the test data, both the competition results andthe post-competition ones. More details on the competition experiments can befound in [9].

The post-competition experiments were more mostly focused on capturing thesemantic correlation between the terms of the pair of concepts and the instancesthat are contained in each relation. We also tried to capture the verb-phrasesoverlap between the training and test instances, because these relations evolvearound the verbs that are attached to the concept pair. As we can see from

3 http://code.google.com/p/semanticvectors/4 http://lucene.apache.org/java/docs/index.html5 http://www.cs.waikato.ac.nz/ml/weka/

http://code.google.com/p/semanticvectors/

http://lucene.apache.org/java/docs/index.html

http://www.cs.waikato.ac.nz/ml/weka/

144 O. Frunza and D. Inkpen

Table 2. F-measure results in the competition

Competition

BOW + Concept + ConceptType + ConText 40.88%

BOW + ConceptType 40.98%

BinaryClassifiers 39.34%

Post-competition

SemVect 300 40.49%

SemVect+VPs+ConceptType 44.44%

BOW + SemVect + VPs + ConceptType 47.05%

BOW + SemVect + VPs + ConceptType + DistSem 47.53%

BOW(context) + ConceptType + VPs + DistSem + VBs 86.15%

Table 2, the post-competition results improved the competition results and thebest representation technique is the one that uses a combination of BOW, se-mantic vectors information, type of the concepts, and verb phrases.

7 Discussion and Conclusions

The results obtained in the competition showed that a richer representation bet-ter identifies the existing relations. The ensemble of classifiers showed more bal-ance between all the measures. Since the ensemble of classifiers showed promisingresults in weeding out the negative examples, we run more experiments whenusing only 8 relations of interest. With this setting, we obtain the best result of86.15%. In this experiment, we used additional nominal features for each relationcontaining verbs that are synonyms to the verbs that describe each relation. Thevalue of these features is the number of verbs overlapping with the context ofeach pair. The contexts consist in all the words all the words between the pair.The features that we used are presented in Table 2.

We believe that the results can be further improved by using classifiers thatare trained on the relations that exist between a certain type of concepts, e.g.,one classifier that is trained only on the relations that exist between medicalproblems and treatments, etc. Our post-competition results are exceeding themean results in the competition.

As future work, we plan to focus more on adding features that are specific foreach concept, reduce the context from sentence level to shorter contexts, lookinto more verb information, and better understand and incorporate additionalinformation for each relation.

References

1. Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B.: Prebindand textomy: Mining the biomedical literature for protein-protein interactions usinga support vector machine. BMC Bioinformatics 4(11), 11–24 (2003)

2. Mitsumori, T., Murata, M., Fukuda, Y., Doi, K., Doi, H.: Extracting protein-proteininteraction information from biomedical text with svm. IEICE Transactions on In-formation and Systems 89(8), 2464–2466 (2006)


3. Bunescu, R., Mooney, R.: shortest path dependency kernel for relation extraction.In: Proceedings of the Conference on Human Language Technology and EmpiricalMethods in Natural Language Processing (HLT/EMNLP), pp. 724–731 (2005)

4. Giuliano, C., Lavelli, A., Romano, L.: Exploiting shallow linguistic information forrelation extraction from biomedical literature. In: 11th Conference of the EuropeanChapter of the Association for Computational Linguistics, pp. 401–409 (2006)

5. Frunza, O., Inkpen, D., Tran, T.: A machine learning approach for identifyingdisease-treatment relations in short texts. IEEE Transactions on Knowledge andData Engineering (2010) (in press)

6. Roberts, K., Rink, B., Harabagiu, S.: Extraction of medical concepts, assertions,and relations from discharge summaries for the fourth i2b2/va shared task (2010)

7. Chapman, W., Chu, D., Dowling, J.N.: Context: an algorithm for identifying contex-tual features from clinical text. In: ACL 2007 Workshop on Biological, Translational,and Clinical Language Processing (BioNLP 2007), pp. 81–88 (2007)

8. Widdows, D., Ferraro, K.: Semantic vectors: a scalable open source package andonline technology management application. In: Calzolari, N., (Conference Chair),Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.)Proceedings of the Sixth International Language Resources and Evaluation (LREC2008). European Language Resources Association (ELRA), Marrakech (2008),http://www.lrec-conf.org/proceedings/lrec2008/

9. Frunza, O., Inkpen, D.: Identifying and classifying semantic relations between med-ical concepts in clinical data, i2b2 challenge (2010)

http://www.lrec-conf.org/proceedings/lrec2008/

Compact Features for Sentiment Analysis

Lisa Gaudette and Nathalie Japkowicz

School of Information Technology & EngineeringUniversity of Ottawa

Ottawa, Ontario, Canada{lgaud082,njapkow}@uottawa.ca

Abstract. This work examines a novel method of developing featuresto use for machine learning of sentiment analysis and related tasks. Thistask is frequently approached using a “Bag of Words” representation –one feature for each word encountered in the training data – which caneasily involve thousands of features. This paper describes a set of com-pact features developed by learning scores for words, dividing the rangeof possible scores into a number of bins, and then generating featuresbased on the distribution of scored words in the document over the bins.This allows for effective learning of sentiment and related tasks with 25features; in fact, performance was very often slightly better with thesefeatures than with a simple bag of words baseline. This vast reductionin the number of features reduces training time considerably on largedatasets, and allows for using much larger datasets than previously at-tempted with bag of words approaches, improving performance.

1 Introduction

Sentiment analysis is the problem of learning opinions from text. On the sur-face, sentiment analysis appears similar to text categorization by topic, but it isa harder problem for many reasons, as discussed in Pang & Lee’s 2008 survey[11]. First and foremost, with text categorization, it is usually much easier to ex-tract relevant key words, while sentiment can be expressed in many ways withoutusing any words that individually convey sentiment. In topic classification thereare undoubtedly red herrings, such as the use of analogies and metaphor, butif a word associated with a given domain is mentioned frequently, it is usuallyrelated (although not necessarily the most relevant). However, in sentiment anal-ysis there are many examples of “thwarted expectations” (e.g. “I was expectingthis movie to be great, but it was terrible”) and comparison to an entity withopposing sentiment (e.g. “I loved the first movie, but this sequel is terrible”)such that a positive review can easily have many negative words and vice versa[11]. In addition, words that convey positive sentiment in one domain may beirrelevant or negative in another domain, such as the word unpredictable, whichis generally positive when referring to the plot of a book or a movie but negativewhen referring to an electronic device.

The Bag of Words (BOW) representation is commonly used for machine learn-ing approaches to text classification problems. This representation involves cre-ating a feature vector consisting of every word seen (perhaps some minimum


Compact Features for Sentiment Analysis 147

number of times) in the training data and learning based on the words thatare present in each document, sentence, or other unit of interest. However, thisapproach leads to a very large, sparse feature space, as words are distributedsuch that there are a small set of very frequent, not very informative words,and a great deal of individually rarer words that carry most of the informationin a sentence. Thousands of features are required to adequately represent thedocuments for most text classification tasks.

Another approach is to learn scores for words, and then use these words toclassify documents based on the sum or average of the scores in a documentand some threshold. While this type of approach is useful, bag of words basedapproaches generally perform better, although the two approaches can be com-bined together through meta classifiers or other approaches to improve on eitherapproach individually.

This research proposes a novel method of combining machine learning withword scoring through condensing the sparse features of BOW into a very compact“numeric” representation by using the distribution of the word scores in the doc-ument. This approach allows for combining word scores with machine learningin a way that provides much more information than a simple global word scorefor a document, with substantially fewer features than a BOW approach. Othercombination approaches add complexity to the BOW idea, by adding an extrameta-classifier or making the features even more complicated; ours instead usesthe results of a word scoring method to make the machine learning step simplerby reducing a feature vector in the thousands to one of about twenty-five.

2 Related Work

This paper combines ideas from two different basic approaches to sentimentanalysis. The first is to use a bag of words feature set and train a machinelearning classifier such as a Support Vector Machine (SVM) using the BOWfeatures, such as in [10]. The second approach is to learn scores for words andscore documents based on those scores, such as in [3]. Some previous attempts tocombine these two approaches include combining results from both systems witha meta classifier such as in [8], [6], and [1], and weighting bag of words featuresby a score, such as the use of TF/IDF in [7]. While there are many techniquesto improve on the basic idea of BOW through refining the features or combiningit with other approaches, it remains a good basic approach to the problem.

3 Approach to Generating Compact Features

The approach used here involves 3 steps. The first step is to calculate scoresfor the words, while the second is to represent the documents in terms of thedistribution of those word scores. Finally, we run a machine learning algorithmon the features representing the distribution of the word scores. We refer to ourfeatures as “Numeric” features.

148 L. Gaudette and N. Japkowicz

3.1 Learning Word Scores

The first step to this approach involves learning word scores from the text. Weinitially considered three different supervised methods of scoring words, andfound that Precision performed best.

This method was inspired by its use in [12] for extracting potential subjectivewords. It represents the proportion of occurrences of the word which were inpositive documents, but does not account for differences in the number of wordsin the sets of positive and negative documents. This produces a value between0 and 1.

precision =wP

wP + wN(1)

wP , wN the number of occurrences of word w in positive (negative)documents

In order to calculate the precision, we first go through the training data andcount the number of positive and negative instances of each word. We thencompute the scores for each word. As a word which appears very few timescould easily appear strongly in one category by chance, we chose to only usewords appearing at least 5 times. This produces a list of word scores customizedto the domain of interest.

3.2 Generating Features from Scored Words

In order to generate features from these scored words, we first divide the range ofpossible scores into a number of “bins”, representing a range of word scores. Wethen go through each document, look up the score for each word, and incrementthe count of its corresponding bin. After we have counted the number of words ineach bin, we normalize the count by the number of scored words in the document,such that each bin represents the percentage of the words in the document in itsrange of scores.

Example of Generating Features. This section shows an example of scoringa document after we have scored the words, assuming 10 bins and precision wordscores ranging from 0 to 1. Figure 1a shows the preprocessed text of the reviewwith word scores, while Figure 1b shows the results of counting the numberof words in each bin, and then normalizing those counts based on the numberof scored words in the document to generate the features we use for machinelearning.

After going through this process for a set of documents, we have a set ofnumeric features based on the distribution of the word scores that is much morecompact than the bag of words representation and can be used as input to amachine learning algorithm.

4 Selecting Parameters

There are three main options to this approach – the method for scoring thewords, the number of bins to use, and the machine learning algorithm to use.


0.460 0.503 0.545 0.555 0.576i have always been very

0.850 0.526 0.497 0.449 0.351pleased with the sandisk products

0.460 0.403 0.898 0.568 0.465i would highly recommend them

(a) Review Preprocessed and Annotatedwith Word Scores

Range Count Feature

0.00-0.10 0 0.0000.10-0.20 0 0.0000.20-0.30 0 0.0000.30-0.40 1 0.0670.40-0.50 6 0.4000.50-0.60 6 0.4000.60-0.70 0 0.0000.70-0.80 0 0.0000.80-0.90 2 0.1330.90-1.00 0 0.000

(b) Numeric FeaturesGenerated from Review

Fig. 1. Generating Features From a Sample Review

We used two basic datasets to select these options – the reviews of Steve Rhodesfrom [10] and the 2000 review, balanced, Electronics dataset from [2]. We usedthese datasets in terms of both ordinal and binary problems, for a total of 4distinct problems. We examined the effect of three different scoring methods,varying the number of bins, and using a variety of classifiers as implemented inthe WEKA machine learning system [13].

While we do not have space to present all of the details here, we found thatthe precision scoring method performed best on most datasets by a small mar-gin, but that all scoring methods were close. The number of bins only affectedperformance by a very small amount given enough bins – some datasets per-formed very well with as few as 10 bins, while others needed 25, and using morebins had no consistent effect on performance beyond that point. The SMO algo-rithm, WEKA’s implementation of Support Vector Machines (SVM), performedwell, while we also found that BayesNet performed nearly as well and was muchfaster, particularly in the ordinal case.

The experiments using the word score based features all use precision scoringwith 25 bins. In the Binary case, they use SMO, with default settings exceptfor the option of fitting logistic models to the outputs for Binary problems. Forordinal problems, the BayesNet classifier is used instead. The BOW baselineclassifiers are all constructed using SMO with default settings, as SVM has beenshown to perform well in previous work. BayesNet did not perform well usingthe BOW representation.

5 Experiments

In order to evaluate the feasibility of this method, we test it across a range ofdatasets. In all cases, we compare the results to an SVM classifier using BOWfeatures. We chose to compare against BOW because it is a widely used basicmethod and we view our approach as effectively compressing the features given


to a BOW algorithm with a pre-processing step. Where available, we also provideresults obtained by the authors who introduced the datasets in order to providesome comparison to a wider range of methods. We evaluate both classifiers usingonly unigrams that appear at least 5 times in the training data.

Many authors working in this domain have simply reported accuracy as anevaluation metric, which has problems in even the binary case For the ordinalproblem, we will use Mean Squared Error (MSE), as it was shown in [4] to bea good measure for ordinal problems of this type, while for the binary problemwe include AUC. We include accuracy in places to compare with previous work.Where multiple runs were feasible we use 10x10 fold cross validation.

Times reported include all time taken to read in and process the documentsinto the respective representations, as well as the time to train and test theclassifier, averaged over all folds where multiple runs were performed.

We have selected a range of datasets on which to evaluate this approach. Wehave both “document” level datasets, representing units that are (at least usu-ally) several sentences or more long, and “sentence” level datasets, representingunits of about one sentence (although sometimes a phrase, or two or three sen-tences). We also have a contrast between sometimes poorly written online userreviews of products and more professionally written movie reviews. We haveone set of datasets which contain an order of magnitude more documents thanthe others on which to examine the effects of adding more documents. Finally,we have one dataset that is for a slightly different problem than the others –subjectivity detection rather than sentiment analysis.

5.1 Small Amazon Reviews

This dataset consists of online user reviews across 4 categories and was used in[2]. As shown in Table 1, the numeric features are always slightly more accuratewith substantially higher AUC, while also being considerably faster than theBOW method, but neither performs as well as the linear predictor method usedin [2]. The Electronics portion of this dataset was used for parameter tuning.These datasets were manually balanced such that each class represents 50% ofthe documents, which is not a very natural distribution.

In this case, we also chose to look at how a classifier trained and tested on alldatasets together performed in comparison to the average of all classifiers. Wefound that both can train a classifier using all of the data that is slightly betterthan the average performance of the individual classifiers, however, training thisclassifier is very, very slow for the BOW method – so much slower we onlyperformed 2x10 fold cross validation and used a different, faster, computer, ratherthan 10x10 cross validation as in the other cases, and it still took many timeslonger to train. On the other hand, the numeric method trains a combinedclassifier slightly faster than the sum of the individual numeric classifiers; thesetwo methods clearly scale up very differently as we add more documents.


Table 1. Amazon reviews, BOW vs. Numeric, Accuracy, AUC, and Time

Dataset Type Accuracy AUC Time(mm:ss)

Electronics Numeric 0.801 ± 0.005 0.874 ± 0.004 0:01.0(0.844)a BOW 0.791 ± 0.005 0.791 ± 0.005 0:22.2

DVD Numeric 0.797 ± 0.005 0.865 ± 0.005 0:01.4(0.824) BOW 0.775 ± 0.006 0.776 ± 0.006 0:37.8

Book Numeric 0.768 ± 0.005 0.839 ± 0.005 0:01.4(0.804) BOW 0.754 ± 0.006 0.754 ± 0.006 0:42.9

Kitchen Numeric 0.814 ± 0.006 0.896 ± 0.004 0:01.0(0.877) BOW 0.809 ± 0.005 0.809 ± 0.005 0:23.5

All Datab Numeric 0.796 ± 0.003 0.874 ± 0.002 0:04.6BOW 0.791 ± 0.002 0.791 ± 0.002 10:00.1

Averagec Numeric 0.795 ± 0.005 0.869 ± 0.005 0:04.9BOW 0.782 ± 0.006 0.782 ± 0.006 2:06.5

Majorityd 0.474 0.500

a Accuracy in [2]b One classifier trained on all datasets. BOW classifier is 2x10 CV, all other classifiers

10x10 CVc Average performance of the individual classifiers over all datasets and total time to

train the individual classifiersd The results for the majority classifier are the same for all datasets given the same

seed for the split of the data into folds

5.2 Very Large Datasets

This collection of data is a larger set of Amazon.com reviews from which theprevious datasets were created. This collection included three domains with over100,000 reviews – books, DVDs, and music, which allows us to explore how thisapproach scales to very large datasets. We used these larger datasets to examinehow the numeric features scale in terms of both performance and time. In allcases, the results are reported on a single run using a 10,000 review test set (whichis larger than most complete datasets used in previous research). These datasetsare all highly imbalanced, with the majority class (5 star reviews) containingfrom 61-71% of the documents, and the minority class (2 star reviews) containing4-6% of the reviews.

As shown in Figure 2, the time required to train with the numeric featuresscales much more gently than the time required to train with the BOW features.Note that the graph features a logarithmic scale for time. For time reasons, weonly trained BOW based classifiers on up to 10-15,000 reviews, while we trainedthe classifiers using numeric features on over 100,000 reviews for each dataset,with 300,000 reviews for the Books dataset. For the books dataset, it took 8hours and 23 minutes to train on 15000 documents with BOW features, whilewith the numeric features we were able to train on 300,000 documents in 1 hourand 15 minutes. In the case of the numeric features, the vast majority of thetime is spent scoring words and generating the features, while for BOW most ofthe time is spent training the machine learning algorithm.


Fig. 2. Time required to train classifiers based on Numeric and Bag of Words featuresusing varying amounts of data, 3 large Ordinal datasets

Figure 3a shows performance of both approaches with up to 15,000 reviews.In this range, the numeric features generally perform better by MSE, althoughat 15,000 Book reviews BOW is very slightly better. Figure 3b extends theseperformance results to show the space where we only tested the numeric features;the straight dotted lines represent the performance on the largest BOW classifier,15,000 or 10,000 reviews depending on the dataset. This shows that if largenumbers of documents are available, the numeric method continues to improve.

Similar results are obtained when looking at these datasets in terms of binaryclassification, with one and two star reviews as the negative class and four andfive star reviews as the positive class.

5.3 Movie Review Datasets

We use a number of datasets created by Bo Pang & Lillian Lee in the domainof movie reviews. The movie review polarity dataset (version 2) and a datasetfor sentence level subjectivity detection are introduced in [9], while a datasetfor ordinal movie reviews by four different authors and a dataset for the sen-timent of movie review “snippets” (extracts selected by RottenTomatoes.com)are introduced in [10].

Table 2 presents the results on the three binary datasets, as well as the resultsreported by Pang & Lee on the datasets, where available. While in the case ofthe Binary Movie reviews the numeric features fall well short of their reportedresults, on the Subjective Sentences dataset they are very close. Note that thisdataset is for the related problem of subjectivity detection and not sentimentanalysis. Pang & Lee report results on 10 fold cross validation, while we reportresults on 10 runs of 10 fold cross validation in order to be less sensitive to therandom split of the data.


(a) Up to 15,000 Reviews

(b) Over 10,000 Reviews

0.5

0.7

0.9

1.1

1.3

1.5

1.7

1.9

2.1

0 5000 10000 15000

Mea

n Sq

uare

d Er

ror

Number of Reviews

Books Numeric Books BOWDVD Numeric DVD BOWMusic Numeric Music BOW

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

0 100 200 300

Mea

n Sq

uare

d Er

ror

Number of Reviews (Thousands)Books Numeric Books BOW 15000DVD Numeric DVD BOW 10000Music Numeric Music BOW 10000

Fig. 3. Mean Squared Error, Numeric vs. BOW, 3 large Ordinal datasets

Table 2. Binary Datasets, Average Performance and Time, with 95% confidence in-tervals

Accuracy AUC Time (m:ss)

Movie Review Polarity (Pang & Lee Accuracy: 0.872)

Numeric 0.824 ± 0.005 0.896 ± 0.005 0:03BOW 0.850 ± 0.005 0.850 ± 0.005 1:18

Movie Review Snippets

Numeric 0.760 ± 0.002 0.841 ± 0.002 0:05BOW 0.739 ± 0.002 0.739 ± 0.002 19:56

Subjective Sentences (Pang & Lee Accuracy: 0.92 )

Numeric 0.910 ± 0.002 0.967 ± 0.001 0:05BOW 0.880 ± 0.002 0.880 ± 0.002 9:15

Table 3. Ordinal Movie Reviews, BOW vs. Numeric, Accuracy, MSE, and Time, with95% confidence intervals

Author Type MSE Accuracy Time (s)

Schwartz Numeric 0.580 ± 0.013 0.518 ± 0.009 0.83(0.51)a BOW 0.691 ± 0.020 0.510 ± 0.009 13.20

Berardinelli Numeric 0.478 ± 0.010 0.557 ± 0.008 1.48(0.63) BOW 0.443 ± 0.012 0.644 ± 0.007 35.64

Renshaw Numeric 0.634 ± 0.016 0.468 ± 0.011 1.05(0.50) BOW 0.696 ± 0.020 0.496 ± 0.009 13.39

Rhodes Numeric 0.490 ± 0.010 0.566 ± 0.007 1.65(0.57) BOW 0.478 ± 0.011 0.609 ± 0.006 43.79

a Accuracy obtained by Pang & Lee

Table 3 reports the results on the ordinal movie reviews, by author. Again,we compare results to Pang & Lee, noting that we are approximating the valuesreported in a graph. Comparing to Pang & Lee based on accuracy, we find that inone case, the numeric features appear to be slightly better than their best result,


and in one other case, Pang & Lee’s result is within the confidence range of ournumeric features. In the two other cases, Pang & Lee’s result is better than ourresult for the numeric feature set. However, we also note the comparison of theirresult to our simple BOW; in two cases our simple BOW classifier appears to bebetter, while in the other two the results are virtually the same. This confirmsour assessment that this simple BOW is a good baseline to compare against.

6 Comparison with Feature Selection Methods

Another approach one might take to speeding up BOW is the idea of featureselection – selecting the most relevant features. In this section, we briefly comparethe numeric features, plain BOW features, and BOW features reduced throughtwo fast feature selection methods, Chi Squared and Information Gain. We use 5fold cross validation on the subjective sentences, binary electronics (2000 reviewbalanced version), and movie review snippets datasets. These feature selectionmethods both evaluate individual attributes; methods which evaluate subsets ofattributes together exist but are much slower [5]. We show the results of theseexperiments in 4.

Table 4. Feature Selection

Subjective Electronics Snippets

Method Features Accuracy Time Accuracy Time Accuracy Time(m:ss) (s) (m:ss)

BOW – 0.874 11:27.0 0.786 19.3 0.732 37:21.3Numeric 25 0.911 0:04.2 0.790 1.4 0.755 0:04.4

Chi Squared 100 0.828 0:54.4 0.789 5.7 0.655 0:58.5Chi Squared 250 0.860 1:12.8 0.807 8.0 0.697 2:15.7Chi Squared 500 0.877 2:18.9 – – – –Chi Squared 1000 0.883 3:54.1 0.775 14.1 0.743 3:58.5Chi Squared 1500 – – – – 0.748 5:26.7Info Gain 100 0.830 0:55.8 0.788 5.2 0.654 0:58.4Info Gain 250 0.862 1:22.9 0.807 7.8 – –Info Gain 500 0.878 1:56.1 0.780 13.5 – –Info Gain 1000 0.883 3:03.5 – – 0.743 4:24.9Info Gain 1500 – – – – 0.748 8:09.4

For the subjective sentences dataset, feature selection by both methods per-formed slightly better than plain BOW with 500 selected features and even bet-ter with 1000, and with substantial time savings over plain BOW. However, thenumeric features are still much faster than any of the feature selection methods,and achieve the highest accuracy by a substantial margin. On the electronicsdataset all classifiers complete in seconds but the numeric features are still thefastest. However, in this instance, the feature selection methods which select 250features both achieve slightly higher accuracy than the numeric features, and,while slower, this difference in time may not be meaningful on a dataset of this


size, as both complete in under 10 seconds. Finally, on the Movie Review Snip-pets dataset we again see that feature selection can save considerable time overplain BOW, and improve performance slightly with enough features, howeverthe numeric method again is both much faster and more accurate.

7 Discussion

This work decomposes the problem of learning the sentiment of documents intotwo simpler parts: scoring the strength of different words based on their distribu-tion in positive and negative documents, and then learning document sentimentbased on the distribution of those scores. This produces a compact represen-tation, of around 25 features, compared with thousands for an effective BOWbased approach.

While this decomposition results in the loss of some information – for instance,if two words appearing together in a document is significant – it appears as ifthe BOW representation may be too sparse for such relationships to be learnedmeaningfully. It seems as though the machine learning algorithms for the BOWrepresentation are mainly learning which words are significant indicators of sen-timent, but are much slower at this than simple word scoring methods.

The numeric features performed better than BOW in all respects on the twosentence level datasets. In addition, they also performed well on the online userreviews, While the gap in performance narrowed on some datasets when com-paring the largest trained BOW classifiers and the numeric classifiers trainedwith the same number of documents, the numeric features make training onvery large datasets much more feasible, and we saw that performance continuedto improve when using numeric features on larger and larger datasets. In addi-tion, the most time consuming part of our approach is generating the features.In a system where new documents are frequently being added, such as an onlinereview website, the words in each document only need to be counted once. Thiswould reduce the time needed to update the system with new information, whilethe BOW/SVM approach would need to be completely retrained to account fornew documents.

On the document level movie review datasets, the results are mixed. With theordinal datasets, the numeric features perform slightly better on accuracy on oneof the authors, and we see the worst relative performance overall on two of theauthors. However, we note that the MSE differences are relatively large in favorof the numeric features on two of the datasets, and relatively small in favor ofBOW on the other two. In the binary movie reviews, BOW has higher accuracy,and while the numeric features retain their advantage in terms of AUC, it is thesmallest gap we see on that measure. These two datasets contain both relativelylong and relatively well written material. While not conclusive, it may be thatthe numeric features are particularly adept at dealing with the shorter, less wellwritten material found in all manner of less formal online discourse includingonline user reviews – which is a more interesting domain in many respects thanprofessionally written reviews.


8 Conclusions

We have shown that it is possible to greatly condense the features used formachine learning of sentiment analysis and other related tasks for large speedimprovements. Second, we have shown that these features often improve per-formance over a simple BOW representation, and are competitive with otherpublished results. These speed improvements make it possible to process datasets orders of magnitude larger than previously attempted for sentiment anal-ysis, which in turn generally leads to further performance improvements. Thismethod is effective on both longer and shorter documents, as well as on smalland large datasets, and may be more resilient to poorly written documents suchas those found in online user reviews.

In addition, we have briefly compared this approach with a feature selectionbased approach. While feature selection can improve speed over plain BOW con-siderably, and can also increase performance, the numeric features remain con-siderably faster, particularly on larger datasets, and exceeded the performanceof the best feature selection methods on two of the three datasets we examined,and were close on the other.

References

1. Andreevskaia, A., Bergler, S.: When specialists and generalists work together:Overcoming domain dependence in sentiment tagging. In: Proceedings of the 46thAnnual Meeting of the Association for Computational Linguistics on Human Lan-guage Technologies (2008)

2. Blitzer, J., Dredze, M., Pereira, F.: Biographies, Bollywood, boom-boxes andblenders: Domain adaptation for sentiment classification. In: Proceedings of the45th Annual Meeting of the Association of Computational Linguistics, ACL 2007(2007), http://acl.ldc.upenn.edu/P/P07/P07-1056.pdf

3. Dave, K., Lawrence, S., Pennock, D.M.: Mining the peanut gallery: Opinon ex-traction and semantic classification of product reviews. In: Proceedings of the 12thInternational Conference on World Wide Web (2003)

4. Gaudette, L., Japkowicz, N.: Evaluation methods for ordinal classification. In: Pro-ceedings of the Twenty-second Canadian Conference in Artificial Intelligence, AI2009 (2009)

5. Hall, M.A., Holmes, G.: Benchmarking attribute selection techniques for discreteclass data mining. IEEE Transactions on Knowledge and Data Engineering 15(6),1437–1447 (2003)

6. Kennedy, A., Inkpen, D.: Sentiment classification of movie reviews using contextualvalence shifters. Computational Intelligence 32(2), 223–262 (2006),http://www.site.uottawa.ca/~diana/publications.html

7. Martineau, J., Finin, T.: Delta TFIDF: An improved feature space for sentimentanalysis. In: Third AAAI Internatonal Conference on Weblogs and Social Media(2009)

8. Mullen, T., Collier, N.: Sentiment analysis using support vector machines withdiverse information sources. In: Proceedings of the 2004 Conference on Empir-ical Methods in Natural Language Processing (2004), http://acl.ldc.upenn.

edu/acl2004/emnlp/pdf/Mullen.pdf

http://acl.ldc.upenn.edu/P/P07/P07-1056.pdf

http://www.site.uottawa.ca/~diana/publications.html

http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mullen.pdf

http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mullen.pdf


9. Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivitysummarization based on minimum cuts. In: ACL 2004: Proceedings of the 42ndAnnual Meeting on Association for Computational Linguistics, p. 271. Associationfor Computational Linguistics, Morristown (2004)

10. Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment catego-rization with respect to rating scales. In: ACL 2005: Proceedings of the 43rd AnnualMeeting on Association for Computational Linguistics, pp. 115–124. Associationfor Computational Linguistics, Morristown (2005)

11. Pang, B., Lee, L.: Opinion mining and sentiment analysis, Foundations and Trendsin Information Retrieval, vol. 2. Now (2008)

12. Wiebe, J., Wilson, T., Bell, M.: Identifying collocations for recognizing opinions.In: Proceedings of the ACL 2001 Workshop on Collocation (2001)

13. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and tech-niques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

Instance Selection in Semi-supervised Learning

Yuanyuan Guo1, Harry Zhang1, and Xiaobo Liu2

1 Faculty of Computer Science, University of New BrunswickP.O. Box 4400, Fredericton, NB, Canada E3B 5A3

{yuanyuan.guo,hzhang}@unb.ca2 School of Computer Science, China University of Geosciences

Wuhan, Hubei, China [email protected]

Abstract. Semi-supervised learning methods utilize abundant unlabeleddata to help to learn a better classifier when the number of labeled in-stances is very small. A common method is to select and label unlabeledinstances that the current classifier has high classification confidence toenlarge the labeled training set and then to update the classifier, whichis widely used in two paradigms of semi-supervised learning: self-trainingand co-training. However, the original labeled instances are more reliablethan the self-labeled instances that are labeled by the classifier. If unla-beled instances are assigned wrong labels and then used to update theclassifier, classification accuracy will be jeopardized. In this paper, wepresent a new instance selection method based on the original labeleddata (ISBOLD). ISBOLD considers not only the prediction confidence ofthe current classifier on unlabeled data but also its performance on theoriginal labeled data only. In each iteration, ISBOLD uses the changeof accuracy of the newly learned classifier on the original labeled dataas a criterion to decide whether the selected most confident unlabeledinstances will be accepted to the next iteration or not. We conductedexperiments in self-training and co-training scenarios when using NaiveBayes as the base classifier. Experimental results on 26 UCI datasetsshow that, ISBOLD can significantly improve accuracy and AUC of self-training and co-training.

Keywords: self-training, co-training, instance selection.

1 Introduction

In many real-world machine learning applications, it may be expensive or time-consuming to obtain a large amount of labeled data. On the other hand, it isrelatively easy to collect lots of unlabeled data. Learning classifiers from a smallnumber of labeled training instances may not produce good performance. There-fore, various algorithms have been proposed to exploit and utilize the unlabeleddata to help to learn better classifiers. Semi-supervised learning is one kind ofsuch algorithms that use both labeled data and unlabeled data.


Instance Selection in Semi-supervised Learning 159

Many semi-supervised learning algorithms have been proposed in the pastdecades, including self-training, co-training, semi-supervised support vector ma-chines, graph-based methods, and so on [2,13]. The general idea of self-training[12] and co-training [1] is to iteratively pick some unlabeled instances accordingto a given selection criterion and move them (together with the labels assignedby the classifier) to the training set to build a new classifier. These selectedinstances are called “self-labeled” instances in [5]. The main difference betweenself-training and co-training is that, in co-training, the attributes are split intotwo separate sub-views and every operation is conducted on the two sub-views,respectively.

A commonly used instance selection criterion is “confidence selection” whichselects unlabeled instances that are predicted by the current classifier with highconfidence [1,2,6,8,12], that is, the instances with the high class membershipprobabilities. Other selection methods have also been proposed by researchers.Wang et al. presented an adapted Value Difference Metric as the selection metricin self-training, which does not depend on class membership probabilities [10].In [5], a method named SETRED is presented that utilizes the information ofthe neighbors of each self-labeled instance to identify and remove the mislabeledexamples from the self-labeled data.

Ideally, the selected unlabeled instances (together with the predicted labels)can finally help to learn a better classifier. In [3], however, it concludes thatunlabeled data may degrade classification performance in some extreme condi-tions and under common assumptions when the model assumptions are incor-rect. In our previous work [4], an extensive empirical study was conducted onsome common semi-supervised learning algorithms (including self-training andco-training) using different base Bayesian classifiers. Results on 26 UCI datasetsshow that, the performance of using “confidence selection” is not necessarily su-perior to that of randomly selecting unlabeled instances. If the current classifierhas poor performance and wrongly assigns labels to some self-labeled instances,the final performance will be jeopardized due to the accumulation of mislabeleddata. It is a general problem for the methods based on the classifier performanceon the expanded data, including the original labeled data and the self-labeleddata. Since the originally labeled instances are generally more reliable than self-labeled instances, the performance on the former instances alone is more critical.Thus, we conjecture that, the classifier should have a good performance on theoriginal labeled data if it wants to have good prediction performance on fu-ture data. More precisely, when the accuracy of the classifier evaluated on theoriginal labeled data decreases, the accuracy on the future testing set generallydegrades as well. Hence, utilizing the accuracy on the original labeled data toselect more reliable unlabeled instances seems crucial to the final performanceof semi-supervised learning.

In this paper, we present an effective instance selection method based on theoriginal labeled data (ISBOLD) to improve the performance of self-training andco-training when using Naive Bayes (NB) as the base classifier. ISBOLD con-siders both the prediction confidence of the current classifier on the self-labeled

160 Y. Guo, H. Zhang, and X. Liu

data and the accuracy on the original labeled data only. In each iteration, af-ter the selection of the most confident unlabeled instances, the accuracy of thecurrent classifier on the original labeled data is computed and then used todecide whether to add the selected instances to the training set in the next iter-ation. Experiments on 26 UCI datasets demonstrate that, ISBOLD significantlyimproves the accuracy of self-training and co-training on 6 to 7 datasets andprevents the performance being degraded on the other datasets, compared toour experimental results in [4]. Besides, ISBOLD significantly improves AUC on8 to 9 datasets.

The rest of the paper is organized as follows. Section 2 briefly describes self-training and co-training algorithms and reviews related research work. A newinstance selection method based on the original labeled data (ISBOLD) is pre-sented in Section 3. Section 4 shows experimental results on 26 UCI datasets, aswell as detailed performance analysis. Finally, it is concluded in Section 5.

2 Related Work

Semi-supervised learning methods utilize unlabeled data to help to learn betterclassifiers when the amount of labeled training data is small. A set L of labeledtraining instances and a set U of unlabeled instances are given in semi-supervisedlearning scenario. In [13], a good survey of research work on several well-knownsemi-supervised learning methods has been given. These algorithms and theirvariants are also analyzed and compared in [2]. Self-training and co-training aretwo common algorithms among them.

2.1 Self-training and Co-training Algorithms

Self-training works as follows [12]. A classifier is built from L and used to predictthe labels for instances in U . Then m instances in U that the current classifierhas high classification confidence are labeled and moved to enlarge L. The wholeprocess iterates until stopped.

Co-training works in a similar way except that it is a two-view learningmethod [1]. Initially, the attribute set (view) is partitioned into two condition-ally independent sub-sets (sub-views). A data pool U ′ is created by randomlychoosing some instances from U for each sub-view, respectively. On each sub-view, a classifier is built from the labeled data and then used to predict labelsfor the unlabeled data in its data pool. A certain number of unlabeled instancesthat one classifier has high classification confidence are labeled and moved to ex-pand the labeled data of the other classifier. And the same number of unlabeledinstances will be randomly moved from U to replenish U ′. Then the two clas-sifiers are rebuilt from their corresponding updated labeled data, respectively.The process iterates until stopped. In other words, in co-training, it iterativelyand alternately uses one classifier to help to “train” another classifier.

The stopping criterion in self-training and co-training is that, either there is nounlabeled instance left or the maximum number of iterations has been reached.


There are two assumptions in co-training to ensure good performance [1]: eachsub-view is sufficient to build a good classifier; and the two sub-views are condi-tionally independent of each other given the class. The two assumptions may beviolated in real-world applications. In [8], it is stated that, co-training still workswhen the attribute set is randomly divided into two separate subsets, althoughthe performance may not be as good as when the attributes are split sufficientlyand independently.

2.2 Variants of Self-training and Co-training Algorithms

Researchers have presented different variants of self-training and co-training al-gorithms.

One kind of methods is to use all the unlabeled instances in each iterationso that no selection criterion is needed. A self-training style method, semi-supervised EM, is presented in [9]. During each iteration, all the unlabeledinstances are given predicted labels and then used to enlarge the training setand update the classifier. In [8], co-training is combined with EM to generatea new algorithm co-EM which in each iteration uses all the unlabeled instancesinstead of a number of instances picked from the data pool.

Another kind of methods is to use active learning method to select unlabeledinstances and then ask human experts to label them. Hence, no mislabeled ex-amples will occur, in principle. In [7], an active learning method is used to selectunlabeled instances for the multi-view semi-supervised Co-EM algorithm. Andlabels are assigned to the selected unlabeled instances by experts. However, ac-tive learning methods are not applicable if we do not have available humanexperts.

Some researchers also used different selection techniques to decide which un-labeled instances should be used in each iteration. In [10], the authors presentedan adapted Value Difference Metric as the selection metric in self-training. In [5],a data editing method is applied to identify and remove the mislabeled examplesfrom the self-labeled data.

In our previous work [4], an empirical study on 26 UCI datasets shows that,in self-training and co-training, using “confidence selection” cannot always out-perform that of randomly selecting unlabeled instances. If the classification per-formance of the current classifier is poor, wrong labels may be predicted to mostunlabeled instances and the final performance of semi-supervised learning willbe affected accordingly. Generally speaking, the original labeled instances aremore reliable than the instances with predicted labels by the current classifier.Hence, the performance on the original labeled data is an important factor toreflect the final performance of semi-supervised learning.

3 Instance Selection Based on the Original Labeled Data

Motivated by the existing work, in this paper, we present a new method, In-stance Selection Based on the Original Labeled Data (ISBOLD), to improve the


performance of self-training and co-training when using NB as the base classi-fier. The main idea of ISBOLD is to use the accuracy on the original labeleddata only to prevent adding unlabeled instances that will possibly degrade theperformance. How to use ISBOLD in self-training and co-training scenarios isdescribed in following two subsections, respectively.

3.1 ISBOLD for Self-training

In order to describe our method, some notations are used here. In iteration t,we use Lt to denote the new labeled training set, Ct to represent the classifierbuilt on Lt, and Acct as the accuracy of Ct on the original labeled data L0. Thedetailed algorithm is shown in Figure 1.

1. Set t, the iteration counter, to 0.2. Build a classifier Ct on the original labeled data L0.3. Compute Acct, which is the accuracy of Ct on L0.4. While the stopping criteria are not satisfied,

(a) Use Ct to predict a label for each instance in U .(b) Generate Ls

t+1: select m unlabeled instances that Ct has high classificationconfidence, and assign a predicted label to each selected instance.Delete the selected instances from U .

(c) Lt+1 = Lt ∪ Lst+1.

(d) Build a classifier Ct+1 on Lt+1.(e) Compute Acct+1, which is the accuracy of Ct+1 on L0.(f) If Acct+1 < Acct, then Lt+1 = Lt, and rebuild Ct+1 on Lt+1.(g) Increase t by 1.

5. Return the final classifier.

Fig. 1. Algorithm of ISBOLD for self-training

The difference between ISBOLD and the common confidence selection methodin self-training is displayed in steps 4(e) and 4(f). In iteration t + 1, after se-lecting the most confident unlabeled instances and assigning labels to them (forsimplicity, the set of those selected instances is denoted as Ls

t+1), the training setLt+1 = Lt ∪ Ls

t+1. Now we build a classifier Ct+1 on Lt+1 and compute Acct+1.If Acct+1 < Acct, Lt+1 is reset to be equal to Lt, and Ct+1 is updated on Lt+1

accordingly. The whole process iterates until there is no unlabeled instance leftor the maximum number of iterations is reached.

The reason that we remove Lst+1 from Lt+1 once the accuracy on L0 decreases

is that, if adding Lst+1 to the training set degrades the classifier’s performance

on L0, it is very possible that the performance of the current classifier on thetest set degrades as well. Hence, we use this method to roughly prevent possibleperformance degradation. Furthermore, notice that in step 4(b), all the selectedinstances are removed from U , which means that each selected instance is eitheradded to the labeled data or removed from U .


3.2 ISBOLD for Co-training

A similar selection method is used in co-training. We denote the classifiers on thetwo sub-views in iteration t as Ca

t and Cbt . The algorithm is shown in Figure 2.

1. Set t, the iteration counter, to 0.2. Randomly partition the attribute set Att into two separate sets Atta and Attb.

Generate La0 and Lb

0 from L. Generate Ua and Ub from U .3. Generate data pool U ′

a and U ′b by randomly choosing u instances from Ua and Ub,

respectively.4. Use La

0 to train a classifier Cat .

5. Use Lb0 to train a classifier Cb

t .6. Compute Accat , which is the accuracy of Ca

t on La0 .

7. Compute Accbt , which is the accuracy of Cbt on Lb

0.8. While the stopping criteria are not satisfied,

(a) Use Cat to predict a label for each instance in U ′

a. Use Cbt to predict a label

for each instance in U ′b.

(b) Generate Las

t+1: select m unlabeled instances that Cbt has high classification

confidence, together with predicted labels. Delete the selected instances fromU ′

b.(c) Generate Lbs

t+1: select m unlabeled instances that Cat has high classification

confidence, together with predicted labels. Delete the selected instances fromU ′

a.(d) La

t+1 = Lat ∪ Las

t+1. Lbt+1 = Lb

t ∪ Lbs

t+1.(e) Use La

t+1 to train a classifier Cat+1.

(f) Compute Accat+1, which is the accuracy of Cat+1 on La

0 .(g) If Accat+1 < Accat , then La

t+1 = Lat , and rebuild Ca

t+1 on Lat+1.

(h) Use Lbt+1 to train a classifier Cb

t+1.(i) Compute Accbt+1, which is the accuracy of Cb

t+1 on Lb0.

(j) If Accbt+1 < Accbt , then Lbt+1 = Lb

t , and rebuild Cbt+1 on Lb

t+1.(k) Randomly move m instances from Ua to replenish U ′

a.Randomly move m instances from Ub to replenish U ′

b.(l) Increase t by 1.

Fig. 2. Algorithm of ISBOLD for co-training

The difference between ISBOLD and the common confidence selection methodin co-training is displayed in steps 8(f), 8(g), 8(i) and 8(j). In iteration t + 1,on sub-view a, after selecting a certain number of unlabeled instances that Cb

t

has high classification confidence, a label is assigned to each selected instance(for simplicity, the set of those selected instances is denoted as Las

t+1). ThenLa

t+1 = Lat ∪ Las

t+1 and Cat+1 is built on La

t+1. Now we compute Accat+1 that

represents the accuracy of Cat+1 on La

0. If Accat+1 < Acca

t , Lat+1 = La

t and Cat+1

is updated accordingly. The same steps are repeated on sub-view b to generateLb

t+1 and Cbt+1. New unlabeled instances will be replenished from the remaining


unlabeled data part to the data pool of each sub-view. The whole process iteratesuntil there is no unlabeled instance left or the maximum number of iterations isreached.

4 Experimental Results and Analysis

4.1 Experimental Settings

In order to examine the performance of ISBOLD, we conducted experiments on26 UCI datasets, including 18 binary class datasets and 8 multi-class datasets.These datasets are downloaded from a package of 37 classification problems,“datasets-UCI.jar”1. Each dataset is then preprocessed in Weka software [11]by replacing missing values, discretization and removing any attribute that itsnumber of attribute values is almost equal to the number of instances in thedataset [4]. We only use 26 datasets out of the package because the other 11datasets have extremely skewed class distributions. For example, in the hy-pothyriod dataset, the frequency of each class value is 3481, 194, 95 and 2 re-spectively. When randomly sampling the labeled data set in semi-supervisedlearning, the classes that have very small values of frequency may not appear insome generated datasets if we want to keep the same class distributions. Usuallyresearchers merge the minor classes into a major class or simply delete instanceswith minor classes. However, to minimize any possible influence, we ignoredthose datasets with extremely skewed class distributions. The 26 datasets arethe same as those used in our previous work [4].

On each dataset, 10 runs of 4-fold stratified cross-validation are conducted.That is, 25% of the original data will be put aside as the testing set to evaluatethe performance of learning algorithms. The remaining 75% data are divided intolabeled data (L) and unlabeled data (U) according to a pre-defined percentageof labeled data (lp). The data splitting setting follows those in [1,4,5,6]. In ourexperiments, lp is set to be 5%. Therefore, 25% data are kept as the testing set,5% of the 75% data are randomly sampled as L while the remaining 95% of the75% data are saved as U . When generating L, we made sure that L and theoriginal training data had the same class distributions.

Naive Bayes is used in self-training and co-training. The maximum numberof iterations in both is set to 80. The size of data pool in co-training is set to be50% of the size of U . Accuracy and AUC are used as performance measurements.In our experiments on co-training, the attributes are randomly split into twosubsets.

4.2 Results Analysis

Performance comparison results of using ISBOLD and using the common “con-fidence selection” method in self-training and co-training are shown in Table 1and Table 2. For simplicity, the methods are denoted as ISBOLD and CF

1 They are available from http://www.cs.waikato.ac.nz/ml/weka/

http://www.cs.waikato.ac.nz/ml/weka/


Table 1. Accuracy of CF vs ISBOLD in self-training and co-training

(a) self-training

Dataset CF ISBOLD

balance-scale 59.52 66.21breast-cancer 65.09 65.61breast-w 96.67 96.34colic 74.54 75.38colic.ORIG 55.05 60.57credit-a 80.68 80.78credit-g 60.62 66.03 vdiabetes 70.55 70.53heart-c 81.55 81.15heart-h 83.06 82.41heart-statlog 81.37 80.74hepatitis 79.70 78.34ionosphere 80.97 79.86iris 90.31 90.05kr-vs-kp 67.26 80.07 vlabor 88.26 87.92letter 40.38 57.39 vmushroom 91.90 92.57 vsegment 63.49 72.88 vsick 91.54 94.15sonar 55.72 57.93splice 82.05 85.48 vvehicle 41.79 48.35vote 87.89 88.53vowel 18.75 21.78waveform-5000 77.98 78.87

mean 71.80 74.61

w/t/l 6/20/0

(b) co-training

Dataset CF ISBOLD

balance-scale 59.10 67.17breast-cancer 70.41 71.00breast-w 96.85 96.47colic 76.60 75.76colic.ORIG 55.19 62.04credit-a 81.36 79.67credit-g 63.04 67.72 vdiabetes 67.51 69.58heart-c 82.77 80.13heart-h 81.46 78.60heart-statlog 82.03 80.30hepatitis 81.04 80.21ionosphere 81.50 83.08iris 80.79 78.98kr-vs-kp 59.22 77.36 vlabor 77.21 78.43letter 36.67 56.05 vmushroom 91.74 92.38 vsegment 61.49 71.64 vsick 93.40 93.56sonar 55.43 58.08splice 73.91 82.63 vvehicle 41.57 47.86vote 88.21 88.60vowel 18.83 23.36waveform-5000 71.61 75.91 v

mean 70.34 73.71

w/t/l 7/19/0

in the tables. In each table, figures on each row are the average accuracy orAUC over 10-runs of 4-fold cross-validation on the corresponding dataset. Row“w/t/l” represents that using ISBOLD in the corresponding column wins on wdatasets (marked by ‘v’), ties on t datasets, and loses on l datasets (marked by‘*’) against using “confidence selection” in self-training or co-training, under atwo-tailed pair-wise t-test with the significant level of 95%. Values in row “mean”are the average accuracy or AUC over the 26 datasets.

Table 1(a) shows the average accuracy of using ISBOLD and CF in self-training. The “w/t/l” t-test results show that, ISBOLD significantly improvesclassification accuracy on 6 datasets. Values in row “mean” also demonstratethat ISBOLD improves the average performance. Table 1(b) shows the averageaccuracies in co-training. The “w/t/l” t-test results tell that ISBOLD signifi-cantly improves the performance of co-training on 7 datasets. And the meanvalue increases from 70.34 to 73.71.


Table 2. AUC of CF vs ISBOLD in self-training and co-training

(a) self-training

Dataset CF ISBOLD

balance-scale 61.37 66.68breast-cancer 63.98 63.48breast-w 99.07 99.08colic 79.24 78.43colic.ORIG 51.62 58.49credit-a 86.81 86.79credit-g 56.56 65.24 vdiabetes 78.03 76.36heart-c 83.97 83.92heart-h 83.74 83.74heart-statlog 88.93 88.64hepatitis 83.02 80.99ionosphere 86.86 86.68iris 98.33 98.29kr-vs-kp 74.65 89.03 vlabor 96.59 96.72letter 86.09 93.08 vmushroom 98.04 98.81 vsegment 90.86 95.24 vsick 91.51 93.96sonar 58.64 62.21splice 94.40 96.23 vvehicle 59.63 66.95 vvote 96.31 96.52vowel 57.65 64.49 vwaveform-5000 88.85 90.96 v

mean 80.57 83.12

w/t/l 9/17/0

(b) co-training

Dataset CF ISBOLD

balance-scale 60.44 65.34breast-cancer 63.51 64.37breast-w 99.22 99.19colic 78.99 79.08colic.ORIG 49.62 55.82credit-a 88.05 86.35credit-g 55.33 61.62diabetes 72.61 74.95heart-c 84.02 83.80heart-h 83.77 83.50heart-statlog 90.03 88.03hepatitis 78.38 73.19ionosphere 87.89 88.92iris 93.21 92.27kr-vs-kp 66.86 86.39 vlabor 87.76 85.18letter 82.98 92.57 vmushroom 97.89 98.75 vsegment 87.93 94.82 vsick 87.74 93.83sonar 59.59 62.93splice 88.65 94.87 vvehicle 59.56 67.09 vvote 96.31 96.46vowel 57.97 66.44 vwaveform-5000 84.22 89.54 v

mean 78.56 81.74

w/t/l 8/18/0

Comparison results on AUC in self-training and co-training are displayed inTable 2. It can be observed that, using ISBOLD, the AUC of self-training issignificantly improved on 9 datasets. And the mean value increases from 80.57to 83.12. Similarly, the AUC of co-training is sharply improved on 8 datasets,and the mean value is improved from 78.56 to 81.74.

4.3 Learning Curves Analysis

Based on our previous work [4], we guess that, the classifier should have a goodprediction performance on the testing set if the accuracy on the original labeleddata does not degrade. To verify our conjecture and to further examine the per-formance of ISBOLD during each iteration, learning curves of a random runningof two self-training methods on datasets vehicle and kr-vs-kp are displayed in


Figure 3 and Figure 4, respectively. The data splitting setting is the same as thatin subsection 4.1. Curves in co-training are omitted here due to space limitation.

On each graph, at each iteration t, the accuracy values of classifier Ct onthe original labeled data L0 and the testing set for using ISBOLD or CF inself-training are displayed, respectively. Curves “ISBOLD-L0” and “ISBOLD-test” show accuracy values on the original labeled data L0 and on the testingset, respectively, when using ISBOLD in self-training on the dataset. Curves“CF-L0” and “CF-test” display accuracy values on L0 and on the testing set,respectively, when using “confidence selection” in self-training on the dataset.

0 10 20 30 40 50 60 70 80 90

0.4

0.5

0.6

0.7

0.8

0.9

1

the number of iterations

Acc

urac

y

vehicle

ISBOLD−L0ISBOLD−testCF−L0CF−test

Fig. 3. Learning curves on the vehicle dataset

0 10 20 30 40 50 60 70 80 900.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

the number of iterations

Acc

urac

y

kr−vs−kp

ISBOLD−L0ISBOLD−testCF−L0CF−test

Fig. 4. Learning curves on the kr-vs-kp dataset

According to our conjecture, when the accuracy on the original labeled dataL0 decreases, the accuracy on the corresponding testing set generally decreasesas well. This is actually observed on the trends of curve “CF-L0” and curve“CF-test” in Figure 3 and Figure 4. Curve “CF-test” generally goes down whencurve “CF-L0” goes down.


ISBOLD is presented based on our conjecture that the classifier will havegood prediction performance on the testing set if its accuracy on the originallabeled data does not degrade during each iteration. As shown in Figure 3 andFigure 4, comparing curves on “confidence selection” method to curves on IS-BOLD method, ISBOLD can sharply improve the accuracy on the testing setwhile improving it on L0. When the accuracy on L0 does not degrade, the finalaccuracy on the testing set does not significantly decrease. These observationsconfirm that, using the accuracy on the original labeled data to further decidewhether to accept the selected unlabeled instances into the next iteration or notis an effective way to improve the performance in semi-supervised learning.


In this paper, we presented a new instance selection method ISBOLD to im-prove the performance of self-training and co-training when using NB as thebase classifier. During each iteration, after selecting a number of unlabeled in-stances that the current classifier has high classification confidence, we use theaccuracy of the current classifier on the original labeled data to decide whetherto accept the selected unlabeled instances to the labeled training set in the nextiteration. Experiments on 26 UCI datasets show that ISBOLD can significantlyimprove the performance of self-training and co-training on many datasets. Thelearning curve analysis gives a vivid demonstration and experimentally provesthe feasibility of our method.

In future work, we will try different base classifiers such as non-naive Bayesianclassifiers and decision trees, and extend the method to more semi-supervisedlearning methods. Besides, theoretical analysis will also be done to help to un-derstand the functionality of the method. Based on these work, we will presentnew methods to improve the performance of semi-supervised learning.

References

1. Blum, A., Mitchell, T.: Combing labeled and unlabeled data with co-training. In:Proceedings of the 1998 Conference on Computational Learning Theory (1998)

2. Chapelle, O., Scholkopf, B., Zien, A. (eds.): Semi-supervised learning. MIT Press,Cambridge (2006)

3. Cozman, F.G., Cohen, I.: Unlabeled data can degrade classification performanceof generative classifiers. In: Proceedings of the Fifteenth International Florida Ar-tificial Intelligence Research Society Conference (2002)

4. Guo, Y., Niu, X., Zhang, H.: An extensive empirical study on semi-supervisedlearning. In: The 10th IEEE International Conference on Data Mining (2010)

5. Li, M., Zhou, Z.H.: SETRED: self-training with editing. In: Proceedings of theAdvances in Knowledge Discovery and Data Mining (2005)

6. Ling, C.X., Du, J., Zhou, Z.H.: When does co-training work in real data? In: Pro-ceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discoveryand Data Mining (2009)


7. Muslea, I., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robustmulti-view learning. In: Proceedings of the Nineteenth International Conference onMachine Learning (2002)

8. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training.In: Proceedings of the 9th International Conference on Information and KnowledgeManagement (2000)

9. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from la-beled and unlabeled documents using EM. Machine Learning 39, 103–134 (2000)

10. Wang, B., Spencer, B., Ling, C.X., Zhang, H.: Semi-supervised self-training forsentence subjectivity classification. In: The 21st Canadian Conference on ArtificialIntelligence, pp. 344–355 (2008)

11. Witten, I.H., Frank, E. (eds.): Data mining: Practical machine learning tools andtechniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

12. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised meth-ods. In: Proceedings of the 33rd Annual Meeting of the Association for Computa-tional Linguistics, pp. 189–196 (1995)

13. Zhu, X.J.: Semi-supervised learning literature survey (2008)

Determining an Optimal Seismic NetworkConfiguration Using Self-Organizing Maps

Machel Higgins1, Christopher Ward1, and Silvio De Angelis2

1 The University of the West Indies2 University of Washington

Abstract. The Seismic Research Centre, University of the West Indiesoperates a seismic network that performs suboptimally in detecting, lo-cating, and correctly determining the magnitude of earthquakes due to adiverse constituion of seismometers and the utilization of a site selectionprocess that approximates an educated guess. My work seeks to applySelf-Organizing Maps (SOM) to arrive at the optimal network configu-ration and aid in site selection.

1 Introduction

The University of the West Indies, Seismic Research Centre (SRC), currentlyemploys a network of seismometers, the Eastern Caribbean Seismic Network(ECSN) in all English speaking countries in the Eastern Caribbean - spanningthe length of the island arc from Anguilla to Trinidad. The ECSN has beenupgraded in stages since 1956 and currently constitutes seismometers of differingcapabilities in regards to monitoring earthquakes and volcanoes. In upgradingthe seismic network, the process of selecting new sites involved creating a denserand evenly spaced seismic network. This heterogeneous mix of seismometers,along with the site selection process that was used, has resulted in a seismicnetwork configuration that may not be optimized in its ability to detect, locate,and correctly determine the magnitude of earthquakes.

This work’s intention is to determine a seismic network configuration withthe best magnitude detection capability. That is, the seismic network should beable to detect most or all earthquakes of appreciable size that the physical char-acteristics of the region will allow and record the largest magnitude earthquakeswith the least amount of sensors being saturated (clipping).

Several bodies of work and publications exist that examines the problem ofoptimizing seismic network configurations. Most of these previous works haveconsidered the site selection process by employing ideas from optimal experi-mental designs in addressing the problem proposed by Kijko[3]. Steinberg et al[1,2] extended this idea to incorporate a statistical approach by minimizing theerror of hypolocations employing D-criterion that takes into account multipleseismic event sources. Hardt and Scherbaum [4] used Simulated Annealing todetermine an optimal seismic network of one event source and aftershock inves-tigations. Bartal et al has presented an approach, the use of a Genetic Algorithm,


Determining an Optimal Seismic Network Configuration Using SOMs 171

that produces similar results to Steinberg and allows for more flexibility in thespace that a station is contained in and the number of event sources. Unfortu-nately, the parameters for the genetic algorithm become unwieldy when manymore event sources are added.

Most of the methods previously mentioned attempt to optimize the seismicnetwork by minimizing the location error. This is only appropriate for very smallnetworks with few seismic sources. With regard to the ECSN, the seismic ar-ray spans a large region with heterogeneous layer-velocity models and severalseismic sources, attempting to model phase arrivals and their errors and thenetwork’s efficacy becomes intractable. These methods can be altered to min-imize the magnitude detected instead of error in phase arrivals but they willremain inapplicable to a reconfiguration of the ECSN. With the exception ofBartal[5] and Hardt[4] methods are strictly tied to the creation of a rectangulargrid where stations are allowed to move between grid points around one or moreepicentres. This is inappropriate for the region that the ECSN monitors; stationsare placed on a slivered island archipelago surrounded by seismicity.

To determine the minimum magnitude detection threshold of the ECSN,Brune source modelling[6,7] was carried out. A region of 890 x 890 km withthe Eastern Caribbean island arc centred is divided into a grid, where, for eachgrid cell, the rupture dimensions[6,7,9]and expected shear wave amplitudes andcorner frequencies for earthquake magnitudes from Mw = 2 to Mw =8 weremodelled. Each amplitude is compared to all sites to determine, with atten-uation from grid to site applied[8], the minimum amplitude that exceeds theambient noise level in the same frequency band. Shear wave amplitudes withcorner frequencies that were below the sites’ seismometers cut-off frequencieswere disregarded. The noise in the band of 0.01 Hz to 50 Hz for all sites wasanalyzed using waveform data over a period of three months. Not all sites areequipped with seismometers with the frequency response ranges to investigatethe noise in the band of interest so cubic spline interpolation was carried outwherever interpolated points were on the order of one surface wave wavelengthfrom real data points. Fortunately, with a nominal surface wave velocity of ~35km/s[10]for the Eastern Caribbean, all sites satisfied this requirement.

An earthquake magnitude probability function was created with the inten-tion of establishing the efficacy of the ECSN’s minimum magnitude detectionthreshold. A complete Earthquake catalogue, from various agencies monitoringseismicity within the region, was compiled and aggregated to cluster main andaftershock seismic events. From this compilation, probability density functionsper magnitude range were derived for each grid cell of the grid previously iden-tified in determination of the minimum magnitude threshold.

2 Optimizing Seismic Network with SOM

This project intends to minimize the earthquake magnitude detection by opti-mizing a seismic network via the application of a Self-Organizing Map (SOM).Implementing a SOM was opted for because of its SOM’s ability to, sort, order

172 M. Higgins, C. Ward, and S. De Angelis

and classify data [11]. To accomplish these tasks the SOM has been used ex-tensively in data mining, signal processing and pattern recognition [11] and itsprevalence is due to its simplistic algorithm and ability to transform input thatis non-linear and arbitrary in dimension to a low-dimension output [11]. Forthese reasons and the direct relationship the solution has to real world space,it is advantageous to implement an SOM to solve the problem of optimizing aseismic network configuration.

In minimizing the magnitude detection threshold the following inputs are con-sidered; earthquake magnitude probability distribution; a pool of seismometerswith known dynamic ranges, frequency response and sensitivities; sites and theirambient noise characteristics; and volcano locations. A two-dimensional weightmap is created to be equivalent in scale but denser than the grid created todetermine the minimum magnitude detection threshold. In this map each unitconsists of a weight vector whose elements represent a seismometer chosen ran-domly from a definitive pool and an associated site with its characteristics. Theassociation of seismometers to units is dynamic while a unit is fixed to theclosest site. The weight vector of a unit is ultimately combined to produce anoverall weight called the sensor-site suitability SSS, from the consideration ofits seismometer’s performance of being able to detect earthquake magnitudesby performing Brune source modelling for a surrounding region of radius wheremagnitudes shear wave amplitude is attenuated by 5 decibels when scaled by theseismometer’s benchmarked sensitivity. The resulting SSS is an index between0 and 1 and serves as the overall weight that will compare units during the SOMoperations. SSS values will also be determined for inputs through the combi-nation of the input’s vector by the comparison of seismometers characteristicsto an idealized site located in a region where the probability of an earthquakeoccurrence in any magnitude is the average of the entire region.

Fig. 1. The SOM Network

A typical SOM generates its feature map by updating units’ weights throughcompetitive learning: selecting an input value and determining the unit, or theBest Matching Unit, whose weight vector most closely matches that of the inputvalue. The Best Matching Unit‘s (BMU) weight and the weights of surroundingunits are then scaled to be more similar to the input value. The training in theSOM applied to the problem at hand has a different tactic in that the unitsin the BMU’s neighbourhood will not be trained to be more similar but the

Determining an Optimal Seismic Network Configuration Using SOMs 173

complement of the BMU. This tactic serves to direct the decision borders tonot allow two or more seismometers of equivalent capabilities to be sited closeto each other. Another tactic implemented is that a unit may become quicklyresistant to training once it has deemed it has perfect suitability to monitoringin volcano its vicinity. These tactics have been employed in other works [12] andhave been shown that map formation and solution is possible.

3 Discussion

To validate the results, the conventional measure of self-organization [13], thetopographic error of the feature map will be ascertained by finding the averagesimilarity of each unit. If self-organization has been achieved, units’s seismome-ters are then assigned to sites. The minimum magnitude detection thresholdis then calculated for the resulting reconfigured seismic network. To date, therehas been moderate self-organization and further refinement of the neightborhoodupdate function is necessary. It is hoped that, with a succesful SOM, not onlywill the best magnitude dectection threshold be achieved but also new sites canbe sited through this scheme.

References

1. Rabinowitz, N., Steinberg, D.M.: Optimal configuration of a seismographic net-work: A statistical approach. Bull. Seism. Soc. Am. 80(1), 187–196 (1990)

2. Steinberg, D.M., et al.: Configuring a seismograph network for optimal monitoringof fault lines and multiple sources. Bull. Seism. Soc. Am. 85(6), 1847–1857 (1995)

3. Kijko, A.: An algorithm for the optimum distribution of a regional seismic network- I. Pageoph. 115, 999–1009 (1977)

4. Hardt, M., Scherbaum, F.: The design of optimum networks for aftershock record-ings. Geophys. J. Int. 117, 716–726 (1994)

5. Bartal, Y., et al.: Optimal Seismic Networks in Israel in the Context of the Com-prehensive Test Ban Treaty. Bull. Seism. Soc. Am. 90(1), 151–165 (2000)

6. James, B.: Tectonic Stress and the spectra of seismic shear wave from earthquakes.J. Geophys. Res. 75, 4997–5009 (1970)

7. Brune, J.: Correction. J. Geophys. Res. 76, 5002 (1971)8. James, B.: Attenuation of dispersed wave trains. Bull. Seism. Soc. Am. 53, 109–112

(1962)9. Kanimori: The energy release in great earthquakes. J. Geophys. Res. 82, 2981–2986

(1977)10. Beckles, D., Shepherd, J.B.: A program for estimating the hypocentral coordinates

of regional earthquakes, Prestanda a La Primer Reunion de LA Ascoicacion Ibero-Latino Americarnca de Geofiscia (1977)

11. Kohonen, T.: The Self-Organizing Map. Proceedings of the IEEE 78, 1464–1480(1990)

12. Neme, A., Hernández, S., Neme, O., Hernández, L.: Self-Organizing Maps withNon-cooperative Strategies. In: Príncipe, J.C., Miikkulainen, R. (eds.) WSOM2009. LNCS, vol. 5629, pp. 200–208. Springer, Heidelberg (2009)

13. Bauer, H., Herrmann, M., Villmann, T.: Neural Maps and Topographic VectorQuantization. Neural Networks 12(4-5), 659–676 (1999)

Comparison of Learned versus Engineered Features forClassification of Mine Like Objects from Raw Sonar

Images

Paul Hollesen1, Warren A. Connors2, and Thomas Trappenberg1

1 Department of Computer Science, Dalhousie University{hollense,tt}@cs.dal.ca

2 Defence Research and Development [email protected]

Abstract. Advances in high frequency sonar have provided increasing resolutionof sea bottom objects, providing higher fidelity sonar data for automated targetrecognition tools. Here we investigate if advanced techniques in the field of vi-sual object recognition and machine learning can be applied to classify mine-likeobjects from such sonar data. In particular, we investigate if the recently popularScale-Invariant Feature Transform (SIFT) can be applied for such high-resolutionsonar data. We also follow up our previous approach in applying the unsupervisedlearning of deep belief networks, and advance our methods by applying a convo-lutional Restricted Boltzmann Machine (cRBM). Finally, we now use SupportVector Machine (SVM) classifiers on these learned features for final classifica-tion. We find that the cRBM-SVM combination slightly outperformed the SIFTfeatures and yielded encouraging performance in comparison to state-of-the-art,highly engineered template matching methods.

1 Introduction

Naval mine detection and classification is a difficult, resource intensive task. Mine de-tection and classification is dependent on the training and skill level of the human op-erator, the resolution and design of the sonar, and the environmental conditions that themines are detected in. Research has occured over the last 25 years into both sensor de-velopment and processing of sonar data. Although the sensors and capability of minecountermeasures platforms have improved in this time, the issue of operator overloadand fatigue have caused the duty cycles of mine detection and classification to be short,therefore diminishing the effectiveness of Mine Counter Measures (MCM) platforms.

Recent research focuses on development of computer aided tools for detection andclassification of bottom objects [1,2,3]. This typically takes the form of a detectionphase where mine like objects are selected from the seabed image, and a classificationphase where the objects are fitted to a multi-class set of potential mines. This detectionand classification process has typically been implemented using a set of image process-ing tools (Z-test, matched filter), feature extraction, and template-based classification[1,2,3]. These techniques are effective at finding mines, but are sensitive to tuning theparameters for the processing method, and the sea bottom environment under test [2,3].

C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 174–185, 2011.© Her Majesty the Queen in Right of Canada as Represented by the Minister of National Defence 2011

Comparison of Learned versus Engineered Features 175

Learning algorithms, such as Artificial Neural Networks, have been examined forthe mine problem, however success has been limited, and these methods have requiredthe training sets to closely reflect the sea bottom environment of the area where thesystem will be tested. Earlier work includes using a deep belief network (DBN) [4]which is a stack of multiple Restricted Boltzmann Machines (RBM) [5], to learn toextract features from side scan sonar data. This technique was sucessful in detectingmines with comparable performace to the traditional methods [4].

The RBM learning method is effective, however the Scale-invariant feature transform(SIFT) [6] has been very influential recently in vision and image processing, and hasbeen applied to numerous image processing and feature extraction tasks sucessfully. Atthe same time, while the original work on RBM/DBN structures for feature learning andclassification has shown the power of the DBN for feature extraction, no considerationwas given to recent developments with the DBN model including sparseness constraintsand a convolutional variation [4]. Imposing sparsity on the RBM regularizes the leanedmodel by decreasing the weights of nodes whose activity exceeds a prescribed sparsity,therefore simplifying the model it learns and providing a more compact representationof the input. The convolutional approach allows the model to scale to high-resolutionimagery and further regularizes the model by reducing the parameter space.

This paper compares feature extraction using SIFT versus a convolutional RBM(cRBM), for the mine classification problem. This also serves to examine how wellSIFT generalizes to application domains analogous to visual wavelength imagery. Acentral argument for using learned rather than carefully selected, contrived features isthe ability to apply the model to diverse application domains. This is an interesting do-main to explore in this context, as there is a natural 2D, grayscale representation forsonar data, and the mine classification task contains most of the same challenges asgeneric object recognition: invariance to translation, rotation, luminance, clutter, andnoise.

Both techniques were applied to a series of sonar images to extract features, withthe output fed to a Support Vector Machine (SVM) for training and classification. Asthe goal of this effort is to develop a classification system for sonar images of sea bot-tom objects with comparable performance to highly contrived methods, each techniqueis treated as a feature extraction method, with the features being passed to an SVMfor training and classification. The results were compared to state-of-the-art templatematching methods with encouraging results for correct classification of targets.

2 Synthetic Aperature Sonar Imagery

Traditional side scan sonar imagery (e.g. Figure 1a) depicts objects by a strong brightregion (highlight) where it is insonified by sound waves followed by a dark region(shadow) cast behind it. The size, shape and disposition of such features are importantfor both automated and manual methods in mine classification. This imagery may belittered with background noise coming from natural and artificial sources. In imagingsonars, range resolution is mostly determined by the bandwidth of the transmit pulse,while azimuthal resolution is determined by the length of the receiver array. While

176 P. Hollesen, W.A. Connors, and T. Trappenberg

bandwidth in modern sonars are sufficient to achieve a high range resolution, azimuthalresolution is difficult to improve due to the engineering limitations of constructing longarrays.

Synthetic Aperture Sonar (SAS) is a recent side scan sonar technique being appliedto detecting and classifying mine like objects. This technology is inspired by syntheticaperture radar, which is commonly used on terrestrial and space based radar sensors.Synthetic Aperture Sonar is a technique whereby a longer array length is synthesizedby integrating a number of sonar pings in the direction of travel of the sonar, resulting inimproved resolution which is also independent of range (e.g. Figure 1b). This provides apowerful tool for the mine detection/classification problem, as the higher fidelity imagesallow for a richer set of features available for the detection and classification of a seabottom object.

Fig. 1. Sonar images of mine-like objects, showing (a) a Side Scan Sonar image from and (b) animage from the MUSCLE Synthetic Aperture Sonar (SAS) [7] of the same type of object

The data used in this paper was collected by the NATO Undersea Research Center[7] on the MUSCLE Autonomous Underwater Vehicle (AUV). This vehicle is equippedwith a 300KHz SAS. The SAS gives an 2.5 cm × 2.5 cm resolution, at up to 200 metersin range.

2.1 Data Preparation

The SAS dataset was collected in the summer of 2008 off Latvia in the Baltic Sea. TheMUSCLE vehicle was used to survey multiple mine-like targets that were deployedas part of the trial, including multiple sonar passes over each target in the field fromdifferent angles. The targets were three mine-like shapes, including a cylinder (2.0m ×0.5m), a truncated cone (1.0m base, 0.5m height), and a wedge shape (1.0m × 0.6m× 0.3m). Clutter included numerous rocks and boulders, geographic features of thesea bottom, and a specific rock which was chosen due to its similarity in shape withthe truncated cone. The dataset was composed of 65 cylinders, 69 truncated cones, 37wedges, and 2218 non-mine clutter objects, including 47 rock images that are highlycorrelated to a target shape.


The raw SAS data can contain as much as ten times the data of a side scan sonar,and the maximum and minimum values for these samples describe a very large dy-namic range for the sensor. The data is organized in complex values which describethe amplitude and phase data from the sonar. Although it is appealing to examine thephase component of the SAS data, it is beyond the scope of this work, and is consideredin the Outlook section as future work. The data was prepared by removing the phasecomponent, then re-mapping the amplitude component to a decibel (dB) scale.

3 Feature Extraction

Feature extraction is a difficult and error prone task that typically is performed manu-ally. This process is done through an analysis of the sonar data, and careful selectionof characteristics which help to describe the class of the object. Modern methods havelooked to reduce this complexity through automatically selecting features from the datathrough decomposition (e.g. PCA, Wavelet) [1] in order to have a set of features thatuniquely describe the object, and can be used directly for training. Methods such asSIFT have been effective as an automated method for feature extraction, and has beenapplied to both visual and acoustic images to select a set of features for the object [8].

Learning methods are appealing for feature extraction, specifically generative mod-els, as they learn the dominant features of the image they are being trained on, andbuild an internal representation of what elements the object should be composed of.This allows for an unsupervised approach where many images are shown to the learningmethod, and the learning method determines the features to be selected and modelled.Furthermore, the generative models have the added advantage that it is possible to seethe feature filters which have been learned, which gives the researcher a measure of theprogress of the learning.

With either method, the goal is the same. We wish to select the most descriptivefeatures for the class to provide the least ambiguous training set to the SVM, allowingit to find easily separable classes, and perform effectively as a target classifier comparedto existing manual methods.

3.1 Scale Invariant Feature Transform (SIFT)

SIFT [6] is a method for feature extraction that is invariant to scale, orientation anddistortions. Briefly, SIFT convolves a Difference of Gaussians (DoG) filter with theinput at multiple scales to detect image gradients (edges). In the case of dense SIFTfeature extraction as employed here, a 128-dimensional feature vector is generated foreach overlapping window of the input image by computing orientation histograms with8 bins for each of 16 subregions (4x4) of the window. For a more detailed description,the interested reader is referred to [6].

We employ dense SIFT feature extraction and explore window sizes ranging from 12to 24 pixels wide (i.e. spatial bins from 3 to 6 pixels), spaced from 4 to 12 pixels apart.Similar to our cRBM experiments described later, best results were obtained with themaximum possible window size (24x24) for this data (93x24), spanning the full widthof the image. Using a windows spacing of 6 pixels results in 13 windows spaced alongthe length of the image, with 128 features per window for 1664 features per image(roughly equal in size to the cRBM representation).


3.2 Restricted Boltzmann Machines (RBMs)

The RBM [9] is an energy-based, generative model that can learn to represent the distri-bution of implicit features of the training data and generate examples thereof. An RBMconsists of two layers of nodes, forming the visible and hidden layers. Each layer isfully connected to the others, but is restricted in that there are no connections betweennodes within a layer. The energy of the joint configuration of visible and hidden unitsgiven the connections between them (ignoring biases for simplicity) is given by

E(v,h) = −V

∑i=1

H

∑j=1

vih jwi j (1)

where v and h are the states of the visible (input) and hidden units, respectively, and wis the connection strengths between each visible and each hidden unit.

Stacks of RBMs can be learned in a greedy, layer-wise fashion, with the output of theprevious layer providing the input to the next, forming a Deep Belief Network (DBN)[9]. This enables higher-layer nodes to learn progressively more abstract regularities inthe input.

RBM training is accomplished with the Contrastive Divergence (CD) algorithm [10]which lowers the energy (i.e., raises the probability) of the data observed on the visibleunits and raises the energy of reconstructions of the data produced by the model:

Δwi j ∝ 〈vih j〉data −〈vih j〉model (2)

Using CD, the RBM learns a generative model of the input in a purely unsupervisedfashion by measuring the discrepency between the data and the model’s reconstructionsthen “correcting" the system by slightly altering the weights to minimize reconstructionerrors.

We can also regularize the learned model with a sparse representation by decreasingthe weights of nodes whose activity exceeds a prescribed sparsity level, s [11]:

Δw j ∝ s−〈h j〉 (3)

where 〈h j〉 is the expected probability of activation which is computed as a decayingaverage of the activity of that unit over training examples. This has the added benefitof increasing the weights of nodes whose activity is below the target threshold, thusreintegrating nodes whose random initial conditions lead to them being suppressed bythe network (“dead nodes"). While this regularization may lead to greater reconstructionerror by forcing the network to represent the input with a smaller proportion of nodes,the resulting hidden representation is likely to be more interpretable by subsequentlayers or classifiers.

3.3 Convolutional Restricted Boltzmann Machines (cRBMs)

In the cRBM model [12] each hidden node, rather than being fully connected to everyinput element as in a standard RBM, is connected to only a small, localized region ofthe image which is defined by the researcher. Furthermore, these connections are shared


by a group of hidden nodes which are collectively connected to every input region. Thisarchitecture enables the computationally efficient convolution operation to be used togenerate each groups’ activation.

If the region of the input image that each node of the cRBM is connected to is signifi-cantly smaller than the total input image, as we expect when the input is high resolutionimagery, then the cRBM requires orders of magnitude fewer parameters for a similarrepresentation size, since weights are shared by all nodes in a group. This is especiallyuseful when patterns recur in different regions of the input, since any knowledge learnedabout this pattern is automatically transfered to all input regions.

By pooling adjacent hidden activation within groups, either with the commonly usedmaximum pooling or the probabilistic maximum pooling method [12], we can attain adegree of translational invariance while also keeping the size of the hidden represen-tation within reasonable bounds. If maximum pooling is used, then we calculate theprobability of activation of each node in a pooling window by applying the logisticfunction to the feedforward activation. In the probablistic maximum pooling method,each pooling window is sampled multi-nomially, so that only one hidden node in a win-dow can be on, and the pooling node is off only if all hidden nodes in its window areoff, according to Eq. (4) and (5):

P(hki, j = 1|v) =

exp(I(hki, j))

1 + ∑i′, j′∈Bα exp(I(hki′, j′))

(4)

P(pkα = 0|v) =

1

1 + ∑i′, j′∈Bα exp(I(hki′, j′))

(5)

where hki, j is a hidden node in pooling window Bα receiving feedforward input I(hk

i, j)resulting from the convolution of the kth filter with the input, and pk

α is the pooling nodefor that window.

The representation of each group of hidden nodes is then convolved with its filter toget that group’s reconstruction of the input. Summing over all groups’ reconstructionsyields the networks reconstruction of the input used for CD learning.

For the present experiments we restricted ourselves to a single-layer cRBM. Theparameters having the largest impact on classification performance are the filter size andnumber of filters. Through experimentation, best representations were obtained using50 filters with width one less than the image width (i.e. 23× 23). After probabilisticmaximum pooling with a 2×2 window size, this filter width results in a 50 filter by36 height representation, with the width collapsed to 1 (1800 dimensional). That is,the width of the image is collapsed in the cRBM’s representation by the convolutionoperation and pooling. This can be seen as a compromise between the conventional andconvolutional approach, with minimal transfer of knowledge horizontally. The datasetwas amenable to this severe reduction in representation width because targets werecentered in the image, with the pooling layer providing sufficient invariance to the smalldifferences in position.

Based on research by Nair and Hinton using the NORB dataset[9], real-valued im-ages can be used at the visible layer of the RBM if training speed is decreased. Thereforea low learning rate of 0.01 for weights and biases was found to be stable for the real


valued images and was sufficiently large that learning peaked after 50 epochs throughthe training set of 228 images. The learning rate for sparsity regularization was initial-ized at the learning rate for weights, and then increased to 10 times this rate linearlyover epochs. This enables the network to explore representations early in learning sincemany nodes are active (and thus learning), and then gradually get driven to the desiredsparsity level. The target sparsity giving best results was dependent on the represen-tation size and thus the number and size of filters employed. For the 50, 23× 23 filternetwork for which results are reported, a target sparsity of 0.01 in the hidden layer (0.04in the final pooled representation) yielded best results through experimentation.

4 Results

The original images were 466×119 pixels, though some images had missing rows to-ward the bottom of the image which were detected and filled with the image meanintensity value. Each image was downsampled by a factor of five (to 93x24 pixels) toremove some noise, provide a more computationally tractable representation size, anddecrease in-class variation. Image intensities were then normalized to have zero meanand unit standard deviation. Normalization was done per image because the dynamicrange varied substantially from one image to the next. Normalizing per pixel across thetraining set, as is more common, rendered a significant proportion of images undistin-guishable from the background due to their low dynamic range.

Classification performance was determined via ten-fold cross validation. This methodpartitions the training set of data into 10 subsets, where one is retained for the valida-tion of the classification model, and nine subsets are used for training. This process isrepeated ten times, where each of the subsets is used once as the validation set. 2121clutter, 10 mine-like rock and 10 of each type of mine was reserved for testing and themodel was trained on the remaining data (50 clutter, 37 mine-like rock, 55 cylinders, 59truncated cones, 27 wedges). The small proportion of available clutter examples usedfor training was chosen so the total clutter in the training set approximated the meantotal of the mine-like objects.

Classification was performed with an SVM using the libSVM [13] software library.Grid searches were performed for optimal parameters for both a linear and radial basisfunction kernels. For both the SIFT and cRBM feature vectors, the linear kernel gavesuperior results and was robust over a wide range of the SVM kernel cost parameter.

4.1 Convolutional RBM

We examined the representation learned by RBMs by the reconstruction of the input andits learned filters. Figure 2 provides a sample of sonar images and their reconstructionsby one of the cRBMs trained in the course of cross-validation. The reconstructions aresignificantly smoothed, but with the smoothing generally respecting object boundariesas in nonlinear diffusion.

After learning filters from the training set, the activation probabilities of the convolu-tional RBM’s hidden units were generated for both the training and test sets and passedto the SVM for training and classification. To show not only the correct classification


Fig. 2. Sample sonar images [7] (top) and reconstructions produced by the convolution RBMmodel which resulted in the best classification performance (bottom). The reconstructions aresignificantly smoothed and the highlight of the object somewhat filled in.

performance but also the missed classifications, a confusion matrix (Table 1) was com-puted for comparison with SIFT and template matching [14].

Out of 300 target views (3 types of targets, 10 of each type of target, 10 cross-validations), there were 5 false negatives, and out of 21310 views of non-targets, 1035false positives. This yields a sensitivity to mines of .983± .024 showing a high rate ofcorrect target classification, and a specificity of .954± .012. While most categories hadthis high level of classification accuracy, it is interesting to note that a large proportionof wedges (23%) were mis-labelled as truncated cone, due to both the similarity oftheir appearance in some of the sonar data and the poor representation of wedges inthe dataset (37 wedges vs 69 truncated cones). This led to a poor sensitivity for wedgesspecifically but did not impact the sensitivity to mines in general.

4.2 SIFT

As the SIFT method does not require training, the algorithm was applied to each trainingimage, generating a 1664-dimensional feature vector (128 features for each of 13 24×24 windows spaced 6 pixels apart along the length of the image). This served as the


Table 1. Confusion matrix for SVM trained on cRBM features

CONFUSION clutter cylinder trunc. cone wedge mine-like rock

clutter 0.949±0.013 0.031±0.011 0.006±0.003 0.011±0.003 0.003±0.003cylinder 0.030±0.048 0.970±0.048 0±0 0±0 0±0

trunc. cone 0.01±0.032 0±0 0.980±0.042 0.010±0.032 0±0wedge 0±0 0.020±0.042 0.230±0.125 0.740±0.127 0.010±0.032

mine-like rock 0±0 0±0 0.080±0.140 0±0 0.920±0.140

Table 2. Confusion matrix for SVM trained on dense SIFT features

CONFUSION clutter cylinder trunc. cone wedge mine-like rock

clutter 0.932±0.010 0.011±0.004 0.019±0.006 0.031±0.009 0.008±0.002cylinder 0.010±0.032 0.980±0.063 0±0 0.010±0.032 0±0

trunc. cone 0±0 0±0 0.950±0.071 0.030±0.068 0.020±0.042wedge 0.020±0.042 0.030±0.048 0.460±0.158 0.450±0.135 0.040±0.052

mine-like rock 0±0 0.020±0.063 0.040±0.070 0.010±0.032 0.930±0.082

input for training and testing the SVM. The following confusion matrix in Table 2illustrates the correct and incorrect classifications using the SIFT features.

The SIFT features resulted in similar performance to those of the cRBM but withslightly more false positives. The sensitivity to mines was .970± .025 which showsstrong classification performance and the specificity was .944± .008. The biggest dif-ference in the performance with respect to the cRBM was that there was significantlymore confusion between truncated cones and wedges. This is an interesting result as itshows that learned features are in particular outperforming in the cases that are difficultto classify.

5 Discussion

Overall the results of this work are encouraging and merit further research into theapplication of learning methods to sonar imagery and mine classification in particular.Both the SIFT method and the cRBM methods were comparable in performance, withthe cRBM performing slightly better than the SIFT feature extraction method. As a basisof comparison, we include below in Table 3 the results from a normalized shadow andecho template-based cross-correlation method [14] which has proven highly effectiveat classifying targets. These templates are designed for a specific sensor and specifictemplates are generated for different ranges and therefore are an excellent baseline forlearning methods to be compared against.

As shown in earlier work [4], the RBM/DBN model can effectively extract featuresof mine-like targets and classify them using traditional side scan sonar data, howeverthis method showed poor performance using the higher resolution data from the SASsensor (results not shown). We believe that this is caused by the very large dynamicrange in the sensor leading the DBN to learn features of the background (noise) dis-tribution at the expense of modelling the object highlight. Although the increase in


Table 3. Confusion matrix for template-matching method [14]

NSEM non-mine cylinder trunc. cone wedge

non-mine 0.94 0.01 0.02 0.03cylinder 0.03 0.97 0 0

trunc. cone 0.04 0 0.96 0wedge 0.08 0 0 0.92

resolution in the sensor provides a richer set of detailed features of the object beinglearned, it also has the downside that the learning machines have a tendency to try tomodel and classify this noise rather than just the object. The cRBM model with enforcedsparsity was beneficial in this regard, as the smaller parameter space and sparsity reg-ularized the model and thereby limited the modelling of the background features. Toillustrate this, the reconstructions in Figure 2 show a form of smoothing in areas of theimage where no target features were present.

The filter sizes providing best results spanned the full width of the image (minus 1 inthe case of the cRBM due to 2×2 pooling). This results in an architecture more similarto a conventional network in the horizontal direction but convolutional in the verticaldirection. While there is significant error in the reconstructions (Figure 2), the hiddenrepresentation from which they are produced has a relatively small number of filters(50) given the large filter size, as well as sparse activation, which proves more inter-pretable for the classifier. Using smaller filters has the benefit of being able to modelfiner features and transfer this learning horizontally, however it results in a significantlylarger representation which the classifier had greater difficulty interpreting. In general,cRBM parameterizations which allow more accurate models of the data in the senseof reconstruction error, either by having smaller filters, more filters, or less sparsity,decrease classification performance since they naturally resulted in more complex rep-resentations which are more difficult to model with the SVM.

In comparison to the template matching method, the two methods examined in thispaper showed comparable performance, with the exception of the wedge shapes, whereboth methods suffered in comparison to the template based method. Note that the tem-plate method utilized a significantly larger set of templates for wedges to compensatefor the complexity of its shape and ambiguity with the truncated cone. Since the high-light of many wedges and truncated cones was little more than a strip of light in manyinstances, these two classes were confused, with the SVM opting to classify most astruncated cones due to their greater prevalence in the dataset. However, the cRBM dis-tinguished them significantly better than SIFT. Examining the raw data, it was observedthat a subset of wedges had some of the brightest highlights in the dataset. This feature,which may have been an artifact of this particular dataset, was likely captured by thethe RBM representation but removed by SIFT in its attempt to create an illuminationinvariant representation. This would explain the cRBM’s better performance in distin-guishing the wedges and truncated cones whose shape representation was very similarin the sonar image. This effect highlights the benefit of using learned filters rather thanengineered features from neighbouring application domains, since features which maybe uninformative in one domain (in this case, the illumination of a particular feature)may be informative in the other.


6 Outlook

The results from both the cRBM and SIFT models are encouraging, but also highlightthe need for further research. Distinguishing wedges from truncated cones, in partic-ular, proved challenging for our models and demands further attention. In general, thedetection and classification of objects from sonar imagery could potentially benefit fromadditional pre-processing or extensions to the two models, as described below.

As noted in the data preparation section, the raw SAS data is organized as a set ofcomplex numbers that describe both the amplitude and phase of the reflected soundintensities. For the purposes of this paper, the phase element was removed, and justthe raw amplitudes considered. Although this phase element is stripped, it is likely thatthere is some coherent features in the phase information, which could help distinguishnon-image related features such as material. If this feature could be extracted, it couldbe supplied as another feature for classification, or as a method to limit the false alarmsduring detection and classification phases.

In the Spatial Pyramid Matching (SPM) method [15], dense SIFT features are ex-tracted as in the method we employed. SIFT feature vectors are subsequently vectorquantized, and then histograms at multiple levels of resolution (whole image, quaterimage, ...). A histogram intersection (χ2) kernel is then employed to classify the his-togram representation. This has been very successful in object recognition tasks suchas Caltech 101 used in [15]. Preliminary experiments with this method offered poorperformance on this dataset, but more work needs to be done in exploring the manyparameters of this model to determine if it can be successfully applied to classificationof bottom objects in SAS imagery.

Our experiments with stacked layers of cRBMs yielded poor performance on theclassification task. However, experimentation was hindered by the large computationalburden imposed by convolving many filters with many layers of hidden representation.As we and other groups develop software to transfer the computation of these expensiveoperations to graphics processing units (GPU), this architecture will become much eas-ier to explore and we expect that higher layers will achieve greater invariance to noiseand small intra-class variations, as well as uncover more complex regularities in thetraining data.

Acknowledgements

The authors would like to acknowledge the NATO Undersea Research Center (NURC)for the use of the SAS data for this paper.

References

1. Chapple, P.: Automated detection and classification in high-resolution sonar imagery for au-tonomous underwater vehicle operations. Technical report, Defence Science and TechnologyOrganization (2008)

2. Fawcett, J., Crawford, A., Hopkin, D., Myers, V., Zerr, B.: Computer-aided detection of tar-gets from the CITADEL trial Klein sonar data. Defence Research and Development CanadaAtlantic TM 2006-115 (November 2006), pubs.drdc.gc.ca

pubs.drdc.gc.ca


3. Fawcett, J., Crawford, A., Hopkin, D., Couillard, M., Myers, V., Zerr, B.: Computer-aidedclassification of the Citadel Trial sidescan sonar images. Defence Research and DevelopmentCanada Atlantic TM 2007-162 (2007) pubs.drdc.gc.ca

4. Connors, W., Connor, P., Trapperberg, T.: Detection of mine like objects using restrictedboltzmann machines. In: Proceedings of the 23rd Canadian Conference on AI (2007)

5. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. NeuralComput. 18(7), 1527–1554 (2006)

6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vi-sion 60, 91–110 (2004)

7. Bellettini, A., Pinto, M.: Design and experimental results of a 300 kHz synthetic aperturesonar optmized for shallow-water operations. IEEE Journal of Oceanic Engineering 34, 285–293 (2008)

8. Myers, V., Fortin, A., Simard, P.: An automated method for change detection in areas of highclutter density using sonar imagery. In: Proceedings of the UAM 2009 Conference, Nafplio,Greece (2009)

9. Nair, V., Hinton, G.E.: Implicit mixtures of restricted boltzmann machines. In: NIPS, pp.1145–1152 (2008)

10. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. NeuralComput. 14(8), 1771–1800 (2002)

11. Hinton, G.E.: A practical guide to training restricted boltzmann machines. Technical ReportUTML TR 2010-003, University of Toronto (2010)

12. Lee, H., Grosse, R., Ranganath, R., Ng, A.: Convolutional deep belief networks for scalableunsupervised learning of hierarchical representations (2009)

13. Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines (2001), Softwareavailable at http://www.csie.ntu.edu.tw/~cjlin.libsvm

14. Myers, V., Fawcett, J.: A template matching procedure for automatic target recognition insynthetic aperture sonar imagery. IEEE Signal Processing Letters 17(7), 683–686 (2010)

15. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matchingfor recognizing natural scene categories. In: 2006 IEEE Computer Society Conference onComputer Vision and Pattern Recognition, pp. 2169–2178 (2006)

pubs.drdc.gc.ca

http://www.csie.ntu.edu.tw/~cjlin.libsvm

Learning Probability Distributions over

Permutations by Means of Fourier Coefficients

Ekhine Irurozki, Borja Calvo, and Jose A. Lozano

Intelligent Systems Group, University of the Basque Country, Spain{ekhine.irurozqui,borja.calvo,ja.lozano}@ehu.es

Abstract. An increasing number of data mining domains consider datathat can be represented as permutations. Therefore, it is important todevise new methods to learn predictive models over datasets of permu-tations. However, maintaining probability distributions over the space ofpermutations is a hard task since there are n! permutations of n elements.The Fourier transform has been successfully generalized to functions overpermutations. One of its main advantages in the context of probabilitydistributions is that it compactly summarizes approximations to func-tions by discarding high order marginals information. In this paper, wepresent a method to learn a probability distribution that approximatesthe generating distribution of a given sample of permutations. In par-ticular, this method learns the Fourier domain information representingthis probability distribution.

Keywords: Probabilistic modeling, learning, permutation, ranking.

1 Introduction

Permutations and orders appear in a wide variety of real world combinatorialproblems such as multi object tracking, structure learning of Bayesian networks,ranking, etc. Exact probability representation over the space of permutations ofn elements is intractable with the exception of very small n, since this space hassize n!. However, different simplified models for representing or approximatingprobability distributions over a set of permutations can be found in the literature[2], [3], [5].

One way to represent probability distributions over permutations is theFourier-based approach. This is based on a generalization for permutations of thewell-known Fourier transform in the real line. Permutations form an algebraicgroup under the composition operation, also known as the symmetric group, sowe will use both expressions, permutations and symmetric group, interchange-ably throughout this paper. Although the use of the Fourier transform for repre-senting functions over permutations is not new, this topic has once again cometo the attention of the researchers, partly due to a framework recently providedby [5] and [7] which allows to carry out inference tasks entirely in the Fourierdomain. Moreover, new concepts such as the probability independence [4] havebeen introduced.


Learning Probability Distributions over Permutations 187

In this paper, we focus on the problem of learning the generating distributionof a given sample of permutations, particularly we present a method for learninga limited number of Fourier coefficients that best approximates it (i.e., thatmaximize the likelihood of the sample). The first attempt to learn a probabilitydistribution by means of the Fourier coefficients was presented in [6]. The authorsconcentrated on getting a consensus ranking and a probability distribution underconstrained sensing, when the available information is limited to the first ordermarginals. However, to the best of our knowledge, this work is the first attemptto do it in a general way.

The rest of the paper is organized as follows. The next section introduces thebasis of the Fourier transform over permutations. In Section 3 we detail how weformulate the maximum likelihood method. Section 4 presents the experimentalresults of several tests. In Section 5, we conclude the paper.

2 The Fourier Transform on the Symmetric Group

Since it is out of the scope of this paper to be a proper tutorial on both theFourier transform (FT) on the symmetric group and on representation theory,we just give some ideas for intuition and refer the interested reader to [1] and [8]for further discussion.

Formally, a permutation is defined as a bijection of the set {1,...,n} into itselfwhich can be written as σ = [σ(1), ..., σ(n)]. The FT on the symmetric group,which is a generalization of the FT over the real line, decomposes a functionover permutations into n! real numbers which are known as Fourier coefficients.These coefficients are grouped in matrices which are in turn ordered by -followingthe analogy with the FT over the real line- frequency {fρ1 , ..., fρl}. The originalfunction can be recovered from the Fourier coefficients by using the inversiontheorem, which is stated as follows:

f(σ) =1

|Sn|∑

λ

dρλTr[f Tρλ · ρλ(σ)] (1)

where ρλ(σ) denote the real valued irreducible representation matrices and dρλ

their dimension [8].In the context of probability distributions, the FT has a very interesting prop-

erty: Each matrix of Fourier coefficients stores the information corresponding toa particular marginal probability1. Moreover, the (k − 1)-th marginal probabili-ties can be obtained by multiplying a matrix consisting of the direct sum of thek lowest frequency matrices of coefficients, M , by some matrices Ck which areprecomputed. The size of M for large values of k (i.e., for high order statistics)makes the storage of this matrix prohibitive. However, maintaining such ma-trix and computing the multiplication is computationally cheap for small values1 While the first order marginal expresses the probability of item i being at posi-

tion j, higher order marginals capture information such as the probability of items(i1, i2, ..., ik) being at positions (j1, j2, ..., jk).

188 E. Irurozki, B. Calvo, and J.A. Lozano

of k. Therefore, it is possible to approximate functions by discarding the coef-ficient matrices at high frequencies. Such approximations smooth the originalprobability distribution, bringing it closer to the uniform distribution.

3 Learning Probability Distributions over the FourierDomain

In this section we describe our proposed formulation for learning the Fourier co-efficients from a given sample of permutations. Our proposal consists of findingthe Fourier coefficients that maximize the likelihood given a sample of permuta-tions. Actually, we are interested in obtaining an approximation which considersonly the (k − 1)-th lowest marginal probabilities. In order to learn such an ap-proximation, the Fourier coefficients in the formulation are restricted to thosein the k lowest frequency matrices of the FT, {fρ1 , ..., fρk}.

Maximizing the likelihood of a sample {σ1, ..., σt} given the model in equation1 means solving the following nonlinear optimization problem:

(fmleρλ1

, ..., fmleρλk

) = argmaxfρ1 ,...,fρk

L(σ1, ..., σt|fρ1 , ..., fρk)

= argmaxfρ1 ,...,fρk

t∏i=1

( 1|Sn|

k∑λ=1

dρλTr[f Tρλ

· ρλ(σi)])

Unfortunately, not every set of Fourier coefficients leads to a valid probabil-ity distribution. Compactly describing the coefficients of a valid distribution isstill an open problem [5]. We will restrict the search space by adding of someconstraints that forbid searching in regions of the space where no coefficientrepresenting a valid distribution can be found. We have considered two kinds ofconstraints.

The first kind of constraint ensures a positive probability for each permutationin the sample. The second kind of constraint ensures that the Fourier coefficientstake values between the maximum and the minimum values of the irreduciblerepresentations that multiply it. That is:

minσ([ρλ(σ)]ij) ≤ [fρλ ]ij ≤ maxσ([ρλ(σ)]ij)

The Fourier coefficients obtained by maximizing the likelihood restricted tothese constraints correspond to a distribution whose sum is guaranteed to be 1.However, this does not ensure a valid probability distribution, as it is possibleto have negative ’probabilities’. If so, we perform a normalization process. Letm be the minimum probability value associated to a permutation. This processconsists of adding to every value of the probability distribution the absolutevalue of m and normalizing it. Note that if we added a first kind of constraintfor each σ ∈ Sn the estimated distribution would be valid.


4 Experiments

In this section we will show the performance of the proposed formulation. Ouraim is to demonstrate that the accuracy of the estimated distributions increasesas the sample size grows and higher order marginals are learned.

4.1 Experimental Setup

In order to evaluate our approach on the above described statements, we havedesigned the following experimental framework. First of all, a probability dis-tribution is randomly generated and this is used to draw several permutationsamples. From these samples, the proposed algorithm learns the Fourier coeffi-cients, and the distributions corresponding to these coefficients are calculated.Finally, the Kullback-Leibler divergences between the reference and the resultingestimated distributions are calculated.

We also propose a comparison test based on Monte Carlo techniques. Thetest consists of sampling a large number of random distributions and measuringthe Kullback-Leibler divergence between the reference and each of the randomdistributions.

The reference and the random distributions are generated by sampling aDirichlet distribution. In this way, the generation of each distribution requiresn! hyper-parameters α1, ..., αn!. We have set these hyper-parameters, for everydistribution, as α1 = α2 = ... = αn! = α, where α is uniformly drawn from theinterval [0.05, 0.25].

It seems reasonable to think that, by learning higher order marginals, it willbe possible to more accurately approximate the reference distribution. In orderto prove this intuition, the Fourier coefficients corresponding to three differentmarginals have been learned for each test, that is, the coefficients at the lowest2, 3 and 4 frequency matrices.

The tests have been made over the set of permutations of 6 and 7 elements, S6

and S7 respectively. For each different Sn, three sample sizes have been definedwhich are 5%, 10% and 25% of n! for S6 and 1%, 5% and 10% of n! for S7. Also,for each n and sample size, ten different samples are randomly generated andthe average results are computed. The number of random distributions for thecomparison test is 100,000 and their divergences with the reference distributionsare used to draw a histogram.

The resulting constrained nonlinear optimization problems have been solvedusing the fmincon function of MATLAB.

4.2 Results

Figures 1a, 1b and 1c show the results of estimating a sample of 5%, 10% and25% of n! respectively. Particularly, these figures show the Kullback-Leibler di-vergence between the reference and the estimated distributions, and the ref-erence and random distributions. The first point to consider in each figure isthat the divergence between the reference and the random distributions span a

190 E. Irurozki, B. Calvo, and J.A. Lozano

0 10 20 30 40 500

1000

2000

3000

4000

5000

6000

Num

ber

of d

istr

ibut

ions

KL divergence

2.51

719

2.57

38

2.65

318

(a) Sample size of 5%, S6

0 10 20 30 40 500

1000

2000

3000

4000

5000

6000

Num

ber

of d

istr

ibut

ions

KL divergence

2.39

995

2.47

655

2.63

829

(b) Sample size of 10%, S6

0 10 20 30 40 500

1000

2000

3000

4000

5000

6000

Num

ber

of d

istr

ibut

ions

KL divergence

2.33

52

2.44

395

2.54

405

(c) Sample size of: 25%, S6

0 5 10 15 20 25 30 350

1000

2000

3000

4000

5000

6000

Num

ber

of d

istr

ibut

ions

KL divergence

2.17

366

2.19

262

2.23

045

(d) Sample size of 1%, S7

0 5 10 15 20 25 30 350

1000

2000

3000

4000

5000

6000

Num

ber

of d

istr

ibut

ions

KL divergence

2.10

149

2.13

219

2.17

647

(e) Sample size of 5%, S7

0 5 10 15 20 25 30 350

1000

2000

3000

4000

5000

6000

Num

ber

of d

istr

ibut

ions

KL divergence

2.08

967

2.13

17

2.18

512

(f) Sample size of 10%, S7

Fig. 1. Kullback-Leibler divergence between the reference and estimated distributions,and the reference and the random distributions for S6 and S7

wide interval, the higher concentration being in the first half of the range. How-ever, none of the random distributions is closer to the reference distribution thanany of those obtained by learning the Fourier coefficients, which are plotted witha vertical line. Note that each line represents the average divergence of ten dis-tributions obtained from ten different samples of the same size. The estimateddistributions are significantly better than any random distribution. The threelines correspond to the distributions obtained by estimating the Fourier coeffi-cients at the lowest 2, 3 and 4 frequency matrices. Since the differences cannotbe clearly appreciated in the plots, a zoom over them is done at the top of eachfigure. In every zoomed figure the line on the right corresponds to the estimationof the lowest order marginals (k = 2) and the line on the left to the estimationof highest order marginals considered (k = 4). This means that as the number oflearned Fourier coefficients grows, the resulting distribution gets closer to thereference distribution.

Figures 1d, 1e and 1f show a similar performance on the group S7. Moreover,one can see that, as the number of elements in the set of permutations grows, thedivergences between the reference and the random distributions quickly increase,while the divergence of the learned distributions are quite stable.


In this paper we propose a novel method for learning probability distributionsfrom a set of permutations. The model for representing such distributions is a


Fourier-based approach. We have described a formulation that, by maximizingthe likelihood function, learns the Fourier coefficients that best represent theprobability distribution of a given sample, considering only the first k marginals.

Although our approach can only be used with low values of n, it can be usefulwhen combined with other learning approaches based on independence [4]. Withthis in mind, in order to learn the generating probability distribution of a givensample it is possible to, first, find the items in each of the independent (or nearlyindependent) factors, and then learn the distribution of each of the subsets ofitems by using our proposed formulation. In this way, we will deal with smallersets of items, making it possible to work with distributions that are otherwiseunaffordable.

Acknowledgments

This work has been partially supported by the Saiotek and Research Groups2007-2012 (IT-242-07) programs (Basque Government), TIN2008-06815-C02-01and TIN2010-14931 MICINN projects and COMBIOMED network in computa-tional biomedicine (Carlos III Health Institute). Ekhine Irurozki holds the grantBES-2009-029143 from the MICINN.

References

1. Diaconis, P.: Group representations in probability and statistics. Institute of Matem-atical Statistics (1988)

2. Fligner, M.A., Verducci, J.S.: Distance based ranking models. Journal of the RoyalStatistical Society 48(3), 359–369 (1986)

3. Helmbold, D.P., Warmuth, M.K.: Learning permutations with exponential weights.Journal of Machine Learning Research (JMLR) 10, 1705–1736 (2009)

4. Huang, J., Guestrin, C.: Learning hierarchical riffle independent groupings fromrankings. In: International Conference on Machine Learning (ICML 2010), Haifa,Israel (June 2010)

5. Huang, J., Guestrin, C., Guibas, L.: Fourier theoretic probabilistic inference overpermutations. Journal of Machine Learning Research (JMLR) 10, 997–1070 (2009)

6. Jagabathula, S., Shah, D.: Inferring rankings under constrained sensing. In:Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver,British Columbia, Canada, pp. 753–760 (2008)

7. Kondor, R., Howard, A., Jebara, T.: Multi-object tracking with representations ofthe symmetric group. In: Proceedings of the Eleventh International Conference onArtificial Intelligence and Statistics, San Juan, Puerto Rico (March 2007)

8. Serre, J.P.: Linear Representations in Finite Groups. Springer, Heidelberg (1977)

Correcting Different Types of Errors in Texts

Aminul Islam and Diana Inkpen

University of Ottawa, Ottawa, Canada{mdislam,diana}@site.uottawa.ca

Abstract. This paper proposes an unsupervised approach that auto-matically detects and corrects a text containing multiple errors of bothsyntactic and semantic nature. The number of errors that can be cor-rected is equal to the number of correct words in the text. Error typesinclude, but are not limited to: spelling errors, real-word spelling errors,typographical errors, unwanted words, missing words, prepositional er-rors, punctuation errors, and many of the grammatical errors (e.g., errorsin agreement and verb formation).

Keywords: Text Error Correction, Detection, Unsupervised, GoogleWeb 1T 5-grams.

1 Introduction

Most approaches to text correction are for only one or at best for a few typesof errors. To the best of our knowledge, there is no fully-unsupervised approachthat corrects a text having multiple errors of both syntactic and semantic nature.Syntactic errors refer to all kinds of grammatical errors. For example, in thesentence, “Our method correct real-word spelling errors.”, there is an error ofsyntactic nature in subject-verb agreement, whereas, in the sentence, “She hada cup of powerful tea.”, the word ‘strong’ is more appropriate than the word‘powerful’ in order to convey the proper intended meaning of the sentence, basedon the context. The latter is an example of a semantic error.

In this paper, a more general unsupervised statistical method for automatictext error detection and correction, done in the same time, using the GoogleWeb 1T 5-gram data set [1] is presented. The proposed approach uses the threebasic text correction operations: insert, delete, and replace. We use the followingthree strict assumptions for the input text that needs to be corrected: (1) Thefirst token is a word1. (2) There should be at least three words in an input text.(3) There might be at most one error in between two words. We also assumethat there might be at most one error after the last word.

We also use the following weak assumption: (4) We try to preserve the in-tended semantic meaning of the input text as much as possible.

1 Whenever we use only the term ‘word’ without an adjective (e.g., correct or incor-rect), we imply a correct word.


Correcting Different Types of Errors in Texts 193

2 Related Work

Some approaches consider spelling correction as text correction. An initialapproach to automatic acquisition for context-based spelling correction was astatistical language-modeling approach using word and part-of-speech (POS)n-grams [2–5]. Some approaches in this paradigm use Bayesian classifiers anddecision lists [6–8]. Other approaches simply focus on detecting sentences thatcontain errors, or computing a score that reflects the quality of the text [9–14].

In other text correction approaches, the prediction is typically framed as aclassification task for a specific linguistic class, e.g., prepositions, near-synonymchoices, or a set of predefined classes [15, 16]. In some approaches, a full syntacticanalysis of the sentence is done to detect errors and propose corrections. Wecategorize this paradigm into two groups: those that constrain the rules of thegrammar [17, 18], and those that use error-production rules [19–22].

[23] presents the use of a phrasal Statistical Machine Translation (SMT) tech-niques to identify and correct writing errors made by ESL (English as a SecondLanguage) learners.

The work that is closely related to ours is that of Lee’s [24], a supervisedmethod built on the basic approach of template-matching on parse trees. Toimprove recall, the author uses the observed tree patterns for a set of verb formusages, and to improve precision, he utilizes n-grams as filters. [25] trains amaximum entropy model using lexical and POS features to recognize a varietyof errors. Their evaluation data partially overlaps with that of [24] and our paper.

3 Proposed Method

Our proposed method determines some probable candidates and then sorts thosecandidates. We consider three similarity functions and one frequency value func-tion in our method. One of the similarity functions, namely the string similarityfunction, is used to determine the candidate texts. The frequency value functionand all the other similarity functions are used to sort the candidate texts.

3.1 Similarity and Frequency Value Functions

Similarity between Two Strings. We use the same string similarity measureused in [26], with the following different normalization from [27]:

v1 =2 × len(LCS(s1, s2))

len(s1) + len(s2)v2 =

2 × len(MCLCS1(s1, s2))len(s1) + len(s2)

v3 =2× len(MCLCSn(s1, s2))

len(s1) + len(s2)v4 =

2 × len(MCLCSz(s1, s2))len(s1) + len(s2)

The similarity of the two strings, S1∈[0, 1] is:S1(s1, s2) = α1v1 + α2v2 + α3v3 + α4v4 (1)

Here, len calculates the length of a string,LCS, MCLCS1, MCLCSn, and MCLCSz

calculate the Longest Common Subsequence, Maximal Consecutive LCS starting

194 A. Islam and D. Inkpen

at character 1, starting at character n, and ending at the last character betweentwo strings, respectively. α1, α2, α3, α4 are weights and α1+α2+α3+α4 = 1. Weheuristically set equal weights for most of our experiments2.

Common Word Similarity between Texts. If two texts have some words incommon, we can measure their similarity based on the common words. We countthe number of words in common between the text to correct and a candidatecorrected text, normalizing the count by the size of both texts. Let us considera pair of texts, T1 and T2 that have m and n tokens, with δ tokens in common.Thus, the common word similarity, S2∈[0, 1] is:

S2(T1, T2) = 2δ/(m + n) (2)

Non-Common Word Similarity. If the two texts have some non-commonwords, we can measure how similar the two texts are based on their non-commonwords. If there are δ tokens in T1 that exactly match with T2, then there arem−δ and n−δ non-common words in texts T1 and T2, respectively, assumingthat T1 and T2 have m and n tokens, respectively, and n ≥ m. We remove allthe δ common tokens from both T1 and T2. We construct a (m−δ)×(n−δ)string similarity matrix using Equation 1 and find out the maximum-valuedmatrix element. We add this matrix element to a list (say, ρ). We remove allthe matrix elements which are in the row and column of the maximum-valuedmatrix element, from the original matrix. We remove the row and column, inorder to remove the pair with maximum similarity. This makes the computationmanageable: in the next steps, fewer words are left for matching. We repeatthese steps until either the current maximum-valued matrix element is 0, orm−δ−|ρ| = 0, or both. We sum up all the elements in ρ and divide by n− δ toget the non-common word similarity, S3∈[0, 1):

S3(T1, T2) =∑|ρ|

i=1 ρi/(n− δ) (3)

Normalized Frequency Value. We determine the normalized frequency valueof a candidate text (how we determine candidate texts is discussed in detail inSection 3.2) with respect to all other candidate texts. A candidate text havinghigher normalized frequency value is more likely a strong candidate for the cor-rection, though not always. Let us consider, we have n candidate texts for theinput text T : {T1, T2, · · ·Ti · · · , Tn}

T1 = {w11, w12, · · ·w1j · · ·w(1)(m1)}T2 = {w21, w22, · · ·w2j · · ·w(2)(m2)}

· · · · · · · · · · · · · · · · · · · · · · · ·Ti = {wi1, wi2, · · ·wij · · ·w(i)(mi)}

· · · · · · · · · · · · · · · · · · · · · · · ·Tn = {wn1, wn2, · · ·wnj · · ·w(n)(mn)}

Here, wij is the jth token of the candidate text, Ti, and mi means that the can-didate text Ti has mi tokens. It is important to note that the number of tokens2 We use equal weights in several places in this paper in order to keep the system un-

supervised. If development data would be available, we could adjust the weights.


each candidate text has may be different from the rest. The number of 5-gramsin any candidate text, Ti is mi − 4. Again, let us consider that Fi is the set offrequencies of all the 5-grams that Ti has; fij is the frequency of the jth 5-gramof the candidate text, Ti. That is:

F1 = {f11, f12, · · · f1j · · · f(1)(m1−4)}F2 = {f21, f22, · · · f2j · · · f(2)(m2−4)}

· · · · · · · · · · · · · · · · · · · · · · · ·Fi = {fi1, fi2, · · · fij · · · f(i)(mi−4)}

· · · · · · · · · · · · · · · · · · · · · · · ·Fn = {fn1, fn2, · · · fnj · · · f(n)(mn−4)}

Here, {f11, f21, · · · fi1 · · · fn1}, {f12, f22, · · · fi2 · · · fn2}, {f1j , f2j, · · · fij · · · fnj}and {f(1)(mi−4), f(2)(mi−4), · · · f(i)(mi−4) · · · f(n)(mi−4)} are the sets of 5-gramfrequencies for all n candidate texts that are processed in the first step3, thesecond step, the jth step, and the (mi − 4)th step, respectively. We calculatethe normalized frequency value of a candidate text as the summation of all the5-gram frequencies of the candidate text over the summation of the maximumfrequencies in each step that the candidate text may have. Thus the normalizedfrequency value of Ti represented as S4 ∈ [0, 1] is:

S4(Ti) =∑mi−4

j=1 fij/∑mi−4

l=1 maxk∈N fkl (4)

3.2 Determining Candidate Texts

Let us consider an input text, that after tokenization has m tokens, i.e., T ={w1, w2 · · · , wm}. Our approach consists in going from left to right accordingto a set of rules that are listed in Table 1 and Table 2. We use three basicoperations, Insert, Replace and Delete to list these 5-gram rules. We also use NoOperation to mean that we do not use any operation, rather we directly use thenext token from T to list a 5-gram rule.

5-gram Rules Used in Step 1. Table 1 lists all possible 5-gram rules generatedfrom the said operations and assumptions. We use each of these 5-gram rules togenerate a set of 5-grams and their frequencies by trying to match the 5-gramrule with the Web 1T 5-grams. We take the decision of how many candidate5-grams generated from each 5-gram rule we keep for further processing (say,n). The 5-gram Rule #1 in Table 1 says that we take the first five tokens fromT to generate a 5-gram and try to match with the Web 1T 5-grams to generatethe only candidate 5-gram and its frequency, if there is any matching. In 5-gramRule #2, we take the first four tokens from T and try to insert each word from alist of words (our goal here is to determine this list of words; it might be empty)3 By the first step, we mean the step when we process the first possible 5-grams in

the input text. Similarly, by the second step, we mean the step when we process thenext possible 5-grams (by removing the first token from the 5-grams used in firststep and adding an extra word from the input text or other way, which is discussedin detail in Section 3.2) in the input text, and so on.


in between w1 and w2 to generate a list of 5-grams and try to match with theWeb 1T 5-grams to generate a set of 5-grams and their frequencies. We sortthese 5-grams in descending order by their frequencies and only keep at mostthe top n 5-grams and their frequencies. All I’s and R’s in Table 1 and Table 2function similar to variables and all wi ∈ T function similar to constants. The5-gram Rule #9 can generate a list of 5-grams and their frequencies, based on allthe possible values of R2, a set of all replaceable words of w2. We determine thestring similarity between w2 and each member of R2 using (1) and sort the listin descending order by string similarity values and only keep at most n 5-grams.

Table 1. List of all possible 5-gram rules in step 1

Rule# 5-gram Rule Generated from Rule# 5-gram Rule Generated from1 w1 w2 w3 w4 w5 No Operation 26 w1 w3 I1 w4 w5

2 w1 I1 w2 w3 w4 27 w1 w3 w4 I1 w5

3 w1 w2 I1 w3 w4 28 w1 w2 w4 I1 w5 Single Delete +4 w1 w2 w3 I1 w4 Single Insert 29 w1 w2 w4 w5 I1 Single Insert5 w1 w2 w3 w4 I1 30 w1 w3 w4 w5 I16 w1 I1 w2 I2 w3 31 w1 w2 w3 w5 I17 w1 w2 I1 w3 I2 Double Insert 32 w1 w3 w4 w5 R6

8 w1 I1 w2 w3 I2 33 w1 w2 w3 w5 R6

9 w1 R2 w3 w4 w5 34 w1 w3 R4 w5 w6

10 w1 w2 R3 w4 w5 35 w1 w2 w4 R5 w6 Single Delete +11 w1 w2 w3 R4 w5 Single Replace 36 w1 w3 w4 R5 w6 Single Replace12 w1 w2 w3 w4 R5 37 w1 w2 w4 w5 R6

13 w1 R2 w3 R4 w5 38 w1 R2 w3 w5 w6

14 w1 w2 R3 w4 R5 Double Replace 39 w1 w2 R3 w4 w6 Single Replace +15 w1 R2 w3 w4 R5 40 w1 R2 w3 w4 w6 Single Delete16 w1 w3 w4 w5 w6 41 w1 I1 w2 R3 w4

17 w1 w2 w4 w5 w6 42 w1 w2 I1 w3 R4 Single Insert +18 w1 w2 w3 w5 w6 Single Delete 43 w1 I1 w2 w3 R4 Single Replace19 w1 w2 w3 w4 w6 44 w1 R2 w3 I1 w4 Single Replace +20 w1 w3 w5 w6 w7 45 w1 w2 R3 w4 I1 Single Insert21 w1 w2 w4 w6 w7 46 w1 R2 w3 w4 I122 w1 w3 w4 w6 w7 47 w1 I1 w2 w4 w5

23 w1 w2 w4 w5 w7 Double Delete 48 w1 w2 I1 w3 w5 Single Insert +24 w1 w2 w3 w5 w7 49 w1 I1 w2 w3 w5 Single Delete25 w1 w3 w4 w5 w7

Limits for the Number of Steps, ntp. We figure out what maximum andminimum number of steps we need for an input text. Taking the second assump-tion into consideration, it is obvious that if the value of m is 3 (the number ofwords would also be 3) then only rules # 6, 7 and 8 can be used to generate5-grams. 5-grams generated from rule # 7 and 8 can not be used in the next stepas after the last word (w3); we might have at most one error and all the 5-grams,if any, generated using these rules have this error (i.e., I2). 5-grams generatedfrom rules # 6 can be used in the next step (by rule # 2 in Table 2) to testwhether we can insert a word in the next step, provided that the previous stepgenerates at least one 5-gram. Thus, if m = 3 we might need at most 2 steps.Now, if m = 4, then for the added word (i.e., w4) we need two extra steps totest rules # 5 and 2, in order, on top of the previous two steps (for the firstthree words), provided that each previous step generates at least one 5-gram.


That is, each extra token in T needs at most two extra steps. We generalize themaximum number of steps needed for an input text having m tokens as:

Max ntp = 2+(m−3)×2 = 2m−4 (5)

Again, the minimum number of steps is ensured if rules # 6 to 8 in step 1 donot generate any 5-gram. This means that, if m = 3, we might need at least 0steps4. Now, if m = 4 then for the added word (i.e., w4) we need only an extrastep to test rule # 5 on top of the previous single step (for the first three tokens).That is, each extra token in T needs at least one extra step, provided that eachprevious step for each extra token generates at least one 5-gram.5 We generalizethe minimum number of steps needed for an input text having m tokens as:

Min ntp = m−3 (6)

In (5), the maximum number of steps, 2m− 4, also means that the maximumnumber of tokens possible in a candidate text is 2m. Thus, an input text hav-ing 2m tokens can have at most m errors to be handled and m correct words,assuming m ≥ 3 (the second assumption on page 192).

Table 2. List of All Possible 5-gram Rules in Step 2 to Step 2m− 4

Rule# 5-gram Rule Generated from Case Number1 − − −wiwi+1 No Operation2 − − −wiIj Single Insert 1: if the last word3 − − −wiwi+2 Single Delete in step 1 is in T4 − − −wiRi+1 Single Replace5 − − wiIjwi+1 No Operation 2: if the second last word in step 1 is in6 − − wiRi+1wi+2 No operation T and the last word is either an inserted

or a replaced word

5-gram Rules used in Step 2 to 2m−4. Table 2 lists all possible 5-gramrules generated from the said operations and assumptions for step 2 to step2m−4. We use step 2 (i.e., the next step) only if step 1 (i.e., the previous step)generates at least one 5-gram from 5-gram rules listed in Table 1. Similarly, weuse step 3 (i.e., the next step) only if step 2 (i.e., the previous step) generatesat least one 5-gram from the 5-gram rules listed in Table 2, and so on. In Table2, ‘−’ means that it might be any word that is in T , or an inserted word (aninstance of I’s), or a replaced word (an instance of R’s) in the previous step.To give a specific example of how we list the 5-gram rules in Table 2, considerthat rule #2 (w1 I1 w2 w3 w4) in Table 1 generates at least one 5-gram in step1. We take the last four words of this 5-gram (i.e., I1 w2 w3 w4) and add thenext word from T (in this case w5), in order to form a new rule in step 2 (whichis I1 w2 w3 w4 w5). The general form of this rule (− − −wi wi+1) is listed asrule #1 in Table 2. In step 1, I1 in rule #2 acts like a variable, but in step 2

4 We call a step successful if it generates at least one 5-gram. Thus, if we try togenerate some 5-grams in step 1 and if we fail to generate any, then the number ofstep, ntp is 0, though we do some processing for step 1.

5 If we omit the assumption that each previous step for each extra token generates atleast one 5-gram, then to determine the Min ntp is very straight forward, it is 0.


we use only a single instance of I1, which acts like a constant. We categorize allthe 5-grams generated in step 1 (i.e., the previous step) into two different cases.Case 1 groups each 5-gram in step 1 having its last word in T . Case 2 groupseach 5-gram in step 1 having its second last word in T , and the last word notin T . We stop when we fail to generate any 5-gram in the next step from all the5-gram rules of the previous step.

Determining the Limit of Candidate Texts. There might be a case whenno 5-gram is generated in step 1; this means that the minimum n possible is 0.Table 1 shows that there are 11 5-gram rules (rules without any I’s or R’s) instep 1 that generate at most one 5-gram per 5-gram rule. It turns out that theremaining 5-gram rules can generate at most n 5-grams per 5-gram rule. Thus,the maximum number of candidate texts, n, that can be generated having onlya single step (i.e., ntp = 1) is:

Max n=(no. of 5-gram rules in step 1−no. of 5-gram rules in step 1 withoutany I’s or R’s)× n + no. of 5-gram rules in step 1 without any I’s or R’s (7)

=(49 − 11)× n + 11 = 38n + 11 (8)

At most 2n + 2 5-grams (rules #1 to 4 in Table 2) can be generated in step 2from a single 5-gram generated in step 1 having the last word in T . There maybe at most 33 such 5-grams in step 1. At most 1 5-gram (rules #5 and 6 in Table2) can be generated in step 2 from a single 5-gram generated in step 1 having thesecond last word in T and the last word being either an inserted or a replacedword. There may be at most 16 such 5-grams in step 1. The maximum numberof candidate texts, n, that can be generated having two steps (i.e., ntp = 2) is:

Max n = 33(2n + 2) + 16 × 1 (9)We generalize Max n for different values of ntp as:

Max n≈

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

38n+11 if step = 133×20(2n+2)+16 if step = 233×21(2n+2)+66×20n if step = 3· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·33×2ntp−2(2n+2)+66×2ntp−3n if step = ntp

(10)

Simplifying (10):

Max n≈

⎧⎪⎨⎪⎩

38n+11 if step = 166n+82 if step = 22ntp−3(198n+132) if step ≥ 3

(11)

Theoretically, Max n seems to be a large number, but practically n is muchsmaller than Max n. This is because not all theoretically possible 5-grams are inthe Web 1T 5-grams data set, and because fewer 5-grams generated in any stephave an effect in all the subsequent steps.

Forming Candidate Texts. Algorithm 1 describes how a list of candidatetexts can be formed from the list of 5-grams in each step. That is, the output of


Algorithm 1. forming candidate textsinput : ntp, list of 5-grams in each stepoutput: candidate listcandidate list ← NULL1

for each 5-gram of step 1 do2k ← 13

candidate text[k] ← 5-gram of step 14

for i ← 2 to ntp do5j ← 16

for each k do7for each 5-gram of step i do8

next 5-gram← 5-gram of step i9

temp candidate text[j] ← candidate text[k]10

str1 ← last four words of temp candidate text[j]11

str2 ← first four words of next 5-gram12

if str1 = str2 then13

temp candidate text[j] ← temp candidate text[j] . last14

word of next 5-gram /* ‘.’ is to concatenate */end15

increment j16

end17

end18

decrement j19

for each j do20

candidate text[j] ← temp candidate text[j]21

end22

k ← j23

end24

for each k do25

candidate list ← candidate list + candidate text[k]26

end27

end28

Algorithm 1 is {T1, T2, · · ·Ti · · · , Tn}. The algorithm works as follows: Takingthe last four words of each 5-gram in step 1, it tries to match with the first fourwords of each 5-gram in step 2. If it matches, then concatenating the last wordof the matched 5-gram in step 2 with the matched 5-gram in step 1 generatesa temporary candidate text for further processing. If a 5-gram in step 1 doesnot match with at least a single 5-gram in step 2, then the 5-gram in step 1 isa candidate text. One 5-gram in step 1 can match with several 5-grams in step2, thus generating several temporary candidate texts. We continue this processuntil we cover all the steps.

3.3 Sorting Candidate Texts

It turns out from 3.2 that, if the input text is T , then the total n candidatetexts are {T1, T2, · · ·Ti · · · , Tn}. We determine the correctness value, S for eachcandidate text using (12), a weighted sum of (2), (3) and (4), and then wesort in descending order by the correctness values. In (12), it is obvious thatβ1 + β2 + β3 = 1 to have S ∈ (0, 1].

S(Ti)=β1S2(Ti, T )+β2S3(Ti, T )+β3S4(Ti) (12)


By trying to preserve the semantic meaning of the input text as much aspossible, we intentionally keep the candidate texts and the input text as close(both semantically and syntactically) as possible. Thus, we set more weight onS2 and S3. Though we set low weight on S4, it is one of the most crucial partsof the method, that helps to identify and correct the error. If we only rely onthe normalized frequency value of each candidate text, then we have to dealwith an increasing number of false positives: the method detects an input textas incorrect, while, in reality, it is not. On the contrary, if we only rely on thesimilarity of common words, non-common words, and so on, between input textand each candidate text, then we have to deal with an increasing number of falsenegatives: the method detects an input text as correct, while, in reality, it is not.

4 Evaluation and Experimental Results

4.1 Evaluation on WSJ Corpus

Because of the lack of a publicly available data set having multiple errors in shorttexts, we generate a new evaluation data set, utilizing the 1987-89 Wall StreetJournal corpus. It is assumed that this data contains no errors. We select 34 shorttexts from this corpus and artificially introduce some errors, so that it requires toperform some combinations of insert, and/or delete, and/or replace operation toget back to the correct texts. To generate the incorrect texts, we artificially in-sert prepositions and articles, delete articles, prepositions, auxiliary verbs, andreplace prepositions with other prepositions, singular nouns with plural nouns(e.g., spokesman with spokesmen), articles with other articles, real words withreal-word spelling errors (e.g., year with tear), real words with spelling errors(e.g., there with ther). To generate real-word spelling errors (which are in factsemantic errors) and spelling errors, we use the same procedure as [28]. The av-erage number of tokens in a correct text and an incorrect text are 7.44 and 6.32,respectively. The average number of corrections required per text is 1.76. We keepsome texts without inserting any error, to test the robustness of the system (wegot only a single false positive). This decreases the number of errors per text.

The performance is measured using Recall (R), Precision (P), F1 and Accu-racy (Acc). We asked two human judges, both native speakers of English andgraduate students in Natural Language Processing, to correct those 34 texts. Theagreement between the two judges is low (the detection agreement is 53.85% andthe correction agreement is 50.77%), which means the task is difficult even forhuman experts. Table 3 shows two examples of test texts. The results in Ta-ble 4 show that our method gives comparable recall value for both detection andcorrection, whereas human judges give better precision value for both detectionand correction. Since a majority of the words in the evaluation data set are cor-rect, the baseline is to propose no correction, achieving 76.28% accuracy. Takingthis baseline accuracy as a lower limit and the accuracy achieved by the humanjudges as an upper limit, we conclude that the automatic method realizes abouthalf of the possible improvement between the baseline and the human expertupper bound (76%-84%-92%, respectively).


Table 3. Some examples

Example 1 Example 2Incorrect All funding decisions is made the What his believed to the nextCorrect All funding decisions are made by the What is believed to be the nextJudge 1 All funding decisions is made by the What is believed to be nextJudge 2 All funding decisions are made the What he believed to be the nextOur Method All funding decisions are made by the What is believed to be the next

Table 4. Results on the WSJ corpus

Detection CorrectionAcc.

R P F1 R P F1

Our Method 90.0 75.00 81.82 78.33 65.28 71.21 84.98Judge 1 65.0 88.64 75.00 58.33 79.54 67.31 86.56Judge 2 90.0 93.10 91.53 83.33 86.21 84.75 92.89

4.2 Evaluation on JLE Corpus

We also evaluate the proposed method using the NICT JLE corpus [22], todirectly compare with [24]. The JLE corpus has 15,637 sentences with annotatedgrammatical errors and their corrections. We generated a test set of 477 sentencesfor subject-verb (S-V) agreement errors, and another test set of 238 sentences forauxiliary agreement and complementation (AAC) errors by retaining the verbform errors, but correcting all other error types. [24] generated the same numberof sentences of each category.

[24] used the majority baseline, which is to propose no correction, since thevast majority of verbs were in their correct forms. Thus, [24] achieved a majoritybaseline of 96.95% for S-V agreement and 98.47% for AAC. Based on thesenumbers, it can be determined that [24] had only 14 or 15 errors in the S-Vagreement data set and 3 or 4 errors in the AAC data set. Our data set hasa majority baseline of 80.5% for S-V agreement and 79.8% for AAC. It meansthat we have 93 errors in the S-V agreement data set and 48 errors in the AACdata set. The small number of errors in their data set is the reason why they gethigh accuracy even when they have moderate precision and recall. For example,if their method fails to correct 2 errors out of the 3 errors in the S-V agreementdata set (i.e., if true positive is 1 and false positives are 2), then their recallwould be 33.3%, even then their accuracy would be 99.16%. Table 5 shows thatour method generates consistent precision, recall, and accuracy.

Table 5. Results on the JLE corpus.‘—’ means that the result is not mentioned in[24].

Detection CorrectionAcc.

R P F1 R P F1

Lee (S-V) — 83.93 — 80.92 81.61 — 98.93Lee (AAC) — 80.67 — 42.86 68.0 — 98.94Our (S-V) 98.92 96.84 97.87 97.85 95.79 96.81 98.74Our (AAC) 97.92 94.0 95.92 95.83 92.0 93.88 97.48


5 Conclusion

The proposed unsupervised text correction approach can correct one error, whichmight be syntactic or semantic, for every word in a text. This large magnitude oferror coverage, in terms of number, can be applied to correct Optical CharacterRecognition (OCR) errors, to automatically-mark (based on grammar and se-mantics) subjective examination papers, etc. A major drawback of our proposedapproach is the dependence on the availability of enough 5-grams. The futurechallenge is how to tackle this problem, while keeping the approach unsupervised.

References

1. Brants, T., Franz, A.: Web 1T 5-gram corpus version 1.1. Technical report, GoogleResearch (2006)

2. Atwell, E., Elliot, S.: Dealing with ill-formed english text. In Garside, R., Samp-son, G., Leech, G., eds.: The computational analysis of English: a corpus-basedapproach, London, Longman (1987)

3. Gale, W.A., Church, K.W.: Estimation procedures for language context: Poorestimates are worse than none. In: Proceedings Computational Statistics, Physica-Verlag, Heidelberg (1990) 69–74

4. Mays, E., Damerau, F.J., Mercer, R.L.: Context based spelling correction. Infor-mation Processing and Management 27(5) (1991) 517–522

5. Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Statisticsand Computing 1(2) (December 1991) 93–103

6. Golding, A.R., Roth, D.: A winnow-based approach to context-sensitive spellingcorrection. Machine Learning 34(1-3) (1999) 107–130

7. Golding, A.R., Schabes, Y.: Combining trigram-based and feature-based methodsfor context-sensitive spelling correction. In: Proceedings of the 34th annual meetingon Association for Computational Linguistics, Morristown, NJ, USA, Associationfor Computational Linguistics (1996) 71–78

8. Yarowsky, D.: Decision lists for lexical ambiguity resolution: application to accentrestoration in spanish and french. In: Proceedings of the 32nd annual meeting onAssociation for Computational Linguistics, Morristown, NJ, USA, Association forComputational Linguistics (1994) 88–95

9. Gamon, M., Aue, A., Smets, M.: Sentence-level mt evaluation without referencetranslations: Beyond language modeling. In: European Association for MachineTranslation (EAMT). (2005) 103–111

10. Sjobergh, J.: Chunking: an unsupervised method to find errors in text. In Werner,S., ed.: Proceedings of the 15th NoDaLiDa conference. (2005) 180–185

11. Wang, C., Seneff, S.: High-quality speech translation for language learning. In:Proc. of InSTIL, Venice, Italy (2004)

12. Eeg-olofsson, J., Knutsson, O.: Automatic grammar checking for second languagelearners - the use of prepositions. In: NoDaLiDa, Reykjavik, Iceland (2003)

13. Chodorow, M., Leacock, C.: An unsupervised method for detecting grammaticalerrors. In: Proceedings of NAACL’00. (2000) 140–147

14. Atwell, E.S.: How to detect grammatical errors in a text without parsing it. In:Proceedings of the third conference on European chapter of the Association forComputational Linguistics, Morristown, NJ, USA, Association for ComputationalLinguistics (1987) 38–45


15. Islam, A., Inkpen, D.: An unsupervised approach to preposition error correction.In: Proceedings of the IEEE International Conference on Natural Language Pro-cessing and Knowledge Engineering (IEEE NLP-KE’10), Beijing (August 2010)1–4

16. Felice, R.D., Pulman, S.G.: A classifier-based approach to preposition and deter-miner error correction in L2 English. In: Proceedings of the 22nd InternationalConference on Computational Linguistics (Coling 2008), Manchester, UK, Coling2008 Organizing Committee (August 2008) 169–176

17. Fouvry, F.: Constraint relaxation with weighted feature structures. In: Proceedingsof the 8th International Workshop on Parsing Technologies, Nancy, France (2003)23–25

18. Vogel, C., Cooper, R.: Robust chart parsing with mildly inconsistent feature struc-tures. In Schter, A., Vogel, C., eds.: Nonclassical Feature Systems. Volume 10.Centre for Cognitive Science, University of Edinburgh (1995) Working Papers inCognitive Science.

19. Wagner, J., Foster, J., van Genabith, J.: A comparative evaluation of deep andshallow approaches to the automatic detection of common grammatical errors.In: Proceedings of the 2007 Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Language Learning (EMNLP-CoNLL). (2007) 112–121

20. Andersen, O.E.: Grammatical error detection using corpora and supervised learn-ing. In Nurmi, V., Sustretov, D., eds.: Proceedings of the 12th Student Session ofthe European Summer School for Logic, Language and Information. (2007)

21. Foster, J., Vogel, C.: Parsing ill-formed text using an error grammar. Artif. Intell.Rev. 21(3-4) (2004) 269–291

22. Izumia, E., Uchimotoa, K., Isaharaa, H.: SST speech corpus of Japanese learners’English and automatic detection of learners’ errors. ICAME Journal 28 (2004)31–48

23. Brockett, C., Dolan, W.B., Gamon, M.: Correcting ESL errors using phrasalSMT techniques. In: ACL-44: Proceedings of the 21st International Conferenceon Computational Linguistics and the 44th annual meeting of the Association forComputational Linguistics, Morristown, NJ, USA, Association for ComputationalLinguistics (2006) 249–256

24. Lee, J.S.Y.: Automatic Correction of Grammatical Errors in Non-native EnglishText. PhD thesis, Massachusetts Institute of Technology, Department of ElectricalEngineering and Computer Science (June 2009)

25. Izumi, E., Supnithi, T., Uchimoto, K., Isahara, H., Saiga, T.: Automatic errordetection in the japanese learners english spoken data. In: In Companion Volumeto Proc. ACL03. (2003) 145–148

26. Islam, A., Inkpen, D.: Real-word spelling correction using Google Web 1T n-gram data set. In Cheung, D.W.L., Song, I.Y., Chu, W.W., Hu, X., Lin, J.J.,eds.: Proceedings of the 18th ACM Conference on Information and KnowledgeManagement, CIKM 2009, Hong Kong, ACM (November 2009) 1689–1692

27. Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarityand string similarity. ACM Trans. Knowl. Discov. Data 2 (July 2008) 10:1–10:25

28. Hirst, G., Budanitsky, A.: Correcting real-word spelling errors by restoring lexicalcohesion. Natural Language Engineering 11(1) (March 2005) 87–111


Simulating the Effect of Emotional Stress on Task Performance Using OCC

Dreama Jain and Ziad Kobti

School of Computer Science, University of Windsor Windsor, ON, Canada, N9B-3P4

{jainh,kobti}@uwindsor.ca

Abstract. In this study we design and implement an artificial emotional response algorithm using the Ortony, Clore and Collins theory in an effort to understand and better simulate the response of intelligent agents in the presence of emotional stress. We first develop a general model to outline a generic emotional agent behaviour. Agents are then socially connected and surrounded by objects, or other actors, that trigger various emotions. A case study is built using a basic hospital model where nurse servicing patients interact in various static and dynamic emotional scenarios. The simulated results show that increase in emotional stress leads to higher error rates in nurse task performance.

Keywords: Multi-agent system, emotions, behavior, affective computing.

1 Introduction

Theories of emotion are rooted in psychology [1,2 and 4]. Artificial Intelligence community has been showing growing interest in emotional psychology, particularly artificial emotional response to formulate a more realistic social adaptation [5-7]. In previous work [11] we describe three emotional theories and design a computer algorithm for each. The Ortony, Clore and Collins theory (OCC) [1] is selected for this study since comparative work between OCC, Frijda’s theory [2] and Scherer’s theory [3] produce relatively similar results in simulated benchmarks. OCC captures the cognitive structure of emotions. It is commonly used in computational models of emotions. According to this theory, there are cognitions and interpretations that lead to the generation of an emotion. These cognitions are determined by the interaction of events, agents, and objects.

There is neurological evidence among others showing a correlation between emotional stress and task performance. This can be trivial in observed human behaviour as one would underperform or have an increased likelihood of error when performing a given task. We aim to replicate this behaviour in a simulated session by first creating a generic algorithm outlining such artificial behaviour, and second to test the agent task performance in the presence of emotional stress. Consequently we build a multi-agent simulation which generates emotions in agents and observes the influence of emotions on agent’s behavior.

In the next section, we highlight related work which has been done so far in this area and discuss the psychological theories that we have used in creating algorithms

Simulating the Effect of Emotional Stress on Task Performance Using OCC 205

for our simulation. Next we detail the model, the process of emotion generation and its influence on behaviour. In the next section we perform some experiments using a case study of a hospital system and generating emotions in patients and nurses using OCC theory under different settings followed by concluding remarks.

2 Related Work

Gratch and Marsella [8] and [9] introduce a domain independent framework of emotion known as Emotion and Adaptation (EMA) which not only implements appraisal of events but also generates a coping process for the event and the emotion generated. The authors used a doctor’s example to generate emotions for a child patient and then generate coping strategies in order to cope from that situation. In [9] the authors have generated a method to evaluate their EMA model by comparing it with human behavior, using stress and coping questionnaire. The authors conclude that their work is very close to the results of the questionnaire, with some limitations.

Adam, Herzig and Longin [10] describe the recent work done in building emotional agents with the help of BDI logic and the formalization of emotional theories as described by OCC [1]. Their work introduces a logical framework based on BDI (Belief, Desire and Intention) which consists of agents, actions and atomic formulae. With the help of full belief, probability, choice, and like/dislike, various emotions are formalized and a task is performed after decision-making. The authors conducted a case study related to Ambient Intelligence. They developed a logical formalization of twenty two emotions as described by OCC. The authors state that modeling of triggering of emotions from mental states has been done in their research.

3 Generic Model

In a generic model we build a multi-agent system where agents are placed randomly on a two dimensional grid. Here an agent represents a human with emotional response in accordance to the underlying emotional theory used, in this case the OCC. There are several randomly distributed objects on the grid which emit emotion as in liking or disliking of the object. The positions of the objects are fixed while the agents move according to a given move speed parameter. Agents can interact with their neighbours on the grid according to a set communication distance.

Every time step, in order for an agent to interact with its neighbours it moves randomly on the grid. The current position of the agent is checked relative to a nearby object and a corresponding emotion is reflected in the agent. Next an event is triggered and the agent checks for neighbours in the surrounding area depending upon the communication distance parameter. Now according to the emotional theory a corresponding emotion is generated. These steps are performed for all the agents.

According to the OCC theory, if an object is found then according to its value it is recognized as either liked or disliked. If there are no neighbours around the agent then one of the wellbeing, prospect or attribution emotion is generated depending upon the event triggered. If there are neighbours, then for every neighbour agent some emotion is generated out of fortune of others or attribution emotion, again depending upon the

206 D. Jain and Z. Kobti

event triggered. A generated emotion history is kept updated. The wellbeing emotion either generates a pleased or a displeased emotion. Furthermore, in prospect emotions we use a set probability for an event to happen according to which hope, fear, satisfaction, fears-confirmed, disappointment and relief emotions can be generated. Attribution emotions generate approval and gratification or disapproval and remorse for the agent’s own actions. For another agent’s action it can generate approval and gratitude or disapproval and anger. Fortune of others emotions can again lead to a pleased or displeased emotion but this time due to some other agent’s action.

In order to see the influence of the emotion on the agent, every time step an agent performs some task. If the agent’s emotional state is happy or on the positive side, then the agent can perform the task logically, that is in the way the task was expected to be performed. On the other hand, if the agent’s emotional state is unstable and is on the negative side, like emotional states such as despair, disappointment, sadness, or disgust, then there are chances that the agent may not be able to perform the task efficiently as it is expected to be performed.

We define a task to be a process which is composed of a sequence of steps. If each step is performed the way it is expected then the task is said to be completed correctly. We represent the task using a weight directed graph. Every step of the task (node of the graph) has some weight associated with it, which is summed in order to check the completion of the graph. A task for instance can have around 4 to10 steps.

The weight of the task can be defined as the sum of the individual weights of the steps of the task. These weights are actually the attention required by the agent to perform that step. In other words, the attention factor of a step to be performed is represented as the weight associated with the node of the graph. A step can have an attention factor varying from 0 to 100. A task is performed by traversing through the graph and summing the weights/attention factor attached to the node/step of the graph/task being traversed. Logically a task is said to be completed if all the steps of the task are performed, that is all the nodes of the graph are visited. So at the completion of the task logically we get a final sum of the measure of attention factors of each step of the task. A task is performed logically if the emotional state of the agent is either on the positive side or neutral. When the emotional state of the agent is on the negative side, then the agent is assumed to commit mistakes or perform the task in a different manner than performing it logically. Under the influence of negative emotion humans tend to make mistakes and sometimes skip a step while performing a task or make decisions emotionally which are not logical. With this motivation in mind, in our simulation if the agent is under the influence of negative emotion [12] then it tends to miss a step or more while performing the task according to its current emotional state. When the agent misses one or more steps, the total sum of the weight of the task is different from that expected or would have occurred when performed logically. We plot this difference of the task completion logically and emotionally on a graph to observe the behavior of the agents. The average of the task attention for all nurses achieved logically and emotionally is plotted on the graph.

Once an agent performs the same task over and over again for a number of times, the agent learns and its ability of making mistakes even under the influence of negative emotion is reduced. For every agent we check how many times it has performed the same task and update it every time the agent performs the same task. Once a threshold is reached, like when the agent has performed the same task for over


20 times, then the agent learns from its experience and gets used to performing the task without making further mistakes even under the influence of negative emotion.

4 Case Study: Nurse-Patient Hospital System

We use a case study inspired by the hospital system described in [13]. In this model there are two kinds of agents: patients and nurses. Initially all the patients and nurses are allocated to patient rooms and nurse offices respectively. There are 38 patient agents and 5 nurse agents. The patient agents have fixed locations, while the nurse agents move from one room to another with their main task of servicing the patient. Patients buzz when they need a nurse. Nurses follow a path from there room to the patient’s room, who has buzzed, to service that patient. The path in the simulation has been represented by weighted directed graphs, in which nodes represent particular area and edges represent ability to travel between two adjacent areas. Following this path nurses serve the patients, but in between this path a nurse agent may see another nurse agent and they may interact with each other. The hospital system runs for different time steps, while one time step is equivalent to 12.5 seconds simulated in real time.

Initially every patient agent has some emotion which is dependent upon the severity of that patient. In other words if a patient is more severe then he may have more negative emotions. When a nurse serves a patient, the affect of patient’s emotion is reflected on the nurse’s emotional state. For example, if a nurse sees a patient in pain and disappointed by his condition, the nurse may feel sad and displeased. While if a nurse sees a patient recovering and satisfied, the nurse may also feel satisfied and pleased. In this scenario we have used OCC theory for generating emotions, both in patients as well as nurses. Moreover when the nurse interacts with other nurses, their emotional state is also affected. This update in the emotion then causes change in the behavior of the nurses while they perform other tasks, such as preparing medications, documentation, etc. In our simulation we see the affect of emotion on the nurse’s behaviour while they perform their tasks.

5 Experiments and Results

We have used three different settings to perform experiments with the simulation. We ran the simulation for 10 times with each setting. In the first setting the patient agent’s emotion is fixed and the nurse does not interact with other nurses. The simulation shows that whenever there are more nurses with a negative state of emotion like displeased, then the nurses tend to skip some step in their tasks. But this does not happen frequently as the patient’s emotion state is constant and more nurses have positive emotional state. In the second setting nurses interact with other nurses while the patient’s emotion is fixed. In this setting, the nurses communicate with each other if they are in the same room. When they communicate their emotional state may also changes. Now as more nurses interact, their emotional states change more often. If they have negative emotional state then they tend to skip a step or more while performing their tasks. The comparison between pure logical way of performing a

208 D. Jain and Z. Kobti

task and emotionally performed task, in this case, show a lot of difference. More nurses interact, their emotional state changes more often and they tend to make mistakes more often. Since the patient’s emotion is constant, when negative emotion is generated and tends to multiply among nurses when the nurses influence each other. A large pattern of mistakes being made by nurses is seen. In the third setting, the patient’s emotional state changes with time. The patient’s emotion also changes when the nurses’ visit them. We see a pattern, when there is increase of unhappy patients, unhappy nurses also increase and task performance is affected. When the number of unhappy patients decreases, there are less unhappy nurses and consequently more tasks are performed logically. Figure 1 shows task performance of the nurses, logically and emotionally.

0

50

100

150

200

250

300

350

400

450

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

Task

Att

enti

on

Time Step

logical emotional

Fig. 1. Graph showing task attention vs. time for logical and emotional performance of the nurse for Setting 3


We define a generic emotional model and apply it to the nurses and patients in the hospital simulation system. It has been observed that the model is able to generate emotions for the nurses depending on the emotions of the patients. Moreover the performance of the task is very much dependent upon the emotion of the nurses. We can conclude that our model is able to generate emotions according to the situations agents encounter and the general model can be implemented in other simulations to


generate emotions and observe the behavior under the influence of these emotions. This model can still be improved by adding learning and adapting capabilities in the agents. When an agent comes across the same situation, like a nurse sees a patient in pain, again and again then they adapt to the situation and then they will not generate the same emotion but will get used to the situation. Moreover personality of the agent can also affect the generation of the emotion, so defining the personality of the agent can be a future work.

References

1. Ortony, A., Clore, G., Collins, A.: The cognitive structure of emotions. Cambridge University Press, Cambridge (1988)

2. Frijda, N.H.: The Emotions. Cambridge University Press, Cambridge (1986) 3. Scherer, K.: Emotion as a multicomponent process: a model and some cross-cultural data.

Review of Personality and Social Psychology 5, 37–63 (1984a) 4. Lazarus, R.S.: Emotion and Adaptation. Oxford University Press, Oxford (1991) 5. Sloman, A.: Motives, Mechanisms, and Emotions. In: Boden, M. (ed.) The Philosophy of

Artificial Intelligence. Oxford University Press, Oxford (1990) 6. Russell, S.J., Norvig, P.: Artificial Intelligence: a Modern Approach. Prentice-Hall, Inc.,

Englewood Cliffs (1995) 7. Reilly, N.: Believable social and emotional agents. Doctoral thesis, School of Computer

Science, Carnegie Mellon University, Pittsburgh, PA, USA (1996) 8. Gratch, J., Marsella, S.: A domain independent framework for modeling emotion. Jour. of

Cog. Sys. Res. 5(4), 269–306 (2004) 9. Gratch, J., Marsella, S.: Evaluating a Computational Model of Emotion. Autonomous

Agents and Multi-Agent Systems 11(1), 23–43 (2005) 10. Adam, C., Herzig, A., Longin, D.: A logical formalization of the OCC theory of emotions.

Synthese 168, 201–248 (2009) 11. Jain, D., Kobti, Z.: Emotionally Responsive General Artificial Agent Simulation. In:

FLAIRS 2011, AAAI Proceedings (to appear, 2011) 12. Dijksterhuis, A.: Think Different: The Merits of Unconscious Thought in Preference

Development and Decision Making. J. of Pers. and Soc. Psych. 87(5), 586–598 (2004) 13. Bhandari, G., Kobti, Z., Snowdon, A.W., Nakhwal, A., Rahman, S., Kolga, C.A.: Agent-

Based Modeling and Simulation as a Tool for Decision Support for Managing Patient Falls in a Dynamic Hospital Setting. In: Schuff, D., Paradice, D., Burstein, F., Power, D.J., Sharda, R. (eds.) Deci. Supp., Annals of Information Systems, vol. 14, pp. 149–162. Springer, Heidelberg (2011)

Base Station Controlled Intelligent Clustering

Routing in Wireless Sensor Networks

Yifei Jiang and Haiyi Zhang

Jodrey School of Computer Science, Acadia UniversityWolfville, Nova Scotia, Canada, B4P 2R6

Abstract. The main constrains for Wireless Sensor Network (WSN)are its limited energy and bandwidth. In industry, WSN deployed withmassive node density produces lots of sensory traffic with redundancy.Accordingly, it decreases the network lifetime. In our proposed approach,we investigate the problem on energy-efficient routing for a WSN in aradio harsh environment. We propose a novel approach to create optimalrouting paths by using Genetic Algorithm (GA) and Dijkstra’s algorithmperformed at Base Station (BS). To demonstrate the feasibility of ourapproach, formal analysis and simulation results are presented.

Keywords: Genetic Algorithm; Dijkstra’s algorithm; WSN.

1 Introduction

Most Wireless Sensor Networks (WSNs) use battery-powered sensor nodes. Bysurveying the whole field of WSNs, the main problem is network lifetime, whichdemonstrates by the limited energy supplies for every sensor node. The pri-mary function for WSN is to perform data communication between BS and allsensor nodes. It is expected that WSNs can work for a long duration. Due tothe limited energy resources, this expectancy puts constraints on the energy us-age. Moreover, according to the limited bandwidth, lots of sensory traffic withredundancy are generated with the massive node density. This turns out to in-crease the overload on sensors, which, in turn, will drain their power quickly.For this reason, so many attempts have been made to prolong nodes lifetimeand to eliminate redundant data in [1], [2], [3]. Our proposed technique is de-signed by absorbing the essence from those aforementioned technologies. To bespecific, we proposed a Base Station Controlled Intelligent Clustering Routing(BSCICR) protocol by using an aggregation method to handle redundant data,and creating an energy-efficient multi-hop routing based on cluster-heads (CHs)by using Dijkstra’s algorithm at BS. Moreover, we improved the fitness functionof GA to produce better clusters and CHs. Therefore, the main contribution ofthis paper is to create an energy-efficient routing protocol for a large-scale WSNused in a radio harsh environment, and balance the energy load among all partic-ipating nodes. BSCICR protocol is based on a clustering scheme; however, it hasthe following specified features: 1. All source nodes are randomly deployed in aWSN with the same initial energy. 2. BS is located far away from the sensor field


Intelligent Energy-Efficient Routing in WSN 211

and has a powerful processor and persistent stable energy supply. 3. The sourcenodes and BS in the WSN are all stationary. 4. The non-CH nodes in the WSNperform sensing and transmitting tasks at a fixed rate. As for the CHs, they onlyperform data aggregation and transmitting operations. To calculate the energyconsumption in BSCICR, we utilize the first order radio model described in [1].

2 System Design

For a dynamic environment, the topology for WSN is constantly changed dueto the increasing number of dead nodes. In BSCICR, a GA is used to createa globally optimal clusters and CHs based on the network topology. Therefore,the changes for the topology of WSN will greatly influence the accuracy of theresult of GA. For this reason, scheduler used in BSCICR is to maintain relativelyhigh accurate results of GA. The online scheduler generates a data aggregationtree, which is the energy-efficient routing paths produced by Dijkstra’s algorithmexecuted at BS. The generated schedule is defined as the data aggregation treealong with its frequency (intervals). To explain, if an aggregation tree Ti is usedfor n number of rounds, then at the nth round, all living nodes send their currentenergy status along with their data packets to BS. Once the scheduler at BSreceives those information, BS updates the network topology accordingly. Then,Dijkstra’s algorithm at BS will create a new energy-efficient data aggregationtree Ti+1 based on the up-to-date topology.

After a schedule was generated, BS broadcasted it as the multiple packets tothe WSN. Meanwhile, in order to decrease the overall network latency, BS sendsout synchronization pulses to all sensor nodes. This ensures that all clusters starttheir data transmission phase at the same time. The schedule consists two sin-gle packets, named as IntervalSchedulePacket and ClusterNodeSchedulePakcetshown in Figure 1. As IntervalSchedulePacket on the left, it is composed of fiveelements, which are Frequency, NodeID, CHID, ClusterID and ChildList. The lastfour elements represent the addressing scheme used in BSCICR. This addressingscheme is based on the source nodes’ attributes and geographical deploymentpositions, which is shown as the form of < NodetypeID, LocationID >. Forinstance, the NodeID and CHID are the source nodes’ identification numbers,which indicate the type of the source nodes as non-CH node and CH node, re-spectively. As for the ClusterID and ChildList, it shows the composition of eachcluster. As the (n− 2) bytes in ChildList, it is the maximum size that child listcan have, if there are n source nodes in the corresponding cluster. BSCICR uti-lizes a “Round-Robin” scheduling scheme to minimize the collisions happenedwhen non-CH nodes transmitted the data to CHs. Therefore, Frequency in In-tervalSchedulePacket defines as the interval used in “Round-Robin” schedulingscheme, which denotes as milliseconds. For example, if one of the non-CH nodesstarts transmitting its sensory data, the other nodes only transmit their sen-sory data after certain intervals. After non-CH nodes finished sensing process,they only turn their radio on when it is their turn to transmit the data; other-wise, they are set to be in the sleep state (keep their radio off). This can also

212 Y. Jiang and H. Zhang

Fig. 1. Schedule packets

provide the benefit to delay the first node death for the whole WSN. Anothersingle schedule packet in Figure 1 is the core schedule packet for our proposedapproach, which is only assigned and executed by CHs. This ClusterNodeSched-ulePakcet can be divided into two parts: Control Segment and Process Segment.The Control Segment consists of the same < NodetypeID, LocationID > ad-dressing scheme. For the other three elements, CurrentRound, LastRoundFlagand Routing-SequenceList : the first one represents the number of current datagathering round, which initially is set to be 1. This field will be increased by1 after every data gathering round. LastRoundFlag is used to judge whetherthe current round is the last data gathering round or not. When it is set tobe “True”, it will trigger the Scheduler to generate and disseminate the newschedule, where all elements in Control Segment will be changed. As for thethird one, it describes a routing sequence constructed by Dijkstra’s algorithmthat guides CHs to transmit the gathering data to BS in an energy-efficient way.With regard to the Process Segment, it contains the processing codes used foreliminating the redundant data. These process codes are dispatched by BS. Theyare also encapsulated at BS to do further processing on the data gathered fromall CHs after each data gathering round.

Genetic Algorithm. With the constantly changed number of sensor devicesfor a WSN used in a radio harsh environment, we utilize a GA to generate anapproximate optimal solution. To the best of our knowledge, researchers stilluse the same methods listed in [4] for handling all these operations of selectionapproach, crossover and mutation. We will not explain them here. We choosethe permutation method to encode a chromosome. In permutation encoding,every chromosome is a string of numbers that indicates the routing sequence. InBSCICR, A chromosome is a collection of genes and represents a single cluster fora given WSN. Each chromosome has the size of fixed length that is determinedby the number of source nodes in the WSN. The other key components andoperators of GA, such as, fitness and fitness function are described as following:

A. Fitness Parameters. The fitness parameters below are designed to de-fine the fitness function of GA. Most important, they are the critical guidelinesto minimize the energy consumption during the data transmission phase andprolong the network lifetime. 1. Node Distance (CD): it denotes as the sum ofthe spatial transmission distances from all non-CH nodes to their correspond-ing CHs. As described below, for one of the clusters in a given WSN, withj source nodes (indexed from i to j ), and has the coordinates (xch, ych) and(xnon−ch, ynon−ch) for the CH and one of the non-CH nodes, respectively, CD is

calculated as CD =∑j−1

i=1

√(xch − xnon−ch)2 + (ych − ynon−ch)2. CD is a very


important parameter utilized to control the size of cluster in the cluster setupphase. For a large scale of WSN, the value of CD will be very big, and thus theenergy consumption will be higher considered the aforementioned radio model.Hence, in order to achieve the energy-efficient routing, we need to focus on reduc-ing the value of CD to a small number. 2. Routing Distance (D): it represents asthe spatial transmission distance between any CH and BS. If the coordinates ofCH and BS are (xch, ych) and (xbs, ybs), respectively; then, D can be represented

as D =√

(xbs − xch)2 + (ybs − ych)2. Parameter D is the core part of BSCICRprotocol. According to the radio model, the energy consumed on transmit am-plifier of a sensor node is proportional to D4. Therefore, for energy-efficientpurpose, the value of D should be small. 3. Average CHs Distance (CH): it de-notes as the average distances among all CHs. In BSCICR, through executingthe Dijkstra’s algorithm at BS, a shortest routing path is generated. The routingpath here is a routing sequence list, which contains the multi-hop transmissionpaths among CHs. After a certain number of data gathering rounds, a few newCHs will be generated by GA. It turns out that the distances between CHs arechanged. Therefore, take into account the radio model, CH should also be set asa small value to reduce the energy cost on routing the gathering data between

CHs. CH is calculated as CH =∑ni=1

√(xn−xi)

2+(yn−yi)2

n(n−1)/2 , where (xn, yn) and(xi, yi) represent as the coordinates of any two different CHs. As for n (n ≥ 1),it indicates the number of CH. In our knowledge, for a complete graph with nvertices, there are at most n(n − 1)/2 edges. 4. Energy Consumption (Ec): itrepresents the energy consumed on any cluster of a given WSN. For example, fora given cluster C with j source nodes (indexed from i to j ), Ec can be definedas Ec =

∑ji=1 ET(i,CH) + j × ER + (j − 1)× EDA + ET(CH,BS) . In this equation,

the sum of the energy consumed on transmitting the sensory data from everynon-CH node to its CH node is denoted as the first term. As for the second term,it shows the energy dissipation on the CH for receiving the gathering data fromall non-CH nodes. Regarding to the third and fourth terms, they represent theenergy expenditure on executing the operation of data aggregation and trans-mitting the aggregation data from CH to BS, respectively. In order to designthe energy-efficient routing protocol in WSN, obviously, on the precondition ofensuring a good data communication, the smaller value of Ec the better. 5. Num-ber of Data Gathering Rounds (R): it is a predefined number dispatched by BS.According to R, GA decides when to start reproducing the next generation (thepopulations). Moreover, the value of R can be adjusted by the current energystatus of all source nodes in the WSN. Furthermore, if R is assigned to be alarger value for the current population of GA, it indicates that this populationhas a better fitness value than others, which means this population will be usedfor a longer period. A reasonable larger value of R will be good for the fitnessfunction of GA to generate the small variations in the best fitness value of thechromosomes. 6. Percentage of CHs (P): it is the ratio of the total number of ac-tive CHs NCH over the total number of participating source nodes NPS (includenon-CH and CHs) in the WSN, which is defined as P = NCH

NPS×100%. Here, only

214 Y. Jiang and H. Zhang

alive nodes are considered as participating nodes. In BSCICR, P is computedafter the cluster initial phase, since we can get an optimal value of P due to theaforementioned optimal number of clusters k, which is equal to NCH at thatmoment. Although after certain number of data gathering rounds R, the valueof NPS may decrease due to running out of the power on some participatingnodes, we should still keep the same value of P for every data gathering roundto distribute the participating nodes evenly in each cluster. In this way, it canmaintain the lowest energy load on each CH for every data gathering round.Accordingly, extend the network lifetime by postponing the first node death.

B. Fitness Function. The fitness function is defined over the above fitnessparameters and used to measure the quality of the represented chromosomes.In BSCICR, GA is performed at BS. This provides the BS with the abilityto determine the optimal cluster formation that will give the minimum energyconsumption during run time. The fitness function in f(x) is represented asf(x) =

∑i(αi×f(xi)), ∀f(xi) ∈ {CD, D, CH , Ec, R, P}. In this expression, αi is

a set of arbitrarily assigned weights for the above fitness parameters. After everygeneration, the best-fit chromosome is evaluated and all six fitness parametersare updated as Δfi = f(xi+1) − f(xi). The Δfi in this equation represents thechange in the value of fitness parameters, where index i(i ≥ 1) represents thenumber of generations. Therefore, Δfi can be described as the subtraction ofthe fitness value for the current population and previous population. After everygeneration, the above six fitness parameters are evaluated to see the improve-ments. As for the initial weight αi, it can be calculated αi = αi−1+ci ·Δfi, whereci = 1

1+e−fi improves the value of weights based on the previous experience [4].A suitable range of αi is assigned in the Section of Simulation.

3 Simulation

The simulator is implemented using Java language under Eclipse developmentenvironment. The communication channel in WSN is assumed as ideal. Figure2 shows the graphical user interface of our simulator. From Figure 2, we canclearly see the layout of the simulator. It consists of three parts. First partis on the top of the simulator, which contains the control and input panels.

Fig. 2. Wireless Sensor Network Simulator


All simulation parameters can be manually adjusted by using the input textfields on the simulator. For example, the size of WSN can be scaled by manuallyconfiguring the network and cluster size in step 2. Through scaling, although thegeneral layout of WSN remains the same due to the limited size of the simulator(800×800 pixels), the spacing among all nodes is adjusted according to the givendisk size. Moreover, user can define the network initial state, as well as choosedesired data structure and algorithm implementation. The graph panel is in themiddle of the simulator, where the graphical results are displayed there. Thoseresults represent the generated energy-efficient routing paths and optimal CHs.For graph visualization, the expected energy-efficient routing paths are shown asa red color, which are distinguished with the connected graph shown on the leftwith a black color. For example, when all simulation parameters are set in step1 and 2, after pressing Create and then Connect button in step 3, we can get thecorresponding completed graph with all CHs connected, which is shown as theblack graph. Next, after pressing Simulate button, the shortest routing paths willbe shown as the colored graph. In addition, BS is also displayed as blue and redin both graphs respectively, which is different with the black color representedas all CHs. As for the output panel, it is at the bottom of the simulator, in whichshows the data results for all simulations of different protocols. In this panel, usercan check the status of energy (minimum, average, maximum and standard), themaximum data gathering rounds achieved, the basic network configuration aswell as the selected algorithms and simulation scenarios.


We proposed a GA-based solution for a large-scale WSN used in a radio harshenvironment. We described this WSN as a set of chromosomes (population),which is represented by a GA to compute the approximate optimal routingpaths. By utilizing Dijkstra’s algorithm, we were able to transform a dynamictopology of the entire network to a complete graph. More study of GA withimproved fitness function is our next step to improve our approach.

References

1. Heinzelman, W.R., Chandrakasan, A., Balakrishnan, H.: Energy-efficient communi-cation protocol for wireless microsensor networks. In: Hawaii International Confer-ence on System Sciences (2000)

2. Muruganathan, S., Ma, R.B.D., Fapojuwo, A.: A centralized energy-efficient routingprotocol for wireless sensor networks. IEEE Communications Magazine 43, S8–S13(2005)

3. Hussain, S., Matin, A.W., Islam, O.: Genetic Algorithm for Hierarchical WirelessSensor Networks. Journal of Networks 2(5) (2007)

4. Goldberg, D.E.: Genetic algorithms in search, optimization, and machine learning(1989)


Comparison of Semantic Similarity for Different Languages Using the Google n-gram Corpus and Second-

Order Co-occurrence Measures

Colette Joubarne and Diana Inkpen

School of Information Technology and Engineering University of Ottawa, ON, Canada, K1N 6N5

[email protected], [email protected]

Abstract. Despite the growth in digitization of data, there are still many languages without sufficient corpora to achieve valid measures of semantic similarity. If it could be shown that manually-assigned similarity scores from one language can be transferred to another language, then semantic similarity values could be used for languages with fewer resources. We test an automatic word similarity measure based on second-order co-occurrences in the Google n-gram corpus, for English, German, and French. We show that the scores manually-assigned in the experiments of Rubenstein and Goodenough’s for 65 English word pairs can be transferred directly into German and French. We do this by conducting human evaluation experiments for French word pairs (and by using similarly produced scores for German). We show that the correlation between the automatically-assigned semantic similarity scores and the scores assigned by human evaluators is not very different when using the Rubenstein and Goodenough’s scores across language, compared to the language-specific scores.

1 Introduction

Semantic similarity refers to the degree to which two words are related. Measures of semantic similarity are useful for techniques such as information retrieval, data-mining, question answering, and text summarization. As indicated by Irene Cramer [2] many studies such as question answering, topic detection, and text summarization,, rely on semantic relatedness measures based on word nets and/or corpus statistics as a resource. However, these approaches require large and various amounts of corpora, which are often not available for languages other than English. If it could be shown that measures of semantic similarity have a high correlation across languages, then values for semantic similarity could be assigned to translated n-grams; thus enabling one set of values to be applied to many languages.

Determining semantic similarity is routinely performed by humans, but it is a complex task for computers. Gabrilovich and Markovitch [3] point out that humans do not judge text relatedness only based on words. Identification of similarity involves reasoning at a much deeper level that manipulates concepts. Measures of similarity for humans are based on the larger context of their background and experience. Language

Comparison of Semantic Similarity for Different Languages 217

is not merely a different collection of characters, but is founded on a culture that impacts the variety and subtlety of semantically similar words. For example, in French, often described as the “language of love”, the verbs “to like” and “to love” both translate to “aimer”. The word pair “cock, rooster” from Rubenstein and Goodenough [10] translate to “coq, coq” in French, and “Hahn, Hahn” in German.

Rubenstein and Goodenough [10] defined the baseline for the comparison of semantic similarity measures. However, the fact that translation is not a 1:1 relation introduces difficulty in the use of a baseline. Understanding whether it is possible to use translated words to measure semantic similarity using corpora from another language is the goal of this experiment.

2 Related Work

Automatically assigning a value to the degree of semantic similarity between two words has been shown to be quite difficult [5]. Rubenstein and Goodenough [10] presented human subjects with 65 noun pairs and asked them how similar they were on a scale from 0 to 4. Miller and Charles [8] took a subset of this data (30 pairs) and repeated this experiment. Their results were highly correlated (97%) to those of the previous study.

Semantic similarity is a fundamental task in Natural Language Processing, therefore many different approaches to automate measures of semantic similarity of words have been studied. Jarmasz and Szpakowicz, [7] used a computerized version of Roget’s Thesaurus to calculate the semantic distance between the word pairs. They achieved correlation of 0.82 with Rubenstein and Goodenough’s [10] results. Budanitsky and Hirst [1] compared 5 different measures of semantic similarity based on WordNet. They found that when comparing the correlation of each measure with Rubenstein and Goodenough’s [10] human evaluator scores, the difference between the automatic measures was small (within 0.05). Islam and Inkpen [6] introduced Second Order Co-occurrence PMI as a measure of semantic similarity, and achieved results with a 0.71 correlation to Rubenstein and Goodenough [10] when measured using the British National Corpus (BNC)1.

Hassan and Mihalcea [4] use the interlanguage links found in Wikipedia to produce a measure of relatedness using explicit semantic analysis. They achieved a correlation with Miller and Charles [8] word pairs between 0.32 and 0.50 for Spanish, Arabic and Romanian. Not surprisingly, they found that better results were achieved for languages with a larger Wikipedia. Mohammad et al [9] proposed a new method to determine semantic distance combining text from a language, such as German, which has fewer corpora available, with a knowledge source in a language with large corpora available, such as English. They combined German text with an English thesaurus to create cross-lingual distributional profiles of concepts to achieve a correlation of 0.81 with Rubenstein and Goodenough’s word pairs [10].

Typically, two approaches have been used to solve multilingual problems, rule-based systems and statistical learning from parallel corpora. Rule-based systems usually have low accuracy, and parallel corpora can be difficult to find. Our approach

1 http://www.natcorp.ox.ac.uk/

218 C. Joubarne and D. Inkpen

will be to use manual translation and language-specific corpora, in order to measure and compare semantic similarity for English, French and German, using second-order co-occurrence.

3 Data

The data used was the Google n-gram corpus, which included n-grams (n=1-5) generated from roughly 100 billion word tokens from the web for each language. Only the unigrams, bigrams and 5-grams were used for this project. Since the purpose is to compare the semantic similarity of nouns only, and to compare results achieved on the same data, it was decided that removal of non-alphabetic characters and stemming of plurals was sufficient for our purposes.2

The word pairs were taken from Rubenstein and Goodenough [10] and translated into French using a combination of Larousse French-English dictionary, Le Grand dictionnaire terminologique, maintained by the Office quebecois de la langue francaise, a couple of native speakers and a human translator. In some cases where the semantic similarity of the word pair was high, the direct translation of each word in the word pair resulted in the same word. In these cases the pair was left out completely. The semantic similarity of the translated words was then evaluated by human judges.

The 18 evaluators, who had French as their first language, were asked to judge the similarity of the French word pairs. They were instructed to indicate, for each pair, their opinion of how similar in meaning the two words are on a scale of 0-4, with 4 for words that mean the same thing, and 0 for words that mean completely different things. The results were averaged over the 18 responses (with the exception of three word pairs, where the respondents left their scores blank, so these were only averaged over 17). For 71% of the word pairs there was good agreement amongst the evaluators, with over half of the respondents agreeing on their scores; however in 23% of the cases, there was high disagreement with scores ranging from 0-4. The results can be seen in Appendix A3, which presents the words pairs for the three languages used in our study together with the similarity scores according to human judges.

The German translation of the word pairs, including human evaluation of similarity, was borrowed from Mohammad et al [9]. Some of the word pairs do not match exact translations. Since the focus of their study was on the comparison between scores from human evaluators and automated results, they addressed the issue of semantically similar words resulting in identical words during translation, by choosing another related word.

A comparison of the frequencies for similarity values amongst all evaluators for each language, presented in Table 1, shows that the English and German scores are similarly distributed, whereas the French scores are more heavily weighted around a score of 0 and 1.

2 Stopword removal and stemming was performed during further research, but it was found

that results were significantly worse for stopword removal and stemming, and relatively unchanged for stopword removal alone. Stopword lists were taken from Multilingual Resources at University of Neuchatel. The Lingua stemming algorithms was used.

3 Available at http://www.site.uottawa.ca/~mjoub063/wordsims.htm


Table 1. Frequency of similarity scores

Frequency Similarity Score English German French

0 0 4 15 1 25 19 23 2 12 16 5 3 8 4 10 4 20 22 12

4 Methodology

Unigram and bigram counts were taken directly from the 1-gram and 2-gram files, taking into account characters and accents in the French and German alphabets. Second order counts were generated from the 5-gram data.

Two measures of semantic similarity were used, point-wise mutual information and second order co-occurrence point-wise mutual information. These measures were calculated for each set of word pairs, and compared to the baseline measures from the original data set, as well as the new values generated by human evaluators.

Point-wise mutual information (PMI) measure is a corpus-based measure, as opposed to a dictionary-based measure of semantic similarity. PMI measures the more general sense of semantic relatedness where two words are related by their proximity of use without necessarily being similar. The PMI score between 2 words w1 and w2 is defined as the probability of the 2 words appearing together divided by the probability of each word occurring separately. PMI was chosen because it scales well to larger corpora, and it has been shown to have the highest correlation amongst corpus-based measures [6].

Second order co-occurrence PMI (SOC-PMI) is also a corpus-based measure that determines a measure of semantic relatedness, based on how many words appear in the neighbourhood of both words. The SOC-PMI score between 2 words w1 and w2 is defined as the probability of word y appearing with w1 and of y appearing with w2,

within a given window in separate contexts. SOC-PMI was chosen because it fits well with the Google n-gram corpora. The frequencies for a window of size 5 are easily obtained from the 5-gram counts. The formula can be found in Islam and Inkpen [6].

5 Results

The PMI and SOC-PMI scores were calculated for each set of word pairs and compared to both the scores collected by Rubenstein and Goodenough [10] and the language specific scores collected from human evaluators (see Table 2).

Table 2. Pearson correlation of calculated PMI and SOC-PMI scores with R&G scores and new human evaluator scores

Language vs. R&G vs. Evaluators PMI SOC-PMI PMI SOC-PMI

English 0.41 0.61 n/a n/a French 0.34 0.19 0.29 0.17 German 0.40 0.27 0.47 0.31

220 C. Joubarne and D. Inkpen

6 Discussion

Our best correlation of 0.61 for the English SOC-PMI is not as good as that achieved by Islam and Inkpen [6]. However, their correlation of 0.73 was achieved using the BNC. The higher results could possibly be explained by the lack of noise in the BNC (discussion of noise issues found in Google n-gram corpus appears in Section 7), as well as the ability to use a larger window than supported by the Google 5-grams.

The correlation of the SOC-PMI scores and the original scores was slightly lower than for the human scores for the German word pairs, and slightly higher for the French word pairs. Almost 2/3 of the French and German word pairs had a SOC-PMI of 0. This is reflected in the poor correlation values and is likely due to the fact that the French and German corpora were approximately 1/10 the size of the English corpus.


Given the lack of data for over 2/3 of the French and German pairs, it is not possible to make any claims with any certainty; however, since the results were not significantly improved by using language specific human evaluation, the results do suggest that it might be possible to transfer semantic similarity across languages. While further work needs to be done to confirm our hypothesis, we have produced a set of human evaluator scores for French which can be used for future work.

Although results were improved from earlier work, given the larger corpora for English, it appears that larger French and German corpora are still required to draw any significant conclusions. The Google n-gram corpora, for both French and German, contain approximately 13 billion tokens each; however, many of these tokens are not words. There are strings of repeating combinations of letters and many instances of multiple words in one token. For example, there are roughly 500-1000 tokens containing “abab” or “cdcd” and every other combination. There are 2000 occurrences of “voyageurdumonde” and 5000 of “filleougarcon”. Future work of this type with the Google n-grams should consider using a dictionary to filter out these kinds of tokens.

Another approach would be to select words that are common in all of the languages of interest, and that result in unique word pairs after translation. A new baseline would have to be created. This would require some study of word frequencies, and effort being spent in having the semantic similarity of the word pairs evaluated by human evaluators.

Budanitsky and Hirst [1] suggest a different approach. In their comparison of 5 different measures of semantic similarity, they suggest that comparing only to human evaluator scores is not a sufficient comparison, and that what we are really interested in is the relationship between the concepts for which the words are merely surrogates; the human judgments that we need are of the relatedness of word senses, not for words. They attempt to define such an experiment, and find that the effectiveness of the 5 measures varies considerably when compared this way. The idea of using the


relatedness of word senses, not of words, could possibly overcome some of the issues4 encountered when translating the word pairs.

Acknowledgements

We address our thanks to the Social Science Research Council (SSHRC) and to the Natural Sciences and Engineering Research Council (NSERC) of Canada for supporting this research work. We thank Aminul Islam for sharing his code for SOC-PMI. We thank Saif Mohammad for sharing the German word pair similarity scores. We also thank Stan Szpakowicz for his comments on the draft of this paper.

References

1. Budanitsky, A., Hirst, G.: Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In: Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources, Pittsburgh (2001)

2. Cramer, I.: How Well Do Semantic Relatedness Measures Perform? A Meta-Study. In: Proceedings of STEP 2008 Conference, vol. 1, pp. 59–70 (2008)

3. Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In: Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India (January 2007)

4. Hassan, S., Mihalcea, R.: Cross-lingual Relatedness using Encyclopedic Knowledge. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), Singapore, pp. 1192–1201 (August 2009) (to appear)

5. Inkpen, D., Desliets, A.: Semantic Similarity for Detecting Recognition Errors in Automatic Speech Transcripts. In: EMNLP 2005, Vancouver, Canada (2005)

6. Islam, A., Inkpen, D.: Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038 (May 2006)

7. Jarmasz, M., Szpakowicz, S.: Roget’s Thesaurus and semantic similarity. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2003), Borovets, Bulgaria, pp. 212–219 (2003)

8. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Language and Cognitive Processes 6(1), 1–28 (1991)

9. Mohammad, S., Gurevych, I., Hirst, G., Zesch, T.: Cross-lingual distributional profiles of concepts for measuring semantic distance. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP/CoNLL 2007), Prague, Czech Republic (2007)

10. Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Communications of the ACM 8(10), 627–633 (1965)

4 In some cases the translation of word pairs resulted in the same word, and in other cases the

result produced a phrase, or a more obscure word. For example – “midday, noon” = “midi, midi”, “woodland” = “region boisée”, and “mound” = “monticle”.

A Supervised Method of Feature Weighting

for Measuring Semantic Relatedness

Alistair Kennedy1 and Stan Szpakowicz1,2

1 SITE, University of Ottawa, Ottawa, Ontario, Canada{akennedy,szpak}@site.uottawa.ca

2 Institute of Computer SciencePolish Academy of Sciences, Warsaw, Poland

Abstract. The clustering of related words is crucial for a variety of Nat-ural Language Processing applications. Many known techniques of wordclustering use the context of a word to determine its meaning. Wordswhich frequently appear in similar contexts are assumed to have simi-lar meanings. Word clustering usually applies the weighting of contexts,based on some measure of their importance. One of the most popularmeasures is Pointwise Mutual Information. It increases the weight ofcontexts where a word appears regularly but other words do not, anddecreases the weight of contexts where many words may appear. Essen-tially, it is unsupervised feature weighting. We present a method of su-pervised feature weighting. It identifies contexts shared by pairs of wordsknown to be semantically related or unrelated, and then uses PointwiseMutual Information to weight these contexts on how well they indicateclosely related words. We use Roget’s Thesaurus as a source of trainingand evaluation data. This work is as a step towards adding new termsto Roget’s Thesaurus automatically, and doing so with high confidence.

1 Introduction

Pointwise Mutual Information (PMI) is a measure of association between twovalues of two random variables. PMI has been applied to a variety of NaturalLanguage Processing (NLP) tasks, and shown to work well when identifyingcontexts indicative of a given word. In effect, PMI can be used to give higherweights to contexts in which a word occurs frequently, but other words appearrarely, while giving lower weight to contexts with distributions closer to random.Finding these weights requires no actual training data, so it is essentially an un-supervised method of context weighting, an observation also made in [1]. In ourpaper we show how to incorporate supervision into the process of context weight-ing. We learn appropriate weights for the contexts from known sets of relatedand unrelated words extracted from a thesaurus. PMI is then calculated for eachcontext: we measure the association between pairs of words which appear in thatcontext and pairs of words which are known to be semantically related. The PMIscores can then be used to apply a weight to the contexts in which a word isfound. This is done by building a word-context matrix which records the counts


A Supervised Method of Feature Weighting for MSRs 223

of how many times each word appears in each context. By applying our weight-ing technique to this matrix, we are effectively training a measure of semanticrelatedness (MSR). In our experiments, unsupervised PMI measures associationbetween a context and a word, while supervised PMI measures association be-tween a context and synonymy. We also perform experiments combining thesesupervised and unsupervised methods of learning semantic relatedness betweenword pairs.

Our system uses data from two versions of Roget’s Thesaurus, from 1911 andfrom 1987, for our supervised context weighting method. We also compare thetwo versions of Roget’s and determine how its age and size affect it as a sourceof training data. We use SuperMatrix [2], a system which implements a varietyof MSRs. Specifically, we use its Cosine similarity and PMI MSRs. The corpuswe use for building our word-context matrix is Wikipedia.

MotivationThis work is designed to be a step towards automatically updating Roget’s The-saurus through identifying semantically related words and clustering them. Ourgoal is not to create a new thesaurus from scratch but rather to update an exist-ing one. We can therefore try to use the existing thesaurus as a tool for learninghow words are related, which in turn can help update Roget’s. Rather than re-lying on unsupervised word similarity metrics, we can use Roget’s Thesaurus totrain potentially superior word similarity metrics. This has been partially in-spired by [3], where machine learning is used to learn from a corpus wordsrelated by hypernymy. Training on known hypernym and non-hypernym pairsin WordNet [4] allows the system to learn to identify hypernyms for adding toWordNet. Roget’s is structured quite differently from WordNet, so the techniqueof [3] is not appropriate here, but we adopt the “bootstrapping” idea of using alexical resource to aid in its own expansion.

2 Related Work

A variety of corpus-based Measures of Semantic Relatedness (MSRs) have beendeveloped by NLP researchers – see [5] for an in-depth review. Corpus-basedMSRs generally work by representing word w as a vector of contexts in which wappears. The context can be as broad as the document where w appears [6], oras specific as one word, for example in a verb-object relationship with w [7].

Contexts are most often determined using a dependency parser to extractfrom the text triples 〈w, r, w′〉, where word w is related to another word w′ byrelationship r. The context of w is then the pair 〈r, w′〉. This technique has beenwidely applied [8–11].

There have been attempts to incorporate some supervision into the processof learning semantic distance. In [12], a function consisting of weighted combi-nations of precision and recall of contexts is proposed for measuring semanticrelatedness. In this function there are two thresholds which the authors optimize

224 A. Kennedy and S. Szpakowicz

using a set of training instances. Many variations on their measure were evaluatedon the task of predicting how closely word clusters match that of a thesaurus(as we do), and on pseudo-word-sense-disambiguation. This involves minimalsupervision: only two thresholds are learned.

There also is related work on learning weights for short document similar-ity. In [13, 14] a method of learning weights in a word-document matrix wasproposed. The authors weighted terms to learn document similarity rather thanweighting contexts to learn word similarity. The method was to minimize a lossfunction rather than to apply PMI. They compared their system against TF.IDFweighting of documents. The documents they used were actually queries and thetask was to identify advertisements relevant to a given query.

[15] presents another related project. A combination of supervised and unsu-pervised learning determines whether one verb can be a paraphrase of another.Unsupervised learning is used to bootstrap instances where one verb can bereplaced by another. These bootstrapped examples are then used to train aclassifier which can tell in what contexts one word can replace another.

A supervised method of learning synonyms in [1] is probably the work mostclosely related to ours. A variety of methods, both distributional and pattern-based, for identifying synonymy is followed by machine learning to combinethese methods. Such combination was found to give improvement over individualmethods. We do not use supervision to combine methods of identifying synonymsbut rather to determine the weights for a measure of semantic relatedness.

PMI itself has been widely used in NLP. In [16], PMI is used to learn wordsentiment by measuring the association between a phrase and other words knownto be positive or negative. PMI has also been applied to named entity extractionfrom text [17] and query classification into types [18]. In [19], PMI is used in anunsupervised manner to assign weights to a word-context matrix. This processis further described in Section 3.

3 Unsupervised Use of PMI for Measuring SemanticRelatedness

We use PMI for both supervised and unsupervised learning of context weights.In this section we describe how PMI is used in an unsupervised way. PMI isactually a measure of association between two events, x and y:

PMI(x, y) = log

(P (x, y)

P (x) ∗ P (y)

)(1)

When those two events are a particular word and a particular context, we canmeasure association between them and use this as a weighting scheme for mea-suring semantic distance [19]. This is what is calculated when using PMI forunsupervised term-context matrix weighting. To create the term-context matrixwe used a tool called SuperMatrix.


3.1 SuperMatrix

SuperMatrix [2] is a tool which has implemented a large variety of MSRs ona word-context matrix. These include other variations on PMI [20] and Lin’smeasure [8], and measures proposed in [12]. A number of variations on thesemeasures and many others, all referred to as RankWeight Function (RWF) [21,22] have been implemented and are shown to enhance many of those measures.RWF is interesting as it applies one context weighting function on top of another.Likewise, we will apply different weighting methods on top of each other whenwe combine supervised and unsupervised context weighting.

To use SuperMatrix, we give it a single query word q and ask for it to returnthe set of 100 words w1..w100 most closely related to q.1

To construct a word-context matrix to run the SuperMatrix MSRs, we ap-plied the same methods as [8]. We parsed with Minipar [23] a corpus comprisedof about 70% of Wikipedia.2 The parsing results supply dependency triples〈w, r, w′〉. We split these triples into two parts: a word w and a pair 〈r, w′〉 –the context in which w is found. Examples of triples are 〈time, mod, unlimited〉and 〈time, conj, motion〉, where the word “time” appears in the context withthe modifier “unlimited” and in a conjunction with “motion”.

The word-context matrix is constructed from these dependency triples. Eachrow corresponds to a word w, each column – to one of the contexts, C. That cellof the matrix records count(w, C): how many times w is found in C. As we learneither supervised or unsupervised weights, we change the values in this matrixfrom straight counts to more appropriate weights. Each row in this matrix isessentially a vector representing a word. The distance between two words is thedistance between their vectors.

To reduce noise, only words appearing 50 or more times and contexts appear-ing 5 or more times are included. This gives us a total of 32743 words and 321152contexts. The average word appears in approximately 480 unique contexts, whileeach context appears as a feature in around 50 words. We only used nouns inour experiments.

3.2 Applying Unsupervised PMI

A PMI score determines to what extent a word and a context appear togetherbeyond random chance. In this case we have the probabilities P (x) of seeing theword, P (y) of seeing the context and P (x, y) of seeing both together. This iscalculated for all contexts in all word vectors. The actual distance between twowords a and b is the distance between the vectors of contexts for those words,A and B respectively. One of the most common means of measuring distancebetween vectors – and indeed the measure we apply – is cosine similarity:

cos(A, B) =A • B

‖A‖‖B‖ (2)

1 Scores for each word, in the range 〈0..1〉, are provided, but we only need rank.2 That was a dump of August 2010. 70% was the most data we could process on a

computer with 4GB of RAM.


Vectors which appear closer together are assumed to have much more similarmeaning while vectors that appear farther apart are assumed to have less relatedmeanings. Our two unsupervised MSRs will be plain cosine similarity and PMIweighting with cosine similarity.

4 Supervised Learning of Context Weights

In this section we describe how a weight for each context is learned. For thiswe need training data, we turn to Roget’s Thesaurus to provide us with lists ofknown related and unrelated words.

4.1 Roget’s Thesaurus

Roget’s Thesaurus is a nine-level hierarchical thesaurus. The levels, from topto bottom, are Class → Section → Sub-Section → Head Group → Head →Part of Speech → Paragraph → Semicolon Group → Words/Phrases. Earliestpublished versions of Roget’s come from the 1850s, but it has been constantlyunder revision: new editions are released every few years. We will use two versionof Roget’s. Open Roget’s [24] is a publicly available Java implementation intendedfor use in NLP research, built on Roget’s data from 1911.3 The second versionis proprietary, based on data from the 1987 edition [25]. Generally we prefer towork with public-domain resources. Still, the 1987 Roget’s Thesaurus gives usan opportunity to see how a newer and larger resource compares to an older andsmaller one.

Roget’s contains a variety of words and phrases divided into four main partsof speech: Nouns, Verbs, Adjectives and Adverbs. In our experiments we willonly work with Nouns. The main concepts in Roget’s are often considered to berepresented by the Heads, of which there are usually about 1000. The divisioninto parts of speech occurs between the Head and the Paragraph, so that eachmain concept (Head) contains words in different parts of speech. The smallestgrouping in Roget’s is the Semicolon Group (SG), while the next smallest isthe Paragraph. SGs group together near-synonyms, while Paragraphs tend tocontain a little more loosely related words. An example of some of the NounSGs and Paragraphs from the Head for “Language” can be seen in Figure 1.Each SG is delimited by a semicolon while Paragraphs start with an italicizedword/phrase and end in a period.

Our evaluation requires information from the SG and Paragraphs in Roget’s.Table 1 shows the statistics of those groupings: the counts of Noun Paragraphs,SGs, their average sizes in words, and the total count of all Nouns. The latterincludes duplicates when a noun appears in two or more SGs. A phrase countsas a single word, although the individual words inside it could be used as well.The 1911 Roget’s has more paragraphs, but the 1987 version has more SGs,more words and a higher average number of words in each grouping. The 1987Thesaurus should be better for evaluation: it simply has more labeled data.3 rogets.site.uottawa.ca

rogets.site.uottawa.ca


language; phraseology; speech; tongue, lingo, vernacular; mother tongue,vulgar tongue, native tongue; household words; King’s English, Queen’s En-glish; dialect.confusion of tongues, Babel, pasigraphie; pantomime; onomatopoeia;betacism, mimmation, myatism, nunnation; pasigraphy.lexicology, philology, glossology, glottology; linguistics, chrestomathy; paleol-ogy, paleography; comparative grammar.

Fig. 1. Excerpt from the Head for “Language” in the 1911 Roget’s Thesaurus

Table 1. Counts of Semicolon Groups and Paragraphs, their average sizes, and allNouns in Roget’s Thesaurus

Year Para Count Words per Para SG Count Words per SG Noun Count

1911 4495 10.3 19215 2.4 463081987 2884 39.7 31174 3.7 114473

4.2 Supervised Weighting

We want to measure the association between pairs of words appearing in a contextand a pair of words appearing in the same SG. For each context C, all the wordsw1..wn which appear in C are collected and all pairs of these words are recorded.C is a pair 〈r, w′〉, while each word wi in w1..wn appears in the triple 〈wi, r, w

′〉 inthe parsed Wikipedia. We then find in Roget’s all words in the same SG as wi ∈〈w1..wn〉, and record these pairs. Only the words also found in our word-contextmatrix are included in these counts. These groups of word pairs can be treated asevents for which we measure the Pointwise Mutual Information, effectively givingthe context C a score. Words which appear in our set of 500 test cases are notincluded when learning the weights of the contexts. To calculate the PMI, we countthe following pairs of words wi, wj (C is a context):

– wi and wj are in the same SG and share C [True Positives (tp)];– wi and wj are in different SGs and share C [False Positives (fp)];– wi and wj are in the same SG and only one of them appears in C [False

Negatives (fn)];– wi and wj are in different SGs and only one of them appears in C [True

Negatives (tn)].

We define the probability of event x as P (x) = x/(tp+ tn+fn+fp). Essentiallywe build a confusion matrix and from it calculate the probabilities. Next, wecalculate the PMI for context C, effectively giving a score to this context.

score(C) = log

(P (tp)

P (tp + fp) ∗ P (tp + fn)

)(3)

This is repeated for every context in our word-context matrix. Once all the scoreshave been generated, we can use them to re-weight our word-context matrix. For


every word wi which appears in a given context C, its count count(wi, C) ismultiplied by score(C).

Calculating this number for all contexts is not trouble-free. For one, not allcontexts will appear in the training data. To avoid this, we normalize everyscore(C) calculated in Equation 3 so that the average score(C) is 1; next, weassume that any unseen contexts also have a weight of 1; finally, we multiplythe count of context C by score(C) for every word in which C appears. Anotherproblem is that PMI may give a negative score when the two events are lesslikely to occur together than by chance. In such situations we set score(C) tozero. Another problem is that often the supervised PMI is calculated with afairly small number of true positives and false negatives, so it may be difficult toget a very reliable score. The unsupervised PMI matrix weighting, on the otherhand, will use the distributions of a word and context across the whole matrix,so often will have more data to work with. It may, then, be optimistic to thinkthat supervised PMI will on its own outperform unsupervised PMI. The moreinteresting experiments will be to see the effects of combining supervised andunsupervised PMI MSRs.

4.3 Experiment Setup

The problem on which we evaluate our technique is that of ranking closely relatedwords. We select a random set of 500 words found in our SuperMatrix matrixand both in the 1911 and 1987 Roget’s Thesaurus, from a possible set of 11725.These 500 words were not used for matrix weighting, described in Section 4.2.For each of these words we use our MSRs to generate a ranked list of the 100most closely related words in our matrix. These lists are evaluated for accuracyat various levels of recall using Roget’s Thesaurus as a gold standard. Specificallywe measure the accuracy at the top 1, 5, 10, 20, 40 and 80 words. We take wordsfrom a list of the top 100 but not all of these 100 words will appear in Roget’s.That is why there will be cases in which we cannot find all 40 or 80 words toperform our evaluation. In such cases we simply perform our evaluation on allthe words we can use from that list of 100.

As shown in Table 1, the newer and larger 1987 version contains more wordsknown to be semantically related than the 1911 version, so we will only use it forevaluation. We measure accuracy at identifying words in the same SG and thesame Paragraph. This is done because, when adding new words to Roget’s, onemay want to take advantage of both the closely related words (SG) and moreloosely related words (Paragraph).

In our evaluation we run six different MSRs. We use unsupervised cosinesimilarity and an unsupervised PMI MSRs as low and high baselines. We alsotest cosine similarity when context weights are learned using both the 1987 and1911 Roget’s Thesaurus. These MSRs are denoted 1987-Cosine and 1911-Cosine.They can be compared to the unsupervised PMI MSR. Finally we attempt tocombine the supervised and unsupervised matrix weighting. This is done by firstapplying the weighting learned through supervision to the word-context matrixand then using the unsupervised PMI MSR on that matrix, once again for both


versions of Roget’s. These MSRs are denoted 1987-PMI and 1911-PMI. Althoughthis may not seem intuitive, it is not so different from the RWF measures, inthat two ranking methods are combined. Sample lists generated with two ofthese measures, 1911-Cosine and PMI, appear in Figure 2.

1911-Cosine – backbencher (0.715), spending (0.657), bureaucracy (0.645), fund-ing (0.619), agency (0.616)PMI – incentive (0.200), funding (0.192), tax (0.187), tariff (0.180), payment(0.176)

Fig. 2. The top 5 words related to “Subsidy”, with their similarity score using thesupervised 1911-Cosine MSR and unsupervised PMI

5 Experiment Results

We evaluate our new supervised MSRs as well as the unsupervised MSRs on twokinds of problems. In one, we evaluate the ranked list by calculating its accuracyin finding words in the same SG. The second evaluation is done by determiningaccuracy at finding words in the same Paragraph.

5.1 Ranking Words by Semicolon Group

We count the number of words found to be in the same SG and those knownto be found in different SGs in Roget’s Thesaurus. From this we calculate theaccuracy of each MSR for the top 1, 5, 10, 20, 40 and 80 related words – seeTable 2. In evaluating our results, we broke the data into 25 sets of 20 lists andperformed Student’s t-test to measure statistical significance at p < 0.05. Thenumbers are in bold when a supervised MSR shows a statistically significantimprovement over its unsupervised counterpart.

Table 2. Evaluation results for identifying related words in the same Semicolon Groups

Measure Top 1 Top 5 Top 10 Top 20 Top 40 Top 80

Cosine .110 .070 .052 .039 .031 .024PMI .368 .243 .188 .136 .100 .072

1987-Cosine .146 .092 .071 .055 .042 .0341987-PMI .378 .240 .187 .136 .101 .073

1911-Cosine .146 .097 .073 .055 .042 .0341911-PMI .372 .242 .189 .138 .100 .073

Our lower baseline MSR – cosine similarity – does quite poorly. In comparison,1987-Cosine and 1911-Cosine gives a relative improvement of 30-40%. Supervisedlearning of context weights using PMI improved the Cosine similarity MSR bya statistically significant margin in all cases. Surprisingly, in a number of cases1911-Cosine performs slightly better than 1987-Cosine. Figure 2 may suggest


why supervised PMI did worse than unsupervised PMI. The latter tended toretrieve closer synonyms, while the former selected many other related words.

Supervised matrix weighting with PMI (1911-Cosine and 1987-Cosine) did notwork as well as unsupervised matrix weighting with PMI. As noted in Section4 this is not entirely unexpected. Combining the supervised and unsupervisedPMI weighted methods does in some cases show an advantage. 1987-PMI and1911-PMI showed a statistically significant improvement only when the top 40and 20 words were counted respectively. That said, in a few cases combiningthese measures actually hurt results, although never in a statistically significantmanner; most often results improved slightly. It is easier to show a change to bestatistically significant as more related words are considered, because it providesa more reliable accuracy. This is tested further where we perform evaluation onParagraphs rather than on SGs.

5.2 Ranking Words by Paragraph

The experiments from Section 5.1 are repeated on Paragraphs – see Table 3. Ob-viously accuracy at all levels of recall is higher in this evaluation, because thereare far more related words in the same Paragraph than in the same SG. Anotherinteresting observation is that the improvement from combining supervised andunsupervised PMI matrix weighting was statistically significant much more of-ten. 1987-PMI showed a statistically significant improvement over PMI whenthe top 20 or more closest words were used in evaluation. For 1911-PMI the im-provement was statistically significant for the top 10 or more closest words. Wefound improvements of up to 3% when mixing the supervised and unsupervisedmatrix weighting.

Table 3. Evaluation results for identifying related words in the same Paragraphs

Measure Top 1 Top 5 Top 10 Top 20 Top 40 Top 80

Cosine .256 .206 .173 .148 .127 .110PMI .624 .524 .466 .401 .345 .287

1987-Cosine .298 .240 .208 .180 .157 .1381987-PMI .644 .523 .470 .406 .349 .291

1911-Cosine .296 .240 .209 .182 .160 .1411911-PMI .640 .533 .478 .416 .352 .295

Once again evaluation on the 1911 Roget’s often performed better than on the1987 version. It is easier to show statistically significant improvements for Para-graphs than for SGs, because the number of positive candidates grows higher.The data in Table 1 suggest that a word may only have a few other words in thesame SG with it, while it will often have dozens of words in the same Paragraph.As a result, when we perform a t-test, each fold contains many more positiveexamples and so gives better estimate of how much incorporating supervisedweighting actually improves these MSRs.


5.3 Possible New Word Senses

We have not taken into account the possibility that new or missing senses ofwords are being discovered. If we look at the highest-ranked word in each listof candidates, we often find that the word appears to be closely related, butsometimes Roget’s labels them as not belonging in that Paragraph or SG. Thefollowing are a few of the more obvious examples of closely related words whichdid not appear in the same Paragraph: invader – invasion; infant – newborns ;mafia – mob and evacuation – airlift. Although not all the candidates labeledas unrelated may be as closely related as these pairs, it appears clear that theaccuracies we find should be considered as lower bounds on the actual accuracy.

6 Analysis and Conclusion

We have clearly shown that supervised weighting of word-context matrices is asignificant improvement over unweighted cosine similarity. Our method of su-pervised weighting of word-context matrices with PMI was not as effective asunsupervised term weighting with PMI. We found, however, that combining su-pervised and unsupervised matrix weighting schemes often showed a statisticallysignificant improvement. This was particularly the case when identifying moreloosely semantically related words, in the same Paragraph rather than limitingoccurrences of related words to the same SG. Never did combining supervisedand unsupervised learning actually hurt the results in a statistically significantmanner. There are simply are not enough words in the average SGs to provethat incorporating supervised training helps the PMI MSR. This is supportedby the fact that when enough data is used – the top 10-20 related words – theevaluation on Paragraphs does show a statistically significant improvement.

One surprise was that often weighting the word-context on the 1911 Ro-get’s Thesaurus performed slightly better than its counterpart weighted withthe 1987 version. This is difficult to explain, but the differences between thetwo trained systems tended to be quite small. This does suggest that the 1911version of Roget’s provides sufficient data for weighting of these contexts despiteits smaller size. This is particularly good news, because the 1987 version is notpublicly available, while the 1911 version is.

6.1 Future Work

The long-term motivation for this work is automatic updating of Roget’s The-saurus with new words. The results we present here suggest that the first steptoward that goal has been successful. Next, ranked lists will be used to determinewhich SGs and Paragraphs are good candidate locations for a word to be added.

We applied two version of Roget’s Thesaurus for training our system, but itis quite possible to use other resources, including WordNet. It is also possibleto use functions other than PMI for learning matrix weighting. Likelihood ratiotests are known to work well on rare events and should be considered [26].


Finally, let us note that we have only used our supervised matrix weightingtechnique to enhance Cosine similarity and PMI MSRs. Many other measuresare available via SuperMatrix, and there are other resources on which supervisedmatrix weighting could be applied.

Acknowledgments

Our research is supported by the Natural Sciences and Engineering ResearchCouncil of Canada and the University of Ottawa.

References

1. Hagiwara, M., Ogawa, Y., Toyama, K.: Supervised synonym acquisition using dis-tributional features and syntactic patterns. Journal of Natural Language Process-ing 16, 59–83 (2005)

2. Broda, B., Jaworski, D., Piasecki, M.: Parallel, Massive Processing in SuperMatrix– a General Tool for Distributional Semantic Analysis of Corpus. In: Proceed-ings of the International Multiconference on Computer Science and InformationTechnology, pp. 373–379 (2010)

3. Snow, R., Jurafsky, D., Ng, A.Y.: Semantic Taxonomy Induction from Heteroge-nous Evidence. In: Proceedings of COLING/ACL 2006, Sydney, Australia (2006)

4. Fellbaum, C. (ed.): WordNet: an Electronic Lexical Database. MIT Press,Cambridge (1998)

5. Turney, P.D., Pantel, P.: From Frequency to Meaning: Vector Space Models ofSemantics. Journal of Artificial Intelligence Research 37, 141–188 (2010)

6. Crouch, C.J.: A Cluster-Based Approach to Thesaurus Construction. In: SIGIR1988: Proceedings of the 11th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, pp. 309–320. ACM, New York(1988)

7. Ruge, G.: Automatic Detection of Thesaurus relations for Information RetrievalApplications. In: Foundations of Computer Science: Potential - Theory - Cognition,to Wilfried Brauer on the Occasion of his Sixtieth Birthday, pp. 499–506. Springer,London (1997)

8. Lin, D.: Automatic retrieval and Clustering of Similar Words. In: Proceedingsof the 17th International Conference on Computational Linguistics, pp. 768–774.Association for Computational Linguistics, Morristown (1998)

9. Curran, J.R., Moens, M.: Improvements in Automatic Thesaurus Extraction. In:Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon(SIGLEX), pp. 59–66 (2002)

10. Yang, D., Powers, D.M.: Automatic Thesaurus Construction. In: Dobbie, G., Mans,B. (eds.) Thirty-First Australasian Computer Science Conference (ACSC 2008).CRPIT, vol. 74, pp. 147–156. ACS, Wollongong (2008)

11. Rychly, P., Kilgarriff, A.: An Efficient Algorithm for Building a DistributionalThesaurus (and other Sketch Engine Developments). In: Proceedings of the 45thAnnual Meeting of the Association for Computational Linguistics Companion Vol-ume Proceedings of the Demo and Poster Sessions, pp. 41–44. Association forComputational Linguistics, Prague (2007)

12. Weeds, J., Weir, D.: Co-occurrence Retrieval: A Flexible Framework for LexicalDistributional Similarity. Comput. Linguist. 31(4), 439–475 (2005)


13. Yih, W.-t.: Learning term-weighting functions for similarity measures. In: Proceed-ings of the 2009 Conference on Empirical Methods in Natural Language Process-ing, EMNLP 2009, vol. 2, pp. 793–802. Association for Computational Linguistics,Morristown (2009)

14. Hajishirzi, H., Yih, W.-t., Kolcz, A.: Adaptive near-duplicate detection via simi-larity learning. In: Proceeding of the 33rd International ACM SIGIR Conferenceon Research and Development in Information Retrieval, SIGIR 2010, pp. 419–426.ACM, New York (2010)

15. Connor, M., Roth, D.: Context sensitive paraphrasing with a global unsuper-vised classifier. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin,S., Mladenic, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp.104–115. Springer, Heidelberg (2007)

16. Turney, P., Littman, M.: Unsupervised Learning of Semantic Orientation from aHundred-Billion-Word Corpus. Technical report NRC technical report ERB-1094,Institute for Information Technology, National Research Council Canada (2002)

17. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S.,Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: anexperimental study. Artif. Intell. 165, 91–134 (2005)

18. Kang, I.H., Kim, G.: Query type classification for web document retrieval. In:Proceedings of the 26th Annual International ACM SIGIR Conference on Researchand Development in Informaion Retrieval, SIGIR 2003, pp. 64–71. ACM, New York(2003)

19. Pantel, P.A.: Clustering by Committee. PhD thesis, University of Alberta (2003)20. Evert, S.: The Statistics of Word Cooccurrences: Word Pairs and Collocations.

PhD thesis, Universitat Stuttgart (2004)21. Piasecki, M., Szpakowicz, S., Broda, B.: Automatic Selection of Heterogeneous

Syntactic Features in Semantic Similarity of Polish Nouns. In: Matousek, V., Maut-ner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 99–106. Springer, Heidelberg(2007)

22. Broda, B., Derwojedowa, M., Piasecki, M., Szpakowicz, S.: Corpus-based SemanticRelatedness for the Construction of Polish WordNet. In: Calzolari, N., (ConferenceChair), Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias,D. (eds.) Proceedings of the Sixth International Language Resources and Evalua-tion (LREC 2008). European Language Resources Association (ELRA), Marrakech(2008)

23. Lin, D.: Dependency-Based Evaluation of MINIPAR. In: Proceedings of the Work-shop on the Evaluation of Parsing Systems, First International Conference on Lan-guage Resources and Evaluation (1998)

24. Kennedy, A., Szpakowicz, S.: Evaluating Roget’s Thesauri. In: Proceedings of ACL2008: HLT, pp. 416–424. Association for Computational Linguistics, Morristown(2008)

25. Kirkpatrick, B. (ed.): Roget’s Thesaurus of English Words and Phrases . Longman,Harlow (1987)

26. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence.Computational Linguistics 19(1), 61–74 (1993)

Anomaly-Based Network Intrusion Detection

Using Outlier Subspace Analysis: A Case Study

David Kershaw1, Qigang Gao1, and Hai Wang2

1 Faculty of Computer Science, Dalhousie University{kershaw,qggao}@cs.dal.ca

2 Sobey School of Business, St. Mary’s [email protected]

Abstract. This paper employs SPOT (Stream Projected Outlier de-Tector) as a prototype system for anomaly-based intrusion detection andevaluates its performance against other major methods. SPOT is capableof processing high-dimensional data streams and detecting novel attackswhich exhibit abnormal behavior, making it a good candidate for net-work intrusion detection. This paper demonstrates SPOT is effective todistinguish between normal and abnormal processes in a UNIX SystemCall dataset.

1 Introduction

Intrusion detection is a field of study which focuses on detecting unwanted be-haviours in a computer network. As networked computers are increasingly beingused for the storage of sensitive materials, the demand for secure networks hasintensified. Today, there is continuing difficulty preventing unwanted behavioursin computer network transactions. There are two major strategies of intrusiondetection systems (IDS): misuse-based detection and anomaly-based detection.Misuse-based detection uses well studied patterns as indicators of intrusive ac-tivities. Misuse-detection is also known as signature based detection, since a pre-defined set of signatures are used as an outline of network attacks. Anomaly-baseddetection differs from this approach, as it attempts to model a system’s normalbehaviour, then statistically determine whether new actions fall within the rangeof normal, with behaviour outside this range considered possibly harmful. Theadvantages of anomaly-based network intrusion detection have lead to researchefforts in recent years to find effective anomaly-based methods for protectingsecurity data, which is often high-dimensional and streaming. SPOT (StreamProjected Outlier deTector) was recently proposed to handle this challenge[8].This paper aims to apply SPOT [8] as a prototype IDS, taking advantage ofits ability to handle high dimensional streaming data, comparing its results toanother well known IDS STIDE (Sequence-based Intrusion Detection Method)[2]. It is posited that SPOT will perform as well as existing methods, and con-tribute to a unique problem domain: finding abnormalities in high dimensionalstreaming data. SPOT is tested against the UNIX System Call dataset from


Anomaly-Based Network Intrusion Detection Using Outlier Subspace Analysis 235

the University of New Mexico (UNM) [2]. First, an overview of anomaly basedintrusion detection is presented through a discussion of some influential papers;second, a summary of the methodology of SPOT; third, the experimental setupand results, and last, a brief conclusion discussing the findings and ideas forfuture research.

2 Anomaly Based Intrusion Detection on UNIX SystemCalls

Protecting computer systems from intruder’s attackers has been the goal of muchprevious research. In order to properly protect against threats, normally an intru-sion detection system (IDS) is configured. An IDS is a software and/or hardwaresystem intended to protect against (usually external) penetration. Vunerabili-ties are continually being found in desktop software residing on a host system.By attempting to detect abnormal system behaviour via system calls, we canhope to augment a host’s current security level. First, two relevant studies arereviewed.

In [2], Forrest et al. defined a unique method of modeling the properties ofnatural immune systems by analyzing short series of UNIX system calls in com-mon processes. The authors posited that these series can define a ’self’, whereabnormalities are detected by previously unseen sequences, i.e. changes in be-haviour. These definitions must be flexible enough to allow legit activities, such aspatching and program installations, while detecting abnormal, possibly harmfulactivities. The observed combination of UNIX system calls is consistently smallallowing for identification a program’s self definition. A two step process is pro-posed: first, build a database of normal behaviour by scanning the traces andrecording the observed sequences and second, examine new traces which couldcontain new patterns representing abnormal behaviour. The authors concludethat their methods have potential to operate as an effective online immunitysystem (STIDE).

In [3] Forrest et al. examine four data models used for intrusion detection insystem call data. Their methods originate from four different ideologies: 1) Enu-meration of Sequences, 2) Frequency-Based Methods, 3) Data Mining Methods,and 4) Finite State Machines, with the goal of comparing their false positiveand false negative rates, relative to the idealogy’s complexity. This study useslive, real world traces for its datasets. The authors conclude that variations indetection are more indicative of different data streams than an analysis method’scomplexity.

Our research attempts to augment these studies by examining the UNM datathrough subspace analysis using SPOT. It employs methods designed to ac-commodate real time high dimensional data streams, detecting data in outlyingsubspaces which could indicate network attacks. Using this method, it is possiblecreate a normal profile of privileged UNIX processes, detect abnormalities whichcould indicate potential attacks and compare results with STIDE.

236 D. Kershaw, Q. Gao, and H. Wang

3 SPOT for Outlier Detection in High Dimensional DataStreams

3.1 Methodology of SPOT

SPOT is an outlier detection approach capable of quickly processing high dimen-sional streaming data, which makes it a high-quality candidate for a prototypeanomaly-based network IDS [8]. SPOT addresses two major challenges: 1) find-ing the outlying subspaces which house the projected outliers and 2) effectivelyanalyzing streaming data. The first challenge results from the exponential growthof the constructed hypercube, which is unfeasible to inspect real-time. The sec-ond challenge addresses streaming data, which arrives ordered, allowing for onlyone chance to process. SPOT has three major contributions to the domain ofhigh-dimensional outlier detection in data streams:

1. Employs a window based time model and decaying data summaries whichallow the streaming data to be processed timely and effectively.

2. Builds a Sparse Subspace Template (SST), a group of subspaces which areused to detect outliers.

3. Employs a multi-objective genetic algorithm to produce the subspaces usedin construction of the SST.

Fig. 1. Architecture of SPOT [8]

Fig. 1 shows the architecture of SPOT. SPOT can be used for offline learningand online learning. In offline learning, the SST is created, comprised of the setof subspaces most likely containing projected outliers. It’s components are theFixed SST Subspaces (FS), the Unsupervised SST Subspaces (US) and the Su-pervised SST Subspaces (SS). The FS is comprised of all the available subspaces,up to a maximum set cardinality. The US uses MOGA to search for the set of


subspaces most likely containing the highest number of projected subspaces. TheSS is a place for domain experts to place subspaces already considered outlyingbased upon existing domain knowledge, allowing flexibility by directing SPOT,focusing the search. In the online learning stage, the incoming data is processed,updating the SST appropriately.

3.2 SPOT Based Network Intrusion Detection System

An anomaly-based IDS’s function is to effectively model a system’s normal be-haviour, and detect deviating activities. Here, SPOT and STIDE are imple-mented as IDSs, and attempt to model the execution of UNIX processes. Amodel of normal behaviour is established by using the set of system calls aUNIX process generates during its execution. This model is further defined byincluding order, i.e. by taking sequences of system calls of a certain length. Thisis known as the window length and is often set to six; however, further researchindicates an optimal window length can be determined [6]. A processes’ normalexecution generates a specific set system call combinations. Using this collec-tion, a process’s normal behaviour can be modelled, where abnormal behaviouris identified by any sequences outside the normal set. The raw datasets pro-vided by UNM are in a format:pid, syscall, where pid = process id and syscall= system call. The system call number represents a unique UNIX system call(e.g. 5 = open). For SPOT, the data must be converted into a multi-dimensionalvector-like format. The number of dimensions which make up the converted datais dependent on the size of the chosen sliding window.

The window is reset once the process id switches, preserving the sequenceinformation. SPOT is able to map this dataset to multidimensional space andcreate a normal profile through the relationships between subspaces. Each se-quence is mapped, populating the cells in a hypercube. Each subspace’s densityis measured by the amount of data mapped to it. Therefore, the hypercube, andits associated subspace densities, represent the profile of normal behaviour.

4 Experiment and Results

The UNM datasets have been widely used for testing the effectiveness of differentalgorithms [3]. The same datasets are employed here to evaluate SPOT’s effec-tiveness on ordered system call data, and then compared to STIDE. The data issplit into training and testing sets. The training data consists solely of normaltraces while the test data consists of a mix of normal and intrusive traces. Theintrusive traces represent both real and simulated attacks which were injectedinto the data. A detector window length of size six was chosen as it has beenshown in previous research that a detector window of at least six is necessaryfor the detection of anomalies [6]. The first step of the offline learning is theconstruction of the SST. Once the SST is fully constructed, the detection stagebegins. During testing, any arriving data point mapped to a cell with a lowdensity achieves a high abnormality value. SPOT places these data points in an

238 D. Kershaw, Q. Gao, and H. Wang

“Outlier Repository”, where a user can inspect it’s outlierness. With an out-lierness number associated with each trace, a comparison can be made betweennormal and intrusive traces.

Forrest et al. identified two measurements for evaluating their data model-ing methods: false positive percentage, or the ratio of normal data classifiedas anomalous compared to total normal data used in testing and true positivepercentage, the ratio of detected intrusions to undetected intrusions in the testdata [3]. These measurements are used as the benchmark comparator for sys-tem evaluation. For a trace to be considered anomalous, one sequence call per athousand must be classified anomalous. SPOT’s sensitivity threshold is a valuefor the outlierness score. If the data point exceeds this value, it is determinedanomalous. In STIDE’s case, the threshold value is LFC size, or locality framecount size. This value represents the size of a frame STIDE employs to detectlocal anomalies.

Fig. 2. Overall Average True Positive Results for STIDE and SPOT across all datasets

The results from all datasets are displayed for both true and false positiverates in Fig. 2. This figure represents averages across all datasets. The x-axisshows the thresholds for each data modelling method, while the y-axis showsthe percentages. The rightmost graph in Figure 2 compares STIDE and SPOT’strue positive rates. It shows that STIDE reaches maximum effectiveness at athreshold of 2, and tapers slowly upwards. SPOT meanwhile reaches maximumeffectiveness around a threshold of 10-12, surpassing STIDE’s maximum, butis slow to gain ground on STIDE initially. The rightmost graph of Figure 2compares the false positive rates. Initially, both STIDE and SPOT’s percentagesare comparable, however, around a threshold of 10, STIDE’s false positive ratesignificantly jumps upwards, while SPOT’s rate does not considerably change.Toward the end of the threshold, SPOT’s rate trends upwards toward STIDE’s.Typically, a good mix of true positive rate versus false positive rate is requiredto determine which method performed the best, depending upon the goal of theintrusion detection system.


5 Conclusion

Using the well-studied UNM dataset is an effective means for standardizing andcomparing the implementation and results of a chosen data modeling method.SPOT is an ideal data modeling tool for application in this domain. SPOT’sability to handle high dimensionality grants the ability of specifying any desiredwindow size. Also, SPOT examines each n-dimensional subspace, allowing forthe inspection of outlying subspaces within the entire window. This is clearly anadvantage over STIDE, which depends on a predetermined window size, and doesnot allow for additional flexibility during its operation. An additional advantageis SPOT’s assignment of statistics for each data point (and subspace), regard-ing its outlierness. By assigning a specific score, SPOT can discern a degree towhich the data point is anomalous, in addition to its inhabited subspaces. Thisallows for the user to specify thresholds for their individual systems. Finally,SPOT can be implemented online, which is necessary in this domain. As SPOTis a relatively new tool, its strengths and weaknesses are being determined. Ap-plying SPOT to different datasets will provide insight into SPOT’s adaptation.Further research should focus on SPOT’s core abilities to process streaminghigh-dimensional data in real-time in domains where online examination of datais critical.

References

[1] Aggarwal, C.C., Yu, P.S.: Outlier Detection for High Dimensional Data. In: SIG-MOD 2001 (2001)

[2] Forrest, S.A., Hofmeyr, S.A., Somayaji, A., Longstaff, T.A.: A Sense of Self forUNIX Processes. In: Proceedings of the 1996 IEEE Symposium on Security andPrivacy, pp. 120–128 (2001)

[3] Forrest, S.A., Warrender, C., Perlmutter, B.: Detecting Intrusions Using SystemCalls: Alternative Data Models. In: Proceedings of the 1999 IEEE Computer Soci-ety Symposium on Research in Security and Privacy, pp. 133–145 (1999)

[4] Garcia-Teodoro, P., Diaz-Verdejo, J., Macia-Fernandez, G., Vazquez, E.: Anomaly-Based Network Intrusion Detection: Techniques, Systems and Challenges. Com-puters and Security 28, 18–28 (2009)

[5] Symantec Global Internet Security Threat Report: Trends for 2008, Vol. XIV (2008)(published April 2009)

[6] Tan, K., Maxion, R.: Why 6? Defining the Operational Limits of Stide,andAnomaly-Based Intrusion Detectors. In: Proceedings of the 1996 IEEE Symposiumon Security and Privacy, pp. 133–145 (2002)

[7] Wenke, J., Salvatore, J.: Framework for Construction Features and Models forIntrusion Detection Systems. In: TISSEC, pp. 227–261 (2000)

[8] Zhang, J.: Towards Outlier Detection for High-Dimensional Data Streams UsingProjected Outlier Analysis Strategy. PHD Thesis. Dalhousie University (2008)


Evaluation and Application of Scenario Based Design on Thunderbird

Bushra Khawaja and Lisa Fan

Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {khawajab,fan}@cs.uregina.ca

Abstract. Scenario based design (SBD) approach has been widely used to improve the user interface (UI) design of interactive systems. In this paper, the effectiveness of using the SBD approach is shown to improve the UI design of the Thunderbird email system. Firstly, an empirical evaluation of the system was performed based on the user comments. Then, a low fidelity prototype of the modified interfaces was developed. Furthermore, the new design interfaces were evaluated using two evaluation methods: a) GOMS keystroke level model was used to compare the efficiency of two interfaces b) Heuristic Evaluation of the system was performed using Nielsen’s usability heuristics. The evaluation results show that the efficiency of accomplishing important and most discussed tasks is improved significantly. Applying SBD approach on email systems is concluded as a promising trend to enhance usability.

Keywords: Scenario-based Design (SBD), Usability, GOMS Keystroke Level Model (KLM-GOMS), Email Systems, Thunderbird 3 (TB-3).

1 Introduction

Scenario based design (SBD) methodology is extensively used to improve the user interface (UI) design of interactive systems. In human-computer interaction (HCI), user comments from discussion forums are used to evaluate the UI design of systems. The discussions of real users are very useful as they compare the strengths and weaknesses of the system with the other similar systems. User Interaction Scenario is a story about people and their activities [4]. A good scenario should include seven characteristic elements i.e. setting, actors, task goals, plans, actions, events and evaluation [6]. Scenario based design is a user-centered approach that is developed by Rosson and Carroll during the last ten years. It is an iterative approach that follows the phases of writing problem, activity, information and interaction scenarios based on the user interactions. After each phase, claims analysis is done to highlight the important design features and the positive and negative implications these features are analyzed. In case of more than one design alternatives, it helps to choose the best design. Real-world and technological metaphors are also brainstormed iteratively before each next phase for innovative design ideas. Explaining the SBD approach in

Evaluation and Application of Scenario Based Design on Thunderbird 241

detail is beyond the scope of this paper. However, a couple of references might be helpful for interested readers [4], [5], [6].

During the last decade, SBD approach is applied to a number of systems. It was applied in the redesign of a hospital information system [2]. It was also used to improve the design of a digital library of geographical resources [10]. In another work, the phases of SBD were followed to design an interface for geo spatial data infrastructure [1]. In a recent work, three scenario-based methods were applied to develop a web browser system [9]. Most of these works found this approach very effective to improve the user interface design of systems. Email systems face a number of usability problems in terms of the efficiency and ease of use due to the increasing bulk of messages. To our knowledge, SBD method is never applied to any email system before where usability should be the most important concern. This is the reason that motivated us to apply this method to the TB email system.

This study highlights the usability issues in the Thunderbird-3 email system by using the SBD method. The rest of the paper is organized as follows: section 2 briefly describes the research methodology and experiments by discussing UI design problems of existing system and suggested improvements shown with interfaces. Section 3 discusses the implications for the new design by using the two evaluation methods i.e. the KLM-GOMS and heuristic evaluation. Lastly, section 4 summarizes the work and states potential future work in this realm.

2 Methodology and Experiments

The discussion forums from Mozilla Messaging webpage were used as data for writing scenarios [7]. The ‘ideas under construction’ categories and a few threads such as message pane layout, new message and address book in new tab were studied in more detail. The messages that exhibited thoughtful suggestions about UI design and provide details for most of the elements of scenario were used to write scenarios. A low fidelity prototype of the modified interfaces was developed. Providing detailed scenarios are beyond the space limitation of this paper. However, the improved features are listed in Table 1 briefly with their problems in the existing design and the suggested solutions in the new design. To better explain the new design ideas, the interfaces for each improved feature are shown in three figures grouped together as first three features in figure 1, next three in figure 2 and the last two in figure 3.

Fig. 1. Compact header (top most), attachment bar (centre), blocked content bar (lower most)

242 B. Khawaja and L. Fan

Table 1. Considered features with existing design problems and suggested solutions

Features Existing Design New Design Expanded header

Expanded by default; has to install an add-on to compact.

Suggested display compact header by default with ‘+’ icon to expand.

1. Header (Figure 1, top most) Compact

header Subject on the left (may be blank), sender’s address on the right besides the date.

Sender on the left, important tools on the right i.e. ‘Reply, Forward, Junk and Delete’.

2. Attachment bar (Figure 1, centre)

Wasted space: Displays the whole list of attachments regardless of the number of attachments.

Left and right scroll arrows if more than 5 attachments and a link ‘More’ with a number to inform the number of attachments that are not shown.

3. Blocked content bar (Figure 1, lower most)

Wide with wasted space: a lengthy link to always load content in the second line.

Narrowed to one line with always show content as a button on the same line.

4. New folder creation (Figure 2, centre)

Time consuming: ‘New Folder’ and ‘New Subfolder’ options are provided in the right click menu.

‘New Folder’ provided under the Inbox; in pop-up window, it asks ‘Create as a subfolder of:’ that is, set to ‘Inbox’ by default.

Confusing: default folder for the main view is named as ‘Local Folders’.

‘Local Folders’ is suggested to change to ‘Thunderbird’ in new design.

5. Folder hierarchy (Figure 2, left most)

Lengthy name for the accounts folder i.e. the email address which creates confusion.

Shortened as username-account server name e.g. Dennis-Hotmail

6. Open message options (Figure 2, right most)

Time Consuming and provided in an irrelevant menu i.e. Tools -> Options -> Advanced -> Reading & Display

A menu with ‘Open Message’ options is provided on the main tools. Also, an information bubble to inform that options work by double clicking.

Difficult to Notice a new tab as the tab bar with the ‘Inbox’ tab is already present on main page and always stays there.

‘Inbox’ is made a main tool instead of a tab. Double clicking a message for the first time pops up a tab bar with a highlighted tab which makes it more noticeable.

7. Tabbed interface (Figure 3)

Inconsistent: Messages open in tabs while composing messages and address book opens in new windows.

Composing messages and address book are suggested to open in tabs by default in new design.

An Unclear Icon for listing all tabs

A clear prominent icon at the right most of tab bar

8. Listing and closing all tabs (Figure 3)

Closing tabs is time consuming. A confusing option ‘Close other tabs’ is provided in the right click menu of ‘x’ button. No option to close all tabs together.

A prominent red ‘x’ button to close tabs altogether; A menu button is also provided to choose from the options as ‘Close Tab’, ‘Close Other Tabs’ or ‘Close All Tabs’


All Folders

Inbox

New Folder

DraftsSentTrash

Dennis-Hotmail

Thunderbird

New FolderNew Folder

CancelCreate Folder

Folder Name:

Create as a sub folder of :

Inbox

Fig. 2. Folder column pane (left most), new folder creation interface (centre), open message menu (right most) i.e. New Tab, New Window, Existing Windows, Existing Tab, Conversation

Fig. 3. Tabbed interface: tab bar shows the control buttons on each tab, Close and List All Tabs options on the right most of tab bar

3 Results and Implications for New Design

This section shows the promising results by using the two evaluation methods: a) KLM-GOMS is used to compare the time required to perform tasks involving the above discussed features using existing and new design interfaces b) Nielsen’s Heuristic Evaluation is performed to measure the usability of the new design ideas.

GOMS Keystroke Level Model (KLM-GOMS). The time chart for the general operators provided by Card et al. [3] is used for calculation. Table 2 shows the calculated total time required (in seconds) to accomplish five tasks using the existing and new design interfaces.

Table 2. Comparison of total task accomplishment time using KLM-GOMS

Task accomplishment time Tasks Existing Design New Design

Percentage Improvement

1. Replying using ‘Reply’ from header bar 9.70 secs 3.05 secs 68 % 2. Creating new folder named ‘Friends’ 13.25 secs 9.25 secs 30 % 3. To change message opening settings 14.95 secs 5.70 secs 62 % 4. Writing a message & locating address book 15.05 secs 5.70 secs 62 % 5. Comparing two emails’ contents 36.90 secs 23.3 secs 37 %

244 B. Khawaja and L. Fan

In table 2, it can be clearly seen from the comparative task accomplishment values that the time required to accomplish tasks using the new design is far less than the existing design with an average of 52% less time required for these five tasks. Also, most of the tools provided at hand in the new design improves flexibility and reduces mental effort significantly. For instance, open message, new folder, compact header tools, inbox in main tools and control buttons on tabs.

Heuristic Evaluation: Nielsen’s Usability Heuristics. The usability heuristics provided by Jakob Nielsen [8] are very useful to evaluate the UI design of interactive systems. These are more like general rules of thumbs for evaluating the usability of systems. The heuristics were kept in mind during the entire study and were later used to evaluate and illustrate the positive implications for the suggested design. A few important usability heuristics are discussed below briefly for the features described in section 2.

• Visibility of System Status: All the system messages in TB-3 appear on the screen just for a few seconds that keep the users uninformed about what is going on. These messages are suggested to stay there for at least 20 seconds. For example, the messages that are shown while creating folder and downloading messages.

• User Control and Freedom: Control buttons on tabs make a great difference to the tabbing in email systems. As seen in Table 2, task 5, it improves the efficiency and gives the user a feeling of control. So, email systems should support control buttons on tabs.

• Consistency and Standards: Tabbed interface is made consistent in the new design throughout the system by suggesting that writing messages and address book should open in tabs. Standards as sender name on the header instead of subject and ‘New Folder’ option in the folder column pane are followed.

• Error Prevention: Rather than users need to try confusing ‘Local Folders’ option to create a new folder, it is changed to ‘Thunderbird’ to prevent errors and clear ‘New Folder’ option is provided in the folder column for creating a folder.

• Recognition Rather than Recall: To change default settings for opening messages, users do not have to memorize a sequence of options now as it was in existing design. Instead, it can be done using the ‘Open Message’ from the main tool bar.

• Flexibility and Efficiency of use: is improved significantly in new design. For instance, ‘Open Message’ and ‘New Folder’ options are provided at hand, ‘Inbox’ is provided in the main tools instead of a tab (it makes it easier and more flexible to access anytime as while writing a message in a tab), and control buttons on each tab makes it more flexible to switch between the emails. Tools like Reply, Forward etc provided on the compact header gives affordance to the users to perform the tasks efficiently with very less mental effort.


In this paper, we studied the effectiveness of using the scenario based design (SBD) approach to improve the user interface (UI) design of the Thunderbird email system. Following the phases of SBD approach, the UI design improvements are suggested for a few very important and most discussed features. The comparative evaluation


results using the KLM-GOMS show that the new design reduces mental effort and time required to do the important tasks significantly. For instance, replying, composing messages, locating address book, comparing emails and creating folders can be done more efficiently using the new design. Moreover, Nielsen’s heuristics when considered as rules of thumbs while redesigning can be very useful for improvement. The heuristic evaluation results show that the new UI design ideas satisfy most of the Nielsen’s usability heuristics, which in turn enhances usability. It can be implied by the empirical evaluation of two interfaces that the new design developed using SBD approach provides a number of improvements in terms of its flexibility, efficiency and ease of use. In the future, we plan to use SBD approach to evaluate and compare the UI design of a couple of popular email systems. An email system is also envisioned to design that will keep strengths of the existing systems and overcome weaknesses as much as possible.

References

1. Aditya, T., Ormeling, F.J., Kraak, M.J.: Advancing a National Atlas-Based Portal for Improved Use of a Geospatial Data Infrastructure: Applying Scenario-Based Development. In: Proceedings of Map Asia (2009)

2. Bardram, J.: Scenario-based Design of Cooperative Systems. In: Group Decision and Negotiation. LNCS, vol. 9(3), pp. 237–250. Springer, Heidelberg (1974)

3. Card, S.K., Moran, T.P., Newell, A.: The Psychology of Human-Computer Interaction. L. Erlbaum, Hillsdale (1983)

4. Carroll, J.M., Rosson, M.B.: Human-Computer Interaction Scenarios as a Design Representation. In: 23rd Hawaii International Conference on System Sciences, Software Track, pp. 555–561. IEEE Computer Society Press, Los Alamitos (1990)

5. Carroll, J.M.: Making Use: Scenario-Based Design of Human-Computer Interactions. MIT Press, Cambridge (2000)

6. Carroll, J.M., Rosson, M.B.: Usability Engineering: Scenario-Based Development of Human-Computer Interaction. Morgan Kaufmann, San Francisco (2002)

7. Community-powered Support for Mozilla Messaging (Online), http://getsatisfaction.com/mozilla_messaging (accessed: November 15, 2010)

8. Nielsen, J.: Ten Usability Heuristics (Online), http://www.useit.com/papers/heuristic/heuristic_list.html (accessed: January 15, 2010)

9. Petkovic, D., Raikundalia, G.K.: An Experience with Three Scenario-Based Methods: Evaluation and Comparison. International Journal of Computer Science and Network Security 9(1), 180–185 (2009)

10. Theng, Y.L., Goh, D.H., Lim, E.P., Liu, Z., Ming, Y., Pang, N.L.S., Wong, P.B.: Applying Scenario-based Design and Claims Analysis to the Design of a Digital Library of Geography Examination Resources. Information Processing and Management 41(1), 23–40 (2005)

Improving Phenotype Name Recognition

Maryam Khordad1, Robert E. Mercer1, and Peter Rogan1,2

1 Department of Computer Science2 Department of Biochemistry

The University of Western Ontario, London, ON, Canada{mkhordad,progan}@uwo.ca, [email protected]

Abstract. Due to the rapidly increasing amount of biomedical litera-ture, automatic processing of biomedical papers is extremely important.Named Entity Recognition (NER) in this type of writing has several dif-ficulties. In this paper we present a system to find phenotype names inbiomedical literature. The system is based on Metamap and makes useof the UMLS Metathesaurus and the Human Phenotype Ontology. Froman initial basic system that uses only these preexisting tools, five rulesthat capture stylistic and linguistic properties of this type of literatureare proposed to enhance the performance of our NER tool. The toolis tested on a small corpus and the results (precision 97.6% and recall88.3%) demonstrate its performance.

1 Introduction

During the last decade biomedicine has developed tremendously. Everyday a lotof biomedical papers are published and a great amount of information is pro-duced. Due to the large number of applications of biomedical data, the needfor Natural Language Processing (NLP) systems to process this amount of newinformation is increasing. Current NLP systems try to extract from the biomed-ical literature different knowledge such as, protein–protein interactions [1] [2] [3][4] [5], new hypotheses [6] [7] [8], relations between drugs, genes and cells [9][10] [11], protein structure [12] [13] and protein function [14] [15]. In all of theseapplications recognizing the biomedical objects or Named Entity Recognition(NER) is a fundamental step and obviously affects the final result.

Over the past years it has turned out that finding the name of biomedicalobjects in literature is a difficult task. Some problematic factors are: the existenceof millions of entity names, a constantly growing number of entity names, thelack of naming agreement prior to a standard name being accepted, an extremeuse of abbreviations, the use of numerous synonyms and homonyms, and thefact that some biological names are complex names that consist of many words,like “increased erythrocyte adenosine deaminase activity”. Even biologists donot agree on the boundary of the names [16].

Named Entity Recognition in the biomedical domain has been extensivelystudied and, as a consequence, many methods have been proposed. Some meth-ods like MetaMap [17] and mgrep [18] are generic methods and find all kinds of


Improving Phenotype Name Recognition 247

entities in the text. Some methods, however, are specialized to recognize partic-ular type of entities like gene or protein names [13] [19], diseases and drugs [9][20] [21], mutations [22] or properties of protein structures [13].

NER techniques are usually classified into three categories [16]. Dictionary-based techniques like [19] match phrases from the text against some existingdictionaries. Rule-based techniques like [23] make use of some rules to find entitynames in the text. And machine learning techniques like [24] transform the NERtask into a classification problem.

In this paper we want to focus on phenotype name recognition in biomedicalliterature. Phenotype is defined as the genetically-determined observable charac-teristics of a cell or organism, including the result of any test that is not a directtest of the genotype [25]. A phenotype of an organism is determined by the in-teraction of its genetic constitution and the environment. Skin color, height andbehavior are some examples of phenotypes. We are developing a system that usesexisting databases (UMLS Metathesaurus[26] and Human Phenotype Ontology(HPO) [27]) to find phenotype names1. Our tool is based on MetaMap[17] tofind name phrases and their semantic types. The tool uses these semantic typesand some stylistic and linguistic rules to find human phenotype names in thetext.

2 Phenotype Name Recognition

The last few years have seen a remarkable growth of NER techniques in thebiomedical domain. However, these techniques tend to emphasize finding thename of genes, proteins, diseases and drugs. Although many specialized dictio-naries are available, we are not aware of a dictionary which is both compre-hensive and ideally suited for phenotype name recognition. For example, TheUnified Medical Language System (UMLS) Metathesaurus [26] is a very large,multi-purpose, and multi-lingual vocabulary database that contains more than1.8 million concepts. These concepts come from more than 100 source vocabu-laries. The Metathesaurus is linked to the other UMLS Knowledge Sources – theSemantic Network and the SPECIALIST Lexicon. All concepts in the Metathe-saurus are assigned to at least one semantic type from the semantic network.However the semantic network does not contain Phenotype as a semantic typeso it alone is not adequate to distinguish between phenotypes and other objectsin text. In addition, some phenotype names do not exist in the UMLS Metathe-saurus at all. The Online Mendelian Inheritance in Man (OMIM) [28] is themost important information source about human genes and genetic phenotypes[27]. Over five decades MIM and then OMIM has achieved great success andnow it is used for the daily work of geneticists around the world. NonethelessOMIM does not use a controlled vocabulary to describe the phenotypic features1 This paper describes linguistic techniques to determine the sequence of words that

is a descriptive phrase for a phenotype. A phenocopy is an environmental conditionthat mimics a phenotype and hence would have the same descriptive phrase as thephenotype name. We are not distinguishing between phenotype and phenocopy.

248 M. Khordad, R.E. Mercer, and P. Rogan

in its clinical synopsis section that makes it inappropriate for data mining us-ages [27]. The Human Phenotype Ontology (HPO) [27] is an ontology that wasdeveloped using information from OMIM and is specially related to human phe-notypes. The HPO contains approximately 10,000 terms. Nevertheless this on-tology is not complete and we had several problems finding phenotype names init. First, some acronyms and abbreviations are not available in the HPO. Sec-ond, although the HPO contains synonyms of phenotypes, there are still somesynonyms that are not included in the HPO. For example the HPO containsENDOCRINE ABNORMALITY, but not ENDOCRINE DISORDER. Third, insome cases adjectives and other modifiers are added to phenotype names, makingit difficult to find these phenotype names in the ontology. For example, ACUTELEUKEMIA is in the HPO, but an automatic system would not suggest thatACUTE MYLOID LEUKEMIA is a phenotype simply by searching in the HPO.Fourth, new phenotypes are being continuously introduced to the biomedicineworld. HPO is being constantly refined, corrected, and expanded manually, butthis process is not fast enough nor can the inclusion of new phenotypes be guar-anteed.

3 Background

3.1 Named Entity Recognition

Named entities are phrases that contain the name of people, companies, cities,etc., and specifically in biomedical text entities such as genes, proteins, diseases,drugs, or organisms. Consider the following sentence as an example:

– The RPS19 gene is involved in Diamond-Blackfan anemia.

There are two named entities in this sentence: RPS19 gene and Diamond-Blackfananemia.

Named Entity Recognition (NER) is the task of finding references to knownentities in natural language text. An NER technique may consist of some naturallanguage processing methods like part-of-speech (POS) tagging and parsing.

Part-of-speech tagging is the process of assigning a part-of-speech or othersyntactic class marker to each word in the text [29]. A part-of-speech is a lin-guistic category of words such as noun, verb, adjective, preposition, etc. whichis generally defined by the syntactic or morphological behavior of the word.

Parsing is the process of syntactic analysis that recognizes the structure ofsentences with respect to a given grammar. Using parsing we can find whichgroups of words are for example noun phrases and which ones are verb phrases.Complete and efficient parsing is beyond the capability of current parsers. Shal-low parsing is an alternative.

Shallow parsers decompose each sentence partially into some phrases and afterthat they find the local dependencies between phrases. They do not analyze theinternal structure of phrases. Each phrase is tagged by one of a set of predefined


grammatical tags such as Noun Phrase, Verb Phrase, Prepositional Phrase, Ad-verb Phrase, Subordinated clause, Adjective Phrase, Conjunction Phrase, andList Marker [30].

An important syntactic concept that is applied in our tool is the head of aphrase. The head is the central word in a phrase that determines the syntacticrole of the whole phrase. For example in both phrases “low set ears” and “theears” ears is the head.

3.2 MetaMap

MetaMap [17] is a widely used program developed by the National Library ofMedicine (NLM). MetaMap provides a link between biomedical text and thestructured knowledge in the Unified Medical Language System (UMLS) Metathe-saurus by mapping phrases in the text to concepts in the UMLS Metathesaurus.To achieve this goal it analyzes the input text in some lexical and semanticalsteps.

First, MetaMap tokenizes the input text. In the tokenization process the in-put text is broken into meaningful elements, like words. After part-of-speechtagging and shallow parsing using the Specialist Lexicon, MetaMap has bro-ken the text into phrases. Phrases undergo further analysis to allow mappingto UMLS concepts. Each phrase is mapped to a set of candidate concepts andscores are calculated that represent how well the phrase matches the candidates.An optional last step is word sense disambiguation (WSD) which chooses thebest candidate with respect to the surrounding text [17].

MetaMap is configurable and there are some options for vocabularies and datamodels in use, output format and algorithmic computations. Human-readableoutput is one of the output formats. MetaMap’s human-readable output gener-ated from the input text “at diagnosis.” in the sentence “The platelet and thewhite cell counts are usually normal but neutropenia, thrombopenia or throm-bocytosis have been noted at diagnosis.” is shown in Fig. 1. As you see MetaMapfound 6 candidates for this phrase and finally after WSD it mapped the phraseto the “diagnosis aspect” concept. In UMLS each Metathesaurus concept is as-signed to at least one semantic type. In Fig. 1 the semantic type of each conceptis given in the preceding brackets. Semantic types are categorized into somegroups that are subdomains of biomedicine such as Anatomy, Living Beings andDisorders [31]. These groups are called Semantic Groups (SG). Each semantictype belongs to one and only one SG.

3.3 Human Phenotype Ontology (HPO)

An ontology, defined in Artificial Intelligence and related areas, is a structuredrepresentation of knowledge in a domain. In fact an ontology is a structure ofconcepts and the relationships among them. The Human Phenotype Ontology(HPO) [27] is an ontology that tries to provide a standardized vocabulary of phe-notypic abnormalities encountered in human disease. The HPO was constructedusing information initially obtained from the Online Mendelian Inheritance in


Phrase: "at diagnosis.">>>>> Phrasediagnosis<<<<< Phrase>>>>> CandidatesMeta Candidates (6):1000 Diagnosis [Finding]1000 Diagnosis (Diagnosis:Impression/interpretation of study:

Point in time:^Patient:Narrative) [Clinical Attribute]1000 Diagnosis (Diagnosis:Impression/interpretation of study:

Point in time:^Patient:Nominal) [Clinical Attribute]1000 diagnosis (diagnosis aspect) [Qualitative Concept]1000 DIAGNOSIS (Diagnosis Study) [Research Activity]928 Diagnostic [Functional Concept]

<<<<< Candidates>>>>> MappingsMeta Mapping (1000):1000 diagnosis (diagnosis aspect) [Qualitative Concept]

<<<<< Mappings

Fig. 1. MetaMap output for “at diagnosis”

Man (OMIM) [28] after which synonym terms were merged and the hierarchicalstructure was created between terms according to their semantics. The hier-archical structure in the HPO represents the subclass relationship. The HPOcurrently contains over 9500 terms describing phenotypic features.

4 Proposed Method

The development of our system began when we could not find a comprehen-sive resource for phenotype name recognition. In order to recognize phenotypenames (e.g. “thumb duplication”) in the literature we integrated the availableknowledge in the UMLS Metathesaurus and the HPO. By examining the positiveand negative results we developed five additional rules. When using them theperformance of our system improved significantly. A block diagram showing oursystem processing is shown in Fig. 2. The system performs the following steps:

I MetaMap chunks the input text into phrases and assigns the UMLS seman-tic types associated with each noun phrase. We used the strict model andword sense disambiguation embedded in MetaMap.

II The Disorder Recognizer analyzes the MetaMap output to find phenotypesand phenotype candidates. This part is original to our system and is de-scribed in detail in Section 4.1.

III OBO-Edit [32] is an open source Java program that provides some facilitiesto edit or search in ontology files in OBO format. In this step phenotypecandidates from the previous step are searched in the HPO. Phenotypecandidates that are found in the HPO are recognized as phenotypes.

IV Result Merger merges the phenotypes found by disorder recognizer andOBO-Edit and makes the output that is the final list of available phenotypesin the input text.


Fig. 2. System block diagram

4.1 Disorder Recognizer

After processing the input text by MetaMap, a semantic type has been assignedto each phrase. The UMLS Semantic Network contains 133 Semantic Types. Un-fortunately, Phenotype is not available in these semantic types and it is not easyfor non-experts to determine which semantic types are related to phenotypes.These semantic types are categorized into 15 Semantic Groups (SG) [31] thatare more general and more comprehensive for non-experts. The Semantic GroupDisorders contains semantic types that are close to the meaning of phenotype.This semantic group has been used elsewhere [33] to map terminologies betweenthe Mammalian Phenotype Ontology (MPO)[34] and the Online Mendelian In-heritance in Man (OMIM) [28]. The Semantic Group Disorders contains thefollowing semantic types: Acquired Abnormality, Anatomical Abnormality, Cellor Molecular Dysfunction, Congenital Abnormality, Disease or Syndrome, Ex-perimental Model of Disease, Finding, Injury or Poisoning, Mental or BehavioralDysfunction, Neoplastic Process, Pathologic Function, Sign or Symptom.

Our initial system used MetaMap and the Semantic Group Disorders to rec-ognize phenotypes. However, a number of errors remained with this rudimentarysystem. After some analysis of these errors, it was decided to apply some postprocessing steps to overcome the remaining problems:

1. A number of errors were caused by the use of acronyms. MetaMap has theability to recognize the acronym references but its database does not containall acronyms. In addition, some acronyms are used for more than one conceptand this ambiguity causes problems for MetaMap. Typically, papers indicatethe local unambiguous reference for each acronym used at its first usage.Using this knowledge, we create a list of acronym references for each paperusing BioText [35] and use this list to process the acronyms found in theremainder of the text. So the first rule is:


Rule 1. Resolve the acronym referencing problem by making and using alist of acronyms occurring in each paper.

2. Several phenotypes are phrases containing more than one biomedical or clin-ical term. The complete phrase of some of these phenotypes are not availablein the UMLS. The UMLS often finds separate concepts for the biomedicaland clinical terms in these phrases. Fig. 3 represents the UMLS output for“[The] presented learning disabilities”. There are two separate concepts inthe MetaMap output. The first one is “presented” which is assigned to thesemantic type [Idea or Concept] and the second one is “learning disabilities”with the semantic type [Mental or Behavioral Dysfunction]. As “presented”is only an adjective for “learning disabilities” in this case, the whole phraseshould be considered as one phenotype. So, in these situations the semantictype of the noun phrase head is the most important part and our systemshould consider the head’s semantic type in order to recognize the semantictype of the whole phrase. So we have the rule:Rule 2. The semantic type of a noun phrase is the semantic type assigned

by Metamap to its head.3. Some phenotypes like “large ventricles” that are not recognized by MetaMap

follow a common template. They begin with special modifiers followed byterms that have the Semantic Groups Anatomy or Physiology. This classof phenotypes is mentioned in [31] where a list of 100 special modifiers,having to do with some sort of unusual aspect or dysfunction (like “large”,“defective” and “abnormal”), is given. This list was developed by noticingthe modifiers that occur most frequently with MPO terms. For our purposes,we found the list incomplete and we have added three more modifiers foundin our small corpus. The three added terms are “missing”, “malformed”, and“underdeveloped”. More modifiers will need to be included. The rule is:Rule 3. If a phrase is “modifier (from the list of special modifiers) +

[Anatomy] or [Physiology]” it is a phenotype name.4. A number of the semantic types in the Semantic Group Disorder include

concepts that are not phenotypes, leading to false positives in phenotypename recognition. For example MetaMap assigns “responsible” to the se-mantic type “Finding”. The word “responsible” is clearly not a phenotype.On the other hand “overgrowth”, which is a phenotype, is assigned to the se-mantic type “Finding”, too. The problematic semantic groups are: Finding,Disease or Syndrome, Experimental Model of Disease, Injury or Poisoning,Sign or Symptom, Pathologic Function, and Cell or Molecular Dysfunction.Therefore, if a phrase is assigned to these semantic types we cannot be surethat it is a phenotype. We consider the phrases in these semantic types asphenotype candidates that need further analysis. A search for phenotypecandidates in the HPO in step III of the process described above confirmswhether each phenotype candidate is a phenotype or not. If a phenotype can-didate is found in the HPO, it is recognized as a phenotype. While makingthe candidates list we should consider rules 4 and 5 below.

5. In some cases the phenotype is in plural form but only the singular form isavailable in the HPO. One example is “deep set eyes”. It is not in the HPO


but “deep set eye” is. So, if the singular form is available in the HPO theplural form is a phenotype.Rule 4. If the single form of a phrase is a phenotype the plural form is a

phenotype, too.6. A phenotype candidate may contain adjectives and adverbs in addition to

the phenotype as found in the HPO. In these situations the complete phrasemay not be in HPO. So the system will remove the adjectives and adverbsin the phenotype candidate and search for the head of the phrase.Rule 5. If the head of a phenotype candidate phrase is a phenotype, the

whole phrase is a phenotype.

In summary, the system analyzes all noun phrases, one by one. If the phrasecontains an acronym, the reference for the acronym is first resolved using Rule1. If the phrase matches Rule 3, it is added to the phenotype list, otherwisethe semantic type of the phrase is identified by the semantic type of its headaccording to Rule 2. If the semantic type is in the Semantic Group Disorder, thephrase is recognized as either a phenotype or a phenotype candidate. Phenotypecandidates are added to the phenotype candidate list along with their headsand their singular form if they are plural (according to Rules 4 and 5), to beprocessed in step III.

Phrase: "[The] presented learning disabilities">>>>> Phrasepresented learning disabilities<<<<< Phrase>>>>> CandidatesMeta Candidates (9):

901 Learning Disabilities [Mental or Behavioral Dysfunction]882 Learning disability (Learning disability - specialty)

[Biomedical Occupation or Discipline]827 Learning [Mental Process]827 Disabilities (Disability) [Finding]743 Disabled (Disabled Persons) [Patient or Disabled Group]743 Disabled [Qualitative Concept]660 Presented (Presentation) [Idea or Concept]627 Present [Quantitative Concept]627 Present (Present (Time point or interval)) [Temporal Concept]

<<<<< Candidates>>>>> MappingsMeta Mapping (901):

660 Presented (Presentation) [Idea or Concept]901 Learning Disabilities [Mental or Behavioral Dysfunction]

<<<<< Mappings

Fig. 3. An example of Rule 1

5 Evaluation

The system has been evaluated on a corpus containing 120 sentences with110 phenotype phrases. These sentences are collected from 4 random full textjournal articles specialized in human genetics. Not all these sentences contain


Table 1. Results

Method Precision Recall F-measure

Basic Form 88.78 74.21 80.84Applying Only Rule 1 89.38 78.9 83.81Applying Only Rule 2 97.19 75.91 85.24Applying Only Rule 3 89.09 76.56 82.35Applying Only Rule 4 88.9 75.78 81.32Applying Only Rule 5 89.38 78.9 83.81Applying All Rules 97.58 88.32 92.71

Table 2. Three sources of errors

Cause of Error Example of Error Description of Error %age

MetaMap parser Partial hypoplasia ofthe corpus callosum

MetaMap finds two separatephrases: “Partial hypoplasia”; “ofthe corpus callosum”

20

missing vertebrae MetaMap finds two separatephrases: “missing”; “vertebrae”

MetaMap WSD learning deficit [Functional Concept] chosen in-stead of [Disease or Syndrome]

25

triphalangeal thumb [Gene or Genome] chosen insteadof [Congenital Abnormality]

aplastic anemia [Gene or Genome] chosen insteadof [Disease or Syndrome]

osteosarcoma [Gene or Genome] chosen insteadof [Neoplastic Process]

diabetes insipidus [Functional Concept] chosen in-stead of [Disease or Syndrome]

Phenotype thumb duplication 25candidatesnot in HPO thrombopenia

increased erythrocyteadenosine deaminaseactivity

macrocytosis

phenotypes. Precision, recall and F-measure are typically used to measure theperformance of NER tools. Precision is the percentage of correct entity namesin all entity names found and can be seen as a measure of soundness. Recallis the percentage of correct entity names found compared to all correct entitynames in the corpus and can be used as a measure of completeness. F-measureis the harmonic mean of equally weighted precision and recall. The performanceof differently configured systems is shown in Table 1. The basic form is the inte-gration of UMLS and HPO using none of the rules discussed above. The results


of adding each of the rules are listed in the table. Some errors result from in-adequacies in our method, but other errors are caused by incorrect informationprovided by the systems that we use. Examples of phenotype names not foundby our tool as a result of MetaMap mistakes and HPO incompleteness are shownin Table 2. Some errors are a result of an incorrect parse. In some cases MetaMaphas true candidates but after WSD a wrong candidate is chosen. And finally inseveral examples MetaMap finds reasonable phenotype candidates but they arenot found in the HPO. The percentage of total errors that these three sourcescause are shown in the table.

6 Summary

Biomedical literature is an important source of information that is growingrapidly. The need for automatic processing of this amount of information isundeniable. One of the basic obstacles to achieve this aim is the recognition ofbiomedical objects in text. We have presented a system to improve phenotypename recognition. This system integrates two knowledge sources, UMLS andHPO, and MetaMap in an innovative way to find phenotype names in biomedi-cal texts. In essence, our approach applies specific rules to enhance recognitionof named entities which originate from specific dictionaries and ontologies.

To test the performance of this system, a small corpus has been used givingrecognition results of 97.6% precision and 88.3% recall. BioMedLEE [36] is asystem that extracts a broad variety of phenotypic information from biomedicalliterature. This system was adapted from MedLEE [37] a clinical informationextraction NLP system. To evaluate BioMedLEE, 300 randomly chosen journaltitles were used and BioMedLEE had 64% precision and 77.1% recall. We wantedto compare the performance of our system against this reported performance,but we did not have access to the software nor to the corpus used in [36].

In some cases the errors are caused by the tools used: the MetaMap parserand Word Sense Disambiguation function, and incompleteness of the HPO. Ourfuture aim is to find solutions to solve these remaining problems and to im-prove the accuracy of our system. In addition we plan to make a larger corpusand evaluate the performance of our system more accurately and compare toBioMedLEE’s performance on this corpus.

References

1. Leroy, G., Chen, H., Martinez, J.D.: A shallow parser based on closed-class wordsto capture relations in biomedical text. Journal of Biomedical Informatics 36(3),145–158 (2003)

2. He, X., DiMarco, C.: Using lexical chaining to rank protein-protein interactionsin biomedical texts. In: BioLink 2005: Workshop on Linking Biological Literature,Ontologies and Databases: Mining Biological Semantics, Conference of the Associ-ation for Computational Linguistics (2005) (poster Presentation)

3. Fundel, K., Kuffner, R., Zimmer, R.: Relex - relation extraction using dependencyparse trees. Bioinformatics 23(3), 365–371 (2007)


4. Ng, S.K., Wong, M.: Toward routine automatic pathway discovery from on-linescientific text abstracts. Genome Informatics 10, 104–112 (1999)

5. Yu, H., Zhu, X., Huang, M., Li, M.: Discovering patterns to extract protein-proteininteractions from the literature: Part ii. Bioinformatics 21(15), 3294–3300 (2005)

6. Swanson, D.R.: Fish oil, Raynauds syndrome, and undiscovered public knowledge.Perspectives in Biology and Medicine 30(1), 7–18 (1986)

7. Hristovski, D., Peterlin, B., Mitchell, J.A., Humphrey, S.M.: Using literature-baseddiscovery to identify disease candidate genes. I. J. Medical Informatics 74(2-4),289–298 (2005)

8. Hristovski, D., Friedman, C., Rindflesch, T.C., Peterlin, B.: Exploiting semanticrelations for literature-based discovery. In: AMIA Annual Symposium Proceedings,pp. 349–353 (2006)

9. Rindflesch, T.C., Tanabe, L., Weinstein, J.N., Hunter, L.: Edgar: Extraction ofdrugs, genes and relations from the biomedical literature. In: Pacific Symposiumon Biocomputing, vol. 5, pp. 514–525 (2000)

10. Friedman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A.: GENIES: anatural-language processing system for the extraction of molecular pathways fromjournal articles. Bioinformatics (Oxford, England) 17(suppl. 1), S74–S82 (2001)

11. Tanabe, L., Scherf, U., Smith, L.H., Lee, J.K., Hunter, L., Weinstein, J.N.: Med-Miner: an Internet text-mining tool for biomedical information, with applicationto gene expression profiling. BioTechniques 27(6) (1999)

12. Humphreys, K., Demetriou, G., Gaizauskas, R.: Two applications of informationextraction to biological science journal articles: enzyme interactions and proteinstructures. In: Pacific Symposium on Biocomputing, pp. 505–516 (2000)

13. Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Willett, P.: Protein structuresand information extraction from biological texts: the PASTA system. Bioinfor-matics 19(1), 135–143 (2003)

14. Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientifictext: application to the knowledge domain of protein families. Bioinformatics 14(7),600–607 (1998)

15. Valencia, A.: Automatic annotation of protein function. Current Opinion in Struc-tural Biology 15(3), 267–274 (2005)

16. Leser, U., Hakenberg, J.: What makes a gene name? named entity recognition inthe biomedical literature. Briefings in Bioinformatics 6(4), 357–369 (2005)

17. Aronson, A.R.: Effective mapping of biomedical text to the UMLS metathesaurus:the MetaMap program. In: AMIA Annual Symposium Proceedings, pp. 17–21(2001)

18. Dai, M., Shah, N.H., Xuan, W., Musen, M.A., Watson, S.J., Athey, B.D., Meng,F.: An efficient solution for mapping free text to ontology terms. In: AMIA Summiton Translational Bioinformatics, San Francisco, CA (2008)

19. Krauthammer, M., Rzhetsky, A., Morozov, P., Friedman, C.: Using BLAST foridentifying gene and protein names in journal articles. Gene 259(1-2), 245–252(2000)

20. Xu, R., Supekar, K., Morgan, A., Das, A., Garber, A.: Unsupervised method forautomatic construction of a disease dictionary from a large free text collection. In:AMIA Annual Symposium Proceedings, pp. 820–824 (2008)

21. Segura-Bedmar, I., Martnez, P., Segura-Bedmarr, M.: Drug name recognition andclassification in biomedical texts: A case study outlining approaches underpinningautomated systems. Drug Discovery Today 13(17-18), 816–823 (2008)


22. Horn, F., Lau, A.L., Cohen, F.E.: Automated extraction of mutation data fromthe literature: application of MuteXt to G protein-coupled receptors and nuclearhormone receptors. Bioinformatics 20(4), 557–568 (2004)

23. Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T.: Toward information extraction:identifying protein names from biological papers. In: Pacific Symposium Biocom-puting, pp. 707–718 (1998)

24. Nobata, C., Collier, N., Tsujii, J.: Automatic term identification and classificationin biology texts. In: The 5th NLPRS Proceeding, pp. 369–374 (1999)

25. Strachan, T., Read, A.: Human Molecular Genetics, 3rd edn. Garland Sci-ence/Taylor & Francis Group (2003)

26. Humphreys, B.L., Lindberg, D.A., Schoolman, H.M., Barnett, G.O.: The UnifiedMedical Language System: an informatics research collaboration. J. Am. Med.Inform. Assoc. 5(1), 1–11 (1998)

27. Robinson, P.N., Mundlos, S.: The human phenotype ontology. Clinical Genet-ics 77(6), 525–534 (2010)

28. McKusick, V.: Mendelian Inheritance in Man and Its Online Version, OMIM. TheAmerican Journal of Human Genetics 80(4), 588–604 (2007)

29. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction toNatural Language Processing, Computational Linguistics and Speech Recognition,2nd edn. Prentice Hall, Englewood Cliffs (2008)

30. Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: anoverview. J. Comput. Biol. 10(6), 821–855 (2003)

31. McCray, A.T., Burgun, A., Bodenreider, O.: Aggregating UMLS Semantic Typesfor Reducing Conceptual Complexity. Proceedings of Medinfo. 10(pt 1), 216–220(2001)

32. Day-Richter, J., Harris, M.A., Haendel, M., Obo, T.G.O., Lewis, S.: OBO-Edit anontology editor for biologists. Bioinformatics 23(16), 2198–2200 (2007)

33. Burgun, A., Mougin, F., Bodenreider, O.: Two approaches to integrating phenotypeand clinical information. In: AMIA Annual Symposium Proceedings, pp. 75–79(2009)

34. Smith, C., Goldsmith, C.A., Eppig, J.: The Mammalian Phenotype Ontology asa tool for annotating, analyzing and comparing phenotypic information. GenomeBiology 6(1), R7+ (2004)

35. Schwartz, A.S., Hearst, M.A.: A simple algorithm for identifying abbreviation def-initions in biomedical text. In: Pacific Symposium on Biocomputing, pp. 451–462(2003)

36. Chen, L., Friedman, C.: Extracting phenotypic information from the literature vianatural language processing. Medinfo. 11(Pt 2), 758–762 (2004)

37. Friedman, C., Alderson, P.O., Austin, J.H., Cimino, J.J., Johnson, S.B.: A generalnatural-language text processor for clinical radiology. Journal of the AmericanMedical Informatics Association 1(2), 161–174 (1994)

Classifying Severely Imbalanced Data

William Klement1, Szymon Wilk2, Wojtek Michalowski3, and Stan Matwin4,�

1 Thomas Jefferson Medical College, PA, [email protected]

2 Poznan University of Technology, [email protected]

3 Telfer School of Management, Uni. of Ottawa, [email protected]

4 SITE, University of Ottawa, [email protected]

Abstract. Learning from data with severe class imbalance is difficult.Established solutions include: under-sampling, adjusting classificationthreshold, and using an ensemble. We examine the performance of com-bining these solutions to balance the sensitivity and specificity for binaryclassifications, and to reduce the MSE score for probability estimation.

Keywords: Classification, Class Imbalance, Sampling, Ensembles.

1 Introduction

In medical domains, severe class imbalance is common and is difficult to copewith. For example, classifying head injury patients to assess their needs for CTscans, or examining mammogram images to detect breast cancer, or predictingheart failures, etc. are all tasks that deal with severely imbalanced data becauseusually, there are fewer patients who suffer from an acute condition (positives)than not (negatives). Other domains face this problem also; fraud detection [7],anomaly detection, information retrieval [13], and detecting oil spills [12] arefew examples. When faced with severe class imbalance, most machine learningmethods struggle to achieve a balanced performance. The distinction betweenthe problem of severe class imbalance and the problem of small minority class isvery crucial. Often, the problem of insufficient data occurs in conjunction withsevere class imbalance. Dealing with both problems is a major difficulty.

The paper presents an experimental evaluation of selected methods used byresearchers to counter class imbalance. Our experiment examines combiningthree techniques: under-sampling, classification threshold selection, and usingan ensemble of classifiers (by averaging their predicted probabilities) to assistthe Naive Bayes method to overcome the imbalance problem. The Naive Bayesmethod is favored because it computes probabilities, and thus, allows for theassessment of probability estimation. It is common for medical practitioners toreply on probabilistic estimates when making a diagnosis. Initially, this study� Is affiliated with the Institute of Computer Science, Polish Academy of Sciences.


Classifying Severely Imbalanced Data 259

was presented in [10] (unarchived publication) with preliminary results using anensemble of exactly ten classifiers. This paper presents similar results with exten-sive experiments and calculates the number of members in the ensemble basedon the class imbalance in the data. This paper also examines the performancewith respect to probability scores. Based on our study, our recommendationshave successfully been used in [1].

Our results show that combining under-sampling, threshold selection, and en-semble learning is effective in achieving a balance in classification performanceon both classes simultaneously in binary domains. Furthermore, the results sug-gest that adjusting the classification threshold reduces the mean squared errorsof probability estimates computed by the Naive Bayes classifier. After a briefreview of classification under class imbalance, a description of our experimentdesign and results follow, and we close with conclusive remarks.

2 Classification with Severe Class Imbalance

In the presence of class imbalance, Provost [15] attributes the struggle of machinelearning methods to maximizing the accuracy, and to the assumption that aclassifier will operate in the same distribution as training. Thus, a classifier islikely to predict the majority class [6]. In medical decision-making domains,the minority class is usually the critical class, which we wish to predict withhigher sensitivity at little or no cost of specificity. Therefore, the accuracy is notonly skewed by the imbalance, but also inappropriate. From the many proposedsolutions, we only review those relevant to this paper due to space constraints.However, a comprehensive review of this problem can be found in [9].

Most solutions rely on either adjusting the data for balancing or on modify-ing the learning algorithm to become sensitive to the imbalance. Data samplingrelates to the identification of the correct distribution for a learning algorithm,and the appropriate sampling strategy for a given data set [15,3]. Sampling mod-ifies the class distribution in the training data by increasing or decreasing thefrequency of one class. Respectively, these are known as over-sampling and under-sampling. While the former is based on generating instances in the minority class(by duplication or by sophisticated data generation schemes), the latter relieson removing examples in the majority class. However, under-sampling is shownto outperform over-sampling [6], and over-sampling can lead to over-fitting [2].

Alternative solutions modify the learning algorithm to address class imbal-ance. Cost-based learning is among such techniques where instances in theminority class are assigned misclassification costs different than those in the ma-jority class [4,5,14,18]. However, determining the cost of one class over the other(particularly in medical domains) is a difficult problem, which we avoid in thispaper. We have addressed this issue in the cost-based sampling study publishedin [11]. Finally, adjusting the probability estimation or adjusting classificationthreshold can also help counter the problem [15].

260 W. Klement et al.

3 Measures against Class Imbalance

To counter the effect of class imbalance, Provost [15] calls for adjusting thedecision threshold when the classifier produces probability estimates rather thanlabels [8], a case in which classifications are made by imposing a threshold onthe probabilities. In this study, the threshold selection (TS) is based onmaximizing the F-measure, which represents the harmonic mean of precisionand recall. While recall represents the true positive rate, precision is the fractionof true positives of those instances predicted as positive by the classifier. Thus,maximizing the F-measure for the minority class leads to maximizing precisionand recall, and the performance on both classes is expected to be balanced.

Our under-sampling approach (US) balances the training data by keepingall points in the minority class while randomly selecting, without replacement,an equal proportion of points from the majority class. If the data contains npoints and n+ of those are in the minority class, then, TS selects a balancedsample of s = 2n+ percent of n without replacement. Effectively, this selects alln+ and a random sample of the majority class of an equal proportion.

However, random under-sampling can potentially eliminate information formthe training set by excluding instances from the majority class. To counter thispotential harm, we construct an ensemble model (EN) consisting of m NaiveBayes (NB) classifiers constructed from various data samples consisting ofthe entire minority class and an equal, random subset of the majority class.Effectively, members in this ensemble have similar expertise on the minorityclass but various skills on the majority class. Then, the ensemble of m modelsis combined by averaging their predicted probabilities. In our experiment, thenumber m is based on the class imbalance ratio. For instance, if n+ (the positives)is less than n− (the negatives), then m is the closest integer value to n

n+ .


Our experiment aims to identify which combinations of TS, US, and EN can helpthe Naive Bayes method (NB) achieve a balanced classification performance onimbalanced data. By “balanced” we mean as equally high performance as possi-ble on both classes. NB is used because of its ability to produces probabilities.This feature enables us to assess the performance of probability estimates for ourmodels. All our models are trained and tested using the Weka [17] software. Weexamine the performance of all learning models (there are eight) which combineTS, US, and EN techniques. The first is NB which is a single Naive Bayes classi-fier trained on the original imbalanced data with none of these techniques. Thismodel helps establish a baseline performance. The (TS) model is a single NaiveBayes whose classification threshold is adjusted to maximize the F-measure. USis a single Naive Bayes classifier trained on a sample of the training set ob-tained as described in Section 3. USTS is the last single Naive Bayes model andcombines both US and TS techniques. USEN is an ensemble of m Naive Bayeslearners where m is defined in Section 3 using under-sampling prior to training


Table 1. The data contain n points, n+ are positive, s is a percentage of n for under-sampling, and m is the number of members of the constructed ensemble

Data n n+ s = 2n+

nm = n

n+ Description

MedD 409 48 21 9 Undisclosed prospective medical dataSPECT 267 55 41 5 SPECT images [16]Adult 40498 9640 47.6 4.2 Adult data [16]DIS 3772 58 3 65 Thyroid disease data [16]8HR 2534 160 12.6 16 Ozone Level Detection [16]HPT 155 32 41.3 3 Hepatitis Domain [16]HYP 3163 151 9.5 21 Hypo-thyroid disease [16]1HR 2536 73 5.8 35 Ozone Level Detection [16]SIKEU 3163 293 18.5 11 Sick-Euthyroid data [16]SIK 3772 231 12.2 66 Sick Thyroid disease [16]

each member in the ensemble. Finally, the USTSEN combines all three tech-niques together, i.e., the USTSEN model is an ensemble similar to USEN butwith the addition of TS to each member in the ensemble. The remaining combi-nations EN and TSEN are omitted because, and in addition to space limitations,they failed to produce performance different than NB and TS respectively.

Our experiment includes ten binary classification data sets listed in Table 1.They are mostly obtained from the UCI repository [16] with the exception ofMedD data. The latter is a prospectively collected medical dataset that describesan acute patient condition. Unfortunately, we are unable to disclose any detailsfor MedD data due to intellectual property and privacy issues. The models NB,TS, US, USTS, USEN, and USTSEN are each tested with 10-fold cross-validationruns executed 1000 times. In each run, we record (as a percentage) the sensitivity,the specificity, the mean squared error (MSE), and the area under the ROC curve(AUC). The latter shows little to no change in value, and therefore, we omit itsresults due to space limitations. As a summary, we present the average andstandard deviations over the one thousand runs for the remaining metrics. It isimportant to note that from a medical perspective, the focus of performance ison sensitivity first, then on specificity second.

The sensitivity and specificity results are shown in Table 2. Values in boldindicate higher sensitivity than specificity. In the top part of the table, theNB and TS models suffer from the overwhelming negatives. In most cases, theirspecificities are much higher than their sensitivities. The US model shows a clearimprovement in sensitivity while compromising little specificity. However, USTS,USEN, and USTSEN models achieve a clear advantage and consistently producehigher sensitivity with reasonable specificity. Moreover, ensemble models USENand USTSEN show lower standard deviations than US and USTS.

If we consider the sensitivity and specificity, the TS alone fails to counter theimbalance. But when combined with US (USTS), the performance improves. Sowhat is TS good for? Consider the MSE results shown in Table 3 where boldvalues are the lowest. TS achieves lower MSE scores indicating better probabil-ity estimates which are useful particularly in medical domains. They represent


Table 2. Average Sensitivity / Average Specificity

Data NB TS US

MedD 66.8±0.7 / 91.8±0.3 60.3±3.5 / 92.5±1.0 82.3±3.1 / 86.4±0.8SPECT 76.3±0.3 / 79.6±0.6 66.3±2.6 / 86.3±1.0 80.2±1.9 / 70.0±1.7Adult 51.6±0.1 / 93.3±0.0 77.8±0.5 / 82.3±0.3 60.1±0.3 / 90.7±0.1DIS 45.1±3.5 / 96.7±0.1 60.0±3.5 / 95.8±0.2 83.8±2.7 / 61.9±5.08HR 85.0±0.3 / 66.5±0.2 48.7±1.4 / 89.8±0.4 85.5±0.5 / 64.1±0.4HPT 70.1±1.7 / 87.3±0.8 68.1±3.7 / 87.2±1.7 80.0±3.7 / 81.4±2.1HYP 77.4±0.7 / 98.9±0.1 77.9±1.4 / 98.8±0.1 93.3±1.5 / 96.5±0.31HR 81.8±1.1 / 70.5±0.2 51.7±1.3 / 91.3±0.3 83.3±1.4 / 67.9±0.6SIKEU 89.6±0.4 / 83.6±0.2 65.6±1.5 / 96.4±0.3 92.5±0.2 / 68.9±1.1SIK 77.6±0.6 / 93.7±0.1 59.1±1.4 / 98.0±0.2 89.7±0.8 / 82.3±1.0

USTS USEN USTSEN

MedD 87.7±3.1 / 81.7±1.7 82.2±2.2 / 86.4±0.4 88.6±1.8 / 83.9±0.6SPECT 77.9±3.9 / 70.3±5.5 80.0±1.6 / 69.9±1.0 78.1±1.8 / 71.3±2.1Adult 91.1±0.3 / 70.0±0.4 60.1±0.1 / 90.7±0.0 90.2±0.2 / 71.4±0.2DIS 77.4±3.2 / 83.4±3.7 82.8±1.2 / 62.9±0.9 80.4±1.2 / 83.5±0.58HR 85.6±1.6 / 62.6±2.6 85.5±0.3 / 64.0±0.2 85.3±0.4 / 64.4±0.3HPT 79.8±4.4 / 78.0±3.3 80.7±2.9 / 81.1±1.2 80.8±3.0 / 79.8±1.4HYP 96.7±1.1 / 94.8±0.4 94.1±0.7 / 96.6±0.1 96.9±0.6 / 95.5±0.11HR 87.9±3.2 / 60.2±2.9 83.3±0.9 / 67.8±0.2 84.6±1.1 / 65.9±0.3SIKEU 88.2±1.1 / 85.2±1.0 92.5±0.1 / 68.9±0.4 90.3±0.5 / 82.7±0.4SIK 87.3±1.2 / 86.6±1.0 89.5±0.3 / 82.9±0.2 88.2±0.4 / 86.4±0.2

Table 3. Average Mean Squared Error (MSE)

Data NB TS US USTS USEN USTSEN

MedD 8.2±0.1 8.0±0.4 11.6±0.5 13.5±0.8 11.3±0.2 13.0±0.3SPECT 17.7±0.2 14.8±0.4 22.8±0.9 22.2±2.2 22.4±0.4 20.9±0.9Adult 13.9±0.0 12.8±0.0 13.3±0.1 14.7±0.1 13.2±0.0 14.7±0.0DIS 3.7±0.0 4.1±0.1 27.4±3.6 15.7±2.4 23.9±0.4 13.6±0.28HR 31.7±0.1 15.0±0.7 33.7±0.4 33.2±1.0 33.3±0.1 32.1±0.3HPT 13.7±0.5 13.8±0.7 16.4±1.3 17.4±1.4 15.1±0.7 15.7±0.8HYP 1.8±0.0 1.8±0.0 3.0±0.2 3.7±0.2 2.7±0.0 3.4±0.01HR 28.5±0.2 10.2±0.9 30.9±0.6 32.8±1.2 30.0±0.2 31.4±0.3SIKEU 11.2±0.1 5.8±0.1 22.8±0.9 12.9±0.6 22.5±0.3 12.7±0.2SIK 5.3±0.0 3.6±0.1 13.4±0.7 11.0±0.6 12.7±0.1 10.4±0.1

an estimate of how likely a patient belongs to the positive class, in this case,the minority class. Models using under-sampling (US, USTS, USEN, USTSEN)produce higher MSE scores. They are able to counter the class imbalance forclassification but not necessarily for probability estimation. In addition, USENand USTSEN show lower MSE deviations than US and USTS. Although under-sampling increases the standard deviation due to the random exclusion of datapoints, the construction of an ensemble model seems to provide a remedy.


5 Conclusions

Combining under-sampling with threshold selection while using a voted ensem-ble successfully shifts the focus of Naive Bayes to the minority class. This com-bination builds an effective model when dealing with severe class imbalance.Sampling increases performance deviations but the ensemble provides a rem-edy. Adjusting the classification threshold alone fails to counter the imbalancefor classification, but it succeeds for probability estimation by reducing MSE.Future experiments may include other models, e.g., decision trees or rule-basedmethods.

References

1. B�laszczynski, J., Deckert, M., Stefanowski, J., Wilk, S.: Integrating SelectivePre-processing of Imbalanced Data with Ivotes Ensemble. In: Szczuka, M.,Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS,vol. 6086, pp. 148–157. Springer, Heidelberg (2010)

2. Chawla, N.V., Japkowicz, N., Kolcz, A. (eds.): ICML 2003 Workshop on Learningfrom Imbalanced Data Sets (2003)

3. Chawla, N.V., Cieslak, D.A., Hall, L.O., Joshi, A.: Automatically countering im-balance and its empirical relationship to cost. Data Mining and Knowledge Dis-covery 17(2), 225–252 (2008)

4. Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In:KDD 1999, pp. 155–164 (1999)

5. Drummond, C., Holte, R.C.: Exploiting the cost (in)sensitivity of decision treesplitting criteria. In: ICML 2000, pp. 239–246 (2000)

6. Drummond, C., Holte, R.C.: Severe Class Imbalance: Why Better AlgorithmsAren’t the Answer. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo,L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 539–546. Springer, Heidelberg(2005)

7. Fawcett, T., Provost, F.: Adaptive Fraud detection. Data Mining and KnowledgeDiscovery (1), 291–316 (1997)

8. Flach, P.A., Matsubara, E.T.: A Simple Lexicographic Ranker and ProbabilityEstimator. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S.,Mladenic, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 575–582. Springer, Heidelberg (2007)

9. He, H., Garcia, E.A.: Learning form Imbalanced Data. IEEE Transactions onKnowledge and Data Engineering 21(9), 1263–1284 (2009)

10. Klement, W., Wilk, S., Michaowski, W., Matwin, S.: Dealing with Severely Imbal-anced Data. In: ICEC 2009 Workshop, PAKDD 2009 (2009)

11. Klement, W., Flach, P., Japkowicz, N., Matwin, S.: Cost-based Sampling of Indi-vidual Instances. In: Canadian AI 2009, pp. 86–97 (2009)

12. Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spillsin satellite radar images. Machine Learning (30), 195–215 (1998)

13. Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learn-ing. In: ICML 1994, pp. 179–186 (1994)


14. Margineantu, D.: Class probability estimation and cost-sensitive classification deci-sions. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI),vol. 2430, pp. 270–281. Springer, Heidelberg (2002)

15. Provost, F.: Learning with Imbalanced Data Sets 101. Invited paper for the AAAI2000 Workshop on Imbalanced Data Sets (2000)

16. Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases.University of California, Irvine, Dept. of Information and Computer Sciences(1998), http://www.ics.uci.edu/~mlearn/MLRepository.html

17. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and tech-niques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

18. Zadrozny, B., Langford, J., Abe, N.: Cost-Sensitive Learning by Cost-ProportionateExample Weighting. In: IEEE ICDM 2003 (2003)

http://www.ics.uci.edu/~mlearn/MLRepository.html

Simulating Cognitive Phenomena with a

Symbolic Dynamical System

Othalia Larue

GDAC Research Laboratory - Computer Science Department Universite du Quebec aMontreal C.P. 8888, succursale Centre-ville, Montreal, QC, H3C 3P8

[email protected]

Abstract. We present a new tool: symbolic dynamical approach to thesimulation of cognitive processes. Complex Auto-Adaptive System is asymbolic dynamical system implemented in a multi-agent system. Wedescribe our methodology and prove our claims by presenting simula-tion experiments of the Stroop and Wason tasks. We then explain ourresearch plan: the integration of an emotion model in our system, theimplementation of a higher-level control organization, and the study ofits application to cognitive agents in general.

Keywords: Cognitive Simulation, Cognitive architecture, Symbolic dy-namism.

1 Introduction

Complex Auto-Adaptive System (CAAS) [1] is a multi-agent system that modelsthe dynamics of knowledge activation and suppression as the system performstasks or solves problems. Our aim is to introduce CAAS and motivate it as asimulation tool for cognitive scientists.

There already are simulations of cognitive processes in neural networks [2];however, the opaque nature of neural processing makes it difficult to understandhow the psychological processes emerge from the neural dynamics. Contrary toneural networks, and similarly to what the system presented here does in part,production systems [3] use a symbolic approach for the simulation of cognitiveprocesses, making it easier to interpret activity in the system; but, the sequen-tiality inherent to fixed mechanisms for rules selection limits the simulation ofcognitive processes. CAAS implements a symbolic dynamical system[4] whichwe believe to be a hybrid alternative since it allows the use of discrete-symbolicrepresentations but exhibits dynamical change in time. In CAAS, Symbolic dy-namics refers to the dynamical, real-time, interaction of populations of minimalsymbolic agents, out of which emerges a continuously changing geometrical rep-resentation of the environment.

We cannot claim the fine-grained neurological plausibility of some neural net-works, but CAAS is neurocognitively plausible at the higher (functional) levelof gross neurological structure and it does bring to our simulations the read-ability that neural network simulations lack. Thanks to the similarities between


266 O. Larue

the mesoscopic dynamical activity of brains and CAAS dynamical activity, oursimulations can also account for phenomena that are reproduced in a limitedway in production systems.

2 Methodology

CAAS is implemented in a multi-agent system featuring a reactive agent archi-tecture. Agents in CAAS are assigned to five organisations depending on theirroles in the system: frontal agents, structuring agents, morphological agents,analysis agents, and PSE agents (Pattern of Specific Agents), each having thesame basic structure (a communication module, a type and a state defined byits level of activation).

The frontal agents extract knowledge from inputs to the system. Their stateof activation is determined by the presence of information as input usable bythe software or library it uses and by the structuring agents who send top-downperceptual control messages to them. The structuring agents bear the knowledgeof the system (knowledge base of the system). Each structuring agent is assignedas its role one term (knowledge item) from the ontology (general or specificto the task). States of activation are determined by the amount of messagesexchanged between agents and organisations. The morphological agents generatestatistics about communications between the structuring agents and regulatethe activation of the structuring agents. The analysis agents produce graphsof the communications between the agents with those statistics. Finally, thePSE agents orientate the communication organisation between the structuringagents. Each PSE agent is assigned a global shape (morphology) as its role.CAAS functions by comparing geometrical representations of the activity of itssymbol-bearing Structuring agents with geometrical representations that serveas its goals in order to alter the activation of the agents of the Structuringorganisation. Structuring organisation, however, is not only determined by theFrontal agents, but also by the PSE agents, which get Morphological agents tosend activity-promoting or inhibiting signals to the Structuring agents accordingto which goal/goals are currently active (given to the system by its user).

The cognitive control system located in the DLPFC is, in our system, thePSE organisation. The information processed in the posterior brain regions isrepresented in our system by the regulating messages sent from the Morpho-logical organisation to the Structuring organisation through the goal orientedbehaviour of the PSE organisation. With the following experiments: Stroop andWason tasks, we prove the cognitive plausibility of our system in action, and itslimitations.


3.1 Stroop Task

The Stroop task is a standard psychology test that measures interference be-tween competiting cognitive processes. We already have implemented a classical

Simulating Cognitive Phenomena with a Symbolic Dynamical System 267

Stroop task, and provide an accurate simulation of the time effects and workingmemory variations on cognitive control in “weakening” the system (by reducingthe messages of regulation sent by the Morphological agents to the Structuringagents). We observe disorders similar to those reported in the literature con-cerned with the functional understanding of cognitive disorders and differencesin individuals. Such experiments are a clear demonstration of our previous claimabout the ability of the system to correctly reproduce the mesoscopic dynam-ical activity. Here in the Stroop task, failure in the interaction between theMorphological agents and the Structuring agents and weakening of regulationmessages in the system mirrors the failure of the response conflict monitoringsystem, executed in the anterior cingulate cortex (ACC), and a cognitive controlsystem located in the dorsolateral prefrontal cortex (DLPFC). The response con-flict monitoring system detects conflicts due to interference between processes.The cognitive control system modifies information processing in posterior brainregions to reduce conflicts thus found.

3.2 Wason Task

The Wason task is a common selection task in the psycholoy of reasonning,testing the subject’s ability to search counter-example to test an hypothesis. Inorder to further define the limits of the present system, and what we will need toadd to the system to upgrade it to an efficient simulation tool, we implementedthis test. Our first experiment is a classic version of the Wason card selectiontask [7], where subjects are asked which card(s) they believe must be turnedin order to verify a conditional statement such as “If there is an A on one sideof the card, then there is a 3 on the other side”. Otherwise, we used the sameexperimental setting as that of the Stroop task, using a one way link in theontology to specify rules to the system. Currently only being able to do theModus Ponens, (affirming the antecedent) and not the Modus Tollens (denyingthe consequent) , the system unsurprisingly reproduced the error that mosthumans do: Asked which card(s) it would turn to verify the rule, it correctlyanswered “A but never answered “7 as logic mandates. To help the system seethat, we made the negation explicit. In the second experiment, we thus addedfour elements to the ontology: “P”, “Q”,“notP”, “notQ”, and the linked theStructuring agents as following: “A”- “P”; “3”- “Q”; “D”- “notP”; “7”- “notQ”.Furthermore, in order to make the need to verify the contraposed rule explicit inthe systems “Verification” goal, we added a one way link from “notQ” to “notP”.Using the same input as in the previous experiment, the system was now ableto answer correctly that cards “A” and “7” should be turned. This method wasvery artificial, and as we will point out in our plan of research, we would like togive the ability to the system to create/abstract itself this type of links.

4 Plan for Research

Our first research approach was to identify the strengths and limitations ofthe present system. The method we employed was the simulations of various

268 O. Larue

cognitive tasks (see section above). The Stroop task [5] simulations proved thevalidity of the approach, but also showed that we are not currently able to in-corporate emotional materials in our Stroop task simulations. The Wason task[7] emphasized another missing aspect of our current approach: a higher-levelorganization, with which we wouldnt have to resort to the actual artificial ma-nipulations of our simulation. CAAS was primarily developed for engineeringpurposes. But we believe that its properties are particularly suitable to cognitivesimulations, and we are thus adapting it for this purpose by the incorporation ofemotions and reasoning, important dimensions in cognitive processes. Such addi-tions are also compatible with a parallel thread of our research: implementationof CAAS in cognitive agents.

Therefore, a first research topic is the addition of emotions in CAAS. Oneway of doing this could be the addition of emotional structuring agents relatedto other semantic Structuring agents. Emotional structuring agents could bea way to implement in our system the type of neuromodulation involved inhuman emotions. Neuromodulation could thus be implemented in the system byvariations in the regulation messages sent to these agents. A second research topicis the development of the Higher Level Organization (HLO). Our system needs tobe further developed in order to have the ability to construct a knowledge baseand generate reasoning models [6] (similar perhaps to Johnson-Lairds mentalmodels). Reasoning models could be generated at a level on top of the PSEagents. Using information provided by analysis agents to PSE agents, HLO couldrefine the mental models. These reasoning models could be new goals sent tothe PSE agents, that HLO would be able to link to task-specific goal. In theprocessing of a Wason task, we would be able to witness the construction of awrong model which could be corrected by the presence of new information inthe environment of the system (i.e a change in the wording of the task). A newhigher-level organisation would help to abstract a style of configuration amongagents (its shape) and transformation upon them (for example: negation). Wecould thus be able to propose new simulations in cognitive psychology and thepsychology of reasoning.

References

1. Camus, M.: Morphology Programming with an Auto-Adaptive System. In: ICAI2008 (2008)

2. Eliasmith, C.: Dynamics, control, and cognition. In: Robbins, P., Aydede, M. (eds.)Cambridge Handbook of Situated Cognition. CUP, Oxford (2009)

3. Roelofs, A.: Goal-referenced selection of verbal action. Psychological Review 110,88–125 (2003)

4. Dale, R., Spivey, M.J.: From apples and oranges to symbolic dynamics. JETAI 26,317–342 (2005)

5. Stroop, J.R.: Studies of interference in serial verbal reactions. J. of Exp. Psy. 18,643–662 (1935)

6. Johnson-Laird, P., Byrne, R.: Deduction. Lawrence Erlbaum Associates, Mahwah7. Wason, P.C.: Natural and contrived experience in a reasoning problem. In: Foss,

B.M. (ed.) New Horizons in Psychology. Penguin, NY (1966)

Finding Small Backdoors in SAT Instances

Zijie Li and Peter van Beek

Cheriton School of Computer ScienceUniversity of Waterloo

Waterloo, Ontario, Canada N2L 3G1

Abstract. Although propositional satisfiability (SAT) is NP-complete,state-of-the-art SAT solvers are able to solve large, practical instances.The concept of backdoors has been introduced to capture structuralproperties of instances. A backdoor is a set of variables that, if assignedcorrectly, leads to a polynomial-time solvable sub-problem. In this paper,we address the problem of finding all small backdoors, which is essentialfor studying value and variable ordering mistakes. We discuss our defi-nition of sub-solvers and propose algorithms for finding backdoors. Weexperimentally compare our proposed algorithms to previous algorithmson structured and real-world instances. Our proposed algorithms improveover previous algorithms for finding backdoors in two ways. First, ouralgorithms often find smaller backdoors. Second, our algorithms oftenfind a much larger number of backdoors.

1 Introduction

Propositional satisfiability (SAT) is a core problem in AI. The applications arenumerous and include software and hardware verification, planning, and schedul-ing. Even though SAT is NP-complete in general, state-of-the-art SAT solverscan solve large, practical problems with thousands of variables and clauses. Toexplain why current SAT solvers scale well in practice, Williams, Gomes, andSelman [13,14] propose the concept of weak and strong backdoors to capturestructural properties of instances. A weak backdoor is a set of variables forwhich there exists a value assignment that leads to a polynomial-time solvablesub-problem. For a strong backdoor, every value assignment should lead to apolynomial-time solvable sub-problem.

In this paper, we address the problem of finding all small backdoors in SATinstances. A small backdoor is a backdoor such that no proper subset is also abackdoor. This problem is important for studying problem hardness, which isgenerally represented as the time used or the number of nodes extended by a SATsolver. In addition, identifying all small backdoors is a first step to investigatinghow value and variable ordering mistakes affect the performance of backtrackingalgorithms—the ultimate goal of our research. A variable ordering heuristic canmake a mistake by selecting a variable not in the appropriate backdoor. A valueordering heuristic can make a mistake by assigning the backdoor variable a valuethat does not lead to a polynomial sub-problem.


270 Z. Li and P. van Beek

Backdoors are defined with respect to sub-solvers, which in turn can be de-fined algorithmically or syntactically. Algorithmically defined sub-solvers arepolynomial-time techniques of current SAT solvers, such as unit propagation.Syntactically defined sub-solvers are polynomial-time tractable classes, such as2SAT and Horn. The size of backdoors with respect to purely syntactically de-fined sub-solvers is relatively large. On the other hand, it is possible that asimplified sub-problem is polynomial-time solvable before an algorithmically de-fined sub-solver finds a solution. Therefore, we propose a sub-solver that firstapplies unit propagation, and then checks polynomial-time tractable classes.

We propose both systematic and local search algorithms for finding backdoors.The systematic search algorithms are guaranteed to find all minimal sized back-doors but are unable to handle large instances. Kilby, Slaney, Thiebaux, andWalsh [7] propose a local search algorithm to find small weak backdoors. Build-ing on their work, we propose two local search algorithms for finding smallbackdoors. Our first algorithm incorporates our definition of sub-solver withKilby et al.’s algorithm. Our second algorithm is a novel local search technique.We experiment on large real-world instances, including the instances from SAT-Race 2008, to compare our proposed algorithms to previous algorithms. Ouralgorithms based on our proposed sub-solvers can find smaller backdoors andsignificantly larger numbers of backdoors than previous algorithms.

2 Background

In this section, we review the necessary background in propositional satisfiabilityand backdoors in SAT instances.

We consider propositional formula in conjunctive normal form (CNF). A literalis a Boolean variable or its negation. A clause is a disjunction of literals. A clausewith one literal is called a unit clause and the literal in the unit clause is calleda unit literal. A propositional formula F is in conjunctive normal form if it is aconjunction of clauses.

Given a propositional formula in CNF, the problem of determining whetherthere exists a variable assignment that makes the formula evaluate to true iscalled the propositional satisfiability problem or SAT. Propositional satisfiabilityis often solved using backtracking search. A backtracking search for a solution toa SAT instance can be seen as performing a depth-first traversal of a search tree.The search tree is generated as the search progresses and represents alternativechoices that may have to be examined in order to find a solution or prove thatno solution exists. Exploring a choice is also called branching and the order inwhich choices are explored is determined by a variable ordering heuristic. Whenspecialized to SAT solving, backtracking algorithms are often referred to as beingDPLL-based, in honor of Davis, Putnam, Logemann, and Loveland, the authorsof one of the earliest works in the field [1].

Let F denote a propositional formula. We use the value 0 interchangeablywith false and the value 1 interchangeably with true. The notation F [v = 0]represents a new formula, called the residual formula, obtained by removing all

Finding Small Backdoors in SAT Instances 271

clauses that contain the literal ¬v and deleting the literal v from all clauses.Similarly, the notation F [v = 1] represents the residual formula obtained by re-moving all clauses that contain the literal v and deleting the literal ¬v from allclauses. Let aS be a set of assignments. The residual formula F [aS ] is obtainedby cumulatively reducing F by each of the assignments in aS .

Example 1. For example, the formula, F = (x∨¬y)∧ (x∨y ∨ z)∧ (y ∨¬z∨w)∧(¬w ∨ ¬z ∨ v) ∧ (¬v ∨ u), is in CNF. Suppose x is assigned false. The residualformula is given by, F [x = 0] = (¬y)∧(y∨z)∧(y∨¬z∨w)∧(¬w∨¬z∨v)∧(¬v∨u).

As is clear, a CNF formula is satisfied only if each of its clauses is satisfied anda clause is satisfied only if at least one of its literals is equivalent to true. In aunit clause, there is no choice and the value of the literal is said to be forced.The process of unit propagation repeatedly assigns all unit literals the value trueand simplifies the formula (i.e., the residual formula is obtained) until no unitclause remains or a conflict is detected. A conflict occurs when implications forsetting the same variable to both true and false are produced.

Example 2. Consider again the formula F [x = 0] given in Example 1. The unitclause (¬y) forces y to be assigned 0. The residual formula is, F [x = 0, y = 0] =(z) ∧ (¬z ∨ w) ∧ (¬w ∨ ¬z ∨ v) ∧ (¬v ∨ u). In turn, the unit clause (z) forces zto be assigned 1. Similarly, the assignments w = 1, v = 1, and u = 1 are forced.

Williams, Gomes, and Selman [13] formally define weak and strong backdoors.The definitions rely on the concept of a sub-solver A that, given a formula F ,in polynomial time either rejects the input or correctly solves F . A sub-solvercan be defined either algorithmically or syntactically. For example, a DPLL-based SAT solver can be modified to be an algorithmically defined sub-solverby using just unit propagation, and returning “reject” if branching is required,“unsatisfiable” if a contradiction is encountered, and “satisfiable” if a solution isfound. Examples of tractable syntactic classes include 2SAT, Horn, anti-Horn,and RHorn formulas. A formula is 2SAT if every clause contains at most twoliterals, Horn if every clause has at most one positive literal, anti-Horn if everyclause has at most one negative literal, and renamable Horn (RHorn) if it canbe transformed into Horn by a uniform renaming of variables.

A weak backdoor is a subset of variables such that some value assignmentleads to a polynomial-time solvable sub-problem.

Definition 1 (Weak Backdoor). A nonempty subset S of the variables is aweak backdoor in F for a sub-solver A if there exists an assignment aS to thevariables in S such that A returns a satisfying assignment of F [aS ].

Example 3. Consider once again the formula F [x = 0] given in Example 2. Afterunit propagation every variable has been assigned a value and the formula F issatisfied. Hence, x is a weak backdoor in F with respect to unit propagation.

A strong backdoor is a subset of variables such that every value assignment leadsto a polynomial-time solvable sub-problem.


Table 1. Summary of previous experimental studies, where DPLL means the sub-solverused was defined algorithmically based on a DPLL solver; otherwise the sub-solver wasdefined syntactically using the given tractable class

Sub-solvers Instance domains

Williams et al. [13] DPLL structuredInterian [6] 2SAT, Horn random 3SATDilkina et al. [2] DPLL, Horn, RHorn graph coloring, planning, game theory, auto-

motive configurationParis et al. [9] RHorn random 3SAT, SAT competitionKottler et al. [8] 2SAT, Horn, RHorn SAT competition, automotive configurationSamer & Szeider [11] Horn, RHorn automotive configurationRuan et al. [10] DPLL quasigroup completion, graph coloringKilby et al. [7] DPLL random 3SATGregory et al. [4] DPLL planning, graph coloring, quasigroupDilkina et al. [3] DPLL planning, circuits

Definition 2 (Strong Backdoor). A nonempty subset S of the variables is astrong backdoor in F for a sub-solver A if for all assignments aS to the variablesin S, A returns a satisfying assignment or concludes unsatisfiability of F [aS ].

A minimal backdoor for an instance is a backdoor S such that for every otherbackdoor S′, |S| ≤ |S′|. A small backdoor refers to a backdoor S such that noproper subset of S is also a backdoor. A minimal backdoor can be viewed as aglobal minimum, and a small backdoor can be viewed as a local minimum.

3 Related Work

In this section, we review previous work on algorithms for finding weak andstrong backdoors in SAT and their experimental evaluation (see Table 1).

In terms of algorithms, Williams, Gomes, and Selman [13,14] present a sys-tematic algorithm which searches every subset of variables for a backdoor. Wemodify this algorithm to find minimal backdoors. Interian [6] and Kilby, Slaney,Thiebaux, and Walsh [7] propose a local search algorithm for finding backdoors.Our algorithms build on Kilby et al’s. We discuss this all in detail in Section 4.

In terms of experimental results, Dilkina, Gomes, and Sabharwal [2] show thatstrong Horn backdoors can be considerably larger than strong backdoors withrespect to DPLL sub-solvers. Kottler, Kaufmann, and Sinz [8] compare 2SAT,Horn and RHorn and find that RHorn usually results in smaller backdoors.Samer and Szeider [11] compare Horn and RHorn strong backdoors and findas well that RHorn gives smaller backdoors. In general, previous experimentalresults show that backdoors with respect to syntactically defined sub-solvers havelarger sizes than backdoors with respect to algorithmically defined sub-solvers.

In contrast to previous work, which consider DPLL, 2SAT, Horn, and RHornsub-solvers independently, we combine syntactic and algorithmic sub-solvers intoa single sub-solver. We also propose improved algorithms and experimentallyevaluate our proposals on larger, more varied instances.


Table 2. Algorithms for finding weak and strong backdoors

Exact Our exact algorithm for finding minimal weak backdoors in satisfiableinstances;

Strong Our exact algorithm for finding minimal strong backdoors in unsatisfiableinstances;

Kilby Kilby et al.’s [7] local search algorithm for finding small weak backdoors;KilbyImp Kilby et al.’s [7] algorithm that incorporates our definition of sub-solver;Tabu Our proposed local search algorithm for finding small weak backdoors.

4 Algorithms for Finding Backdoors

In this section, we introduce how we define sub-solvers and describe severalalgorithms for finding backdoors (see Table 2).

In our proposed framework, we define the sub-solver both algorithmically andsyntactically. Specifically, given a partial assignment to a subset of variables S,we first apply unit propagation and then check the following conditions to see ifthe resulting formula F belongs to a polynomial-time tractable class:

1. if F is satisfied;2. if F is 2SAT;3. if F is satisfied after assigning 0 (false) to every unassigned variable;4. if F is satisfied after assigning 1 (true) to every unassigned variable.

If one of the above conditions is true, then S is a backdoor set. The first twoconditions are trivial. The third condition covers (a superset of) Horn formula,while the last condition covers (a superset of) anti-Horn formula. If F is a Hornformula, it can be satisfied by assigning 0 to all variables unless it has unit clauseswith a single positive literal. However, after unit propagation, F is guaranteed tohave at least two unassigned literals in each clause. A similar reasoning appliesif F is anti-Horn. In specifying our algorithms, we make use of the following lowlevel procedures, where F is a CNF formula and v is a Boolean value.

isSatisfied(F ) return true iff F is already satisfied;is2SAT(F ) return true iff F is a 2SAT formula;isSat2SAT(F ) return true iff F is a satisfiable 2SAT formula;setVal(F, v) return true iff F is satisfied after assigning v to every

unassigned variable.

4.1 Exact Algorithms: Exact and Strong

We describe exact algorithms, which are suitable for small instances with smallbackdoors. Algorithm Exact takes as input a formula F and finds all backdoorsof size at most k by performing an exhaustive search. We run algorithm Exactwith k = 1, 2, . . . , n− 1, until minimal backdoors are found.

The algorithm calls procedure expand(V, S, k), which explores the variables ofF in a depth-first manner. Given a set of variables V , a set of minimal backdoors


S, and a positive integer k, the procedure returns true iff V is a backdoor and asa side-effect, S is updated. Given a value assignment to the variables in V , V is abackdoor if there is no conflict after unit propagation and the resulting formulaF is in one of the polynomial-time tractable classes. The procedure recursivelycalls itself with one more variable added to V and k − 1.

Algorithm: Exact(F, k)

S ← ∅;for i← 0 to n− 1 do expand({xi}, S, k);return S;

Procedure: expand(V, S, k)

foreach value assignment aV of V doif unit propagation of aV does not result in conflicts then

if isSatisfied(F) ∨ isSat2SAT(F) ∨ setVal(F , 0) ∨ setVal(F , 1)then S ← S ∪ V ; return true;

if k ≤ 1 then return false;j ← index of the last variable in V ;for i← (j + 1) to n− 1 do expand(V ∪ {xi}, S, k − 1);return false;

The Exact algorithm can easily be modified to give algorithm Strong, whichfinds minimal strong backdoors in unsatisfiable instances. The idea is that ifevery value assignment to a set of variables V results in conflicts during unitpropagation, then we are able to conclude the unsatisfiability of the instance.Thus, V is added to the list of strong backdoors.

4.2 Local Search Algorithms: Kilby, KilbyImp, and Tabu

The exact algorithms based on depth-first search are complete, but do not scaleup to instances with larger backdoors. Here we discuss local search algorithms.In the local search algorithms each search state s is a backdoor, and the cost ofa node s is the cardinality of the backdoor s.

Kilby et al. [7] propose algorithm Kilby for finding small weak backdoorsusing local search. Given a formula F , the DPLL solver Satz-rand is first used tosolve F , recording the set W of branching literals and the solution M . The setW is an initial backdoor as Satz-rand is able to solve F [W ] without branching.Then, algorithm Kilby takes the inputs F , W , and M to find small backdoors.The set B is the current smallest backdoor. The algorithm has three constants:RestartLimit, which controls the number of restarts, a technique for escapingfrom local minima; IterationLimit, which controls the amount of search betweenrestarts; and CardMult , which defines the neighbors of the current candidatebackdoor W . In each iteration, the algorithm randomly selects from M a set Zof |W |×CardMult literals that are not in W . The set Z of literals is appended toW , and procedure minWeakBackdoor is called to reduce the set W ∪Z of literalsinto a small backdoor, which is the next search state.


Algorithm: Kilby(F,W,M)

S ← ∅, B ←W ;RestartLimit ← 2; RestartCount ← 0; IterationLimit ← √

n× 3; CardMult ← 2;while RestartCount < RestartLimit do

RestartCount ← RestartCount + 1;W ← B;for i← 0 to IterationLimit do

Z ← |W | × CardMult literals chosen randomly from M \W ;W ← minWeakBackdoor(F , W ∪ Z);if |W | ≤ |B| then S ← S ∪W ;if |W | < |B| then B ←W ; RestartCount ← 0;

return S;

Procedure: minWeakBackdoor(F, I)

W ← ∅;while I �= ∅ do

Choose literal l ∈ I ; I ← I \ {l};if DPLL applied to F [W ∪ I ] requires branching then

// The following if statement is added in our sub-solver;1 if ¬is2SAT(F) ∧ ¬setVal(F, 0) ∧ ¬setVal(F, 1) then

W ← W ∪ {l};

return W ;

Kilby et al. use a simple sub-solver, which applies Satz-rand’s unit propaga-tion. We modify their algorithm Kilby to use the more sophisticated sub-solverwe define. Algorithm KilbyImp is the local search algorithm that results fromadding Line 1 in procedure minWeakBackdoor. One further difference is that ourDPLL solver is Minisat, where Minisat has a powerful pre-processor.

We also propose a novel algorithm Tabu, which uses local search techniques,including Tabu Search, a best improvement strategy, and auxiliary local search.The search state W is the current candidate backdoor, and tabuList is a list ofpreviously visited search states. The tabu tenure is set to 30 to prevent our Tabufrom revisiting the last 30 search states. When the tabu list is full, the oldest stateis replaced by the new state. The procedure searchNeighbors(W, S, M) evalu-ates the neighborhood of W and updates W with the best improving neighbornot in tabuList. The while loop stops if no new small backdoors have been foundin the last RestartLimit iterations. The procedure localImprovement(S, M) isan auxiliary local search over the neighborhood of newly found small backdoors.

The procedure searchNeighbors(W, S, M) explores all IterationLimit neigh-bors of the current backdoor W to find a best non-tabu candidate backdoor.This is in contrast to Algorithm Kilby, which selects the first neighbor s′ en-countered in the neighborhood of s without considering the cost of s′; i.e., |s′|.The value of minCost is the minimal size of backdoors in Neighbor . If minCostis no larger than the size of the current smallest backdoor, then all the back-doors in Neighbor of size minCost are added to the list of small backdoors S.A small backdoor of size minCost is randomly selected from Neighbor to be the


next search state. When minCost is larger than the size of the current small-est backdoor, the search can escape from local minima by making worse moves.If every non-tabu candidate backdoor in Neighbor has a larger size than thecurrent smallest backdoor, the search moves to a best candidate backdoor fromNeighbor .

Algorithm: Tabu(F,W,M)

W ← minWeakBackdoor(F , W);preSize← |S|; RestartLimit← 2; RestartCount ← 0; tabuList ← ∅;while RestartCount < RestartLimit do

RestartCount ← RestartCount + 1;cost ← searchNeighbors(W , S, M);if cost = 0 then break;tabuList ← tabuList ∪W ;if |S| > preSize then RestartCount ← 0;preSize← |S|;

tabuList ← ∅;localImprovement(S, M);return S;

Procedure: searchNeighbors(W, S,M)

IterationLimit ← √n× 2; CardMult ← 2; Neighbor ← ∅, Cost← ∅;

for i← 0 to IterationLimit doZ ← |W | × CardMult literals chosen randomly from M \W ;W ← minWeakBackdoor(F , W ∪ Z);if W �∈ tabuList then

Neighbor ← Neighbor ∪W ;Cost← Cost ∪ |W |;

if |Neighbor | = 0 then return 0;minCost ← min(Cost);if minCost ≤ current smallest backdoor size then

S ← S ∪ {B ∈ Neighbor | |B| = minCost};W ← select a backdoor from Neighbor with size minCost randomly;return minCost ;

The procedure localImprovement(S, M) is an auxiliary local search that at-tempts to find more minimal backdoors by replacing variables in s. The inspi-ration for the procedure is the observation that some variables appear in mostbackdoors and some backdoor sets only differ from each other by one variable.

Procedure: localImprovements(S, M)

foreach new backdoor B ∈ S, B �∈ tabuList dotabuList ← tabuList ∪B;foreach literal l ∈ {M \ B} do

B ← minWeakBackdoor(F , B ∪ l);if |B| ≤ current minimum backdoor size then S ← S ∪ B;


Table 3. Size, percentage, and number of minimal backdoors found by the Exactalgorithm when applied to small real-world instances with n variables and m clauses

Instance n m BD size (%) # BDs

grieu-vmpc-s05-24s 576 49478 3 (0.52%) 143een-tip-sat-texas-tp-5e 17985 153 1 (0.01%) 2

anomaly 48 182 1 (2.08%) 2medium 116 661 1 (0.86%) 5

huge 459 4598 2 (0.44%) 89bw large.a 459 4598 2 (0.44%) 89bw large.b 1087 13652 2 (0.18%) 7


In this section, we describe experiments on structured and real-world SAT in-stances to compare the algorithms shown in Table 2. The set of satisfiable testinstances consists of planning instances from SATLIB [5] and all but six of thesatisfiable real-world instances from SAT-Race 2008 (the instances excluded werethose that Minisat was unable to solve within the competition time limit). Theset of unsatisfiable test instances is from the domain of automotive configura-tion [12]. The instances were all pre-processed with Minisat, which can sometimesgreatly reduce the number of clauses. The experiments were run on the Whalecluster of the SHARCNET system (www.sharcnet.ca). Each node of the clusteris equipped with four Opteron CPUs at 2.2 GHz and 4.0 GB memory.

5.1 Experiments on Finding Weak Backdoors

Algorithm Exact is able to find all minimal backdoors for instances with smallbackdoors (see Table 3). The sizes of minimal backdoors in the blocks worldinstances are smaller than those reported by Dilkina et al. [3] who report per-centages between 1.09% to 4.17% even though they used clause learning in ad-dition to unit propagation. The reason is that our sub-solver not only appliesunit propagation, but also tests for polynomial-time syntactic classes. Systematicalgorithms do not scale up to instances with larger backdoors, though.

We also compared the small backdoors found by the local search algorithms,Kilby, KilbyImp, and Tabu. With different initial solutions as inputs, the localsearch algorithms were run repeatedly until a cutoff time was reached. Only thesmallest backdoors found by the algorithms were recorded. The cutoff time wasset to 3 hours for instances with fewer than 10,000 variables (see Table 4) and15 hours for larger instances (see Table 5). For each instance, the algorithmthat found the smallest backdoors among the three local search algorithms ishighlighted, with the largest number of backdoors used to break ties.

When the cutoff time was reached, we waited for the algorithms to finishthe current iteration. Because Tabu takes longer to complete one iteration thanKilby and KilbyImp, the time when Tabu found small backdoors in some SAT-Race 2008 instances was a little longer than 15 hours. The longest time recorded


Table 4. Size, percentage, and number of small backdoors found by the local searchalgorithms within a cutoff of 3 hours when applied to real-world instances with nvariables (n < 10, 000) and m clauses

Kilby KilbyImp TabuInstance n m BD size (%) # BDs BD size (%) # BDs BD size (%) # BDs

SAT Competition 2002

apex7 gr rcs w5.shuffled 1500 11136 77 (5.13%) 1 47 (3.13%) 4 53 (3.53%) 42885dp10s10.shuffled 8372 8557 9 (0.11%) 10520 9 (0.11%) 9573 9 (0.11%) 59399bart11.shuffled 162 675 15 (9.26%) 4190 14 (8.64%) 2903 14 (8.64%) 45044

SAT-Race 2005 and 2008

grieu-vmpc-s05-24s 576 49478 3 (0.52%) 143 3 (0.52%) 143 3 (0.52%) 143grieu-vmpc-s05-27r 729 71380 4 (0.55%) 710 4 (0.55%) 660 4 (0.55%) 3271

simon-mixed-s02bis-01 2424 13793 8 (0.33%) 566 8 (0.33%) 566 8 (0.33%) 10440simon-s02b-r4b1k1.2 2424 13811 8 (0.33%) 394 7 (0.29%) 3 7 (0.29%) 16

Blocks world planning

bw large.c 3016 50237 4 (0.13%) 1934 3 (0.10%) 15 3 (0.10%) 15bw large.d 6325 131607 6 (0.10%) 790 5 (0.08%) 69 6 (0.10%) 640

Logistics planning

logistics.a 828 3116 20 (2.42%) 147 20 (2.42%) 6675 24 (2.90%) 584257logistics.b 843 3480 16 (1.90%) 1688 15 (1.78%) 9789 16 (1.90%) 7634logistics.c 1141 5867 26 (2.28%) 18 25 (2.19%) 387 28 (2.45%) 424467logistics.d 4713 16588 25 (0.53%) 39 22 (0.47%) 61 28 (0.59%) 36610

was 168 seconds after the 15-hour cutoff time. It is possible that Kilby and Kil-byImp would have found smaller backdoors during this leeway. Although Tabutakes longer in one iteration than Kilby and KilbyImp, Tabu is sometimes ableto find a larger number of backdoors in the given time, and for instances thathave small backdoors of size less than 10, a remarkably larger number. For manymore of these real-world instances, KilbyImp outperformed Kilby and Tabu infinding small backdoors. Both Kilby and KilbyImp select the first candidatebackdoor encountered. The Tabu algorithm searches the entire neighborhoodfor the best improvement, which can be too expensive when the backdoor sizeand the total number of variables are large.

Williams et al. [13] experimented on practical instances with fewer than 10,000variables and showed that such instances had relatively small backdoors. Weextend their result to the SAT-Race 2008 instances, which have a huge numberof variables and clauses. The SAT-Race 2008 instances have backdoors thatconsist of hundreds of variables. However, the backdoor size is usually less than0.5% of the total number of variables. Thus, our results agree with Williams etal. that practical instances generally have small tractable structures.

5.2 Experiments on Finding Strong Backdoors

In previous work [11,2], unsatisfiable SAT benchmarks from automotive config-uration [12] were used in the experiments. Among the 84 unsatisfiable instances,Minisat concludes the unsatisfiability of 71 instances after pre-processing. We ap-plied the Strong algorithm to find minimal strong backdoors for the remaining13 instances (see Table 6). The sizes of minimal strong backdoors range from1 to 3, which are smaller than the sizes reported in [11,2]. We found smaller


Table 5. Size, percentage, and number of small backdoors found by the local searchalgorithms within a cutoff of 15 hours when applied to real-world instances with nvariables (n > 10, 000) and m clauses. An entry of timeout indicates that the localsearch algorithm failed to find any small backdoor within the cutoff time.

Kilby KilbyImp TabuInstance n m BD size (%) # BDs BD size (%) # BDs BD size (%) # BDs

ibm-2002-04r-k80 104450 238773 252 (0.24%) 10 154 (0.15%) 53 184 (0.18%) 2ibm-2002-11r1-k45 156626 290625 307 (0.20%) 3 282 (0.18%) 7 344 (0.22%) 2ibm-2002-18r-k90 175216 370661 360 (0.21%) 3 331 (0.19%) 6 496 (0.28%) 1ibm-2002-20r-k75 151202 319192 319 (0.21%) 4 275 (0.18%) 17 384 (0.25%) 1ibm-2002-22r-k75 191166 399095 453 (0.24%) 4 424 (0.22%) 3 551 (0.29%) 2ibm-2002-22r-k80 203961 427792 499 (0.25%) 1 466 (0.23%) 4 605 (0.30%) 1ibm-2002-23r-k90 222291 469900 537 (0.24%) 2 534 (0.24%) 1 624 (0.28%) 2ibm-2002-29r-k75 64686 258748 81 (0.13%) 11 58 (0.09%) 26 59 (0.09%) 1ibm-2004-01-k90 64699 201260 148 (0.23%) 2 87 (0.13%) 5 93 (0.14%) 8ibm-2004-1 11-k80 262808 565220 696 (0.27%) 4 648 (0.25%) 1 732 (0.28%) 1ibm-2004-23-k100 207606 481764 524 (0.25%) 2 455 (0.22%) 1 618 (0.30%) 4ibm-2004-23-k80 165606 379170 465 (0.28%) 2 441 (0.27%) 1 550 (0.33%) 1ibm-2004-29-k55 37714 123699 67 (0.18%) 16 52 (0.14%) 21 49 (0.13%) 6381ibm-2004-3 02 3-k95 73525 169473 1297 (1.76%) 1 238 (0.32%) 2 251 (0.34%) 1

mizh-md5-47-3 65604 153650 179 (0.27%) 1 179 (0.27%) 1 265 (0.40%) 4mizh-md5-47-4 65604 153778 184 (0.28%) 2 190 (0.29%) 1 232 (0.35%) 2mizh-md5-47-5 65604 153896 181 (0.28%) 2 181 (0.28%) 2 235 (0.36%) 1mizh-md5-48-2 66892 157184 203 (0.30%) 1 203 (0.30%) 1 289 (0.43%) 1mizh-md5-48-5 66892 157466 189 (0.28%) 6 189 (0.28%) 6 238 (0.36%) 1mizh-sha0-35-3 48689 115548 258 (0.53%) 1 254 (0.52%) 2 238 (0.49%) 1mizh-sha0-35-4 48689 115631 237 (0.49%) 1 237 (0.49%) 1 210 (0.43%) 1mizh-sha0-36-1 50073 120102 261 (0.52%) 1 261 (0.52%) 1 219 (0.44%) 1mizh-sha0-36-3 50073 120212 249 (0.50%) 1 260 (0.52%) 4 209 (0.42%) 5mizh-sha0-36-4 50073 120279 237 (0.47%) 1 237 (0.47%) 1 220 (0.44%) 1

post-c32s-gcdm16-22 129652 88631 12 (0.01%) 133 12 (0.01%) 133 11 (0.01%) 126

velev-fvp-sat-3.0-b18 35853 968394 228 (0.64%) 3 212 (0.59%) 1 227 (0.63%) 1velev-vliw-sat-4.0-b4 520721 13348080 timeout timeout 933 (0.18%) 1velev-vliw-sat-4.0-b8 521179 13378580 timeout timeout timeout

een-tip-sat-nusmv-t5.B 61933 42043 109 (0.18%) 6 88 (0.14%) 35 92 (0.15%) 14318een-tip-sat-vis-eisen 18607 12801 8 (0.04%) 6087 8 (0.04%) 16466 8 (0.04%) 36941narain-vpn-clauses-8 1461772 4572347 timeout timeout timeoutpalac-sn7-ipc5-h16 114548 218043 10 (0.01%) 46 10 (0.01%) 46 10 (0.01%) 1533palac-uts-l06-ipc5-h34 187667 606674 10 (0.01%) 152 10 (0.01%) 152 10 (0.01%) 102schup-l2s-motst-2-k315 507145 590065 timeout timeout timeoutsimon-s03-w08-15 132555 269328 233 (0.18%) 26 115 (0.09%) 31 152 (0.12%) 4

backdoors because we applied a systematic search algorithm, and we definedsub-solvers both syntactically and algorithmically.

6 Conclusion

We presented exact algorithms for finding all minimal weak backdoors in satisfi-able instances and all minimal strong backdoors in unsatisfiable instances. Build-ing on Kilby et al.’s local search algorithm Kilby, we described our improvedlocal search algorithms KilbyImp and Tabu for finding small weak backdoors.We empirically evaluated the algorithms on structured and real-world SAT in-stances. The experimental results show that our algorithms based on our pro-posed sub-solvers can find smaller backdoors and significantly larger numbersof backdoors than previous algorithms. In future work, we intend to use our


Table 6. Size and number of minimal strong backdoors found by the Strong algorithmwhen applied to automotive configuration instances with n variables and m clauses

BD BDInstance n m size #

C168 FW SZ 128 1698 5425 3 6C202 FS RZ 44 1750 6199 2 26C210 FS RZ 23 1755 5778 3 17C210 FS SZ 103 1755 5775 2 3C210 FW RZ 57 1789 7405 2 4C210 FW SZ 128 1789 7412 1 3C220 FV SZ 65 1728 4496 1 2

BD BDInstance n m size #

C168 FW SZ 66 1698 5401 1 3C202 FW SZ 87 1799 8946 3 90C210 FS RZ 38 1755 5763 2 4C210 FW RZ 30 1789 7426 3 16C210 FW SZ 106 1789 7417 2 3C210 FW UT 8630 2024 9721 1 2

algorithms for finding backdoors to study value and variable ordering mistakesand their effect on the performance of backtracking algorithms.

References

1. Davis, M., Logemann, G., Loveland, D.: A machine program for theorem proving.Commun. ACM 5(7), 394–397 (1962)

2. Dilkina, B., Gomes, C.P., Sabharwal, A.: Tradeoffs in the complexity of backdoordetection. In: Bessiere, C. (ed.) CP 2007. LNCS, vol. 4741, pp. 256–270. Springer,Heidelberg (2007)

3. Dilkina, B., Gomes, C.P., Sabharwal, A.: Backdoors in the context of learning. In:Kullmann, O. (ed.) SAT 2009. LNCS, vol. 5584, pp. 73–79. Springer, Heidelberg(2009)

4. Gregory, P., Fox, M., Long, D.: A new empirical study of weak backdoors. In:Stuckey, P.J. (ed.) CP 2008. LNCS, vol. 5202, pp. 618–623. Springer, Heidelberg(2008)

5. Hoos, H.H., Stutzle, T.: SATLIB: An online resource for research on SAT. In: Gent,I.P., Maaren, H.v., Walsh, T. (eds.) SAT 2000, pp. 283–292. IOS Press, Amsterdam(2000)

6. Interian, Y.: Backdoor sets for random 3-SAT. Paper presented at SAT 2003 (2003)7. Kilby, P., Slaney, J., Thiebaux, S., Walsh, T.: Backbones and backdoors in satisfi-

ability. In: Proc. of AAAI, pp. 1368–1373 (2005)8. Kottler, S., Kaufmann, M., Sinz, C.: Computation of Renameable Horn Backdoors.

In: Kleine Buning, H., Zhao, X. (eds.) SAT 2008. LNCS, vol. 4996, pp. 154–160.Springer, Heidelberg (2008)

9. Paris, L., Ostrowski, R., Siegel, P., Sais, L.: Computing Horn strong backdoor setsthanks to local search. In: Proc. of ICTAI, pp. 139–143 (2006)

10. Ruan, Y., Kautz, H., Horvitz, E.: The backdoor key: A path to understandingproblem hardness. In: Proc. of AAAI, pp. 124–130 (2004)

11. Samer, M., Szeider, S.: Backdoor trees. In: Proc. of AAAI, pp. 363–368 (2008)12. Sinz, C., Kaiser, A., Kuchlin, W.: Formal methods for the validation of automotive

product configuration data. AI EDAM 17(1), 75–97 (2003)13. Williams, R., Gomes, C., Selman, B.: Backdoors to typical case complexity. In:

Proc. of IJCAI, pp. 1173–1178 (2003)14. Williams, R., Gomes, C., Selman, B.: On the connections between backdoors and

heavy-tails on combinatorial search. Paper presented at SAT 2003 (2003)

Normal Distribution Re-Weighting

for Personalized Web Search

Hanze Liu and Orland Hoeber�

Department of Computer Science,Memorial University

St. John’s, N.L, Canada{hl5458,hoeber}@mun.ca

Abstract. Personalized Web search systems have been developed to tai-lor Web search to users’ needs based on their interests and preferences.A novel Normal Distribution Re-Weighting (NDRW) approach is pro-posed in this paper, which identifies and re-weights significant terms invector-based personalization models in order to improve the personal-ization process. Machine learning approaches will be used to train thealgorithm and discover optimal settings for the NDRW parameters. Cor-relating these parameters to features of the personalization model willallow this re-weighting process to become automatic.

1 Introduction

Web search is an essential tool for today’s Web users. Web search systems, such asGoogle, Yahoo! and Bing have been introduced to the public users and achievedgreat success. However, traditional search engines share a fundamental problem:they commonly return the same search results to different users under the samequery, ignoring the individual search interests and preferences between users.This problem has hindered conventional Web search engines in their efforts toprovide accurate search results to the users. To address the problem, personalizedWeb search has been introduced as a way to learn the individual search interestsand preferences of users, and use this information to tailor the Web search tomeet each user’s specific information needs [6].

Personalized Web search employs personalization models to capture and repre-sent users’ interests and preferences, which are usually stored in the form of termvectors (see [2][6] for a review of vector-based models for personalized search).High-dimensional vectors are used to represent each user’s interest in specificterms that might be present in the search results. These vectors are then usedto provide a personalized re-ranking of the search results. In this research, wefocus on improving personalized Web search through refining such vector-basedpersonalization models.

The goal in our research is to develop methods to automatically identify andre-weight the significant terms in the target model. This approach is inspired� M.Sc. Supervisor.


282 H. Liu and O. Hoeber

by Luhn’s seminal work in automatic text processing [5], in which he suggeststhat the “resolving power” of significant terms follows a normal distributionplaced over a term list ranked by the frequency of term occurrence. In otherwords, Luhn suggests that the mid-frequency terms are more content bearingthan either common terms or rare terms, and so are better indicators for thesubject of the text. This idea has been widely utilized in the fields of automatictext summarization [8] and Web search converge testing [1]. However, to thebest of our knowledge, it has not been explored in the literature of personalizedWeb search. In the following sections, we will demonstrate how we could borrowLuhn’s idea to improve the vector-based models used in personalized Web search.

2 Normal Distribution Re-Weighting (NDRW)

The first step in the NDRW approach is to rank the terms in the vector-basedpersonalization model according to their frequency, resulting in a term his-togram as illustrated in Fig. 1. Luhn’s suggestion is that high-frequency andlow-frequency terms are not valuable. By placing a normal distribution curveover top of the term histogram, we can assign a significance value to each term,reducing the weight of the terms near the two ends, and giving more weight tothe valuable terms in the middle range.

To calculate the term significance (TS) value for each term, we employ thefollowing formula:

TS(i) = normdist(s ∗ ri) =1√

2πσ2e−(s∗ri−μ)2/2σ2

(1)

where ri is the rank of a given term i, and s is a predetermined step size betweenany two adjacent terms along the x-axis. There are three parameters in thisfunction that affect the shape of the normal distribution curve, and thereforethe TS value for a given term. μ is the mean of the distribution; it decides thelocation of the centre of the normal distribution curve. σ2 is the variance ofthe distribution; it describes how concentrated the distribution curve is aroundthe mean. The step size s affects the steepness of the distribution curve given a

Fig. 1. NDRW re-weights the terms using a normal distribution curve

Normal Distribution Re-Weighting for Personalized Web Search 283

constant variance. Once appropriate parameters are chosen for μ, σ2 and s whichspecify the location and shape of the normal distribution curve, TS values canbe calculated for each term and used to re-weight the personalization model.

miSearch [3] is an existing vector-based personalized Web search system thatis used as the baseline system in this research. The novel feature of this person-alization system is that it maintains multiple topic profiles for each user to avoidthe noise which normally exists within single-profile personalization models. Thetopic profiles in miSearch are term vectors in which terms are extracted fromthe clicked result documents and weighted by term frequency. We have imple-mented the NDRW approach within this system to re-weight the terms in thetopic profiles, and have been able to improve the accuracy of the ranked searchresults list by carefully choosing the NDRW parameters. However, an importantpart of this research is to automatically determine these parameters based onfeatures within the target vector-based model. The process by which we plan toachieve this is outlined in the remainder of this paper.

3 Automatic Algorithm for NDRW

In order to develop the automatic algorithm for NDRW, we plan to employ asupervised machine learning scheme. There are three main steps in this plan:preparing the training data and test data for the learning process, defining theevaluation metrics to guide the learning, and training the optimum parametersand the algorithm.

Twelve queries were selected from the TREC 2005 Hard Track [7] for pre-vious evaluations on the baseline miSearch system [3]. We will continue to usethis test collection as the training data for our experiments. These queries wereintentionally chosen because of their ambiguity. For each query, 50 search resultshave been collected and judged for relevance. The value of the personalizationapproach will be decided based on whether the relevant documents can be movedto the top of the search results list. For the test data, we will select another 12ambiguous queries from this test collection and provide relevance judgements onthe documents retrieved.

We will use average precision (AP) measured over the top-10 and top-20documents as the evaluation metric. In order to facilitate the experiments, atest program will be implemented to automatically apply NDRW to the tar-get personalization models with associated test queries, and directly output theresulting AP values, given a set of NDRW parameters.

To train the optimum parameters for each set of training data, Particle SwarmOptimization (PSO) [4] will be employed. The test program mentioned abovewill play the role of the fitness function in the PSO. The fitness value will becalculated by 60% of the top-10 AP value plus 40% of the top-20 AP value. Eachparticle contains three parameters (μ, σ2 and s), and the optimum parametersare achieved when particles converge to the global best fitness value for a givenset of training data.

284 H. Liu and O. Hoeber

After gathering the optimum parameters for each set of training data, it maybe possible to discover relationships between the optimum parameters and thefeatures within the corresponding personalization models. Furthermore, by ana-lyzing these relationships, we may be able to establish general rules for choosingthe NDRW parameters.

With the established rules, the algorithm for automatically choosing the pa-rameters can be constructed. We can then verify its quality by applying it tothe test data, measuring the degree to which the AP is improved and how closethe parameters are to the optimal parameters for each test query.


In order to improve personalized Web search, we proposed a novel Normal Dis-tribution Re-Weighting (NDRW) approach to identify and re-weight significantterms in vector-based personalization models. Currently, we are working on themain task of this research, which is to develop an automatic algorithm for choos-ing NDRW parameters based on the features of the target model. In the future,we plan to conduct user evaluations to measure the benefit of using the NDRWtechnique for improving personalized Web search in realistic settings.

References

1. Dasdan, A., D’Alberto, P., Kolay, S., Drome, C.: Automatic retrieval of similarcontent using search engine query interface. In: Proceedings of the ACM Conferenceon Information and Knowledge Management, pp. 701–710 (2009)

2. Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for person-alized information access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) AdaptiveWeb 2007. LNCS, vol. 4321, pp. 54–89. Springer, Heidelberg (2007)

3. Hoeber, O., Massie, C.: Automatic topic learning for personalized re-ordering ofweb search results. In: Snasel, V., Szczepaniak, P.S., Abraham, A., Kacprzyk, J.(eds.) Advances in Intelligent Web Mastering - 2. Advances in Intelligent and SoftComputing, vol. 67, pp. 105–116. Springer, Heidelberg (2010)

4. Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: Proceedings of IEEEInternational Conference on Neural Networks, vol. IV, pp. 1942–1948 (1995)

5. Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Researchand Development 2, 159–165 (1958)

6. Micarelli, A., Gasparetti, F., Sciarrone, F., Gauch, S.: Personalized Search on theWorld Wide Web. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web2007. LNCS, vol. 4321, pp. 195–230. Springer, Heidelberg (2007)

7. National Institute of Standards and Technology. TREC 2005 Hard Track, http:

//trec.nist.gov/data/t14 hard.html

8. Shen, D., Chen, Z., Yang, Q., Zeng, H., Zhang, B., Lu, Y., Ma, W.: Web-page clas-sification through summarization. In: Proceedings of the International ACM/SIGIRConference on Research and Development in Information Retrieval, pp. 242–249(2004)

http://trec.nist.gov/data/t14_hard.html

http://trec.nist.gov/data/t14_hard.html

Granular State Space Search

Jigang Luo and Yiyu Yao

Department of Computer Science, University of ReginaRegina, Saskatchewan, Canada S4S 0A2

{luo226,yyao}@cs.uregina.ca

Abstract. Hierarchical problem solving, in terms of abstraction hierar-chies or granular state spaces, is an effective way to structure state spacefor speeding up a search process. However, the problem of constructingand interpreting an abstraction hierarchy is still not fully addressed. Inthis paper, we propose a framework for constructing granular state spacesby applying results from granular computing and rough set theory. Theframework is based on an addition of an information table to the originalstate space graph so that all the states grouped into the same abstractstate are graphically and semantically close to each other.

1 Introduction

State space search is widely used for problem solving in artificial intelligence.Hierarchical problem solving using an abstraction hierarchy is one of the mostpopular approaches to speed up state space search [1, 5, 6]. One major issuethat impacts the search efficiency of an abstraction hierarchy is backtracking [3].Many methods have been proposed and investigated for constructing a goodabstraction hierarchy that has as few backtrackings as possible [2, 3]. However,in the existing methods the semantic information of states is not explicitly used.

Granular computing is an emerged field of study dealing with problem solvingat multiple levels of granularity. The triarchic theory of granular computing [8–10] provides a conceptual model for thinking, problem solving and informationprocessing with hierarchical structures, of which an abstraction hierarchy is anexample. Rough set theory provides a systematic and semantically meaningfulway to granulate a universe with respect to an information table [4]. Based onresults from rough set theory and granular computing, in this paper we proposea framework for constructing abstraction hierarchies by using semantic infor-mation. An information table is used to represent semantic information aboutstates. Consequently, a better abstraction hierarchy can be constructed suchthat states in an abstract state are close to each other both semantically andgraphically. This may not only prevent backtrackings but also lead to a betterunderstanding of a problem.

2 Abstraction Hierarchy

A state space can be modeled by a graph G = (S, E), where S is a nonemptyset of states and E is a nonempty set of edges connecting the states. An edge


286 J. Luo and Y. Yao

(s1, s2) is in E if s1 can be transformed to s2 by one operator in a problemsolving process. A solution to a problem is an edge path from state sstart tostate sgoal, where sstart is the start state and sgoal is a goal state of the problem.

An abstraction of a state space (S, E) is another state space (S′, E′) suchthat the following two conditions are satisfied: 1) There exists a partition on S,blocks of the partition and the states in S′ are one-to-one mapped. If state sbelonging to a partition block is mapped to a state s′, then s is the pre-imageof s′ and s′ is the image of s. The set of all pre-images of s′ is the pre-image setof s′; 2) (s′1, s

′2) is in E′ if and only if there is an edge in E that connects s′1’s

pre-image to s′2’s pre-image.An abstraction of a state space may also have an abstraction, so there is

a series of abstractions such that one is more abstract than the previous one.The state space and the series of abstractions form an abstraction hierarchy forstate space. In an abstraction hierarchy the original state space is the level 0state space, the first abstraction is the level 1 state space, and so on. The searchthrough an abstraction hierarchy is to first search for a solution in the highestlevel state space, then in the lower level state spaces one by one until a solutionin the level 0 state space is found.

An abstraction hierarchy may speed up search for the following reasons: 1)states in a higher level state space are fewer than those in the original statespace, search in a higher level state space is relatively faster than search in theoriginal state space; 2) once a solution in higher level state space is found, itcan serve as a guide to find a solution in the original state space, one only needsto search in the pre-image set of an abstract state that belongs to the abstractsolution [3, 7].

An abstraction hierarchy does not always speed up search due to backtrack-ing [3]. A good abstraction hierarchy should satisfy two requirements: 1) back-trackings are reduced to as few as possible; 2) all states in the same pre-imageset are semantically close. The first requirement guarantees the efficiency forsearch process and the second requirement helps give a better understanding ofthe state space. In the next section we propose a granular computing approachto construct abstraction hierarchies that satisfy both requirements.

3 A Granular Computing Model for ConstructingAbstraction Hierarchies

In existing methods for constructing an abstraction hierarchy, one typically usesthe structural information about a state space. States that are close to each otheraccording to their distance in the state space graph are grouped to form abstractstates. Such a granulation may not necessarily reflect the semantic closeness ofdifferent states. To resolve this problem, we propose a new framework by combin-ing attribute-oriented granulation of rough set theory for generating semanticallymeaningful abstraction and graph-oriented verification for selecting graphicallymeaningful abstraction. The proposed model has three main components. Theyare explained in this section.

Granular State Space Search 287

Attribute-Oriented GranulationInformation tables [4] are an important knowledge representation method ingranular computing. In an information table, a set of attributes is used to de-scribe a set of objects.

Definition 1. An information table is a tuple (U, At, {Va|a ∈ At}, {Ia|a ∈ At}),where U is a finite nonempty set of objects, At is a finite nonempty set of at-tributes, Va is a nonempty set of values for a ∈ At, and Ia: U −→ Va is aninformation function for a ∈ At.

Each information function Ia is a total function that maps an object of U toexactly one value in Va. An information table can be conveniently presented in atable form. In an information table for representing a state space, U is the set ofstates, At is the set of attributes for describing the states, Va is the set of valuesthat the attribute a could possess, and Ia determines the value for every state’sattribute a. If Ia(s) = k we say that the state s’s value for a is k.

We adopt the three-disk Hanoi problem from Knoblock [3] as an example toillustrate the main ideas. For the three-disk Hanoi problem, there are three pegs:peg1, peg2, peg3 and three disks: A, B, C. Disk A is bigger than disk B and diskB is bigger than disk C. A disk can be put on any peg, and the top disk on onepeg can be moved to the top of another peg. There is a constraint that no diskcould have a bigger disk on its top. At the beginning all the disks are on peg1,we need find a way to move all the disks to peg3.

We can construct an information table for the 27 states in the problem as(U, At, {Va|a ∈ At}, {Ia|a ∈ At}), where U is the set of all 27 states, At hasthree attributes A, B, C that represent the positions of disks A, B and C,respectively. VA has three values 1, 2, 3 that indicate A is on peg1, peg2 andpeg3, respectively. For the same reason VB and VC also have three values 1, 2, 3.Ia is a function that maps a state to a value. Ia(s) = n if and only if in thestate s, disk a is on peg pegn. This information table can be presented in a tableform by Fig. 1(i). Columns are labeled by attributes, rows are labeled by statenames. A row is a description of a state. For example, the row s3 describes thestate that A is on peg1, B is on peg1 and C is on peg3.

Definition 2. Suppose U is a universe of a domain, a Granule of U is anonempty subset of U , a Granulation of U is a partition of U , that is, it isa set of granules of U such that the intersection of any two granules is emptyand the union of all the granules is U . For two granulations of U , G and G′,G is a refined granulation of G′ or G′ is a coarsened granulation of G, writ-ten G � G′ , iff for every g ∈ G, there is a g′ ∈ G′ such that g ⊆ g′. IfG0 � G1 � G2 · · · � Gn(n ≥ 1), we say G0 � G1 � G2 · · · � Gn is an (n + 1)-level Granulation Hierarchy, G0 is the level 0 granulation, G1 is the level 1granulation and so on.

If a universe is described by an information table, a granulation can be con-structed by using a subset of attributes [4]. Let F ⊆ At be a subset of attributes.According to F , we can define an equivalence relation RF as:

xRF y ⇔ ∀a ∈ F (Ia(x) = Ia(y))


s1

s2

s8

s9

s18

s16

s13

s14

s3

s6

s5

s23

s22

s25

s27

s15 s12 s10 s19 s20 s26

s7 s4

s17 s24

s11 s21

(a) The Original State Space

s∗1

s∗2s∗3

s∗4

s∗5

s∗6

s∗7

s∗8

s∗9

(b) {A,B} Induced Abstraction

s′1

s′2 s′3

(c) {A} InducedAbstraction

state A

s′1 1

s′2 2

s′3 3

(d) {A} Induced InformationTable

state A B

s∗1 1 1

s∗2 1 2

s∗3 1 3

s∗4 2 1

s∗5 2 2

s∗6 2 3

s∗7 3 1

s∗8 3 2

s∗9 3 3

(e) {A,B} Induced InformationTable

state A B C

s1 1 1 1

s2 1 1 2

s3 1 1 3

s4 1 2 1

s5 1 2 2

s6 1 2 3

s7 1 3 1

s8 1 3 2

s9 1 3 3

state A B C

s10 2 1 1

s11 2 1 2

s12 2 1 3

s13 2 2 1

s14 2 2 2

s15 2 2 3

s16 2 3 1

s17 2 3 2

s18 2 3 3

state A B C

s19 3 1 1

s20 3 1 2

s21 3 1 3

s22 3 2 1

s23 3 2 2

s24 3 2 3

s25 3 3 1

s26 3 3 2

s27 3 3 3

(i) An Information Table for Three-Disk Hanoi Problem

Fig. 1. State Space and Information Table

Granular State Space Search 289

That is, two objects are equivalent if and only if they have the same values onall attributes in F . The equivalence class containing x is given by [x]RF = {y ∈U |xRF y}. The partition U/R = {[x]RF |x ∈ U} induced by RF is a granulationof U and every equivalence class is a granule.

Let G be the granulation of a state space induced by an attribute set F . Theset of attributes At−F consists of all attributes that are not used. We delete allthe columns in the information table that correspond to attributes in At − F ,and delete duplicated rows to get a new information table, we take the rows asnew states. In this way we obtain all the abstract states. Every abstract statecorresponds to an equivalence class or a granule. We then create an abstractionby connecting these abstract states by edges as long as there are edges betweentwo states in pre-image sets of two abstract states. As this abstraction is createdby the granulation G induced by F , we call it the F induced abstraction. Forexample, Fig. 1(e) is the {A, B} induced information table and Fig. 1(b) is{A, B} induced abstraction.

Backtracking-free Abstractions SelectionAn abstraction of a state space is called backtracking-free abstraction if a so-lution found in the abstract state space can be refined into a solution in theoriginal space. That is, it is not necessary to backtrack to another abstractstate in the higher level once we reach a lower level. Granulation based on asubset of attributes does not consider graphical information of a state space.An attribute-oriented abstraction may not necessarily be a backtracking-freeabstraction. Thus, graphical information should be used to select graphicallymeaningful, i.e., backtracking-free, abstractions. One can analyze the edges ofstate space and select only the subsets of attributes that can induce backtracking-free abstractions.

Recall that [x]RF is an abstract state consisting of many states in a lowerlevel. An incoming state in [x]RF is a state that has an incoming edge connectedfrom a state outside [x]RF , an outgoing state of [x]RF is the state that has anoutgoing edge connected to a state outside [x]RF . To avoid backtracking, we canselect the subsets of attributes that induce abstractions satisfying the condition:for any equivalence class [x]RF , any pair of incoming and outgoing states areconnected only by states in [x]RF .

Constructing Abstraction HierarchiesSuppose G1 is the granulation induced by a subset of attributes A1 and G2 is thegranulation induced by another subset of attributes A2. If A1 ⊃ A2, then G1 isa refined granulation of G2. If A0 ⊃ A1 ⊃ A2 · · · ⊃ An, the granulations inducedby A0, A1, · · · , An form an (n + 1)-level granulation hierarchy. It provides an(n + 1)-level abstraction hierarchy.

If every granulation in an (n + 1)-level abstraction is a backtracking-free ab-straction, we obtain a backtracking-free abstraction hierarchy. By combiningattribute-oriented granulation and backtracking-free abstraction selection, onecan easily construct a backtracking-free abstraction hierarchy.


Take the three-disk Hanoi problem as an example, Fig. 1(a) is the originalstate space; since every edge is bidirectional, we did not draw the arrow of anedge. From all possible attribute-oriented granulations we select all the subsetsof attributes that induce backtracking-free abstractions. They are {A}, {A, B}.As {A} ⊂ {A, B}, we can create a three level abstraction hierarchy as shown byFig. 1, the level 0 (Fig. 1(a)) is the original state space, the level 1(Fig. 1(b)) isinduced by {A, B}, the level 2(Fig. 1(c)) is induced by {A}.

4 Conclusion

In this paper, we propose a granular computing model for constructing abstrac-tion hierarchies. Our model introduces an information table for describing statesand combining states into abstract state. This approach guarantees that all thepre-images of the same abstract state are semantically close, and every abstractstate is semantically meaningful. An abstraction hierarchy constructed by ourmodel not only avoids backtracking, but also gives a better understanding of aproblem.

References

1. Bacchus, F., Yang, Q.: Downward refinement and the efficiency of hierarchicalproblem solving. Artificial Intelligence 71, 43–100 (1994)

2. Holte, R.C., Mkadmi, T., Zimmer, R.M., MacDonald, A.J.: Speeding up problemsolving by abstraction: a graph oriented approach. Artificial Intelligence 85, 321–361 (1996)

3. Knoblock, C.A.: Generating Abstraction Hierarchies: An Automated Approach toReducing Search in Planning. Kluwer Academic Publishers, Boston (1993)

4. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasonging about Data. KluwerAcademic Publishers, Boston (1991)

5. Sacerdoti, E.D.: Planning in a hierarchy of abstraction spaces. Artificial Intelli-gence 5, 115–135 (1974)

6. Shell, P., Carbonell, J.: Towards a general framework for composing disjunctiveand iterative macro-operators. In: Proceedings of the Eleventh International JointConference on Artificial Intelligence, pp. 596–602 (1989)

7. Yang, Q., Tenenberg, J.D.: Abtweak: Abstracting a nonlinear, least commitmentplanner. In: Proceedings of the Eighth National Conference on Artificial Intelli-gence, pp. 204–209 (1990)

8. Yao, Y.Y.: Artificial intelligence perspectives on granular computing. In: Pedrycz,W., Chen, S.H. (eds.) Granular Computing and Intelligent Systems. Springer,Berlin (2011)

9. Yao, Y.Y.: A unified framework of granular computing. In: Pedrycz, W., Skowron,A., Kreinovich, V. (eds.) Handbook of Granular Computing, pp. 401–410. Wiley,New York (2008)

10. Yao, Y.Y.: Granular computing: past, present and future. In: 2008 IEEE Interna-tional Conference on Granular Computing, pp. 80–85 (2008)

Comparing Humans and Automatic Speech

Recognition Systems in Recognizing DysarthricSpeech

Kinfe Tadesse Mengistu and Frank Rudzicz

University of Toronto,Department of Computer Science

6 King’s College RoadToronto, Ontario, Canada

{kinfe,frank}@cs.toronto.edu

Abstract. Speech is a complex process that requires control and coor-dination of articulation, breathing, voicing, and prosody. Dysarthria isa manifestation of an inability to control and coordinate one or moreof these aspects, which results in poorly articulated and hardly intel-ligible speech. Hence individuals with dysarthria are rarely understoodby human listeners. In this paper, we compare and evaluate how welldysarthric speech can be recognized by an automatic speech recognitionsystem (ASR) and naıve adult human listeners. The results show thatdespite the encouraging performance of ASR systems, and contrary tothe claims in other studies, on average human listeners perform better inrecognizing single-word dysarthric speech. In particular, the mean wordrecognition accuracy of speaker-adapted monophone ASR systems onstimuli produced by six dysarthric speakers is 68.39% while the meanpercentage correct response of 14 naıve human listeners on the samespeech is 79.78% as evaluated using single-word multiple-choice intelligi-bility test.

Keywords: speech recognition, dysarthric speech, intelligibility.

1 Introduction

Dysarthria is a neurogenic motor speech impairment which is characterized byslow, weak, imprecise, or uncoordinated movements of the speech musculature[1] resulting in unintelligible speech. This impairment results from damage toneural mechanisms that regulate the physical production of speech and is oftenaccompanied by other physical handicaps that limit interaction with modalitiessuch as standard keyboards. Automatic speech recognition (ASR) can, therefore,assist individuals with dysarthria to interact with computers and control theirenvironments. However, the deviation of dysarthric speech from the assumednorm in most ASR systems makes the benefits of current speaker-independent(SI) speech recognition systems unavailable to this population of users.


292 K.T. Mengistu and F. Rudzicz

Although reduced intelligibility is one of the distinguishing characteristics ofdysarthric speech, it is also characterized by highly consistent articulatory errors[1]. The consistency of errors in dysarthric speech can, in principle, be exploitedto build an ASR system specifically tailored to a particular dysarthric speakersince ASR models do not necessarily require intelligible speech as long as con-sistently articulated speech is available. However, building a speaker-dependent(SD) model trained of spoken data from an individual dysarthric speaker ispractically infeasible due to the difficulty of collecting large enough amount oftraining data from a dysarthric subject. Therefore, a viable alternative is toadapt an existing SI model to the vocal characteristics of a given dysarthricindividual.

The purpose of this study is to compare naıve human listeners and speaker-adapted automatic speech recognition (ASR) systems in recognizing dysarthricspeech and to investigate the relationship between intelligibility and ASR perfor-mance. In earlier studies, it has been shown that ASR systems may outperformhuman listeners in recognizing impaired speech [2–4]. However, since intelligibil-ity is typically a relative rather than an absolute measure [5], these results do notnecessarily generalize. Intelligibility may vary depending on the size and typeof vocabulary used, the familiarity of the listeners with the intended messageor the speakers, the quality of recording (i.e. the signal-to-noise ratio), and thetype of response format used.

Yorkston and Beukelman [6] compared three different types of response for-mats: transcription, sentence completion, and multiple choice. In transcription,listeners were asked to transcribe the word or words that have been spoken. Insentence completion, listeners were asked to complete sentences from which asingle word had been deleted. In the multiple choice format, listeners selectedthe spoken word from a list of phonetically similar alternatives. Their results in-dicated that transcription was associated with lowest intelligibility scores, whilemultiple choice tasks were associated with the highest scores. This clearly showsthat listeners’ performance can vary considerably depending on the type of re-sponse format used. Therefore, when comparing human listeners and an ASRsystem, the comparison should be made on a level ground; i.e., both should begiven the same set of alternative words (foils) from which to choose. In otherwords, it would be unfair to compare an ASR system and a human listenerwithout having a common vocabulary, and since the innate vocabulary of ourparticipants is unknown (but may exceed 17,000 base words [7]), we opt for asmall common vocabulary. Hence, the multiple choice response format is chosenin this paper.

2 Method

2.1 Speakers

The TORGO database consists of 15 subjects, of which eight are dysarthric(five males, three females), and seven are non-dysarthric control subjects (fourmales, three females) [8]. All dysarthric participants have been diagnosed by a

Humans vs. ASR Systems in Recognizing Dysarthric Speech 293

speech-language pathologist according to the Frenchay Dysarthria Assessment[9] to determine the severity of their deficits. According to this assessment, fourspeakers (i.e., F01, M01, M02, and M04) are severely dysarthric, one speaker(M05) is moderately-to-severely dysarthric, and one subject (F03) is moderatelydysarthric. Two subjects (M03 and F04) have very mild dysarthria and are notconsidered as dysarthric in this paper as their measured intelligibility is notsubstantially different from the non-dysarthric speakers in the database.

2.2 Speech Stimuli

Three hours of speech are recorded from each subject in multiple sessions inwhich an average of 415 utterances are recorded from each dysarthric speakerand 800 from each control subject. The single-word stimuli in the database in-clude repetitions of English digits, the international radio alphabets, the 20 mostfrequent words in the British National Corpus (BNC), and a set of words se-lected by Kent et al. to demonstrate phonetic contrasts [5]. The sentence stimuliare derived from the Yorkston-Beukelman assessment of intelligibility [10] andthe TIMIT database [11]. In addition, each participant is asked to describe inhis or her own words the contents of a few photographs that are selected fromstandardized tests of linguistic ability so as to include dictation-style speech inthe database.

A total of 1004 single-word utterances were selected from the recordings of thedysarthric speakers and 808 from control speakers for this study. These consistof 607 unique words. Each listener is presented with 18% of the data (single-word utterances) from each dysarthric subject where 5% of randomly selectedutterances are repeated for intra-listener agreement analysis resulting in a totalof 180 utterances from the six dysarthric individuals. In addition, a total of 100single-word utterances are selected from three male and three female controlsubjects comprising about 6% of utterances from each speaker. Altogether, eachparticipant listens to a total of 280 speech files which are presented in a randomorder. Inter-listener agreement is measured by ensuring that each utterance ispresented to at least two listeners.

2.3 Listeners

Fourteen native North American English speakers who had no previous famil-iarity with dysarthric speech and without hearing or vision impairment wererecruited as listeners. The listening task consisted of a closed-set multiple-choiceselection in which listeners were informed that they would be listening to a listof single-word utterances spoken by individuals with and without speech disor-ders in a random order. For every spoken word, a listener was required to selecta word that best matched his/her interpretation from among a list of eight al-ternatives. Four of the seven foils were automatically selected from phoneticallysimilar words in the pronunciation lexicon, differing from the true word in oneor two phonemes. The other three foils were generated by an HMM-based speech


recognizer trained on the entire data to produce an N-best list such that the firstthree unique words different from the target word are selected. Listeners wereallowed to replay prompts as many times as they want.

3 Intelligibility Test Results

For each listener, the percentages of correct responses out of the 180 dysarthricprompts and 100 non-dysarthric prompts were calculated separately. The cor-rect percentages were then averaged across the 14 listeners to compute the meanrecognition score of naıve human listeners on dysarthric and non-dysarthricspeech. Accordingly, the mean recognition score of human listeners is 79.78%for stimuli produced by dysarthric speakers and 94.4% for stimuli produced bycontrol speakers. Figure 1 depicts the recognition score of the 14 naıve listenerson stimuli produced by dysarthric and control speakers.

Fig. 1. Word recognition score of 14 naıve human listeners

To measure the intelligibility of stimuli produced by a speaker, the responsesof all listeners for the stimuli produced by that speaker are collected together andthe percentage of correct identifications is computed. Accordingly, for severelydysarthric speakers, the intelligibility score ranged from 69.05% – 81.88% withthe mean score being 75.2%. Speaker M05, who is moderately-to-severelydysarthric, had 87.88% of his words correctly recognized, and the moderatelydysarthric speaker F03 had 90% of her words recognized correctly. These resultsare presented in Figure 2.


Fig. 2. Intelligibility score of six dysarthric speakers as rated by 14 naıve human lis-teners

On average, listeners agreed on common utterances between 72.2% and 81.6%of the time with the mean inter-listener agreement being 77.2%. The probabilityof chance agreement here is 12.5% since there are 8 choices per utterance.

Intra-listener reliability is measured as the proportion of times that a listeneridentifies the same word across two presentations of the same audio prompt. Themean intra-listener agreement across all listeners is 88.5%, with the lowest being79.6% and the highest being 96.3% (listeners 7 and 10).

4 ASR Experiments and Results

4.1 Data Description

The speaker-independent (SI) acoustic models are built using a subset of theTORGO database consisting of over 8400 utterances recorded from six dysarthricspeakers, two speakers with very mild dysarthria, and seven control subjects. TheSI models are trained and evaluated using the leave-one-out method; i.e., datafrom one speaker are held out for evaluation while all the remaining data from theother speakers are used for training. The held-out data from the test speaker isdivided into an evaluation-set and an adaptation-set. The evaluation-set consistsof all unique single-word stimuli spoken by the test dysarthric speaker (describedin Section 2.2) while the remaining data are later used as adaptation-set to adapta SI acoustic model to the vocal characteristics of a particular dysarthric speaker.


4.2 Acoustic Features

We compare the performance of acoustic models based on Mel-Frequency Cep-stral Coefficients (MFCCs), Linear Predictive Coding-based Cepstral Coeffi-cients (LPCCs), and Perceptual Linear Prediction (PLP) coefficients withvarious feature parameters, including the use of Cepstral Mean Subtraction(CMS) and, the use of the 0th order cepstral coefficient as the energy term insteadof the log of the signal energy. The use of CMS was found to be counterproductivein all cases. This is because single-word utterances are very short and CMS is onlyuseful for utterances longer than 2–4 seconds [12]. The recognition performanceof the baseline SI monophone models based on MFCC and PLP coefficientswith the 0th order cepstral coefficient are comparable (39.94% and 39.5%) whileLPCC-based models gave the worst baseline recognition performance of 34.33%.Further comparison on PLP and MFCC features on speaker-adapted systemsshowed that PLP-based acoustic models outperformed MFCC-based systems by2.5% absolute. As described in [13], PLP features are more suitable in noisyconditions due to the use of different non-linearity compression; i.e., the cuberoot instead of the logarithm on the filter-bank output. The data used in theseexperiments consist of considerable background noise and other type of noiseproduced by the speakers due to hyper-nasality and breathy voices. These as-pects may explain why PLP performed better than MFCCs and LPCCs in theseexperiments. The rest of the experiments presented in this paper are based onPLP acoustic features. PLP incorporates the known perceptual properties of hu-man hearing, namely critical band frequency resolution, pre-emphasis with anequal loudness curve, and the power law model of hearing.

A feature vector containing 13 cepstral components, including the 0th ordercepstral coefficient and the corresponding delta and delta-delta coefficients com-prising 39 dimensions, is generated every 15 ms for dysarthric speech and every10 ms for non-dysarthric speech.

4.3 Speaker-Independent Baseline Models

The baseline SI systems consist of 40 left-to-right, 3-state monophone hiddenMarkov models and one single-state short pause (sp) model with 16 Gaussianmixture components per state. During recognition, the eight words that are usedas alternatives for every spoken test utterance during the listening experimentsare formulated as an eight-word finite-state grammar which is automaticallyparsed into the format required by the speech recognizer. The pronunciationlexicon is based on the CMU pronunciation dictionary1. All ASR experimentsare performed using the Hidden Markov Model Toolkit (HTK) [14].

The mean recognition accuracy of the baseline SI monophone models usingPLP acoustic features on single-word recognition where eight alternatives areprovided for each utterance is 39.5%. The poor performance of the SI models inrecognizing dysarthric speech is not surprising since data from each dysarthricspeaker deviates considerably from the training data. Word-internal triphone1 http://www.speech.cs.cmu.edu/cgi-bin/cmudict

http://www.speech.cs.cmu.edu/cgi-bin/cmudict


models show little improvement over the baseline monophone models for thedysarthric data in our database. Hence, we use the monophone models as ourbaseline in the rest of the experiments.

4.4 Acoustic and Lexical Model Adaptation

To improve recognition accuracy, the SI models are tailored to the vocal char-acteristics of each dysarthric subject. Here we use a 3-level cascaded adaptationprocedure. First we use maximum likelihood linear regression (MLLR) adapta-tion followed by maximum a posteriori (MAP) estimation to adapt each SI modelto the vocal characteristics of a particular dysarthric subject. We then analyzethe pronunciation deviations of each dysarthric subject from the canonical formand build an associated speaker-specific pronunciation lexicon that incorporatestheir particular behavior of pronunciation.

Using the adaptation data from a particular speaker, we perform a two-passMLLR adaptation. First, a global adaptation is performed, which is then used asan input transformation to compute more specific transforms using a regressionclass tree with 42 terminals. We then carry out 2 to 5 consecutive iterationsof Maximum a Posteriori (MAP) adaptation using the models that have beentransformed by MLLR as the priors and maximizing the posterior probabilityusing prior knowledge about the model parameter distribution. This processresulted in 25.81% absolute (43.07% relative) improvement.

Using speaker-dependent (SD) pronunciation lexicons, constructed asdescribed in [15], during recognition improved the word recognition rate furtherby an average of 3.18% absolute (8.64% relative). The SD pronunciation lexiconsconsist of multiple pronunciations for some words that reflect the particular pro-nunciation pattern of each dysarthric subject. In particular, we listened to 25% ofspeech data from each dysarthric subject and carefully analyzed the pronuncia-tion deviations of each subject from the norm; i.e., the desired phoneme sequenceas determined by the CMU pronunciation dictionary was compared against theactual phoneme sequences observed, and the deviations were recorded. These de-viant pronunciations were then encoded into the generic pronunciation lexiconas alternatives to existing pronunciations [15]. Figure 3 depicts the performanceof the baseline and speaker-adapted (SA) models on dysarthric speech.

In total, the cascaded approach of acoustic and lexical adaptation improvedthe recognition accuracy significantly by 28.99% absolute (47.94% relative) overthe baseline yielding a mean word recognition accuracy of 68.39%.

For non-dysarthric speech, the mean word recognition accuracy of the SIbaseline monophone models is 71.13%. After acoustic model adaptation, themean word recognition accuracy rises to 88.55%.

5 Discussion of Results

When we compare the performance of the speaker-adapted ASR systems withthe intelligibility rating of the human listeners on dysarthric speech, we observe


Fig. 3. ASR performance on dysarthric speech

that in most cases human listeners are more effective at recognizing dysarthricspeech. However, an ASR system recognized more stimuli produced by speakerF01 than the human listeners. Figure 4 summarizes the results.

Fig. 4. Human listeners vs. ASR system recognition scores on dysarthric speech

Humans are typically robust at speech recognition in the presence of evenvery low signal-to-noise ratios [16]. This may partially explain their relativelyhigh performance here. Dysarthric speech contains not only distorted acousticinformation due to imprecise articulation but also undesirable acoustic noise dueto improper breathing that severely degrades ASR performance. Due to the re-markable ability of human listeners to separate and pay selective attention tothe different sound sources in a noisy environment [17], the acoustic noise dueto improper breathing has less impact on human listeners than in ASR systems.For instance, the audible noise produced by breathy voices and hyper-nasality isstrong enough to confuse ASR systems while human listeners can easily ignore it.This suggests that noise resilience is an area that should further be investigatedto improve ASR performance to dysarthric speech. Furthermore, approachesto deal with other features of dysarthric speech such as stuttering, prosodic


disruptions, and inappropriate intra-word pauses are areas for further investiga-tion in order to build an ASR system that possesses comparable performancewith human-listeners in recognizing dysarthric speech.

Although there appears to exist some relationship between intelligibility rat-ings and ASR performance, the latter is especially affected by the level of back-ground noise, and the involuntary noise produced by the dysarthric speakers.The impact of hyper-nasality and breathy voice appears to be more severe inASR systems than in the intelligibility rating among human listeners on single-word utterances. F01, for instance, is severely dysarthric but the ASR performsbetter than the human listeners because most of the errors in her speech couldbe offset by acoustic and lexical adaptation. M04, on the other hand, who is alsoseverely dysarthric, was relatively more intelligible but was the least well under-stood by the corresponding speaker-adapted ASR system since this speaker ischaracterized by breathy voice, prosodic disruptions, and stuttering.

6 Concluding Remarks

In this paper we compared naıve human listeners and speaker-adapted automaticspeech recognition systems in recognizing dysarthric speech. Since intelligibilitymay vary widely depending on the type of stimuli and response format used, ourbasis of comparison is designed so that both the human listeners and the ASRsystems are compared on a level ground. Here, we use multiple choice formatfrom a closed set of eight alternatives, where the same set of alternatives areprovided for every single-word utterance to both the ASR systems and to thehuman listeners. Although, there is one case in which a speaker-adapted ASRsystem performed better than the aggregate of human listeners, in most casesthe human listeners are more effective in recognizing dysarthric speech than ASRsystems. However, the mean word recognition accuracy of the speaker-adaptedASR systems (68.39%) relative to the baseline of 39.5% is encouraging. Futurework ought to concentrate on an improved method to deal with breathy voice,stuttering, prosodic disruptions, and inappropriate pauses in dysarthric speechto further improve ASR performance.

Acknowledgments. This research project is funded by the Natural Sciencesand Engineering Research Council of Canada and the University of Toronto.

References

1. Yorkston, K.M., Beukelman, D.R., Bell, K.R.: Clinical Management of DysarthricSpeakers. Little, Brown and Company (Inc.), Boston (1988)

2. Carlson, G.S., Bernstein, J.: Speech recognition of impaired speech. In: Proceedingsof RESNA 10th Annual Conference, pp. 103–105 (1987)

3. Stevens, G., Bernstein, J.: Intelligibility and machine recognition of deaf speech.In: Proceedings of RESNA 8th Annual Conference, pp. 308–310 (1985)


4. Sharma, H.V., Hasegawa-Johnson, M., Gunderson, J., Perlman, A.: Universal ac-cess: speech recognition for talkers with spastic dysarthria. In: Proceedings ofINTERSPEECH 2009, pp. 1451–1454 (2009)

5. Kent, R.D., Weismer, G., Kent, J.F., Rosenbek, J.C.: Toward phonetic intelligi-bility testing in dysarthria. Journal of Speech and Hearing Disorders 54, 482–499(1989)

6. Yorkston, K.M., Beukelman, D.R.: A comparison of techniques for measuring in-telligibility of dysarthric speech. Journal of Communication Disorders 11, 499–512(1978)

7. Goulden, R., Nation, P., Read, J.: How large can a receptive vocabulary be? Ap-plied Linguistics 11, 341–363 (1990)

8. Rudzicz, F., Namasivayam, A., Wolff, T.: The TORGO database of acoustic andarticulatory speech from speakers with dysarthria. Language Resources and Eval-uation (2011) (in press)

9. Enderby, P.: Frenchay Dysarthria Assessment. International Journal of Language& Communication Disorders 15(3), 165–173 (1980)

10. Yorkston, K.M., Beukelman, D.R.: Assessment of Intelligibility of DysarthricSpeech. C.C. Publications Inc., Tigard (1981)

11. Zue, V., Seneff, S., Glass, J.R.: Speech database development at MIT: TIMIT andbeyond. Speech Communication 9(4), 351–356 (1990)

12. Alsteris, L.D., Paliwal, K.K.: Evaluation of the Modified Group Delay Feature forIsolated Word Recognition. In: Proceedings of International Symposium on SignalProcessing and Applications, pp. 715–718 (2005)

13. Hermansky, H.: Perceptual Linear Predictive (PLP) Analysis of Speech. Journalof the Acoustical Society of America 87(4), 1738–1752 (1990)

14. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G.,Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book. Re-vised for HTK Version 3.4, Cambridge University Engineering Department (2006)

15. Mengistu, K.T., Rudzicz, F.: Adapting Acoustic and Lexical Models to DysarthricSpeech. In: Proceedings of International Conference on Acoustics, Speech and Sig-nal Processing (2011) (in press)

16. Lippmann, R.: Speech recognition by machines and humans. Speech Communica-tion 22(1), 1–15 (1997)

17. Bregman, A.S.: Auditory Scene Analysis: The Perceptual Organization of Sound.MIT Press, Cambridge (1990)

A Context-Aware Reputation-Based Model of

Trust for Open Multi-agent Environments

Ehsan Mokhtari1, Zeinab Noorian1, Behrouz Tork Ladani2, andMohammad Ali Nematbakhsh2

1 University of New Brunswick, Canada2 University of Isfahan, Iran

{ehsan.mokhtari,z.noorian}@unb.ca{ladani,nematbakhsh}@eng.ui.ac.ir

Abstract. In this paper we have proposed a context-aware reputation-based trust model for multi-agent environments. Due to the lack ofa general method for recognition and representation of context notion,we proposed a functional ontology of context for evaluating trust(FOCET) as the building block of our model. In addition, a computa-tional reputation-based trust model based on this ontology is developed.Our model benefits from powerful reasoning facilities and the capabilityof adjusting the effect of context on trust assessment. Simulation resultsshows that an appropriate context weight results in the enhancement ofthe total profit in open systems.

1 Introduction

A wide range of open distributed systems including e-business, peer-to-peer sys-tems, web services, pervasive computing environments and the semantic web arebuilt in open uncertain environments. The building blocks for constructing thesesystems are autonomous agents that act and interact flexibly and intelligently.In the absence of legal enforcement procedures in these environments, trust is ofcentral importance in establishing mutual understanding and confidence amongparticipants. There are different approaches for trust modeling including socio-cognitive, game theoretical, security oriented, modal logics and other operationalapproaches[10]. The constituent element for evaluating trust in real and virtualsocieties is reputation. Reputation refers to a perception that an agent has ofothers intentions and norms [9]. In large open multi-agent systems where inter-actions are infrequent it is not always possible to evaluate the trustworthiness ofpeers just based on direct experiences. Thereby, the social dimension of agents isdeveloped to gather reputation information from other members of the society.

Trust and reputation are context-dependent notions [9],[16]. That is, satisfac-tory interactions outcomes in a particular context would not necessarily assurehigh quality interaction results with the same transaction partner in differentcontexts [10]. Nevertheless, most existing trust models has neglected this issueand evaluate trust regardless of their negotiated contexts.


302 E. Mokhtari et al.

Many efforts have been made to model the notion of context with differentapproaches. Main approaches of context modeling can be categorized as: KeyValue, Markup, Graphical, Object Oriented, Logic Based and Ontology Basedmodeling [15]. Strang et al.[15] have shown that the most promising assets forcontext modeling can be found in ontology category. Several models which claimto use context as one of their elements in trust evaluation have been devel-oped [13],[2],[9],[8],[7]. However the existing context-aware trust models sufferfrom lack of a functional and applicable context recognition and representationmethod. Moreover, considering reputation values incorporated with their con-texts is another shortcoming of existing trust models.

In this paper we propose a context-aware reputation-based trust model formulti-agent environments to address such deficiencies. We propose a functionalontology of context for evaluating trust named FOCET. It provides agents withan extensible ontology to adopt different context elements with different im-portance levels pertaining to their subjective requirements and environmentalconditions. Based on these principles, a computational model of trust is de-veloped which aggregates several parameters to derive the trustworthiness ofparticipants.

We begin with a description on some of the related works in context mod-eling and context-aware trust evaluation area. Subsequently, we provide a de-tailed presentation of the proposed ontology of context and discuss the relevantcontext reasoning issues. In the next sections, we describe different constituentelements of our trust and reputation management mechanisms and present aproper computational trust model afterwards. Evaluation framework and exper-imental results are discussed in next sections. Finally we conclude by explainingsome of the open problems and possible future works in this field.

2 Related Works

There is a wide variety of reputation-based trust models in the literature [1],[9],[13],[7]. Reputation mechanisms have been widely used in online electronic com-merce systems. Online reputation mechanisms (e.g. those on eBay and AmazonAuctions [6],[14]) are probably the most widely used ones. Reputation in thesemodels is a global single value representing a user’s overall trustworthiness. Thesetrust models are not well-suited for dynamic environments where providers of-fer different types of services with different satisfaction degrees. Reputation isclearly a context-dependent quantity [9]. However, there are a few trust modelswhich consider the context as a determinant factor in their trust evaluation. Liuet al.[8] introduced a reputation model that incorporates time and context asservices presented by each entity. Zhu et al.[7]considered environment influenceson agents’ behaviors as context . They presented a solution to cope with thefair reputation evaluation problem for agents who are in bad areas. Context hasapplied by Essin [16]as a sub factor in determining the action valuation andsubject stake in the action. Blaze et al.[2]proposed a system named PolicyMakerin which the set of local policies are assumed as the context under which thetrust is evaluated. Rey et al.[13] present a formalized graph-based context model

A Context-Aware Reputation-Based Model of Trust 303

for evaluating trust. They simply describe the context as a set of keywords andproposed a new data structure named context graph for context representa-tion. Wang et al [18] proposed a context-aware computational trust model formulti agent systems. They consider m-types of context information and exploita Bayesian network for trust estimation. However, features like the influence ofreputation from one context to similar ones are missed. Since the most promisingassets for context modeling can be found in ontological methods; [15],[13],[18]suffer from an efficient presentation of context and reasoning capabilities in theirapproaches.

Our work differs from others in number of ways. We present an ontologicalcontext model which is built on an extensible core ontology description calledFOCET. Using ontology as the context model cornerstone, reasoning capabilitiescould be comprehensively applied on context data. Also, the proposed modelprovides means for agents to subjectively use different context features. Further,in this model agents are able to assign different weights to their selected contextfeatures in order to adjust the effect of context on trust evaluation pertaining todifferent environmental circumstances.

3 Ontology of Context

The notion of context has been defined by different authors [15]. Dey et al.[4]have defined context within a comprehensive approach. He defines context asany information that is required to characterize the situation of an entity. Anentity could be a person, place, or object that is considered relevant to the in-teraction between a user and an application, including the user and applicationsthemselves [4].

In the proposed model, we have derived effective input factors for contextreorganization and representation based on Dey et al.[4] definition of context.Inspired by SOUPA [3] we introduce the main effective key concepts which haveinfluence on context in trust evaluation and representing them in a structuredform through the ontology. That is, we have developed Functional Ontology ofContext for Evaluating Trust (FOCET) that represents the core ontology forcontext in trust applications. FOCET contains eight main categories: Environ-ment, Culture, Spatial Factors, Temporal Factors, History, Subject, User profileand Policy (Figure 1). These main categories could be elaborated by adding sub-ontologies as their extensions [3]. We briefly describe each category of FOCETcore in the following:

Fig. 1. Representation of FOCET dimensions


Environment: An environment provides the conditions under which agents ex-ist [11]. In other words, the environment defines the world states and propertiesin which the participants operate. For example, in this paper we consider threedifferent types of environments:1) high-risk, 2) low-risk and 3) mid-risk. Detaileddescription will be discussed in section 7.Culture: Culture includes natural, essential, inherited and permanent featuresof an agent. Agents having different cultures may have different priorities andpreferences toward the same service. Thus, various aspects of culture such aslanguage, nationality and morality in the trust evaluation process can be inte-grated in a culture dimension of FOCET.Spatial factors: Context is dependent on the agent space and location. In or-der to identify the agent spatial properties, characteristics such as location andvicinity are considered as spatial factors. This dimension may be beneficial inparticular conditions in which agents prefer to communicate mostly with partic-ipants in a certain vicinity.Temporal factors: Time is a vital aspect for the human understanding andclassification of context because most statements are related over the temporaldimension [21]. Features like agent’s age, life cycle and it’s communication timeare considered as temporal factors. For example, some participants may considerolder agents more trustworthy than younger ones.History: each participant maintains the previous interaction records and ob-servations in the History dimension of FOCET. This might include environmentdynamics, population tendency and interaction outcomes [10].Subject: This feature addresses diverse aspects of the transaction criteria. Itcomprises the detailed descriptions of the providers’ identity, interaction con-tents and utilities.Policy: A policy is a deliberate plan of actions to guide decisions and achieverational outcomes. Policy can guide actions toward those that are most likelyto achieve a desired outcome. Policy could be consist of security, setup, com-munication, relationship and event handling policies. As an example, the trustthreshold to commit a transaction could be adaptively adjusted based on theobservation of the environmental circumstances.

Many concepts of context ontology are semantically interrelated. We proposeda set of public rules for FOCET based on a common existing knowledge aboutthe concepts. These rules will be integrated with the ontology in order to beavailable to all agents in the society. Using public rules, agents will benefit froman automatic reasoning process to refine and complement initial context data.For example, if provider P which offers pickup&delivery service is located inCanada, Fredericton we could deduce that the timezone for this provider is4:00 hours behind GMT. Also, P is restricted by the federal and governmentalbusiness laws, hence; it would not be able to commit the delivery of certain goodswhich are prohibited by government of New Brunswick or federal government ofCanada. Figure 2 exhibits sample public rules written in CLIPS. Aside from thepublic rules, each agent can use its private knowledge and reasoning capability toimprove its decision making process. This private knowledge can be presented by


Fig. 2. Sample public inference rule in FOCET

a set of private inference rules which the agent applies on FOCET. Each agentwill define its own private rules based on its own knowledge using a predefinedframework. For instance, one can infer from the same provider P that : cross-continent delivery services would take longer using this provider rather than aprovider located in Toronto due to the fact that there is a no direct flight tomost of the main cities in the world from the Fredericton airport.

4 Reputation Component

In this model, reputation can be categorized according to information sources[10],[6]. Direct reputation is referred to the previous direct experiences with par-ticular agents and Indirect reputation refers to information about the targetagent which is received from other sources of information like advisors or trustedthird parties.

Reputation is clearly a context-dependent concept[9],[10]. That is, high repu-tation of a provider in a particular service would not necessarily cascaded to theother services it offers. For example, provider P ’s high reputation in inventoryservice should not affect its reputation in delivery service. However, in mostmodels[5],[20],[17],[19] reputation information are communicated regardless oftheir negotiated contexts. In these models, a consumer agent evaluates the influ-ence degree of recommending agents simply based on their deviations from itsown opinions. The proposed context-aware trust model takes different approachand provides a consumer agent with a mechanism to examine different aspects ofrecommending agents’ negotiated contexts and evaluates their influence degreebased upon their degree of similarity and relevancy with the prospected contextof the future transaction. In this model we assume the recommending agents arehonest but they might have different influence degrees in different transactioncontexts.

The reputation component employs FOCET as the constitutional elementfor representation, transmission, storage and retrieval of context information. Itincorporates every reputation value within a set of context data presented byFOCET features.


5 Trust Component

As aforementioned, each provider may offer different kinds of services in differentcontexts. These contexts might be totally different or have some features incommon. To measure the effect of a particular context on the other one, we definetwo individual metrics: 1) Weight Matrix (WM) which includes the importancelevel of each dimension of context and 2) Relevancy Matrix (RM) that indicatesthe similarity degree of each feature in the first context with the correspondingone in the second context.

5.1 Weight Matrix

The context dimensions in FOCET might have different importance levels de-pendent upon the applications requirements. For example, user profile and policyconcepts are of central importance in an e-auction system while cultural charac-teristics may not be applicable at all. Therefore, we define the WM to handlethis issue. WM is 1×n matrix which n is the number of FOCET concepts and βwhich is in [0, 1] refers to the importance degree of the corresponding concepts.

WM = [β1, β2, ..., βn] (1)

5.2 Relevancy Matrix

Aside from the heterogeneity problem of participants, the homogenous partici-pants who contextually model the reputation information might have differentterminologies and conventions to represent context concepts. This issue may re-sult in a diverse perception from a unique concept. For example, one may usedelivery to describe a particular service while the other uses shipping instead.Although both services express the same concept, the terms standing for themare different. To rectify, we introduce a Relevancy Matrix (RM) to measure thesimilarity degree of a particular context features with the corresponding ones inanother context. To calculate RM , we exploit the WordNet [12] API to measurea conceptual and semantic relation between different elements. RM is n×1 ma-trix which n is the number of FOCET features and υ which is in the range of[0, 1] signifies the similarity level (Equation 2).

RM = [υ1, υ2, ..., υn] (2)

5.3 Context Effect Factor

Given the WM and RM matrixes, we could measure the influence of par-ticipants’ experiences in different contexts on the prospected transaction con-text Ecxt with provider P . We call this measure the Context Effect FactorCEF(Par,Hcxt,Ecxt,P ) where Par ∈ C ∪R implies a particular participant in thecommunity and Hcxt refers to a typical previous context in which the participanthas negotiated with P and can be computed as follows:

CEF(Par,Hcxt,Ecxt,P ) = WM ×RM =∑n

i=1(1 − βi) + βi × υi

n(3)


Here, n is the number of realized features in FOCET; therefore,CEF(Par,Hcxt,Ecxt,P ) would be a scalar value in the range of [0, 1]. To clar-ify, suppose that consumer C intends to interact with provider P in context T .C would calculate the influence degree of each previous interaction context Hexperienced with P against the context T as CEF(C,H,T,P ).

5.4 Trust Dynamics

Evidently, trust information might lose its credibility as time progresses. Thisis because of the fact that transaction peers might change their behaviors overtime. In such case, despite the honesty of recommending agents in providing rep-utation information, their information might not be credible. Thus, in order tocapture the risk of dynamicity in agent’s behavior we should consider the recentinformation of participants more important than the old ones. In so doing, con-sumer agents subjectively specify a degree of decay λ which is 0 ≤ λ ≤ 1 basedon their policies in order to reduce the influence of the old reputation informa-tion adaptively. We formulate a time-dependent influence degree of participantsCEF ′

(Par,Hcxt,Ecxt,P ) as follows:

CEF ′(Par,Hcxt,Ecxt,P ) = e(−λΔt)CEF(Par,Hcxt,Ecxt,P ) (4)

Where Δt indicates the elapsed time period since previous interactions has takenplace and could be determined by the temporal factor concept in FOCET.

6 Computational Model

In the proposed context-aware trust model, we define two individual metricsto evaluate the trustworthiness of the potential transaction partners:1) DirectTrust DT(C,P,Ecxt) which is merely based on consumer C’s direct experienceswith provider P in context Ecxt and 2) Indirect Trust IT(R,P,Ecxt) that derivestrust based on recommending agent’s R reputation reports. We can formalizeDT(C,P,Ecxt) as follows:

DT(C,P,Ecxt) =

∑Hcxti∈(C,P )

CEF ′(C,Hcxti,Ecxt,P ) ∗ vi

n ∗ |�(C, P )| (5)

Where �(C, P ) is a collection of Hcxt of consumer C with provider P and vi

represents the rating value of Hcxti. Also, n indicates the number of contextelements in FOCET.In addition, the IT(R,P,Ecxt) could be formulated as:

IT(R,P,Ecxt) =

∑Rj∈R

∑Hcxti∈(Rj ,P )

CEF ′(Rj ,Hcxti,Ecxt,P ) ∗ vi

n ∗ |R| ∗∑|R|

j=1 |�(Rj , P )|(6)

Where R is the set of recommender agents which provide indirect reputationdata.


Consumer agents may assign different significant levels ω to DT(C,P,Ecxt) andIT(R,P,Ecxt) components based upon their policies. Therefore, a linear weightedcombination of these values is exploited to build the final trust value. Thus, theoverall trust τ(C,P,Ecxt) could be calculated as:

τ(C,P,Ecxt) = ω ×DT(C,P,Ecxt) + (1 − ω) × IT(R,P,Ecxt) (7)

7 Simulation Setting

We have implemented a multi-agent environment consisting 50 service providersthat each of them provides 50 randomly selected services from a service poolcontaining 100 services. The providers would deliver high quality of service innumbers of their services depending on the environmental circumstances. Also,they might change their behavior in terms of presenting various quality of servicerates for the same service during simulation. There are 100 agents who are able toact both as consumers and advisors. The simulation is run for 500 days and thesystem is monitored during this period. Each consumer C is subject to initiateone business transaction with a provider P in context H per day. The trust-worthiness of P is calculated by C using its past direct reputation records of Pregarding H along with the indirect reputation data gathered from the advisorsR about P regarding context H . C will commit a transaction if the calculatedtrust level for P exceeds the expected trust threshold of C. This trust thresholdis subjectively determined based on the policy dimension of FOCET exploited byeach consumer. C will achieve gain proportionate to the value of the committedtransaction if the provider satisfies the expected quality of service, otherwise;the consumer will suffer from the same amount of loss. In order to examine theefficiency of our approach, we consider three different types of environments: 1)low-risk, 2)mid-risk and 3)high-risk environments. In the low-risk environment,the majority of providers offer a satisfactory quality of service and the valuesof the transactions are low. Therefore, failure in delivering the expected qual-ity of service would not result in a significant loss for consumers. In such anenvironment, consumers have low trust threshold for committing transactionswith providers. That is, they will initiate business interactions with providerswith minimum level of confidence about their quality of service. On the otherhand, since the high-risk environment is mostly populated by low-quality serviceproviders, the chance of consumers dealing with unqualified providers increasesubstantially. In this environment, the values of services offered by providersare mostly high, thus; a failure in delivering the expected quality of service willresult in a significant loss for consumers. This inherent characteristic of high-risk environment requires consumers to have a high trust threshold for initiatingtransactions with providers. In the mid-risk environment high-quality and low-quality service providers are almost uniformly distributed. Business transactionsin this environment usually have average values as well.


8 Experimental Results and Analysis

In this section we analyze the effect of context in different types of environments.We examine the functionality of this approach when consumers have differentpreferences in using context in their evaluations.

8.1 Low-Risk Environment

As could be observed in Figure 3, increasing the context weight in low-risk en-vironments would result in decreasing the gain of consumers. Since most of theproviders offer high quality services in these environments, most of the trans-actions would lead to a gain achievement. To put this in perspective, supposethat provider P which usually offers high quality of services has not presentedsatisfactory services in just a few transactions of its delivery service to Europein a certain time due to particular reasons such as inclement weather. However,in low-risk environment P would probably provide high quality delivery servicein other contexts like delivery to North America. When the context has less in-fluence in trust evaluation, consumer C propagates P ’s good reputation in othercontexts to this one. Therefore, having just a few number of bad reputationrecords of P would not prevent C to do further delivery transactions with P .However, as the context weight increases in C’s trust evaluation process it wouldabort future delivery transactions which would probably associated with a con-siderable gain just because of P ’s few bad reputation records in this context. Asa result, the total number of committed transactions and their associated gainwould be reduced.

Fig. 3. The influence of different context weight in a low-risk environment

8.2 High-Risk Environment

In a high-risk environment, as the context weight increases, the propagation ofproviders reputation from one context to another one decreases proportionately.That is, consumers with high context weights merely rely on the reputation in-formation obtained from the similar contexts in their trustworthiness evaluation.As such, future transactions with a particular provider in a specific context Hwho has bad reputation in other contexts fairly similar to H would be avoided.This feature shows its functionality specifically in a high-risk environment wherethe transaction value is high and failure in delivering high-quality of service are


associated with a huge loss. Through the analytical approach of context-awaremodel, consumers with reasonable context weights could cautiously avoid in-teracting with unqualified providers and increase their total profit substantially(Figure 4).

Fig. 4. The influence of different context weights in a high-risk environment

8.3 Mid-Risk Environment

Figure 5 demonstrates the effect of context weight on a mid-risk environment. Inthis environment some providers present high quality services while the othersprovide low quality ones. The effect of context in mid-risk environments is highlydependent on the rate of high/low quality service providers. That is, if high-quality providers are dominant in the environment, a consumer with low contextweight may benefit from more successful interaction results compare with aconsumer whose context influence is high. On the contrary, the basis of thecontext-aware trust evaluation ensures consumers with high context weight toachieve large gain in this environment when low-quality providers take over thecommunity.

Fig. 5. The influence of different context weights in a mid-risk environment


In this paper we presented a context-aware reputation-based trust model formulti-agent systems. This model benefits from a functional ontology of contextfor evaluating trust (FOCET) along with a computational model of trust basedon that. FOCET is an extensible core ontology for context recognition and rep-resentation which contains eight main categories: Environment, Culture, Spatial


Factors, Temporal Factors, History, Subject, User profile and Policy. Exploitingproper public and private inference rules in FOCET, we are able to complementraw context data by adding more deduced facts from existing known ones. Us-ing weight and relevancy matrices we are able to scale (up/down) the effect ofcontext in evaluating trust.

There are number of future works we could consider to extend our model.Uncertainty is an important factor which is usually discussed in context andtrust modeling. Uncertainty could be emanated from information acquired fromagents and their environments as well as the ontology inference rules. We in-tend to extend our model to support uncertainty in context reasoning and trustevaluation. Furthermore, since there is no authority in open environments toforce agents to perform honestly, they should be provided with a means to forma trust network of the trusted peers. Using such trust network, the indirectreputation data would be more reliable. Experimental results demonstrate thatvarious context weights would have different effects on total profit in differenttypes of environments. Building an adaptive context-aware trust model to adjustthis optimum context weight would be another milestone. Moreover, incorporat-ing other sources of trust namely credentials, competence and capability withreputation and context would result in a comprehensive trust model for a widerange of real world applications.

References

1. Abdul-Rahman, A., Hailes, S.: A distributed trust model. In: Proceedings of the1997 Workshop on New Security Paradigms, pp. 48–60. ACM, New York (1997)

2. Blaze, M., Feigenbaum, J., Lacy, J.: Decentralized trust management. In: Pro-ceedings of the 1996 IEEE Symposium on Security and Privacy, SP 1996. IEEEComputer Society, Los Alamitos (1996)

3. Chen, H., Perich, F., Finin, T., Joshi, A.: Soupa: Standard ontology for ubiquitousand pervasive applications. In: International Conference on Mobile and UbiquitousSystems: Networking and Services, pp. 258–267 (2004)

4. Dey, A.K.: Understanding and using context. Personal Ubiquitous Comput. 5, 4–7(2001)

5. Shadbolt, N.R., Huynh, T.D., Jennings, N.R.: An integrated trust and reputationmodel for open multi-agent systems. In: AAMAS, pp. 119–154 (2006)

6. Jsang, A., Ismail, R., Boyd, C.: A survey of trust and reputation systems for onlineservice provision. Decision Support Systems 43(2), 618–644 (2007); Emerging Issuesin Collaborative Commerce

7. Lei, Z., Nyang, D., Lee, K., Lim, H.: Computational intelligence and security8. Liu, J., Issarny, V.: Enhanced reputation mechanism for mobile ad hoc networks.

pp. 48–62 (2004)9. Mui, L.: computational Models of Trust and Reputation: Agents, Evolutionary

Games, and Social Networks. PhD thesis, Massachusetts Institute of Technology(2003)

10. Noorian, Z., Ulieru, M.: The state of the art in trust and reputation systems:a framework for comparison. J. Theor. Appl. Electron. Commer. Res. 5, 97–117(2010)


11. Odell, J.J., Van Dyke Parunak, H., Fleischer, M., Brueckner, S.A.: Modeling Agentsand Their Environment. In: Giunchiglia, F., Odell, J.J., Weiss, G. (eds.) AOSE2002. LNCS, vol. 2585, pp. 16–31. Springer, Heidelberg (2003)

12. Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet:similarity: measuringthe relatedness of concepts. In: Demonstration Papers at HLT-NAACL 2004,HLT-NAACL–Demonstrations 2004, pp. 38–41. Association for ComputationalLinguistics (2004)

13. Ray, I., Ray, I., Chakraborty, S.: An interoperable context sensitive model of trust.Journal of Intelligent Information Systems

14. Resnick, P., Zeckhauser, R., Swanson, J., Lockwood, K.: The value of reputationon ebay: A controlled experiment. Experimental Economics, 79–101 (2006)

15. Strang, T., Linnhoff-Popien, C.: A context modeling survey. In: Workshop on Ad-vanced Context Modelling, Reasoning and Management, UbiComp 2004 - TheSixth International Conference on Ubiquitous Computing, Nottingham, England(2004)

16. Viljanen, L.: Towards an Ontology of Trust. In: Katsikas, S.K., Lopez, J., Pernul,G. (eds.) TrustBus 2005. LNCS, vol. 3592, pp. 175–184. Springer, Heidelberg (2005)

17. Jennings, N.R., Luck, M., Teacy, W.T.L., Patel, J.: Travos: Trust and reputationin the context of inaccurate information sources. Journal of Autonomous Agentsand Multi-Agent Systems (2006)

18. Wang, Y., Li, M., Dillon, E., Cui, L.g., Hu, J.j., Liao, L.j.: A context-aware com-putational trust model for multi-agent systems. In: IEEE International Conferenceon Networking, Sensing and Control, ICNSC 2008, pp. 1119–1124 (2008)

19. Whitby, A., Josang, A., Indulska, J.: Filtering out unfair ratings in bayesian repu-tation systems. In: Proceedings of 7th International Workshop on Trust in AgentSocieties (2004)

20. Zhang, J., Cohen, R.: Evaluating the trustworthiness of advice about seller agentsin e-marketplaces: A personalized approach. Electronic Commerce Research andApplications (2008)

21. Zimmermann, A., Lorenz, A., Oppermann, R.: An Operational Definition of Con-text. In: Kokinov, B., Richardson, D.C., Roth-Berghofer, T.R., Vieu, L. (eds.)CONTEXT 2007. LNCS (LNAI), vol. 4635, pp. 558–571. Springer, Heidelberg(2007)

Pazesh: A Graph-Based Approach

to Increase Readabilityof Automatic Text Summaries

Nasrin Mostafazadeh1, Seyed Abolghassem Mirroshandel1,Gholamreza Ghassem-Sani1, and Omid Bakhshandeh Babarsad2

1 Computer Engineering Department,Sharif University Of Technology, Tehran, Iran

2 Mechatronics Research Laboratory, Computer Engineering Depatment,Qazvin Azad University, Qazvin, Iran

{mostafazadeh,mirroshandel}@ce.sharif.edu

[email protected], [email protected]

Abstract. Today, research on automatic text summarization challengeson readability factor as one of the most important aspects of summarizers’performance. In this paper, we present Pazesh: a language-independentgraph-based approach for increasing the readability of summaries whilepreserving the most important content. Pazesh accomplishes this task byconstructing a special path of salient sentences which passes through topiccentroid sentences. The results show that Pazesh compares approvinglywith previously published results on benchmark datasets.

1 Introduction

Research in automatic text summarization (ATS) area dates back to late 60’s [1],though solving the problem in a substantial manner seems to yet require a longtrail to work on. Among a variety of different approaches to address this problem,Graph-Based methods have noticeably attracted attention. For the first timein ATS history, Salton [2] proposes a graph as a model for input text. Recentgraph-based approaches compute sentence importance based on the eigenvectorcentrality concept and applying ranking algorithms (e.g., TextRank [3], LexRank[4]). Today, most summarization researches are focused on extractive genre whichtends to select a number of sentences out of the initial text. Normally there aremany topic shifts in a text and highly scored sentences can come from diverseimportant topics which require careful output sentences selection. Some methodshave already been devised to optimize the search problem of finding the bestscoring summary [5,6] and ordering text entities based on chronological order ofevents [7]. Such methods might construct a sentence-to-sentence coherent bodyof information but they neglect preserving the most important content.

In this paper, we introduce Pazesh: a new extractive, graph-based, andlanguage-independent approach to address both readability and informative-ness criteria of single-document summaries. At first, Pazesh segments the text


314 N. Mostafazadeh et al.

in order to find topic centroid sentences. Then it ranks text entities by its spe-cially constructed graphs and at last it finds the most precious path passingthrough centroid sentences. The obtained evaluation results show that our algo-rithm performs well on both readability and informativeness aspects. The restof this paper is organized as follows: Section 2 introduces centroid finding andgraph construction phases. In Section 3, the main phase of algorithm, “address-ing Readability”, is revealed. Section 4 evaluates Pazesh, and finally Section 5concludes the paper’s overall idea and focus.

2 Finding Centroid Sentences

Pazesh follows three steps for finding topic centroid sentences: segmenting thetext into coherent partitions, scoring sentences of each segment and scoring seg-ments individually. A topic is what a discourse segment or a sentence is about.A text can be segmented by its different topics denoted by sentences. In Pazesh,we utilize a segmentation algorithm to find the topic sentences to be used aslandmarks of the final coherent path. Here we have used a simple partitioningapproach: TextTiling [8] algorithm which is a method for partitioning a textinto a number of coherent multi-paragraph partitions representing the text’ssubtopics. This algorithm doesn’t rely on semantic relationships between con-cepts for discovering subtopic structures so that it is language-independent[8] and is highly suitable for Pazesh.

After segmentation phase, we construct a weighed undirected graph for eachsegment. The sentences within a segment would be the graph’s nodes and sim-ilarity between sentences would be graph’s weighed edges. We weight word wof segment S named W(w,S) by freq(w,S)*IDF(w) where freq(w,S) is the num-ber of sentences in segment S containing w, and IDF as the inverse documentfrequency of word w equals log( # all sentences

# sentences containing w ). Then the similarity be-tween sentences of each segment can be measured using cosine similarity formulaas follows:

edge(S1, S2) =

∑w:w∈S1 and S2

W (w, S)2√∑w:w∈S1

W (w, S)2 ∗√∑

w:w∈S2W (w, S)2

(1)

where S is the segment containing S1 and S2. Now that we have the segments’graph, we use PageRank[9] scoring function for scoring nodes of these graphs.The use of ranking algorithms in ATS has been first introduced in TextRank.PageRank does not require a deep linguistic knowledge and is highly suitable forPazesh. Suggested PR(V) score of vertex V is computed as follows:

PR(Vi) = (0.25) + 0.25 ∗∑

Vj∈Ln(Vi)

wij∑Vk∈Ln(Vj)

wjk∗ PR(Vj) (2)

Where Ln(V) is a link of the node. We call above score the segment-globalscore of each sentence. We define centroid sentence of each segment as its rep-resentative and the most salient entity having highest segment-global score.

Pazesh: Approach for Increasing Readability of Summaries 315

In Pazesh we achieve both non-redundancy and noise-removing by constructingand ranking another graph: ‘Segment graph’ which has segments as its nodesand segment similarities as the edges.

The final graph of the whole text would be a ‘directed weighed’ graph witheach sentence as its nodes and sentence similarities as its edges. This graph isconstructed to compute the document global importance of each sentence. Thesame cosine similarity function here is calculated for all sentences of the text. Weapply the PageRank algorithm to this final graph and generate each sentence’sdocument-global score. For tuning this final graph for Path-Finding phasewe make it a directed acyclic graph (DAG). The direction of each edge would befrom a predecessor sentence in initial text to any successor sentence. Also, wetag all edges weighed below a certain threshold γ, shallow edges and the restedges are tagged deep edges. The final graph looks like Fig. 1.

Fig. 1. A sample final graph. Nodes in gray are centroids of the corresponding segment.

3 Addressing Readability

Text readability is a measure of how well and easily a text conveys its intendedmeaning. In Document Understanding Conference (DUC) the linguistic qualitymarkers defined to evaluate the readability aspects of summaries are: Gram-maticality, Non-Redundancy, Referential Clarity, Focus, and Structureand Coherence. In Pazesh, we address these readability criteria as follows: Bybeing committed to chronological order of the initial document and its structure–as a grammatically accepted text–, we can avoid fundamental grammatical er-rors. Though being extractive undesirably lead the output to yet have someerrors. We address non-redundancy and focus by “filtering centroid sentences”and also “retaining the informativeness of the output summary”. Though inPazesh no alternations are made to input text’s sentences, so repeated use ofnouns or noun-phrases are probable. Also, Implementing Pazesh’s idea for ad-dressing referential clarity has been left as future work. Finally, Cohesion canbe defined as the “links” that hold sentences of a text together and give the wholetext a clear meaning. Here, we use this term on behalf of lexical relationship ofsentences.


In the case of our final graph in order to have a cohesive output, each sentenceshould be followed by another sentence already being linked to that. Therefore,the final cohesive text can be assumed as a path. Pazesh’s path is built usingdeep edges (introduced earlier) to guarantee the readability of the path. Sincetopic shifts from one centroid to another is sensible, we accept shallow edgesonly as connectors between different segments. The path obtained from the finalgraph should “pass all centroid sentences” – landmarks – and have the “high-est accumulative sentence score”. Passing through centroids has two outcomes:Firstly, centroids as the most salient sentences are guaranteed to be included inthe summary. Secondly, the remainder sentences of the summary are connectingcentroids; thus they come from prominent sub-topic segments and are importantthemselves. In Fig. 2 the path-finding method is applied to the graph in Fig. 1and the output paths are depicted in gray.

Fig. 2. Path-Finding phase of Pazesh: Compression ratio= 0.4. Connecting two cen-troids is possible through 3 different Paths. Each path has a different accumulativesentences score/edge weight. Note that the path can go beyond the last centroid inorder to meet the summary length constraint. In scenario Paz1, Path1 would be theoutput. In scenario Paz2, path1 and path3 have identical scores and one of them wouldbe selected.

4 Evaluation

To evaluate the system, we used two distinct evaluation methods: an automaticevaluation by ROUGE toolkit1 and a manual readability evaluation based on theDUC readability assessment guidelines. The informativeness of Pazesh was eval-uated in the context of an extractive single-document summarization task, using567 news articles of DUC2002. For each article, the evaluation task is to generatea 100-word summary. For evaluation, we used recall score of the three different

1 ROUGE is available at http://berouge.com/

http://berouge.com/

Pazesh: Approach for Increasing Readability of Summaries 317

metrics of ROUGE: ROUGE-1 (unigram-based), ROUGE-2 (bigram-based), andROUGE-W. Two of Pazesh’s evaluation scenarios are: Constructing final pathbased on Paz1 (accumulative sentences score) vs. Paz2: (edge weights). Thesesettings of Pazesh are compared with 4 baselines in 2 categories: 1) Outper-forming informative systems (not being readable) including SentenceRank andanother single-document extractive summarizer called method1 [10]. 2) A read-able system, A* searching [5]. Due to lack of a universal assessment methodologyand a few systems using identical measurement, comparing readability factor ofour system with previous works is not feasible in high quality and quantity. Table1 depicts mentioned ROUGE scores.

Table 1. ROUGE scores of different systems. ‘-’: not reported.

System ROUGE-1 ROUGE-2 ROUGE-W

Paz1 0.38 0.24 0.08Paz2 0.23 0.19 0.05

SentenceRank 0.45 0.19 0.15Method1 0.47 0.2 0.16

A* 0.37 0.08 -

Comparing Pazesh1 with Pazesh2 reveals that taking into account the sen-tences scores result in undisputedly better informativeness than considering edgeweights. The ROUGE-2 score is competitive with informative systems since n-gram length greater than 1 in ROUGE estimates the fluency of summaries.However, the overall results are not outperforming systems focusing on only in-formativeness factor but are competitive with them to some degree. This is dueto the fact that preserving both informativeness and readability -specially forthe cases having summary length constraint-is a trade-off.

Since 2006, a separate set of quality questions have been introduced by DUCto evaluate readability aspects of summaries. However, there are still no au-tomatic evaluations to assess such aspects. For manual evaluation, we used 20documents of the same dataset and scenarios used in automatic evaluation. Eval-uation was accomplished by ten human judges who had read the DUC guidelinesfor readability assessment earlier. For each scenario, three different variations ofsummary length constraint were applied. Table 2 show the results of our manualevaluation on five-point scale. The results show that Pazesh performs very wellon criteria it intended to address: Coherency and Focus. The overall results onratio 0.6 outperform ratio 0.3 and no-ratio (minimum possible length) setting.This was expected since the readability aspect of a text inherently can be mean-ingful in a long text rather than a short text. Comparing the results reveals thatPaz2 outperforms Paz1 on coherency aspect which could be estimated. ThoughPaz1 performs stronger than Paz2 on focus aspect. Putting all together, obtainedresults show that Pazesh has accomplished its intended mission: meeting bothreadability and informativeness.


Table 2. Readability Assessment. Left to right: ratio=0.6, ratio=0.3, ratio=not given.

Criterion Paz1 Paz2 Paz1 Paz2 Paz1 Paz2

Grammaticality 5 5 5 5 5 5Non-redundancy 4 5 5 5 5 5Referential clarity 2 3 2 2 1 2Focus 5 4 3 3 4 3Struct. & Coherence 5 5 3 4 3 4

5 Conclusion

In this paper, we introduced Pazesh: a graph-based approach to address bothreadability and informativeness of automatic text summaries. This is accom-plished by constructing a path of highly ranked sentences, as a readable sequence,passing through centroid sentences. As it is shown in experimental results, Pazeshis a powerful and also simple summarizer system.

Acknowledgments. Special thanks go to Dr. Yahya Tabesh, Ali Moini forigniting the initial motivation on this research. This research is honoured to besupported by a grant from Iran’s National Foundation of Elites.

References

1. Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Re-search Development 2(2), 159–165 (1958)

2. Salton, G., Singhal, A., Mitra, M., Buckley, C.: Automatic Text Structuring andSummarization. Information Processing and Management Journal 33(2), 193–207(1997)

3. Mihalcea, R.: Graph-based ranking algorithms for sentence extraction, applied totext summarization. In: Proceedings of ACL, Spain (2004)

4. Erkan, G., Radev, D.R.: LexRank: Graph-based Lexical Centrality as Salience inText Summarization. JAIR 22, 457–479 (2004)

5. Aker, A., Cohn, T., GaizauskasMulti, R.: Document summarization using A*search and discriminative training. In: Proceedings of EMNLP, USA, pp. 482–491(2010)

6. Riedhammer, K., Gillick, D., Favre, B., Hakkani-T, D.: Packing the meeting sum-marization knapsack. In: Proceedings of the Interspeech Conference, Australia, pp.2434–2437 (2008)

7. Barzilay, R., Elhadad, N., McKeown, K.R.: Inferring strategies for sentence order-ing in multidocument news summarization. JAIR 17, 35–55 (2002)

8. Hearst, M.A.: Texttiling: segmenting text into multi-paragraph subtopic passages.CL 23(1), 33–64 (1997)

9. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:Bringing order to the web. Technical report, Stanford University, USA (1998)

10. Xiaojun, W., Jianwu, Y., Jianguo, X.: Towards an iterative reinforcement approachfor simultaneous document summarization and keyword extraction. In: Proceedingsof the 45th ACL, Czech Republic, pp. 552–559 (2007)

Textual and Graphical Presentation of

Environmental Information

Mohamed Mouine

RALI-DIRO Universite de MontrealCP 6128, Succ Centre ville

Montreal (Quebec) H3C [email protected]

http://www-etud.iro.umontreal.ca/~mouinemo/

Abstract. The evolution of artificial intelligence has followed our needs.At first, there was a need for the production of information, followed bythe need to store digital data. Following the explosion in the amount ofgenerated and stored data, we needed to find the information we require.The problem is now how to present this information to the user. In thispaper we present ideas and research directions that we want to explorein order to develop a new approaches and methods for the syntheticpresentation of objective information.

Keywords: information vizualisation, artificial intelligence, text, graph.

1 Introduction

In recent years there has been an explosion in the volume of generated datain all fields of knowledge. The most difficult task has become its analysis andthe exploration of this data. Data-mining allows us to locate the informationneeded. The information visualization and visual data mining can help to copewith information flow when it is combined with some textual description. In thisthesis we develop methods to automate the task of exploring the informationand present it to the user in the easiest way. We’ll build a generator climatebulletin. The content of the resulting bulletins will be a combination of text andgraphics. this generation of bulletin must take into account the type of outputdevice.

2 State of the Art

My thesis is in the line of the work of [5] who developped a model to solvethe problem of generating integrated text and graphics in statistical reports.It considered several criteria such as the intention of the writer, the types ofvariables, the relations between these variables and the data values. Since then,many researchers have worked on the same subject. [6] presents some techniquesto visualize and interact with various data sets. The process of visualization


http://www-etud.iro.umontreal.ca/~mouinemo/

320 M. Mouine

depends on the input type and choices of output (user profiles, types of outputdevice...). The input of a visualization process can be data, information and/orknowledge[3]. The reader may also refer to [2] where the author gives an overviewof the field of information visualization. There are even those who have studiedevery last little detail (the choice of colors, shapes, location, . . . ) to producea presentation that meets the user expectation [8]. A good visualization is aconcept that can be judged on several criteria related to the user himself, thesystem, types of output device ...[1].

3 Problem Statement

In order to summarize and analyze large amounts of information, we expect todevelop a method that automatically generates a visual report (graph, image,text ...). We want, through this approach, allow the user to easily retrieve allinformation used in generating this report without having to visit the wholemass of information.

3.1 MeteoCode

The RALI1, to which I belong, has started a project2 in collaboration with En-vironment Canada (EC) that already publishes a large amount of meteorologicalinformation in XML form, this type of information is called MeteoCode. The se-lective display of personalized information would allow EC to provide the publicwith better targeted forecasts in time and space than those currently produced.The latter also exhibit a gap (breakdowns, problems ...) we’ll try to work im-prove. This is reflected by the fact that these forecasts are limited to a few tens ofwords found in regional weather forecasts. Already, more than 1000 weather re-ports are issued twice a day. Given the size of Canada, they have to stay generaland cannot show all the details that are available in the MeteoCode

According to the information in the MeteoCode, we want to develop a climatebulletin generator for an address or postal code given by the user most often herown. In addition, regional weather information must also be made available indifferent modes: graphics, web, weather radio and autoresponders. An importantgoal of our project is to study the development of innovative approaches for thecommunication user relevant of meteorological information while taking intoaccount some time and geographic aggregation.

The MeteoCode is already in XML validated by an XML schema, we areconvinced that the entry is easily analyzed. Thus, we will focus on determiningthe most appropriate way to present data in a meaningful way with the type ofoutput device. Given the size of the data, we will develop techniques for specialuse for aggregating data in space and time.

1 RALI comprises computer scientists and linguists of experience in automatic lan-guage processing. It is the largest NLP oriented university laboratory in Canada.

2 http://rali.iro.umontreal.ca/EnvironmentalInfo/index.en.html

http://rali.iro.umontreal.ca/EnvironmentalInfo/index.en.html

Textual and Graphical Presentation of Environmental Information 321

Experiments were conducted over the last two years by members of the RALIto illustrate the type of information available from EC, web prototypes havebeen developed to display weather information graphically using Protovis whichis based on the Scalar Vector Graphics (SVG ) and another using alphanumericinformation but placed geographically using Google Maps. A third experimentwas performed using jqPlot. The latter is a jQuery plugin for creating graphics.These experiments improve interactivity with the user.

This allowed us to experiment different ways of combining information pub-lished daily by Environment Canada with other approaches based on the Web.Although these prototypes are not put into production, they showed the poten-tial of integrating environmental information with Web applications so that itbecomes more accessible and useful.

3.2 Graphic

To use graphics effectively in the automatic generation of reports I will first drawsome ideas from PostGraphe [4] which generated statistical reports containingtext and graphics using an annotated description of the data. The user specifieshis intentions to the system, types of data to be presented and relationshipsbetween data.

Given the variety of available devices (Web, text, TV, PDA, etc..), it is notpossible to adapt MeteoCode for each output device. On the other hand, thesame information should not be presented in exactly the same on all devices.Each type of device provides its own constraints and new opportunities. In doingso, information must be accessible for all types of devices while ensuring thatthe meaning of the information remains intact.

We will also need to develop good techniques for producing natural languagesummaries and for this we will build on results of the project SumTime [9]. Thisproject has developed an architecture for generating short summaries of largetime series data. We want to build the model used to choose the right word tobe used in the summary.

3.3 Summarize: Good Forecast and Location Precision

In our project two types of data3 can be used. The first type is SCRIB4 andcontains predictions raw form of matrices. The information is generated auto-matically by a numerical weather prediction model. This output is then fed toanother system that allows meteorologists to comment and change the predic-tions somewhat. The result of this change is the file called MeteoCode in XMLformat.The second type of data is GRIB file (is a flat file). raw model forecastsat very high resolution.

To further summarize the data, we will need to perform a spatio-temporalclustering of these data based on the similarity of meteorological conditions,and the relationship of cluster (spectral clustering algorithms [7] are known to3 the difference between data is the location accuracy and quality of the forecast.4 good short term forecasts and medium resolution (weather stations).

322 M. Mouine

aggregate the data matrices according to their similarities) and conditions found.To meet this need we plan to use spectral clustering algorithms. These algorithmswill be applied to the GRIB file and the file MeteoCode after transforming itscontents in matrix form. In a second step, we will try to do the same work butusing the SCRIBE file directly instead of using the file MeteoCode which is theresult of a manual change by a meteorologist. The purpose of this clusteringis to reduce the number of possible descriptions of weather conditions. Eachcondition could be described as the closest local kernels. Finally, a good reportshould draw the attention of the user to phenomena and unusual conditions, thedetection of which could be based on a simple technique for estimating densityof the local kernel.

4 Conclusion

I hope that at the end of this project, we will make the science of syntheticpresentation of objective information progress. The new approaches and meth-ods used in this work could also find their application in other fields in whichinformation changes over time in large quantities. The aspect of visual presenta-tion and automatic report generation can be applied also in finance, education,medicine. . .

References

1. Bonnel, N., Chevalier, M.: Criteres d’evaluation pour les interfaces des systemes derecherche d’information. In: CORIA, pp. 109–115 (2006)

2. Chen, C.: Information visualization. Wiley Interdisciplinary Reviews: Computa-tional Statistics 2(4), 387–403 (2010)

3. Chen, M., Ebert, D., Hagen, H., Laramee, R.S., Van Liere, R., Ma, K.L., Ribarsky,W., Scheuermann, G., Silver, D.: Data, information, and knowledge in visualization.Computer Graphics and Applications 29(1), 12–19 (2008)

4. Fasciano, M., Lapalme, G.: Intentions in the coordinated generation of graphics andtext from tabular data. Knowledge and Information Systems 2(3), 310–339 (2000)

5. Fasciano, M.: Generation integre de textes et des graphiques statistiques (1996)6. Heer, J., Bostock, M., Ogievetsky, V.: A tour through the visualization zoo. Com-

munications of the ACM 53(6), 59–67 (2010)7. Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4),

395–416 (2007)8. Ware, C.: Information visualization: perception for design. Morgan Kaufmann,

San Francisco (2004)9. Yu, J., Reiter, E., Hunter, J., Mellish, C.: Choosing the content of textual summaries

of large time-series data sets. Natural Language Engineering 13(01), 25–49 (2007)

Comparing Distributional and Mirror

Translation Similarities for Extracting Synonyms

Philippe Muller1 and Philippe Langlais2

1 IRIT, Univ. Toulouse & Alpage, INRIA2 DIRO, Univ. Montreal

Abstract. Automated thesaurus construction by collecting relations be-tween lexical items (synonyms, antonyms, etc) has a long tradition innatural language processing. This has been done by exploiting dictionarystructures or distributional context regularities (coocurrence, syntacticassociations, or translation equivalents), in order to define measures oflexical similarity or relatedness. Dyvik had proposed to use aligned mul-tilingual corpora and defines similar terms as terms that often sharetheir translations. We evaluate the usefulness of this similarity for theextraction of synonyms, compared to the more widespread distributionalapproach.

1 Introduction

Automated thesaurus construction by collecting relations between lexical itemshas a long tradition in natural language processing. Most effort has been directedat finding synonyms, or rather “quasi-synonyms”, [1], lexical items that have sim-ilar meanings in some contexts. Other lexical relations such as antonymy, hyper-nymy, hyponymy, meronymy, holonymy are also considered, and some thesaurialso consider semantically associated items with less easily definable properties(e.g. the Moby thesaurus).

From the beginning, a lot of different resources have been used towards thatgoal. Machine readable dictionaries appeared first and generated a lot of effortaiming at the extraction of semantic information, including lexical relations, [2]or were used to define a semantic similarity between lexical items [3]. Also pop-ular was distributional analysis, comparing words via their common contexts ofuse, or syntactic dependencies [4,5] in order to define another kind of semanticsimilarity. These approaches went on using more readily available resources inmore languages [6]. More recently, a similar approach has gained popularity us-ing bitexts in parallell corpora. Lexical items are considered similar when theyare often aligned with the same translations in another language, instead of be-ing associated to the same context words in one language [7,8]. A variation onthis principle, proposed by [9], is to consider translation “mirrors”: words thatare translations of the same words in a parallell corpus, as they are supposedto be semantically related. Although this idea has not been evaluated for syn-onyms extraction, it is the basis of some paraphrase extraction work, i.e. findingequivalent phrases of varying lengths in one language, see for instance [10].


324 P. Muller and P. Langlais

Evaluations of this line of work vary but are often disappointing. Lexicalsimilarities usually bring together heterogeneous lexical associations and seman-tically related terms, that are not easy to sort out. Synonymy is probably theeasiest function to check as references are available in many languages, eventhough they may be incomplete (e.g. WordNet for English) and synonym ex-traction is supposed to complement the existing thesauri.

If these approaches have the semantic potential most authors assume, thereis still a lot to be done to harness that potential. One path is to select themost relevant associations output by the aforementioned approaches (dictionary-based, distribution-based, or translation-based), as in the work of [11], hopefullymaking possible a classification of lexical pairs into the various targeted lexicalrelations. Another is to combine these resources and possibly other sources ofinformation; see for instance [8].

We make a step in this latter direction here, by testing Dyvik’s idea on lexicalrelation extraction. Translation mirrors have not been precisely evaluated in sucha framework, and the way it can be combined with distributional information hasnot been investigated yet. We also pay particular attention to the frequency ofthe words under consideration, as polysemy and frequency variations of semanticvariants seem to play an important role in some existing evaluations. Indeed, weshow that mirror translations fare better overall than a reference distributionalapproach in the preselection of synonym candidate pairs, both on nouns andverbs, according to the different evaluations we performed.

The remainder of this paper is organised as follows: we present in section 2the resources we considered and our experimental protocol in section 3. Weanalyze our results in section 4. We relate our results to comparable approachesaddressing the same issue in section 5 and finally conclude our work in section 6.

2 Resources and Input

We considered two reference databases in this work:

– the WordNet lexical database,1 provided through the NLTK package API.2

WordNet provides a reference for the following lexical relations: synonyms,antonyms, hypernyms, hyponyms, holonyms, meronyms. Each lemma presentin WordNet has on average 5-6 synonyms, or 8-10 related terms if all lexicalrelations are taken together.

– the Moby thesaurus3 which provides not only synonyms but more looselyrelated terms. This resource is much richer and less strict than WordNet, aseach target has an average of about 80 related terms.

To estimate the frequencies of the words considered, we used data provided bythe freely available Wacky corpus.4

1 http://wordnet.princeton.edu2 http://www.nltk.org3 http://www.gutenberg.org/dirs/etext02/mthes10.zip4 http://wacky.sslmit.unibo.it

http://wordnet.princeton.edu

http://www.nltk.org

http://www.gutenberg.org/dirs/etext02/mthes10.zip

http://wacky.sslmit.unibo.it

Comparing Distributional and Mirror Translation Similarities 325

In order to compare similarities induced by distributional and mirror ap-proaches, we have selected at random two sets of 1000 lexical items, a set of nounsand a set of verbs, that we will call “targets”. We imposed an arbitrary minimalfrequency threshold on the targets (> 1000). The statistics of the two references,with respect to the test sets of targets considered, are shown in table 1.

Table 1. Reference characteristics with the two target sets considered: median fre-quency in the Wacky corpus, mean number of associated terms, median, minimum andmaximum number; (NB: Moby mixes verbs and nouns, so we considered terms havinga noun form or a verb form in each case)

number of associations

Pos Median frequency reference mean med min max

Nouns 3,538 WordNet syns 3.6 2.0 1 36Nouns 3,538 Moby 73.8 57.0 3 509Verbs 11,136 WordNet syns 5.6 4.0 1 47Verbs 11,136 Moby 113.2 90.0 6 499

3 Protocol

We consider similar terms derived either by a translation mirror approach (sec-tion 3.1) or a syntactic distributional approach (section 3.2). Each approachprovides a set of associated terms, or “candidates”, ranked according to the sim-ilarity considered. These ranked candidates are then evaluated with respect to areference for different lexical relations, either keeping n-best candidates or can-didates above a given threshold. Details of the evaluation are presented below.As an example, table 2 shows candidates proposed by the translation mirrorsfor the randomly chosen target term groundwork. Note the huge difference incoverage of WordNet and Moby.

3.1 Translation Mirrors

The translation mirror approach is based on the assumption that words in alanguage E that are often aligned in a parallel corpus with the same word inanother language F are semantically related. For instance, the french wordsmanger and consommer are often both aligned with, and probable translationsof, the english word eat.

For the translation mirror based approach, we used a French-English bitextof 8.3 millions pairs of phrases in translation relation coming from the CanadianHansards (transcripts of parliamentary debates). This bitext is used by the bilin-gual concordancer TSRali5 and was kindly made available to us by the main-tainers of the application. We lemmatized both French and English sentences

5 http://www.tsrali.com/

http://www.tsrali.com/


Table 2. First ten candidate associations proposed by our translation mirror approachfor the target term groundwork and synonyms according to WordNet as well as a sampleof related terms according to Moby. Underlined candidates belong to the WordNet ref-erence, while those in bold are present in Moby; both are also reported in the referencethey belong to. Words marked with ∗ are absent from the Hansards.

Candidates WordNet Moby

base base arrangementbasis basis basefoundation cornerstone basementland foot basisground fundament∗ bedjob foundation beddingfield substructure∗ bedrockplan understructure∗ bottomforce briefingdevelopment cornerstone

... [47 more]

using TreeTager.6 Then, we trained in both directions7 (English-to-French andFrench-to-English) statistical translation models, running the Giza++ toolkit inits standard setting.8 Our translation mirror approach makes use of the lexicaldistribution of the two models,9 pe2f and pf2e, we obtained this way (see table 3for an example). More specifically, we compute the likelihood that the word s isrelated to the target word w as:

p(s|w) ≈∑

f∈τe2f(w)

pδ1e2f (f|w) × pδ2

f2e(s|f) (τe2f (w) = {f : pe2f (f|w) > 0})

where τe2f (w) stands for the set of French words associated to w by the modelpe2f . In practice, two thresholds, δ1 and δ2, control the noise of the lexicaldistributions:

pδ•(t|s) =

{p•(t|s) if p•(t|s) ≥ δ0 otherwise

In the evaluations below we considered only the first 200 lemmas for each tar-get, in order to compare it with the available distributional candidates presentedin the following section.

3.2 Distributional Similarity

The distributional similarity we used is taken straight from the work of [5], aswe believe it represents well this kind of approach. Also, a thesaurus computed6 www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/7 IBM models are not symmetrical.8 http://code.google.com/p/giza-pp/9 We used IBM models 4.

www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

http://code.google.com/p/giza-pp/


Table 3. The 10 most likely associations to the words consommer and eat accordingto the lexical distributions pf2e(•|consommer) and pe2f (•|eat) respectively

pf2e(•|consommer) � consume (0.22) use (0.18) be (0.1) eat (0.092) consumption (0.048)

consuming (0.037) take(0.023) drink (0.019) burn (0.012) consumer (0.011) . . .

pe2f (•|eat) � manger (0.39) consommer (0.08) se (0.036) de (0.031) nourrir (0.028) avoir(0.027) du (0.023) alimentation (0.017) gruger (0.016) qui (0.014) . . .

by Lin is freely available,10 which eases reproducibility. Lin used a dependency-based syntactic parser to count occurrences of (head lemma,relation,dep.lemma), where relation is a syntactic dependency relation. Each lemma is thusassociated with counts for a set F of features (rel,other lemma), either ashead of a relation with another lemma or as dependent. For instance, the verbeat has the features (has-subj,man), (has-obj,fries), (has-obj,pie), etc.

Let c be the function giving the number of occurrences of a triplet (w, rel, w′)and let V be the vocabulary :

c( , rel, w) =∑

w′∈V

c(w′, rel, w) I(w, rel, w′) = logc(w, rel, w′)× c( , rel, )c(w, rel, )× c( , rel, w′)

c(w, rel, ) =∑

w′∈V

c(w, rel, w′)

c( , rel, ) =∑

w′∈V

c( , rel, w′) ||w|| =∑

(r,w′)∈F (w)

I(w, r, w′)

I is the specificity of a relation (w, rel, w′), defined as the mutual informationbetween the triplet elements [5]. Let’s note ‖w‖ the total information quantityassociated to w.

Finally, similarity between two lemmas w1 and w2 measures the extent towhich they share specific syntactic contexts, using the information quantity oftheir shared contexts, normalised by the sum of their total information quanti-ties.

sim(w1, w2) =

∑(r,w)∈F (w1)∩F (w2)

[I(w1, r, w) + I(w2, r, w)]

||w1||+ ||w2||

The available thesaurus lists the closest 200 lemmas for each word in a givenvocabulary.


Following the protocol introduced above, we evaluated the outputs of lexical sim-ilarities based on the n-best candidates, varying n, or based on varying similarity10 http://webdocs.cs.ualberta.ca/$\sim$lindek/Downloads/sim.tgz

http://webdocs.cs.ualberta.ca/$\sim $lindek/Downloads/sim.tgz


thresholds, both for the distribution-based approach and the mirror approach.We have two different test sets to evaluate differences between nouns and verbs.As shown in table 1, syntactic categories differ in the number of synonyms orother lexical related items they possess, and it is likely that they impact as wellthe approaches we investigated; see [12] on the role of frequency in that per-spective. We considered for evaluation only the items that were common to thereference and the lexicon covered by the resources used. For instances some syn-onyms from WordNet have no occurrences in the Hansard or in Lin’s databaseand this can be seen as a preprocessing filter of rare items.

Moreover, both approaches we compare are sensitive to the typical frequenciesof the targets considered. In both cases, all senses of a word are conflated in thecomputation and it is likely that more frequent usages dominate less frequentones. We wanted to evaluate the role played by this factor and we took thisinto account in our evaluations by adding a varying frequency threshold on thecandidates considered. For a set of values ci, we filtered out candidates withfrequencies less than ci in a reference corpus (the Wacky corpus, mentionedabove).11

Additionally, we took out a list of the most common items in the candidatesof the target sets. We arbitrarily removed those terms that appear in more than25% of the candidate lists (this threshold could be tuned on a development set infurther experimentations). This includes very common nouns (e.g. thing, way,etc) and verbs (e.g. have, be, come), as well as terms that are over-representedin the Hansard corpus (e.g. house), since alignment errors induce some noise forvery frequent items. Finally, we combined the candidate lists produced by thetwo approaches by filtering candidates for one approach that are not present inthe other’s candidate list.

We are interested in two aspects of the evaluation: how much of the referenceis covered by our approaches, and how reliable they are, that is, we want the topof the candidate list to be as precise as possible with respect to an ideal reference.In order to do so, we evaluate our approaches according to precision and recall12

at different points in the n-best list or at different threshold values. We alsocompute typical information retrieval measures to estimate the relevance of theranking: mean average precision (MAP), mean reciprocal rank (MRR). MAPcomputes the precision at each point where a relevant term appears in a list ofcandidates; MRR is the average of the inverses of the ranks of the first relevantterm in a list. Last, we looked at the precision of each method assuming an“oracle” gives them the right number of candidates to consider for each target,a measure called R-precision in the information retrieval literature.

So for instance, the 10 candidates of table 2 evaluated against the WordNetreference would receive a precision of 3/10 and a recall of 3/5, (and not 3/8,because understructure, substructure and fundament are absent from theHansard). R-precision would also be 3/5, since all correct candidates are foundat ranks less than the reference size (5 synonyms). Precision at rank 1 would be

11 The thresholds were chosen to correspond to different ranges of lexical items.12 For the sake of readability, we report precision and recall as percentage.


1 while precision at rank 5 would be 3/5. The MAP would be 0.63 = 6.29/10 =(1/1 + 2/2 + 3/3 + 3/4 + . . . + 3/10) / 10 and MMR would be 1 in this casebecause the first candidate is correct. It would be 1/2 if only the second werecorrect, etc.

Our experiments led to the observation that it is better to cut the n-best listat a given rank than to try to find a good similarity threshold, and we thus onlydetail results for the first method.

4.1 WordNet

Table 4 shows the results for nouns with respect to synonyms in WordNet. Foreach approach we report precision at ranks n=1, ..., 100 in the candidate list,MAP, MRR, the R-precision, the number of considered synonym pairs from thereference (‖ref‖), with respect to which the overall recall is computed. We alsoreport the influence of different frequency filters. A line with f>5000 means weconsider only candidates and reference items with a frequency above 5000 inWacky.

As the WordNet reference has few synonyms, one should focus on precisionsat low ranks (1 and 5) as well as the oracle, R-precision: all others are bound tobe quite low. The other cutoffs make more sense for the evaluation with respectto Moby, and are here for comparison. This being noted, table 4 calls for severalcomments. First of all, we observe that the precision of the mirror approach atrank 1 culminates at 22% while overall recall tops at 60%, a much better scorethan the distributional approach we tested. Second, it is noticeable that filteringout less frequent candidates benefits the mirror approach much more than thedistributional one. It might be a consequence of having a smaller corpus to startwith, in which rarer words have less reliable alignment probabilities.

Third, we observe that combining the candidates of both approaches yieldsa significant boost in precision at the cost of recall. This is encouraging sincewe tested very simple combination scenarios: the one reported here consists inintersecting both lists of candidates.

Table 4. Results (percentages) for nouns, micro-averaged, with respect to synonymsin WordNet

n-best P1 P5 P10 P20 P100 MAP MRR R-prec ‖ref‖ recall

Mirror

f>1 16.4 5.1 3.8 2.7 1.3 11.9 15.1 16.6 2342 50.0f>5000 19.1 5.4 3.8 2.6 1.2 11.3 13.2 17.5 1570 54.8f>20000 22.1 5.7 3.9 2.5 1.1 9.8 11.4 22.7 1052 60.6

Lin

f>1 17.4 5.2 3.5 2.5 1.5 11.7 14.3 14.7 2342 35.9f>5000 16.5 5.0 3.5 2.5 1.6 9.2 10.8 16.7 1570 36.6f>20000 17.5 4.5 3.3 2.5 1.6 7.3 8.4 20.1 1052 36.9

M/L

f>1 25.8 7.5 5.7 4.4 3.8 15.9 17.6 22.0 2342 29.3f>5000 27.4 7.4 5.5 4.3 3.8 12.7 13.6 24.6 1570 31.1f>20000 26.1 6.4 4.7 3.5 2.6 9.7 10.4 28.9 1052 32.7


Last, the results on verbs are quite similar to those for nouns, with a betterprecision at low ranks, and at higher frequency cutoffs, even though the oracleevaluation is roughly the same for all configurations. Again, filtering one methodwith the other yields better results, with oracle precision between 20% and 27%,similarly to what is observed on nouns.

We observe similar behavior on verbs, a much higher recall for mirrors anda better precision on frequent candidates (best P1 is 25 against 23), but Lin ishigher at P1 without frequency filter (30 against 23) and then precisions afterP1 are roughly the same or better for mirrors.

4.2 Moby

Table 5 shows the results for nouns with respect to the related terms in theMoby thesaurus. We expected that this reference would be closer to what isrecovered by a distributional similarity, and that is indeed the case for nouns:Lin’s precision is superior across the board, even by 10 points at rank 1. However,both methods are comparable on verbs. One notable fact is that both similaritiescapture a whole lot more than just synonymy so the scores are much higher thanon WordNet, and this can be considered somewhat of a surprise for the mirrortranslations, since this method should capitalise on translation relations only.

Also, in almost all cases, the overall recall is higher with the translation mirrormethod, an observation consistent with our experiments on WordNet. Filteringout low frequency words has mixed effects: precision is slightly less for f>20000than f>1 but the corresponding recall of high frequency related terms is higher.The combinations of the two methods consistently improve precision (again tothe detriment of recall). As a conclusion, related terms do appear in mirror trans-lations, even if they seem to do so with lower similarity scores than synonyms,and we have to investigate more precisely what is happening (translations ap-proximations or errors or a better coverage of synonymy in the Moby thesaurusthan in WordNet).

Table 5. Results (percentages) for nouns, micro-averaged, with respect to related termsin Moby

n-best P1 P5 P10 P20 P100 MAP MRR R-prec ‖ref‖ recall

Mirror

f>1 33.7 15.8 13.3 11.0 7.0 18.5 40.1 11.0 60774 18.1f>5000 32.7 14.5 12.1 9.8 6.1 18.7 38.1 11.8 43294 21.6f>20000 30.3 13.2 10.7 8.6 5.3 18.1 34.9 12.8 28488 26.7

Lin

f>1 44.8 19.9 16.4 13.4 9.5 26.6 46.8 14.7 60774 15.4f>5000 40.7 18.5 15.0 12.5 9.3 25.6 41.6 15.0 43294 16.3f>20000 39.4 16.1 13.5 11.2 8.4 23.3 35.2 16.8 28488 16.8

M/L

f>1 53.1 25.1 21.4 18.1 15.2 46.6 22.9 25.0 60774 9.4f>5000 52.4 23.0 19.3 16.6 13.7 30.7 41.2 23.4 43294 10.9f>20000 45.9 19.4 16.5 14.0 11.2 24.6 32.6 21.6 28488 12.5


4.3 Error Analysis

The kind of evaluation we presented above has a few shortcomings. The mainreference we used for synonymy does not have a large number of synonyms perentry, and if one of our objectives is to extend existing resources, we cannotestimate the interest of the items we find that are absent from that reference.Using a larger thesaurus such as Moby only partially solves the problem, sincethere is no distinction between lexical relations, and some related terms do notcorrespond to any classical lexical function. In order to evaluate more preciselyour output, but on a much smaller scale, we have looked at a sample of itemsthat are absent from the reference, to measure the amount of actual errors. Todo this, we took a number of terms which are the first candidates proposed bythe mirror approach for a target, but are absent from WordNet. We found anumber of different phenomena, on a sample of 100 cases:

– 25% of words that are part of a multi-word expression which were probablyaligned to the same target translation, such as sea/urchin;

– 18% of words that are actually synonyms, according to other thesauri wecould check manually,13 such as torso/chest;

– 13% hypernyms, listed in WordNet or in www.thesaurus.com, e.g. twitch/movement.

– 6% morphologically related items such as accountant/accounting, proba-bly because of a pos-tag ambiguity in the pivot language, here the Frenchword comptable, which can be a noun or an adjective.

Among the remaining errors that are probably common, some are due to a poly-semy of a pivot translation (e.g.: English word aplomb translated into French asassurance which can also mean insurance in English). This is hard to quantifyexactly in the sample without looking in detail at all related aligned word pairs.On the remaining various errors, some bear on rare occurrences in the inputcorpus, that we should have filtered out beforehand. All in all, we can see thereis room for easy improvement. Only polysemy is a hard problem to address, andthis is so for any kind of approach relying on distributional data.

In addition to that, we are currently looking at items that were not consideredin the evaluation because there was no synonym for them in Wordnet, but forwhich there are mirror translations (such as whopper/lie). Although we cannotyet quantify the presence of truly related lexical items, the few examples welooked at seem to reflect the analysis above.

5 Related Work

There are several lines of work that are comparable to what we presented here,with a variety of objectives, evaluation methodologies and input data. Para-phrase extraction shares some of our objectives and some of the resources we

13 Such as http://www.thesaurus.com.

www.thesaurus.com

http://www.thesaurus.com


considered. Synonym extraction and thesaurus building also overlap our goalsand evaluation methods. Also, work striving to design and compare semanticsimilarities is the closest in nature, if not in the objectives.

Paraphrase acquisition is usually evaluated on the acceptability of substi-tutions in context, and only small-scale human judgments of the results givean indication of the lexical functions captured: [13] reports that 90% of theirpattern-based extracted paraphrases are valid, mixing synonyms, hypernymsand coordinate terms, but with no indication of coverage. Similarly, [14] or [15]precisely evaluate the presence of synonyms on similarity lists on a small subsetsof synonym/antonym pairs, which makes it hard to extrapolate on the kind ofdata we used, where we aim at a much larger coverage.

Closer to our methodology, several studies evaluate the classification of a set ofword pairs as synonyms or not; either directly on the candidates selected for eachtarget, as we do here, or on resampled word pairs test sets that make the taskaccessible to common learning techniques. The former method (non-resampled,which is also ours) is more realistic and of course gives rather low scores: [7] usealignment vectors on a set of language pairs, and syntactic argument vectors,and similarity is defined in a comparable way between the vectors; The study in[8] also uses a syntactic distributional similarity and a distance in a dictionary-based lexical network. The first study only looks at the first three candidatesin Dutch, with respect to a synonym reference (Euro WordNet) and considersonly nouns. Scores P1 range from 17.7 to 22.5% on alignment candidates, withdistributional similarity at 8%, and the combination at 19.9%. The authors havean updated experiment in [16], still on Dutch nouns, and reach 30% for P1,but do not explain the differences in their setup. The second study applies linearregressions on all similarity scores, with different target frequencies and similaritythresholds, and reaches a maximum f-score of 23% on nouns and 30% on verbs onone of its many settings. The reference here was the union of WordNet and theRoget’s, which places it somewhere between WordNet and Moby with respectto coverage.

A different setting is resampled evaluation, where a classifier is trained andtested on a set of word pairs with an priori ratio of synonyms and non-synonyms.It is only relevant if a good preselection method allows one to reach the assumedproportions of synonyms in the training and test sets [17]. Our results couldactually be considered as an input to such methods.

Taken alone, distributional similarities in [18] show results that are compa-rable to ours or better on Moby, but slightly lower on WordNet. His test setis larger, and split differently with respect to word frequencies. His results arelower than what we obtain here with Lin’s own data (as we noticed also about[7]), so we can assume that our comparison is representative with respect to thedistributional approach and is a fair comparison.

Mirror translations thus reach comparable or better results than distributionalsimilarity and alignment similarities for synonyms in English, and we have shownthat the different methods can be usefully combined in a simple way. Besides,


mirror translations are simpler to compute than the best similarities betweenn× n alignment or cooccurrent vectors, where n is the size of the vocabulary.

As a secondary evaluation, authors often use TOEFL synonymy tests [19,6]where the task is to distinguish the synonym of a given word in a given context,among four candidate items. This is a sort of easier word disambiguation testwhere the task is to separate a word from unrelated distractors, instead of dis-tinguishing between word senses. We are planning to test the mirror translationsagainst such available benchmarks in a near future. Another way to evaluate therelevance of similarity measures between words is derived from the data collectedby [20] where humans are asked to judge the similarity or relatedness of itemson a scale. This is an interesting way of providing an intrinsic evaluation of theseassociations, but it covers only a very limited part of the vocabulary (about 300words, with only a couple of associations for each).

6 Conclusion

Our different experiments confirm the variety of lexical associations one canfind for word paired with so-called semantic similarity measures. While the mir-ror and the distributional approaches we considered in this work both seemcorrelated to the references considered, our objective is to be able to pinpointmore precise lexical functions, as they are needed for different tasks (paraphrasesubstitution, translation lexical choice, etc). With respect to synonyms, our ex-periments indicate that mirror translations provide a better filter than syntac-tic distribution similarity. While alignment data have been less studied as asource of similarity than syntactic distributions, we hope we succeeded in show-ing that they are worth the investigation. We also note that finding mirrors iscomputationally simpler than finding the better similarities between alignmentor distributional vectors, the latter method being the closest in spirit to ourapproach.

Our longer-term objective is to reproduce synonymy word pair supervisedclassification; any similarity alone scores quite low as a synonymy descriptor,but experiments, such as [17], show it is doable to reliably label word pairs withlexical functions if the proportion of candidates is more balanced than the verylow natural proportion, and this means designing a filter as we do here.

The complementarity of resources considered here is still an open question,although we show that intersecting similarities as simply as we did here is al-ready providing some gain in precision. A more interesting path is probably tocombine this with pattern-based approaches, either as another filter or to helpselecting productive patterns to start with. The main problem for word similaritymeasures based on any kind of distribution regularity remains to deal with pol-ysemy, especially when different senses have very different frequency use. Lastly,we plan to investigate the use of multiple language pairs to improve the precisionof the predictions of the mirror approach.


References

1. Edmonds, P., Hirst, G.: Near-Synonymy and lexical choice. Computational Lin-guistics 28(2), 105–144 (2002)

2. Michiels, A., Noel, J.: Approaches to thesaurus production. In: Proceedings ofColing 1982 (1982)

3. Kozima, H., Furugori, T.: Similarity between words computed by spreading acti-vation on an english dictionary. In: Proceedings of the Conference of the EuropeanChapter of the ACL, pp. 232–239 (1993)

4. Niwa, Y., Nitta, Y.: Co-occurrence vectors from corpora vs. distance vectors fromdictionaries. In: Proceedings of Coling 1994 (1994)

5. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings ofColing 1998, Montreal, vol. 2, pp. 768–774 (1998)

6. Freitag, D., Blume, M., Byrnes, J., Chow, E., Kapadia, S., Rohwer, R., Wang, Z.:New experiments in distributional representations of synonymy. In: Proceedings ofCoNLL, pp. 25–32 (2005)

7. van der Plas, L., Tiedemann, J.: Finding synonyms using automatic word alignmentand measures of distributional similarity. In: Proceedings of the COLING/ACLPoster Sessions, pp. 866–873 (2006)

8. Wu, H., Zhou, M.: Optimizing synonyms extraction with mono and bilingual re-sources. In: Proceedings of the Second International Workshop on Paraphrasing.Association for Computational Linguistics, Sapporo (2003)

9. Dyvik, H.: Translations as semantic mirrors: From parallel corpus to wordnet. In:The Theory and Use of English Language Corpora, ICAME 2002 (2002)

10. Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In:Proceedings of the 43rd Annual Meeting of the Association for ComputationalLinguistics, pp. 597–604 (2005)

11. Zhitomirsky-Geffet, M., Dagan, I.: Bootstrapping distributional feature vectorquality. Computational Linguistics 35(3), 435–461 (2009)

12. Weeds, J.E.: Measures and Applications of Lexical Distributional Similarity. PhDthesis, University of Sussex (2003)

13. Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In:Proceedings of the 39th Annual Meeting of the Association for ComputationalLinguistics (2001)

14. Lin, D., Zhao, S., Qin, L., Zhou, M.: Identifying synonyms among distributionallysimilar words. In: Proceedings of IJCAI 2003, pp. 1492–1493 (2003)

15. Curran, J.R., Moens, M.: Improvements in automatic thesaurus extraction. In:Proceedings of the ACL 2002 Workshop on Unsupervised Lexical Acquisition, pp.59–66 (2002)

16. Lonneke, P., Tiedemann, J., Manguin, J.: Automatic acquisition of synonyms forFrench using parallel corpora. In: Proceedings of the 4th International Workshopon Distributed Agent-Based Retrieval Tools (2010)

17. Hagiwara, M., Ogawa, Y., Toyama, K.: Supervised synonym acquisition using dis-tributional features and syntactic patterns. Journal of Natural Language Process-ing 16(2), 59–83 (2009)

18. Ferret, O.: Testing semantic similarity measures for extracting synonyms from acorpus. In: Proceeding of LREC (2010)

19. Turney, P.: A uniform approach to analogies, synonyms, antonyms, and associa-tions. In: Proceedings of Coling 2008, pp. 905–912 (2008)

20. Miller, G., Charles, W.: Contextual correlates of semantic similarity. Language andCognitive Processes 6(1), 1–28 (1991)

Generic Solution Construction in

Valuation-Based Systems

Marc Pouly

Interdisciplinary Centre for Security, Reliability and Trust,University of Luxembourg

Abstract. Valuation algebras abstract a large number of formalismsfor automated reasoning and enable the definition of generic inferenceprocedures. Many of these formalisms provide some notions of solutions.Typical examples are satisfying assignments in constraint systems, mod-els in logics or solutions to linear equation systems. Contrary to inference,there is no general algorithm to compute solutions in arbitrary valuationalgebras. This paper states formal requirements for the presence of so-lutions and proposes a generic algorithm for solution construction basedon the results of a previously executed inference scheme. We study theapplication of generic solution construction to semiring constraint sys-tems, sparse linear systems and algebraic path problems and show thatthe proposed method generalizes various existing approaches for specificformalisms in the literature.

Keywords: solution construction in valuation algebras, local computa-tion, semiring constraint systems, sparse matrix techniques.

1 Introduction

In recent years, various formalisms for automated inference have been proposed.Important examples are probability potentials from Bayesian networks, belieffunctions from Dempster-Shafer theory, different constraint systems and log-ics, Gaussian potentials and density functions, relational algebra, possibilisticformalisms, systems of equations and inequalities over fields and semirings andmany more. Inference based on these formalisms is often a computationally hardtask which is successfully addressed by methods exploiting tree-decompositions.Moreover, since all the above formalisms satisfy some common algebraic prop-erties pooled in the valuation algebra framework [7, 18], it is possible to providegeneric tree-decomposition algorithms for the computation of inference with ar-bitrary valuation algebras. Thus, instead of re-inventing such methods for eachdifferent formalism, it is sufficient to verify a small axiomatic system to gainaccess to efficient generic procedures and implementations [12]. This is knownas the local computation framework. Although not limited to, many valuationalgebras are defined over variable systems and express in some sense which as-signments of values to variables are “valid” or “preferred” over others. Typicalexamples are models in logics, satisfying assignments in crisp constraint systems,


336 M. Pouly

solutions to linear equations or assignments that optimize the cost function ofa soft constraint system. Subsequently, we refer to such assignments as solu-tions. In the general case, solutions cannot be obtained by applying generictree-decomposition procedures. Given a constraint system for example, these al-gorithms only tell us whether the system is satisfiable, but they generally do notfind solutions. However, a common way to describe local computation proceduresis by message-passing on a graphical structure called join tree. The nodes of ajoin tree exchange messages and combine incoming messages to their content un-til the result to the inference problem is found. For specific valuation algebras, ithas been shown that solutions can be computed efficiently from the results of theinference process, i.e. from the node contents at the end of the message-passing.This process has been described for crisp constraint systems, linear inequalitiesand most probable assignments in Bayesian networks in [3] and for specific softconstraints in [17]. A generalization of the latter to a larger class of semiring val-uation algebras [8] can be found in [11]. Also, it is known that Gaussian variableelimination in sparse regular systems corresponds to local computation in thevaluation algebra of linear systems [7], where solution construction is achievedby the usual back-substitution process. All these approaches construct solutionsbased on the results of a previous local computation process, but they are alwayslimited to specific valuation algebras. In this paper, we aim for a generic analysisof solution construction. We state sufficient requirements for the existence of so-lutions in valuation algebras and derive a generic algorithm for the identificationof solutions. In the second part, we show how existing approaches are generalizedby this framework and also apply solution construction to other valuation alge-bras which have not yet been considered under this perspective, i.e. Gaussianpotentials and quasi-regular valuations. The family of quasi-regular valuation al-gebras is used to model path problems with varying semantics. They are for thefirst time shown to instantiate the valuation algebra framework which, besidesgeneric solution construction, is the second key contribution of this paper.

The outline of this paper is as follows: We first give a short introduction tothe valuation algebras, state the inference problem as the main computationaltask and present the fusion algorithm for the solution of inference problemswith arbitrary valuation algebras. Section 3 gives general requirements for thepresence of solutions in a valuation algebra and lists some simple propertieswhich are used in Section 3.1 to define a generic solution construction scheme.Finally, we study in Section 4 several instantiations of this framework and alsoshow how existing approaches in the literature are generalized by this scheme.

2 Valuation Algebras and Local Computation

The basic elements of a valuation algebra are so-called valuations. Intuitively, avaluation can be regarded as a representation of knowledge about the possiblevalues of a set of variables. If r denotes the universe of variables, then eachvaluation φ refers to a finite set of variables d(φ) ⊆ r called its domain. Let P(r)be the power set of r and Φ a set of valuations with domains in P(r). We assumethree operations defined in (Φ,P(r)):

Generic Solution Construction in Valuation-Based Systems 337

– Labeling: Φ → P(r); φ �→ d(φ),– Combination: Φ× Φ → Φ; (φ, ψ) �→ φ ⊗ ψ,– Variable Elimination: Φ× r → Φ; (φ, X) �→ φ−X .

The following axioms are imposed on (Φ,P(r)):

1. Commutative Semigroup: Φ is associative and commutative under ⊗.2. Labeling: For φ, ψ ∈ Φ, d(φ ⊗ ψ) = d(φ) ∪ d(ψ).3. Variable Elimination: For φ ∈ Φ and X ∈ d(φ), d(φ−X) = d(φ) − {X}.4. Commutativity of Elimination: For φ ∈ Φ with d(φ) = s and X, Y ∈ s,

(φ−X)−Y = (φ−Y )−X .

5. Combination: For φ, ψ ∈ Φ with X /∈ d(φ) and X ∈ d(ψ),

(φ ⊗ ψ)−X = φ ⊗ ψ−X .

These axioms require natural properties regarding knowledge modeling. Thefirst axiom indicates that if knowledge comes in pieces, the sequence does notinfluence their combination. The labeling axiom tells us that the combination ofvaluations gives knowledge over the union of the involved variables. The thirdaxiom ensures that eliminated variables disappear from the domain of a valu-ation. The fourth axiom says that the order of variable elimination does notmatter, and the combination axiom states that we may either combine a newpiece to the already given knowledge and focus afterwards to the desired domain,or we first eliminate the uninteresting parts of the new knowledge and combine itafterwards. A system (Φ,P(r)) satisfying the above axioms is called a valuationalgebra. More general definitions of valuation algebras exist to cover formalismsbased on general lattices instead of variable systems [7]. But since solutions arevariable assignments, the above definition is appropriate. Due to axiom 4 theelimination order of variables is not significant. We may therefore write

φ↓s = φ−{X1,...,Xk}

if a non-empty set of variables {X1, . . . , Xk} = d(φ) − s is eliminated. This iscalled the projection of φ to s ⊂ d(φ). A listing of formalisms that adopt thestructure of a valuation algebra is given in the introduction. We refer to Section4 and [7, 11, 13] for further examples and next focus on the main computationalinterest in valuation algebras.

2.1 The Inference Problem

Given a set of valuations {φ1, . . . , φn} ⊆ Φ and a query x ⊂ d(φ1) ∪ . . . ∪ d(φn),the inference problem consists in computing

(φ1 ⊗ · · · ⊗ φn)↓x = (φ1 ⊗ · · · ⊗ φn)−{X1,...,Xk} (1)

for {X1, . . . , Xk} = (d(φ1) ∪ . . . ∪ d(φn)) − x. The complexity of combinationand variable elimination generally depends on the size of the factor domains and

338 M. Pouly

often shows an exponential behaviour. According to axiom 2 and 3, the domainsof valuations grow under combination and shrink under variable elimination.Efficient inference algorithms therefore confine the size of intermediate results,which can be achieved by alternating the two operations. This strategy is calledlocal computation, and the valuation algebra axioms proved sufficient for thedefinition of general local computation schemes which solve inference problemsindependently of the underlying formalism. Local computation algorithms in-clude fusion [16], bucket elimination [3] and collect [18] for single queries andmore specialized architectures for the computation of multiple queries [5–7, 9].

2.2 The Fusion Algorithm

Let us first consider the elimination of a variable Y ∈ d(φ1)∪ . . .∪ d(φn) from aset {φ1, . . . , φn} ⊆ Φ of valuations. This operation can be performed as follows:

FusY({φ1, . . . , φn}) = {ψ−Y } ∪ {φi : Y /∈ d(φi)} where ψ =⊗

i:Y ∈d(φi)

φi. (2)

The fusion algorithm then follows by a repeated application of this operation:

(φ1 ⊗ · · · ⊗ φn)−{X1,...,Xk} =⊗

FusXk(. . . (FusX1({φ1, . . . , φn})).

Proofs are given in [7]. In every step i = 1, . . . , k of the fusion algorithm, thecombination in (2) creates an intermediate factor ψi with domain d(ψi). Then,the variable Xi is eliminated only from ψi in (2). We define λ(i) = d(ψi)−{Xi}called label and observe that λ(k) = x. The domains of all intermediate resultsof the fusion algorithm are therefore bounded by the largest label plus one. Inother words, the smaller the labels are, the more efficient is local computation.We further remark that the labels depend on the chosen elimination sequence forthe variables {X1, . . . , Xk}. Regrettably, finding the elimination sequences thatleads to smallest labels is NP-complete [1], but we have good heuristics thatachieve reasonable execution time [4]. The fusion algorithm can be representedgraphically: We create a node for each step i = 1, . . . , k carrying label λ(i). Then,if ψj with j < i occurs as a factor in the combination (2) of ψi, a directed edge isdrawn from node j to node i. Node i is then called the child i = ch(j) of node j.The resulting graph is a tree directed towards the root node k that satisfies therunning intersection property [7], i.e. if i, j are two nodes and X ∈ λ(i) ∩ λ(j),then X ∈ λ(m) for all nodes m on the unique path between i and j. Labeledtrees satisfying this property are called join trees.

3 Solutions in Valuation Algebras

Let ΩX be the set of possible values of a variable X ∈ r. Then, the set of possibleassignments to a non-empty set of variables s ⊆ r is given by the Cartesianproduct Ωs =

∏X∈s ΩX . We refer to the elements in Ωs as configurations of

s and define by convention for the empty variable set Ω∅ = {�}. Further, we


write x↓t for the projection of x ∈ Ωs to t ⊆ s. The definition of solutions ingeneral valuation algebras must be independent of the actual semantics. Instead,we define solutions in terms of a structural property. Assume a valuation φ ∈ Φwith domain d(φ) = s, t ⊆ s and a configuration x ∈ Ωt. We write W t

φ(x) forthe set of all configurations y ∈ Ωs−t such that (x,y) leads to a “preferred”value of φ among all configurations z ∈ Ωs with z↓t = x. It is required that theextension y can either be computed directly by extending x to the domain s, orstep-wise by first extending x to u and then to s for t ⊆ u ⊆ s.

Definition 1. For φ ∈ Φ with t ⊆ s = d(φ) and x ∈ Ωt a set W tφ(x) is called

configuration extension set of φ from t to s, given x, if for all u with t ⊆ u ⊆ s,

W tφ(x) =

{z ∈ Ωs−t : z↓u−t ∈ W t

φ↓u(x) and z↓s−u ∈ Wuφ (x, z↓u−t)

}.

Solutions lead to the “preferred” value of φ among all configurations in Ωs.Hence, if such a system of configuration extension sets is present in a valuationalgebra, we may characterize a solution to φ ∈ Φ as an extensions from theempty configuration to the domain of φ.

Definition 2. The solution set cφ of φ ∈ Φ is defined as cφ = W ∅φ (�).

Examples of such systems of configuration extension sets and their induced so-lution sets for concrete valuation algebras are given in Section 4. The followinglemma is an immediate consequence of the definition of solutions. It says thatevery solution to a projection of φ is also a projection of some solution to φ.

Lemma 1. For φ ∈ Φ and t ⊆ d(φ) it holds that cφ↓t = c↓tφ .

We therefore refer to a projection x↓t of a solution x ∈ cφ as a partial solution toφ with respect to t ⊆ d(φ). If s, t ⊆ d(φ) are two subsets of the domain of φ ∈ Φ,the partial solutions of φ with respect to s∪ t may be obtained by extending thepartial solutions of φ with respect to s from s ∩ t to t. This is the statement ofthe following theorem that follows from Definition 1 and Lemma 1.

Theorem 1. For s, t ⊆ d(φ) we have

c↓s∪tφ =

{z ∈ Ωs∪t : z↓s ∈ c↓s

φ and z↓t−s ∈ W s∩tφ↓t (z↓s∩t)

}.

3.1 Generic Solution Construction

We now focus on the efficient computation of solutions for a valuation φ ∈ Φthat is given as factorization φ = φ1 ⊗ . . .⊗ φn. Since computing φ is in generalintractable, the proposed method assembles solutions to φ using partial solu-tion extension sets obtained from the results of a previous run of the fusionalgorithm (or any other local computation scheme). At the end of the fusion al-gorithm φ↓λ(k) is the only known projection of φ. But if we alternatively execute a

340 M. Pouly

multi-query local computation architecture, then φ↓λ(i) is obtained for all nodesi = 1, . . . , k. Knowing these projections would allow us to build the completesolution set cφ. Due to Lemma 1 and Definition 2, we have for the root node:

c↓λ(k)φ = W ∅

φ↓λ(k)(�). (3)

If follows from Theorem 1 and the running intersection property that this partialsolution set can be extended step-wise to the complete solution set cφ.

Lemma 2. For i = k − 1, . . . , 1 and s = λ(k) ∪ . . . ∪ λ(i + 1) we have

c↓s∪λ(i)φ =

{z ∈ Ωs∪λ(i) : z↓s ∈ c↓s

φ and z↓λ(i)−s ∈ Wλ(i)∩λ(ch(i))

φ↓λ(i) (z↓λ(i)∩λ(ch(i)))}

.

Note that the domains of the configuration extension sets computed in Lemma 2are always bounded by the label λ(i) of the corresponding join tree node. Hence,this algorithm adopts the complexity of a local computation scheme. However, animportant disadvantage is that we require a multi-query architecture to obtainthe projections of φ to all labels. We therefore aim at a procedure which is basedon the results of the fusion algorithm only. Lemma 1 displays how solutionsets behave under the operation of projection. The following lemma supposes asimilar property for combination and shows that the projections to λ(i) in (3)can then be replaced by the factors ψi obtained from the fusion algorithm.

Lemma 3. If configuration extension sets satisfy the property that for all φ1,φ2 ∈ Φ with d(φ1) = s, d(φ2) = t, s ⊆ u ⊆ s ∪ t and x ∈ Ωu we have

Wu∩tφ2

(x↓u∩t) ⊆ Wuφ1⊗φ2

(x), (4)

then it holds for i = k − 1, . . . , 1 and s = λ(k) ∪ . . . ∪ λ(i + 1) that

c↓s∪λ(i)φ ⊇

{z ∈ Ωs∪λ(i) : z↓s ∈ c↓s

φ and z↓λ(i)−s ∈ Wλ(i)∩λ(ch(i))ψi

(z↓λ(i)∩λ(ch(i)))}

The proof, given in [13], is based on the correctness of the Shenoy-Shafer ar-chitecture [7]. Hence, if we assume that configuration extension sets are alwaysnon-empty, then at least one solution to φ can be computed as follows: We ex-ecute a complete run of the fusion algorithm, build the configuration extensionset for the root node using (3) and apply Lemma 3 to step-wise extend c

↓λ(k)φ to

the domain d(φ). The result of this process is a non-empty subset of the solutionset cφ. Alternatively, the proof of Lemma 3 also shows that if equality holds in(4) then it also does in the statement below. In this case all solutions can befound based on the results of the fusion algorithm.

Theorem 2. If inclusion (4) holds and configuration extension sets are non-empty, then a solution can be found based on the results of the fusion algorithm.If equality holds in (4), then all solutions can be found with the same procedure.


4 Instantiations

We now survey some examples of valuation algebras and show that they providea suitable notion of solutions to apply generic solution construction.

Semiring Constraint Systems. Semirings are algebraic structures with twobinary operations + and × over a set A of values. A tuple 〈A, +,×,0,1〉 is calledcommutative semiring if both operations are associative and commutative and if× distributes over +. 0 and 1 are the neutral elements with respect to + and ×.A semiring valuation φ with domain d(φ) = s ⊆ r is a function φ : Ωs → A thatassociates a value from a commutative semiring with each configuration x ∈ Ωs.Combination and variable elimination are defined as follows:

– Combination: For φ, ψ ∈ Φ with d(φ) = s, d(ψ) = t and x ∈ Ωs∪t

φ⊗ ψ(x) = φ(x↓s) × ψ(x↓t). (5)

– Variable Elimination: For φ ∈ Φ with X ∈ d(φ) = s and x ∈ Ωs−{X}

φ−X(x) =∑

z∈ΩX

φ(x, z). (6)

It was shown by [8] that every commutative semiring induces a valuation algebraby the above mapping and operations. Among them, semiring constraint systemsare of particular interest. These are the valuation algebras induced by so-calledc-semirings [2]. Here, we only require the weaker property of idempotency, i.e.for all a ∈ A we have a+a = a. It can easily be shown that idempotent semiringsprovide a partial order [11] satisfying a+b = sup{a, b} for all a, b ∈ A. Moreover,if this order is total, we have a+ b = max{a, b}. Rewriting the inference problem(1) with empty query for this family of valuation algebras gives

φ↓∅(�) = max{φ(x), x ∈ Ωs}. (7)

Hence, the inference problem for valuations induced by totally ordered, idem-potent semirings turns into an optimization task. This covers crisp constraintsinduced by the Boolean semiring 〈{0, 1}, max, min, 0, 1〉, weighted constraintsinduced by the tropical semiring 〈N ∪ {0, +∞}, min, +,∞, 0〉, probabilistic con-straints induced by the t-norm semiring 〈[0, 1], max,×, 0, 1〉 or bottleneck con-straints induced by the semiring 〈R ∪ {−∞,∞}, max, min,−∞,∞〉. Equation(7) motivates the following definition of configuration extension sets for valua-tion algebras induced by totally ordered, idempotent semirings: For φ ∈ Φ withd(φ) = s, t ⊆ s and x ∈ Ωt we define

W tφ(x) =

{y ∈ Ωs−t : φ(x,y) = φ↓t(x)

}. (8)

Theorem 3. Configuration extension sets in valuation algebras induced by to-tally ordered, idempotent semirings satisfy the property of Definition 1.

342 M. Pouly

This follows directly from (8) and shows that configuration extension sets insemiring constraint systems instantiate the framework of Section 3. We nextspecialize the general solution sets in Definition 2:

cφ = W ∅φ (�) =

{y ∈ Ωs : φ(x) = φ↓∅(�)

}. (9)

Note that this indeed corresponds to the notion of solutions in constraint systemsderived in equation (8). Furthermore, we also see that several possibilities todefine configuration extension sets may exist in a valuation algebra. Instead ofthe configurations that map to the maximum value, we could also consider allother configurations that do not satisfy this property and modify (8) accordinglywhich then equals the search for counter-models. This liberty comes from thefact that Definition 1 does not imposes semantical restrictions on configurationextension sets as for example giving a definition of “preferred” values.

Lemma 4. In a valuation algebra induced by a totally ordered, idempotent semir-ing we have for φ1, φ2 ∈ Φ with d(φ1) = s, d(φ2) = t, s ⊆ u ⊆ s ∪ t and x ∈ Ωu

Wu∩tφ2

(x↓u∩t) ⊆ Wuφ1⊗φ2

(x).

Proof. Assume x ∈ Ωu and y ∈ Wu∩tφ2

(x↓t∩u). It follows from equation (8) that

φ1(x↓s) × φ2(x↓u∩t,y) = φ1(x↓s) × φ↓u∩t2 (x↓u∩t).

We conclude y ∈ Wuφ1⊗φ2

(x) from applying axiom 5 to the above expression, i.e.

(φ1 ⊗ φ2)(x,y) = φ1(x↓s) × φ2(x↓u∩t,y) = φ1(x↓s) × φ↓u∩t2 (x↓u∩t)

=(φ1 ⊗ φ↓u∩t

2

)(x) = (φ1 ⊗ φ2)↓u(x).

Finally, we also conclude from (8) that these configuration extension sets arealways non-empty. Altogether, this meets the requirements of Theorem 2. There-fore, applying the generic solution construction algorithm to valuation algebrasinduced by totally ordered, idempotent semirings always identifies at least onesolution based on the results of a previously executed run of the fusion algorithm.For crisp constraints and probabilistic constraints this reproduces the approachpresented in [3]. In addition, the class of valuation algebras induced by totallyordered, idempotent semirings also generalizes the formalisms studied in [17]and thus the corresponding scheme for computing solutions by local computa-tion. However, it is shown in [11] that generally only inclusion holds in (4). Thisis for example the case for bottleneck constraints. The following theorem statesa sufficient semiring property to guarantee equality which then means that allsolutions are found by the above procedure.

Lemma 5. If a totally ordered, idempotent semiring satisfies that for a, b, c ∈ Aand c �= 0, a < b implies a× c < b× c, then equality holds in (4).

The proof is similar to Lemma 4 but requires to exclude the case φ↓∅(�) = 0. Forφ ∈ Φ with d(φ) = s, φ↓∅(�) = 0 implies that cφ = Ωs. Hence, all configurationsare solutions and there is no need to determine cφ algorithmically. Excludingthis case is therefore not limiting. The complete proof is given in [11].


Linear Equation Systems. We discuss solution construction in linear systemsby focussing on symmetric, positive-definite systems. Some comments on generalsystems will be given below. In this context, it is more convenient to work withindex sets instead of variables directly. Hence, we slightly change our notationand consider variables Xi taking indices from a set r = {1, . . . , m}. A systemAX = b is said to be an s-system, if X is the variable vector whose componentsXi have indices in s ⊆ r, A is a real-valued, symmetric, positive-definite s × smatrix and b is a real s-vector. Such systems are fully determined by the pair(A,b), and we write d(A,b) = s for the domain of this system. By conventionwe write (�, �) for the only possible system with empty domain. Now, supposethat we want to eliminate the variable Xi from the s-system AX = b with i ∈ s.We decompose the system with respect to {i} and s− {i} and obtain[

A{i},{i} A{i},s−{i}As−{i},{i} As−{i},s−{i}

] [X{i}Xs−{i}

]=[b{i}bs−{i}

].

Then, the operation of variable elimination is defined as(A,b

)−i =(As−{i},s−{i} −As−{i},{i}(A{i},{i})−1A{i},s−{i},

bs−{i} −As−{i},{i}(A{i},{i})−1b{i}).

This corresponds to standard Gaussian variable elimination. We remark that thematrix component of the right-hand system is still symmetric, positive-definite.Next, consider an s-system A1X1 = b1 and a t-system A2X2 = b2 with s, t ⊆ r.The combination of the two systems is defined by component-wise addition

(A1,b1) ⊗ (A2,b2) =(A↑s∪t

1 + A↑s∪t2 ,b↑s∪t

1 + b↑s∪t2

). (10)

The notation A↑s∪t and b↑s∪t means vacuous extension to the union domain s∪tby adding a corresponding number of zeros. If we write Φ for the set of all possiblesystems with domains in r, then the algebra (Φ,P(r)) is isomorphic to the valu-ation algebra of Gaussian potentials studied in [7, 13]. In other words, (Φ,P(r))is itself a valuation algebra. Factorizations (A,b) = (A1,b1) ⊗ . . . ⊗ (An,bn)of symmetric, positive-definite systems reflect the sparsity pattern contained inthe matrix A. In contrast to semiring valuation algebras, equation systems areof polynomial size. Factorizations may therefore be produced by decomposingan existing system, but there are also applications where such factorizationsoccur naturally. An important example are the normal equations in the leastsquares method [13]. We next define configuration extension sets for symmetric,positive-definite systems: For a real t-vector x, φ = (A,b) and t ⊆ d(φ),

W tφ(x) =

{(As−t,s−t)−1bs−t − (As−t,s−t)−1As−t,tx

}. (11)

This satisfies the property of Definition 1. Hence, we obtain from Definition 2

cφ = W ∅φ (�) =

{A−1b

}.

Note that cφ indeed corresponds to the singleton set of the unique solutionx = A−1b to the symmetric, positive-definite system AX = b.

344 M. Pouly

Lemma 6. Symmetric, positive-definite systems satisfy the property that for allφ1, φ2 ∈ Φ with d(φ1) = s, d(φ2) = t, s ⊆ u ⊆ s ∪ t and u-vector x we have

Wu∩tφ2

(x↓u∩t) = Wuφ1⊗φ2

(x).

The straightforward proof can be found in [13]. Again, this property allowsthe application of the generic solution construction algorithm of Section 3.1to solve factorized or decomposed, symmetric, positive-definite systems. In fact,this corresponds to the standard back-substitution process that follows Gaussianvariable elimination. In combination with local computation for the variableelimination process, this is generally referred to as sparse matrix technique [14].The above theory can also be applied to the valuation algebra of arbitrary linearsystems developed in [7]. Since general systems may have no or infinitely manysolutions, solution construction identifies the affine solution space rather thana single solution [13]. Finally, it is also shown in [7] that the valuation algebraof linear systems belongs to a larger class of valuation algebras called contextvaluation algebras that further includes systems of inequalities and various logics.All these formalisms provide a suitable notion of configuration extension sets andqualify for generic solution construction.

Quasi-Regular Valuation Algebras. Many important applications in com-puter science can be reduced to path problems in labeled graphs with values froma semiring. This is known as the algebraic path problem and typically includesthe computation of shortest paths, connectivities, maximum capacities or relia-bilities, but also other applications that are not directly related to graphs such aspartial differentiation, matrix multiplication or tasks related to Markov chains.Examples are given in [15]. If r denotes a finite set of indices, the algebraic pathproblem requires to solve a fixpoint equation system X = MX + b where X isan s-vector of variables with indices from s ⊆ r, M is an s×s matrix with valuesfrom a semiring and b is an s-vector of semiring values. Such a system providesa solution M∗b if the underlying semiring 〈A, +,×,0,1〉 is quasi-regular, i.e. iffor each element a ∈ A there exists a∗ ∈ A such that aa∗+1 = a∗a+1 = a∗. Thesolution M∗ to the fixpoint equation system can then be computed by the well-known Floyd-Warshall-Kleene algorithm [10]. For example, the Boolean semiringis quasi-regular with 0∗ = 1∗ = 1. The tropical semiring of non-negative integersis quasi-regular with a∗ = 0 for all a ∈ N ∪ {0,∞}, the probabilistic semiringis quasi-regular with a∗ = 1 for a ∈ [0, 1], and the bottleneck semiring is quasi-regular with a∗ = ∞ for all a ∈ R ∪ {−∞,∞}. If M represents an adjacencymatrix of a weighted graph with edge weights from the Boolean semiring, thenM∗ gives the connectivity matrix. Alternatively, we may choose the tropicalsemiring to obtain shortest distances or the probabilistic semiring for maximumreliabilities. Instead of computing M∗ directly, we focus on computing a singlerow of this matrix. In terms of shortest distances, this corresponds to the single-source shortest distance problem. To determine the i-th row with i ∈ s we specifyan s-vector b with b(i) = 1 and b(j) = 0 for all j ∈ s− {i}. Then, the solutionto the system X = MX + b is M∗b which clearly corresponds to the i-th rowof M∗. Similar to the above valuation algebra of symmetric, positive-definite


systems, we represent such systems as pairs (M,b) and consider Φ to be theset of all possible s-systems with s ⊆ r. Then, (Φ,P(r)) again forms a valuationalgebra for every quasi-regular semiring [13]. The combination rule is equal to(10) with semiring addition replacing addition of reals. Also, the operations ofvariable elimination are closely related: For φ = (M,b) with d(φ) = s and i ∈ s,(

A,b)−i =

(As−{i},s−{i} + As−{i},{i}(A{i},{i})∗A{i},s−{i},

bs−{i} + As−{i},{i}(A{i},{i})∗b{i})

Configuration extension sets are defined for φ = (M,b) with d(φ) = s, t ⊆ sand an arbitrary t-vector x of values from a quasi-regular semiring as

W tφ(x) =

{(M↓s−t,s−t)∗(M↓s−t,tx + b↓s−t)

}.

This again fulfills the requirements of Definition 1. Specializing Definition 2 toquasi-regular valuation algebras then gives

cφ = W ∅φ = {M∗b}, (12)

which indeed corresponds to the solution to the fixpoint system X = MX + b.

Lemma 7. Quasi-regular valuation algebras satisfy the property that for allφ1, φ2 ∈ Φ with d(φ1) = s, d(φ2) = t, s ⊆ u ⊆ s ∪ t and u-vector x we have

Wu∩tφ2

(x↓u∩t) = Wuφ1⊗φ2

(x).

Factorized fixpoint equation systems, which represent the sparsity pattern inthe total matrix M, can thus be solved by generic solution construction. Incase of the single-source shortest path problem, the fusion algorithm computesthe shortest distances between a selected node and all other nodes and solutionconstruction then identifies the actual paths for these distances.

5 Conclusion

The valuation algebra framework abstracts inference formalisms and enablesthe definition of generic inference procedures based on tree-decomposition tech-niques. Many important instances of this framework are defined over variablesystems and determine the assignments of values to variables that are “pre-ferred” over others. In contrast to inference, there is no generic procedure to de-termine such variable assignments, called solutions, although many specializedapproached for particular formalisms exist. This paper states formal require-ments for the presence of solutions in valuation algebras and derives a genericalgorithm to compute single or all solutions to a factorized valuation. Theyare based on the intermediate results of the previously executed inference algo-rithm and therefore adopt the same complexity. In the second part of this paper,we instantiated the generic solution construction scheme to semiring constraintsystems and linear systems over fields and observed that both instantiations

346 M. Pouly

correspond to the well-known specialized approaches in these application do-mains. Finally, we presented for the first time a new family of instances calledquasi-regular valuation algebras used to represent and solve sparse path problemsin semiring-weighted graphs. Here, the generic inference algorithm for valuationalgebras is used to determine the optimum path weight, and solution construc-tion delivers the corresponding sequence of graph nodes that describes the path.

References

1. Arnborg, S., Corneil, D., Proskurowski, A.: Complexity of finding embeddings ina k-tree. SIAM J. of Algebraic and Discrete Methods 8, 277–284 (1987)

2. Bistarelli, S., Montanari, U., Rossi, F., Verfaillie, G., Fargier, H.: Semiring-basedcsps and valued csps: Frameworks, properties and comparison. Constraints 4(3)(1999)

3. Dechter, R.: Bucket elimination: a unifying framework for reasoning. Artif. In-tell. 113, 41–85 (1999)

4. Dechter, R.: Constraint Processing. Morgan Kaufmann Publishers, San Francisco(2003)

5. Jensen, F., Lauritzen, S., Olesen, K.: Bayesian updating in causal probabilistic net-works by local computation. Computational Statistics Quarterly 4, 269–282 (1990)

6. Kask, K., Dechter, R., Larrosa, J., Fabio, G.: Bucket-tree elimination for automatedreasoning. Artif. Intell. 125, 91–131 (2001)

7. Kohlas, J.: Information Algebras: Generic Structures for Inference. Springer,Heidelberg (2003)

8. Kohlas, J., Wilson, N.: Semiring induced valuation algebras: Exact and approxi-mate local computation algorithms. Artif. Intell. 172(11), 1360–1399 (2008)

9. Lauritzen, S., Spiegelhalter, D.: Local computations with probabilities on graphicalstructures and their application to expert systems. J. Royal Stat. Soc. B 50, 157–224 (1988)

10. Lehmann, D.: Algebraic structures for transitive closure. Technical report, Depart-ment of Computer Science, University of Warwick (1976)

11. Pouly, M.: A Generic Framework for Local Computation. PhD thesis, Departmentof Informatics, University of Fribourg (2008)

12. Pouly, M.: Nenok - a software architecture for generic inference. Int. J. on Artif.Intel. Tools 19, 65–99 (2010)

13. Pouly, M., Kohlas, J.: Generic Inference - A unifying Theory for Automated Rea-soning. Wiley & Sons, Chichester (2011)

14. Rose, D.: A graph-theoretic study of the numerical solution of sparse positive defi-nite systems of linear equations. In: Read, R. (ed.) Graph Theory and Computing.Academic Press, London (1972)

15. Rote, G.: Path problems in graphs. Computing Suppl. 7, 155–198 (1990)16. Shenoy, P.: Valuation-based systems: A framework for managing uncertainty in

expert systems. In: Zadeh, L., Kacprzyk, J. (eds.) Fuzzy Logic for the Managementof Uncertainty, pp. 83–104. Wiley & Sons, Chichester (1992)

17. Shenoy, P.: Axioms for dynamic programming. In: Gammerman, A. (ed.) Com-putational Learning and Probabilistic Reasoning, pp. 259–275. Wiley & Sons,Chichester (1996)

18. Shenoy, P., Shafer, G.: Axioms for probability and belief-function propagation. In:Shafer, G., Pearl, J. (eds.) Readings in Uncertain Reasoning, pp. 575–610. MorganKaufmann Publishers, San Francisco (1990)

Cross-Lingual Word Sense Disambiguation for

Languages with Scarce Resources

Bahareh Sarrafzadeh, Nikolay Yakovets, Nick Cercone, and Aijun An

Department of Computer Science and Engineering,York University, Canada

{bahar,hush,nick,aan}@cse.yorku.ca

Abstract. Word Sense Disambiguation has long been a central problemin computational linguistics. Word Sense Disambiguation is the abilityto identify the meaning of words in context in a computational man-ner. Statistical and supervised approaches require a large amount oflabeled resources as training datasets. In contradistinction to English,the Persian language has neither any semantically tagged corpus to aidmachine learning approaches for Persian texts, nor any suitable parallelcorpora. Yet due to the ever-increasing development of Persian pagesin Wikipedia, this resource can act as a comparable corpus for English-Persian texts.

In this paper, we propose a cross-lingual approach to tagging theword senses in Persian texts. The new approach makes use of Englishsense disambiguators, the Wikipedia articles in both English and Per-sian, and a newly developed lexical ontology, FarsNet. It overcomes thelack of knowledge resources and NLP tools for the Persian language. Wedemonstrate the effectiveness of the proposed approach by comparing itto a direct sense disambiguation approach for Persian. The evaluationresults indicate a comparable performance to the utilized English sensetagger.

Keywords: Word Sense Disambiguation, WordNet, Languages withScarce Resources, Cross-Lingual, Extended Lesk, FarsNet, Persian.

1 Introduction

Human language is ambiguous, so that many words can be interpreted in mul-tiple ways depending on the context in which they occur. While humans rarelythink about the ambiguities of language, machines need to process unstructuredtextual information which must be analyzed in order to determine the underlyingmeaning.

Word Sense Disambiguation (WSD) heavily relies on knowledge. Withoutknowledge, it would be impossible for both humans and machines to identify thewords’ meaning. Unfortunately, the manual creation of knowledge resources isan expensive and time consuming effort, which must be repeated every time thedisambiguation scenario changes (e.g., in the presence of new domains, different


348 B. Sarrafzadeh et al.

languages, and even sense inventories) [1]. This is a fundamental problem whichpervades approaches to WSD, and is called the knowledge acquisition bottleneck.

With the huge amounts of information on the Internet and the fact that thisinformation is continuously growing in different languages, we are encouraged toinvestigate cross-lingual scenarios where WSD systems are also needed. Despitethe large number of WSD systems for languages such as English, to date no largescale and highly accurate WSD system has been built for the Farsi languagedue to the lack of labeled corpora and monolingual and bilingual knowledgeresources.

In this paper we propose a novel cross-lingual approach to WSD that takesadvantage of available sense disambiguation systems and linguistic resources forthe English language. Our approach demonstrates the capability to overcomethe knowledge acquisition bottleneck for languages with scarce resources. Thismethod also provides sense-tagged corpora to aid supervised and semi-supervisedWSD systems. The rest of this paper is organized as follows: After reviewingrelated works in Section 2, we describe the proposed cross-lingual approach inSection 3, and a direct approach to WSD in Section 4; which is followed byevaluation results and a discussion in Section 5. In Section 6 our concludingremarks are presented and future extensions are proposed.

2 Related Work

We can distinguish different approaches to WSD based on the amount of super-vision and knowledge they demand. Hence we can classify different methods into4 groups [1]: Supervised, Unsupervised, Semi-supervised and Knowledge-based.

Generally, supervised approaches to WSD have obtained better results thanunsupervised methods. However, obtaining labeled data is not usually easy formany languages, including Persian as there is no sense tagged corpus for thislanguage.

The objective of Knowledge-based WSD is to exploit knowledge resources suchas WordNet[2] to infer the senses of words in context. These methods usuallyhave lower performance than their supervised alternatives, but they have theadvantage of wider coverage, thanks to the use of large-scale knowledge resources.

The recent advancements in corpus linguistics technologies, as well as theavailability of more and more textual data encourage many researchers to takeadvantage of comparable and parallel corpora to address different NLP tasks.The following subsection reviews some of the related works which address WSDusing a cross-lingual approach.

2.1 Cross-Lingual Approaches

Parallel corpora present a new opportunity for combining the advantages ofsupervised and unsupervised approaches, as well as an opportunity for exploit-ing translation correspondences in the text. Cross-lingual approaches to WSD

Cross-Lingual WSD for Languages with Scarce Resources 349

disambiguate target words by labelling them with the appropriate translation.The main idea behind this approach is the plausible translations of a word incontext restricts its possible senses to a subset [3].

In recent studies [4–7], it has been found that approaches that use cross-lingualevidence for WSD attain state-of-the-art performance in all-words disambigua-tion. However, the main problem of these approaches lies in the knowledge acqui-sition bottleneck: there is a lack of parallel and comparable corpora for severallanguages including Persian - which can potentially be relieved by collectingcorpora on the Web. To overcome this problem, we utilized Wikipedia pages inboth Persian and English. Before introducing our WSD system a brief survey ofWSD systems for the Persian language follows.

2.2 Related Work for Persian

The lack of efficient, reliable linguistic resources and fundamental text processingmodules for the Persian language make it difficult for computer processing. Inrecent years there have been two branches of efforts to eliminate this shortage [8].

Some researchers are working to provide linguistic resources and fundamentalprocessing units. FarsNet [9] is an ongoing project to develop a lexical ontology tocover Persian words and phrases. It is designed to contain a Persian WordNet inits first phase and grow to cover verbs’ argument structures in its second phase.The included words and phrases are selected according to BalkaNet[10] baseconcepts and the most frequent Persian words and phrases in utilized corpora.Therefore, Persian WordNet goes closely in the lines and principles of PrincetonWordNet, EuroWordNet and BalkaNet to maximize its compatibility to theseWordNets and to be connected to the other WordNets in the world to enablecross-lingual tasks such as Machine Translation, multilingual Information Re-trieval and developing multilingual dictionaries and thesauri. FarsNet 1.0 relatessynsets in each POS category by the set of WordNet 2.1 relations. FarsNet alsocontains inter-lingual relations connecting Persian synsets to English synsets (inPrinceton WordNet 3.0). [11] exploits an English-Persian parallel corpus whichwas manually aligned at the word level and sense-tagged a set of observationsas a training dataset from which a decision tree classifier is learned. [8] deviseda novel approach based on WordNet, eXtended WordNet[12] and verb parts ofFarsNet to extend the Lesk algorithm[13] and find the appropriate sense of aword in an English sentence. Since FarsNet was not released at the time of pub-lishing this paper, they manually translated a portion of WordNet to performWSD for the Persian side. [14] defined heuristic rules based on the grammat-ical role, POS tags and co-occurrence words of both the target word and itsneighbours to find the best sense.

Others work on developing algorithms with less reliance on linguistic resources.We refer to statistical approaches [15–17] using monolingual corpora for solvingthe WSD problem in Farsi texts. Also conceptual categories in a Farsi thesaurushave been utilized to discriminate senses of Farsi homographs in [18].

Our proposed approach is unique, when compared to most cross-lingual ap-proaches, in the sense that we utilize a comparable corpus, automatically


extracted from Wikipedia articles, which can be available for many languagepairs even the languages with scarce resources, and our approach is not lim-ited to sense tagged parallel corpora only. Second, thanks to the availability ofFarsNet, our method tags Persian words using sense tags in the same languageinstead of using either a sense inventory of another language or translationsprovided by a parallel corpus. Therefore the results of our work can be appliedto many monolingual NLP tasks such as Information Retrieval, Text Classifica-tion as well as bilingual ones including Machine Translation and Cross-Lingualtasks. Moreover, the extended version of the Lesk algorithm has never beenexploited to address WSD for Persian texts. Finally, taking advantage of avail-able mappings between synsets in WordNet and FarsNet, we were able to utilizean English sense tagger which uses WordNet as a sense inventory to sense tagPersian words.

3 Introducing the Cross-Lingual Approach: Persian WSDUsing Tagged English Words

This approach consists of two separate phases. In the first phase we utilize an En-glish WSD system to assign sense tags to words appearing in English sentences.In the second phase we transfer these senses to corresponding Persian words.Since by design these two phases are distinct, the first phase can be consideredas a black box and different English WSD systems can be employed. What ismore, the corresponding Persian words can be Persian pages in Wikipedia orPersian sentences in the aligned corpus.

We created a comparable corpus by collecting Wikipedia pages which areavailable for both English and Persian languages and Persian articles are notshorter than 250 words. This corpus contains about 35000 words for the Persianside and 74000 words for English.

Therefore, the Cross-Lingual system contains three main building blocks: En-glish Sense Disambiguation (first phase), English to Persian Transfer (transitionto the second phase) and Persian Sense Disambiguation (second phase). Thesecomponents are described in the following sections. Figure 1 indicates the sys-tem’s architecture for the Cross-Lingual section.

3.1 English Sense Disambiguation

As mentioned, different English Sense Disambiguation systems can be employedin this phase. In this system we utilized the Perl-based application SenseRe-late [19] for the English WSD phase. SenseRelate uses WordNet to performknowledge-based WSD.

This system allows a user to specify a range of settings to control the desireddisambiguation. We selected the Extended Lesk algorithm which leads to themost accurate disambiguation [19].

As an input to SenseRelate we provided plain untagged text of EnglishWikipedia pages that was preprocessed according to application’s preconditions.


Fig. 1. Cross-Lingual System Architecture

We also provided a tweaked stopword list1 that is more extensive than the onewhich came bundled with the application. SenseRelate will tag all ambiguouswords in the input English texts using WordNet as a sense repository.

3.2 English to Persian Transfer

Running SenseRelate for input English sentences, we have English words taggedwith sense labels. Each of these sense labels corresponds to a synset in Word-Net containing that word in a particular sense. Most of these synsets have beenmapped to their counterparts in FarsNet. In order to take advantage of theseEnglish tags for assigning appropriate senses to Persian words, first we transferthese synsets from English to Persian using interlingual relations provided byFarsNet. As FarsNet is mapped to WordNet 3.0 there are two inter-lingual rela-tions; equal-to and near-equal-to between FarsNet and WordNet synsets. Due tothe relatively small size of FarsNet we used both relations and did not distinguishbetween them.

Exploiting these mappings, we match each WordNet synset which is assignedto a word in an English sentence to its corresponding synset in FarsNet. For thispart, we developed a Perl-based XML-Parser and integrated the results into theoutput provided by SenseRelate.

Along with transferring senses, we also need to transfer Wikipedia pages fromEnglish to Persian. Here, we choose the pages which are available in both lan-guages. Hence we can work with the pages describing the same title in Persian.1 The initial list is available at http://members.unine.ch/jacques.savoy/clef/

persianST.txt which was modified and extended according to the application re-quirements.

http://members.unine.ch/jacques.savoy/clef/

persianST.txt


3.3 Persian Sense Disambiguation

There are two different heuristics for disambiguating senses [1]:

– one sense per collocation: nearby words strongly and consistently contributeto determine the sense of a word, based on their relative distance, order, andsyntactic relationship;

– one sense per discourse: a word is consistently referred with the same sensewithin any given discourse or document;

The first heuristic is applicable to any available parallel corpus for English Per-sian texts, and we can assign the same sense as the English word to its translationappearing in the aligned Persian sentence. In this case, we obtain a very highaccuracy, although our system would be limited to this specific type of corpus.

Alternatively, since parallel corpora are not easy to obtain for many languagepairs, we utilize Wikipedia pages which are available in both English and Farsi asa comparable corpus. We used these pages in order to investigate the performanceof our system on such corpus which is easier to collect for languages with scarceresources.

Note that although Farsi pages are not the direct translation of English pages,the context is the same for all corresponding pages, which implies many commonwords appear in both pages. Consequently, we can assume domain-specific wordsappear with similar senses in both languages.

Based on the second hypothesis as the context of both texts is the same, foreach matched synset in FarsNet which contains a set of Persian synonym words,we find all these words in the Persian text and we assign the same sense as theEnglish label to them. Since there may be English words which occurred multipletimes in the text and they could receive different sense tags from SenseRelate, wetransfer the most common sense to Persian equivalences. Here we can use eitherthe “most frequent” sense provided by WordNet as the “most common” sense orchoose the most local frequent sense (i.e., in that particular context). Since thesecond heuristic is more plausible we opted to apply the most frequent sense ofeach English word in that text to its Persian translations. As an example considerSenseRelate assigned the second sense of the noun “bank” to this word in thefollowing sentence: “a bank is a financial institution licensed by a government.”and this sense is the most frequent sense in this English article. The Persianequivalent noun (i.e., “bank”) has six different senses. Among them we selectthe sense which is mapped to the second sense of word bank in WordNet and weassign this sense from FarsNet to “bank”.

We consider 3 possible scenarios:

1. An English word has more than one sense, while the equivalent Persian wordonly has one sense. So, SenseRelate disambiguates the senses for this Englishword, and the equivalent Persian word does not need disambiguation. Forexample “free” in English is a polysemic word which can mean both “ableto act at will” and “costing nothing”, while we have different words for these


senses in Persian (“azad” and “majani” respectively). In this case we areconfident that the transferred sense must be the correct sense for the Persianword.

2. Both the English and the Persian words are polysemous, so as their contextsare the same, the senses should be the same. In this case we use mappingsbetween synsets in WordNet and FarsNet. For example, the word “branch”in English and its Persian equivalent “shakheh” both are polysemous withsimilar set of senses. For example, if SenseRelate assigned the 5th sense (i.e“a stream or river connected to a larger one”) of this word to its occurrencein an English sentence, the mapped synset in FarsNet would also correspondto this sense of the Persian “shakheh”.

3. The third scenario happens when an English word only has one sense, whilethe Persian equivalent has more than one. In this case, as the context of bothtexts are the same, the Persian word is more likely to occur with the samesense as the English word. For example the noun “Milk” in English has onlyone meaning, while its translation in Farsi (i.e., “shir”) has three distinctmeanings: milk, lion and (water) tap. However, since SenseRelate assigns asynset with this gloss “a white nutritious liquid secreted by mammals andused as food by human beings” to this word, the first sense will be selectedfor “shir”.

In summary, for all 3 possible scenarios we utilize the mappings from WordNetsynsets to FarsNet ones. However, according to our evaluation results, the firstcase usually leads to more accurate results and the third case results in thelowest accuracy. Nontheless, when it comes to domain-specific words, all threecases result in a high precision rate.

4 Direct Approach: Applying Extended Lesk for PersianWSD

Thanks to the newly developed FarsNet, the Lesk method (gloss overlap) is ap-plicable to Persian texts as well. Since it is worthwhile to investigate the perfor-mance of this Knowledge based method - which has not as yet been employed fordisambiguating Persian words - and compare the results of both Cross-Lingualand Direct approaches, in the second part of this experiment, the Extended Leskalgorithm has been applied directly to Persian.

4.1 WSD Using the Lesk Algorithm

The Lesk algorithm uses dictionary definitions (gloss) to disambiguate a poly-semous word in a sentence context. The original algorithm counts the numberof words that are shared between two glosses. The more overlapping the glossesare, the more related the senses are. To disambiguate a word, the gloss of each ofits senses is compared to the glosses of every other word in a phrase. A word isassigned to the sense whose gloss shares the largest number of words in commonwith the glosses of the other words.


The major limitation to this algorithm is that dictionary glosses are oftenquite brief, and may not include sufficient vocabulary to identify related senses.An improved version of the Lesk Algorithm - Extended Lesk [20] - has beenemployed to overcome this limitation.

4.2 Extended Gloss Overlap

Extended Lesk algorithm extends the glosses of the concepts to include theglosses of other concepts to which they are related according to a given concepthierarchy.

Synsets are connected to each other through explicit semantic relations thatare defined in WordNet. These relations only connect word senses that are usedin the same part of speech. Noun synsets are connected to each other throughhypernym, hyponym, meronym, and holonym relations. There are other types ofrelations between different part of speeches in WordNet, but we focused on thesefour types in this paper. These relations are also available for Persian synsets inFarsNet.

Thus, the extended gloss overlap measure combines the advantages of glossoverlaps with the structure of a concept hierarchy to create an extended view ofrelatedness between synsets.

4.3 Applying Extended Lesk to Persian WSD

In order to compare the results of Direct and Cross-Lingual approaches, theoutput from the Cross-Lingual phase is used as an input to the knowledge based(direct) phase. Each tagged word from the input is considered as a target word toreceive the second sense tag based on the extended Lesk algorithm. We adoptedthe method described in [20] to perform WSD for the Persian language. Persianglosses were collected using the semantic relations implemented for FarsNet.STeP-1 [21] was used for tokenizing glosses and stemming the content words.

5 Evaluation

5.1 Cross-Lingual Approach

The results of this method have been evaluated on comparable English andPersian Wikipedia pages. Seven human experts - who are all native Persianspeakers - were involved in the evaluation process; they evaluated each taggedword as “the best sense assigned”, “almost accurate” and “wrong sense as-signed”. The second option considers cases in which the assigned sense is not thebest available sense for a word in a particular context, but it is very close to thecorrect meaning (not a wrong sense) which is influenced by the evaluation met-ric proposed by Resnik and Yarowsky in [22]. Currently the tagged words fromeach Wikipedia article were evaluated by one evaluator only. Evaluation resultsindicate an error rate of 25% for these pages. Table 1 summarizes these results.


Table 1. Evaluation Results

Cross-Lingual Direct Baseline

P R F-Score P R F-Score P R F-Score

Best Sense 68%0.35 0.48

51%0.35 0.44

39%0.35 0.40Almost Accurate 7% 9% 8%

Wrong Sense 25% 40% 53%

Our results indicate that the domain-specific words which usually occur fre-quently in both English and Persian texts are highly probable to receive thecorrect sense tag.

Due to the relatively smaller size of Persian texts, this system suffers from alow recall of 35%. However, as Wikipedia covers more and more Persian pagesevery day, soon we will be able to overcome this bottleneck.

According to the evaluation results, our Cross-Lingual method gained an F-score2 of 0.48 which is comparable to 0.54 F-score of SenseRelate using ExtendedLesk [19]. This indicates the performance of our approach can reach the F-scoreof the utilized English tagger. Employing a more accurate English sense taggerthus improves the WSD results for Persian words by far.

This system can be further evaluated by comparing its output to the resultsof assigning either random senses or the first sense to words. Since the sensesin FarsNet are not sorted based on their frequency of usage (as compared toWordNet), we decided to use the first sense appearing in FarsNet (for eachPOS). Assigning the first sense to all tagged Persian words, the performancedecreased significantly in terms of accuracy. The results in Table 1 indicatethat, applying our novel approach results in a 28% improvement in accuracyin comparison with this selected baseline. However, assigning the most frequentsense to Persian words would be a more realistic baseline which yields a betterestimation for our system’s performance. Thus by the time the frequency ofusage is provided for FarsNet senses, we anticipate that this problem will beminimized.

5.2 Direct Knowledge-Based Approach

As mentioned, the output of the Cross-lingual method was tagged again usingthe Direct approach. Overall, 53% of the words received a different tag using theDirect approach. Table 1 indicates the evaluation results for this approach.

5.3 Comparison: Knowledge Based vs. Cross-Lingual

Both systems employ the Extended Lesk algorithm. While the Cross-Lingualmethod applies Extended Lesk on the English side and transfers senses to Per-sian words, the Direct approach works with Persian text directly. In other words,2 F-Score is calculated as 2 (1−ErrorRate)·Recall

1−ErrorRate+Recall, where ErrorRate is the percentage of

words that have been assigned the wrong sense.


the former considers the whole text as the context and assigns one sense per dis-course and the latter considers surrounding words and assigns one sense percollocation. Furthermore, the Cross-Lingual method exploits WordNet for ex-tending the glosses which covers more words, senses and semantic relations thanFarsNet which is employed by the Direct method.

The main advantage of the Cross-Lingual method is that we can utilize anyhighly accurate English sense disambiguator for the first phase while the Persianside remains intact.

On the other hand, this approach assigns the same tag (the most commonsense) to all occurrences of a word which sacrifices accuracy. Moreover, if thereis no English text with the same context available for a Persian corpus, thismethod cannot be applied. However collecting comparable texts over the web isnot difficult. Finally, when the bilingual texts are not the direct translation of oneanother the system coverage will be limited to common words in both Englishand Persian texts. So, Cross-Lingual method mainly works well for domain wordsand not for all the words appearing in the Persian texts.

Although Persian WSD while working with Persian texts directly seems tobe more promising the evaluation results indicate a better performance for theCross-Lingual system. The reasons for this observation have been investigatedand are as follows:

1. Lack of reliable NLP tools for the Persian language. While STeP-1 has justbeen made available as a tokenizer and a stemmer, there is no POS taggerfor Persian which complicated the disambiguation process.

2. Lack of comprehensive linguistics resources for the Persian language. FarsNetis a very valuable resource for the Persian language. However it is still at apreliminary stage of development and does not cover all words and sensesin Persian. In terms of size it is significantly smaller (10000 synsets) thanWordNet (more than 117000 synsets) and it covers roughly 9000 relationsbetween both senses and synsets.

3. More ambiguity for Farsi words. Disambiguating a Farsi word is a big chal-lenge. Due to the fact that the short vowels are not written in the Farsiprescription, one needs to consider all types of homographs including het-eronyms and homonyms. Moreover, there is no POS tagger to disambiguateFarsi words which dramatically increases the ambiguity for many Farsi words.


A large number of WSD systems for widespread languages such as English isavailable. However, to date no large scale and highly accurate WSD systemhas been built for the Farsi language due to the lack of labeled corpora andmonolingual and bilingual knowledge resources.

In this paper we overcame this problem by taking advantage of English sensedisambiguators, availability of articles in both languages in Wikipedia and thenewly developed lexical ontology, FarsNet, in order to address WSD for Persian.The evaluation results of the Cross-lingual approach show a 28% improvement in


accuracy in comparison with the first-sense baseline. The Cross-Lingual approachperformed better than the knowledge based approach which is directly appliedto Persian sentences. However, one of the main reasons for this performance isthat the lack of NLP tools and comprehensive knowledge resources for Persianintroduces many challenges for systems investigating this language.

This paper in the first step examined a novel idea for cross-lingual WSD interms of plausibility, feasibility and performance. The ultimate results of ourapproach demonstrate a comparable performance to the utilized English sensetagger. Therefore, in the next step we will replace SenseRelate with another En-glish sense tagger with a higher F-score. Gaining higher accuracy and recall forthe Persian WSD system we can exploit it as a part of a bootstrapping systemto create the first sense tagged corpus to aid supervised WSD approaches forthe Persian language. Finally, as the available tools and resources improve forthe Persian language, the Direct approach can be employed to address WSDfor Persian texts directly when there is no comparable English text is available.

Acknowledgements. We would like to thank Prof. Shamsfard from the NaturalLanguage Processing Research laboratory of Shahid Beheshti University (SBU)for providing us with the FarsNet 1.0 package.

References

1. Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys (2009)2. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction

to WordNet: An On-line Lexical Database. International Journal of Lexicography(1990)

3. Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: A statistical approachto sense disambiguation in machine translation. In: Proceedings of the Workshopon Speech and Natural Language (1991)

4. Diab, M., Resnik, P.: An unsupervised method for word sense tagging using parallelcorpora. In: Proceedings of the 40th Annual Meeting on Association for Compu-tational Linguistics (2002)

5. Mihltz, M., Pohl, G.: Exploiting Parallel Corpora for Supervised Word-Sense Dis-ambiguation in English-Hungarian Machine Translation. In: Proceedings of the 5thConference on Language Resources and Evaluation (2006)

6. TufiS, D., Ion, R., Ide, N.: Fine-grained word sense disambiguation based on paral-lel corpora, word alignment, word clustering and aligned wordnets. In: Proceedingsof the 20th International Conference on Computational Linguistics (2004)

7. Tufis, D., Koeva, S.: Ontology-Supported Text Classification Based on Cross-Lingual Word Sense Disambiguation. In: Proceedings of the 7th InternationalWorkshop on Fuzzy Logic and Applications: Applications of Fuzzy Sets Theory(2007)

8. Motazedi, Y., Shamsfard, M.: English to persian machine translation exploitingsemantic word sense disambiguation. In: 14th International CSI Computer Con-ference, CSICC 2009 (2009)

9. Shamsfard, M., Hesabi, A., Fadaei, H., Mansoory, N., Famian, A., Bagherbeigi, S.,Fekri, E., Monshizadeh, M., Assi, S.M.: Semi Automatic Development of FarsNet;The Persian WordNet. In: Proceedings of 5th Global WordNet Conference (2010)


10. Stamou, S., Oflazer, K., Pala, K., Christoudoulakis, D., Cristea, D., Tufis, D.,Koeva, S., Totkov, G., Dutoit, D., Grigoriadou, M.: BALKANET: A MultilingualSemantic Network for the Balkan Languages. In: Proceedings of the 1st GlobalWordNet Association Conference (2002)

11. Faili, H.: An experiment of word sense disambiguation in a machine translation sys-tem. In: International Conference on Natural Language Processing and KnowledgeEngineering, NLP-KE 2008 (2008)

12. Harabagiu, S.M., Miller, G.A., Moldovan, D.I.: Wordnet 2 - a morphologically andsemantically enhanced resource (1999)

13. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries:how to tell a pine cone from an ice cream cone. In: Proceedings of the 5th AnnualInternational Conference on Systems Documentation (1986)

14. Saedi, C., Shamsfard, M., Motazedi, Y.: Automatic Translation between Englishand Persian Texts. In: In Proceedings of the 3rd Workshop on ComputationalApproaches to Arabic-script Based Languages (2009)

15. Mosavi Miangah, T., Delavar Khalafi, A.: Word Sense Disambiguation Using Tar-get Language Corpus in a Machine Translation System (June 2005)

16. Soltani, M., Faili, H.: A statistical approach on persian word sense disambiguation.In: 2010 The 7th International Conference on Informatics and Systems, INFOS(2010)

17. Mosavi Miangah, T.: Solving the Polysemy Problem of Persian Words Using Mu-tual Information Statistics. In: Proceedings of the Corpus Linguistics Conference(CL 2007) (2007)

18. Makki, R., Homayounpour, M.: Word Sense Disambiguation of Farsi HomographsUsing Thesaurus and Corpus. In: Advances in Natural Language Processing (2008)

19. Pedersen, T., Kolhatkar, V.: WordNet:SenseRelate:AllWords: a broad coverageword sense tagger that maximizes semantic relatedness. In: Proceedings of Hu-man Language Technologies: The 2009 Annual Conference of the North AmericanChapter of the Association for Computational Linguistics, Companion Volume:Demonstration Session (2009)

20. Banerjee, S.: Extended gloss overlaps as a measure of semantic relatedness. In: Pro-ceedings of the Eighteenth International Joint Conference on Artificial Intelligence,pp. 805–810 (2003)

21. Shamsfard, M., Sadat Jafari, H., Ilbeygi, M.: STeP-1: A Set of Fundamental Toolsfor Persian Text Processing. In: Proceedings of the Seventh conference on Interna-tional Language Resources and Evaluation (LREC 2010) (2010)

22. Resnik, P., Yarowsky, D.: Distinguishing systems and distinguishing senses: newevaluation methods for Word Sense Disambiguation. Nat. Lang. Eng. (1999)

COSINE: A Vertical Group Difference Approach

to Contrast Set Mining

Mondelle Simeon and Robert Hilderman

Department of Computer ScienceUniversity of Regina

Regina, Saskatchewan, Canada S4S 0A2{simeon2m,hilder}@cs.uregina.ca

Abstract. Contrast sets have been shown to be a useful mechanismfor describing differences between groups. A contrast set is a conjunc-tion of attribute-value pairs that differ significantly in their distributionacross groups. These groups are defined by a selected property that dis-tinguishes one from the other (e.g customers who default on their mort-gage versus those that don’t). In this paper, we propose a new searchalgorithm which uses a vertical approach for mining maximal contrastsets on categorical and quantitative data. We utilize a novel yet simplediscretization technique, akin to simple binning, for continuous-valuedattributes. Our experiments on real datasets demonstrate that our ap-proach is more efficient than two previously proposed algorithms, andmore effective in filtering interesting contrast sets.

1 Introduction

Discovering the differences between groups is a fundamental problem in manydisciplines. Groups are defined by a selected property that distinguishes onegroup from the other. For example, gender (male and female students) or year ofadmission (students admitted from 2001 to 2010). The group differences soughtare novel, implying that they are not obvious or intuitive, potentially useful,implying that they can aid in decision-making, and understandable, implyingthat they are presented in a format easily understood by human beings. Forexample, financial institutions may be interested in analyzing historical mortgagedata to understand the differences between individuals who default and thosewho don’t. Analysis may reveal that individuals who have married have lowerdefault rates. Contrast set mining [1] [2] [3] [4] has been developed as a datamining task which aims to efficiently identify differences between groups fromobservational multivariate data.

The contrast set mining techniques previously proposed have all been based ona horizontal mining approach that has been restricted to categorical attributes ora limited number of quantitative attributes. In this paper, we propose a new ver-tical mining approach for generating contrast sets, which can be applied to anynumber of categorical and quantitative attributes. This technique allows simulta-neous candidate generation and support counting unlike horizontal approaches,


360 M. Simeon and R. Hilderman

and it allows for efficient pruning of the search space. A novel yet simple dis-cretization method that is based on the statistical properties of the data values,is utilized in order to produce intervals for continuous-valued attributes.

The remainder of this paper is organized as follows. In Section 2, we brieflyreview related work. In Section 3, we describe the contrast set mining problem.In Section 4, we provide an overview of the vertical data format and the searchframework for contrast set mining. In Section 5, we introduce our algorithm formining maximal contrast sets. In Section 6, we present a summary of experimen-tal results from a series of mining tasks. In Section 7, we conclude and suggestareas for future work that are being considered.

2 Related Work

The STUCCO (Search and Testing for Understandable Consistent Contrasts) al-gorithm [1] [2] which is based on the Max-Miner rule discovery algorithm [5], wasintroduced as a technique for mining contrast sets. The objective of STUCCOis to find statistically significant contrast sets from grouped categorical data.It employed a breadth-first search to enumerate the search space and used thechi-squared (χ2) test to measure independence and employed a modified Bonfer-roni statistic to limit type-1 errors resulting from multiple hypothesis tests. Thisalgorithm formed the basis for a method proposed to discover negative contrastsets [6] that can include negation of terms in the contrast set. The main differ-ence was their use of Holm’s sequential rejective method [7] for the independencetest.

The CIGAR (Contrasting Grouped Association Rules) algorithm [3] was pro-posed as a contrast set mining technique that not only considers whether thedifference in support between groups is significant, but it also specifically iden-tifies which pairs of groups are significantly different and whether the attributesin a contrast set are correlated. CIGAR utilizes the same general approach asSTUCCO, however it focuses on controlling Type II error through increasingthe significance level for the significance tests, and by not correcting for multiplecorrections.

Contrast set mining has also been attempted on continuous data. One ofthe earliest attempts focussed on the formal notion of a time series contrastset [8] and proposed an efficient algorithm to discover timeseries contrast setson timeseries and multimedia data. The algorithm utilizes a SAX alphabet [9]to convert continuous data to discrete data (discretization). Another approachutilized a modified equal-width binning interval where the approximate width ofthe intervals is provided as a parameter to the model [4]. The methodology usedis similar to STUCCO, with the discretization process added so that it takesplace before enumerating the search space.

3 Problem Definition

Let A = {a1, a2, · · · , an} be a set of distinct attributes. We use Q and C todenote the set of quantitative attributes and the set of categorical attributes

COSINE: A Vertical Group Difference Approach to Contrast Set Mining 361

respectively. Let V(ak) be the set of possible values that each ak can take on.An attribute-interval pair, denoted as ak : [vkl, vkr], is an attribute ak associatedwith an interval [vkl, vkr ], where ak ∈ A, and vkl, vkr ∈ V(ak). Further, if ak ∈ Cthen vkl = vkr , and if ak ∈ Q, then vkl ≤ vkr . A transaction T is a set of values{x1, x2, x3, · · · , xn}, where xj ∈ V(aj) for 1 ≤ j ≤ n. A database D is a setof transactions. A database has a class F , which is a set F = {a1, a2, · · · , ak},where ∀ak ∈ A and 1 ≤ |F| < |A|. A group, G, is a conjunction of distinct classattribute-interval pairs. Formally,

G = {a1 : [v1l, v1r] ∩ · · · ∩ an : [vnl, vnr]}, ai, aj ∈ F , ai �= aj , ∀i, j

A quantitative contrast set, X , is a conjunction of attribute-interval pairs havingdistinct attributes defined on groups G1, G2, · · · , Gn. Formally,

X = {a1 : [v1l, v1r] ∩ · · · ∩ an : [vnl, vnr]}, ai, aj ∈ A− F , ai �= aj , ∀i, j

∃X ∩ G1, X ∩ G2, · · · , X ∩ Gn : Gi ∩ Gj = ∅, ∀i �= j

Henceforth, a contrast set refers to a quantitative contrast set. Given a contrastset, X , we define its attribute-interval set, denoted as AI(X) as the set {ai :[vil, vir ]|ai : [vil, vir] ∈ X}. A contrast set X is called k-specific if the cardinalityof its attribute-interval set, |AI(X)|, is equal to k. Given two contrast sets, Xand Y , we say that X is a subset of Y , denoted as X ⊂ Y , if AI(X) ⊂ AI(Y ).

The frequency of a quantitative contrast set X in D, denoted as freq(X),is the number of transactions in D where X occurs. The tidset of a contrastset, X , is the set t(X) ⊆ T , consisting of all the transactions which containX . The diffset of a contrast set, X , is the set d(X) ⊆ T , consisting of all thetransactions which do not contain X . The support of X for a group Gi, denotedas supp(X, Gi), is the percentage of transactions in the database that belong toGi where X occurs. A contrast set is called maximal if it is not a subset of anyother contrast set.

A contrast set, X , is called a group difference if and only if the following fourcriteria are satisfied:

∃ijsupp(X, Gi) �= supp(X, Gj) (1)

maxij

|supp(X, Gi) − supp(X, Gj)| ≥ ε (2)

freq(X) ≥ σ (3)

nmax

i

{supp(Y, Gi)supp(X, Gi)

}≤ κ, (4)

where ε is a threshold called the minimum support difference, σ is a minimumfrequency threshold, κ is a threshold called the maximum subset support ratio,


Table 1. Dataset

TID A B C D E

1 1 0 1 1 1

2 0 1 1 0 1

3 1 1 0 0 1

4 1 0 1 1 1

5 0 0 0 1 1

and Y ⊂ X with |AI(Y )| = |AI(X)| + 1. The first criterion ensures that thecontrast set represents a true difference between the groups. Contrast sets thatmeet this criterion are called significant. The second criterion ensures the effectsize. Contrast sets that meet this criterion are called large. The third criterionensures that the contrast set occurs in a large enough number of transactions.Contrast sets that meet this criterion are called frequent. The fourth criterionensures that the support of the contrast set in each group is different from thatof its superset. Contrast sets that meet this criterion are called specific.

The task of finding all group differences from the set of all contrast setsbecomes prohibitively expensive because of a possibly exponentially sized searchspace. However, a more manageable task is to find the set of maximal groupdifferences. Our goal then is to find all the maximal group differences in a givendataset(i.e, all the maximal contrast sets that satisfy Equations 1, 2, 3, and 4).

4 Background

4.1 Data Format

Our algorithm uses a vertical data format given that we manipulate the tidsetsin determining the frequency of the contrast sets. Mining algorithms using thevertical format have been shown to be very effective and usually outperformhorizontal approaches [10] [11]. We utilize specifically diffsets which have beenshown to substantially improve the running time of algorithms that use it insteadof the traditional tidsets [11] [12].

4.2 Search for Quantitative Contrast Sets

Our algorithm uses a backtracking search paradigm in order to enumerate allmaximal group differences. Backtracking algorithms are useful because they al-low us to iterate through all the possible configurations of the search space.Consider a sample dataset shown in Table 1 with five attributes, A, B, C, D, andE, each taking on values of 0 and 1 indicating absence and presence, respectively,in a transaction. Each transaction is identified by a TID. The full search spacetree is shown in Figure 1.

The root of the tree corresponds to the combine set {A, B, C, D, E}, whichis composed of the 1-specific contrast sets from the items shown in Table 1.


Fig. 1. Search Tree: Square indicates maximal contrast sets

All these contrast sets share the empty prefix in common. The leftmost childof the root consists of all the subsets containing A as the prefix, i.e. the set{AB, AC, AD, AE}, and so on. A combine set lists the contrast sets that theprefix can be extended with to obtain a new contrast set. Clearly no subtree ofa node that fails to satisfy Equations 1, 2, 3, and 4 has to be examined. Themain advantage of this approach is that it allows us to break up the originalsearch space into independent sub-problems. The subtree rooted at A can betreated as a completely new problem such that the contrast sets under it can beenumerated, prefixed with the contrast set A, and so on.

Formally, for a set of contrast sets with prefix P , [P ] = {X1, X2, · · · , Xn},the intersection of PXi with all of PXj with j > i is performed to obtain anew combine set [PXi] where the contrast set PXiXj meets Equations 1, 2, 3,and 4. For example, from [A] = {B, C, D, E}, we obtain [AB] = {C, D, E},[AC] = {D, E}, [AD] = {E}, [AE] = {} for the next level of the search tree. Anode with an empty combine set such as [AE] need not be explored further.

4.3 Distribution Difference

We utilize an interestingness measure, referred to in this paper as the distributiondifference, which measures how different the group support in the contrast setis from the entire dataset [4]. Formally, the distribution difference of a contrastset, X , is

Distribution Difference(X) =n∑i

∣∣∣∣n(X, Gi)n(X)

× N

n(Gi)− 1∣∣∣∣

where n is the number of groups, n(Gi) is the number of transactions that belongto Gi, n(X) is the number of transactions where X occurs, and n(X, Gi) is thenumber of transactions in group, Gi, where X occurs.


5 Our Proposed Approach

In this section we introduce our approach to contrast set mining using a verticalapproach and describe it using the dataset in Table 1.

5.1 Tests for Significance

Like STUCCO, in order to determine if a contrast set is significant we use a 2×Gcontingency table where the row represents the truth of the contrast set, and thecolumn indicates group membership. We use the standard test for independenceof variables in contingency tables, the χ2 statistic. To correct for small samplesizes (i.e, less than 1000), we use Fisher’s exact test when the number of groups istwo, and Yates correction otherwise. Also like STUCCO, we use a Bonferroni-likeadjustment to reduce the number of false discoveries.

5.2 Comparison of Contrasting Groups

In determining statistical significance, when we reject the null hypothesis, wecan conclude that a significant difference exists amongst the groups. When thereare only two groups, we know that that differences lies between ”Group 1 andnot Group 1 (i.e., Group 2)”. However, when there are more than two groups, wedo not have enough information to determine specifically amongst which groupsthe differences lie. We use a set of 2 × 2 contingency tables representing theabsence and presence of each group and determine with which pairs there is asignificant difference. This is referred to as the one versus all approach.

Formally, with the one versus all approach, for a contrast set X , where∃iP (X |Gi), we determine

P (X |Gi) �= P (X |¬Gi), ∀i. (5)

5.3 Discretization

In order to determine intervals for quantitative attributes, we use a discretiza-tion approach to determine the endpoints of the interval. Our algorithm usesstatistical properties of the values, (i.e., the mean and standard deviation) todetermine where an interval begins and ends. This makes our approach simple,akin to simple binning methods, which use a fixed number of intervals, yet moreresponsive to the distribution of the values in determining the number of inter-vals. Our Discretize algorithm shown in Algorithm 1 takes a set of values for aquantitative attribute and returns a list of cut-points.

The algorithm starts by sorting the values in ascending order. The minimum,maximum, mean and standard deviation, Vmin, Vmax, Vmean, Vsd, respectively,are determined. Vmean is the first cut-point. The algorithm finds cut-pointswithin a half a standard deviation away from the minimum and maximum values.For example, assume that the maximum and minimum values for an attributein a set of transactions are 19.4 and 45.8, respectively, with a mean of 28.5 anda standard deviation of 3.5. Lcp would be (28.5-3.5=25.0), and Rcp would be


Algorithm 1. Discretize AlgorithmInput: A set of values VOutput: A list of cut-points C

1: Discretize(V )2: C = ∅3: Sort V4: Calculate Vmin, Vmax, Vmean, Vsd

5: Lcp = Sm − sd

6: Rcp = Sm + sd

7: while Lcp ≥ Vmin + 0.5 × Vsd do8: C = C ∪ Lcp

9: Lcp = Lcp − sd

10: end while11: while Rcp ≤ Vmax − 0.5 × Vsd do12: C = C ∪Rcp

13: Rcp = Rcp + sd

14: end while

(28.5+3.5=32.0) initially. Since both values are greater than a standard devia-tion away from the minimum and maximum values, they are added to C. Theprocess is repeated, generating additional cut-points of 21.5, 35.5, 39, and 42.5.

5.4 Mining Maximal Group Differences

In order to find all the maximal group differences in a given dataset, i.e all thequantitative contrast sets that satisfy Equations 1, 2, 3, and 4, we present ouralgorithm, COSINE(Contrast Set Exploration using Diffsets), in Algorithm 2. Itadapts several tenets of the back-tracking search technique first proposed in [11]for contrast set mining.

COSINE begins by first determining all the 1-specific quantitative contrastsets from the V of each attribute in the dataset not in the class F , and storingthem in B (lines 1-6). Attributes which are quantitative are discretized usingour Discretize Algorithm, to determine a V set from which 1-specific quantitativecontrast sets can be generated. For each element in B, COSINE determines theirdiffset, Dx, their frequency, Fx, and the cardinality of their potential combine set,Cx. It then uses a one versus all approach to determine with which specific groupsthe differences lie, then adds the contrast sets that satisfy Equations 1, 2, and 3into a combine set C0 (lines 8-14). C0 is then sorted in ascending order of thecardinality of Cx, then by the frequency, Fx (line 17). Using these two criteria toorder the combine set has been shown to more likely eliminate many branchesin the search tree from consideration and to produce a smaller backtrackingtree [11]. COSINE then calls a subroutine, MINE, presented in Algorithm 3, withC0, M , which will hold all our maximal group differences at the end, and theprefix, P0 (line 18). If we consider the example in Figure 1, COSINE starts at theroot of the tree with P0 = ∅, and with {A, B, C, D, E}, sorted as {E, D, C, B, A}as C0.


Algorithm 2. COSINE(D,F)Input: Input: Dataset D and class FOutput: The set of all maximal group differences M

1: for each i ∈ A,A ∈ D, i �∈ F do2: if i ∈ Q then3: V(i) = Discretize(i)4: end if5: B = B ∪ V(i)6: end for7: C0 = {}8: for each x ∈ B do9: Determine Dx, Fx, and |Cx|

10: if significant(x) & large(x) & frequent(x) then11: Determine P (x|Gi) �= P (x|¬Gi), ∀i12: C0 = C0 ∪ {x}13: end if14: end for15: Sort each C0 in increasing |Cx| then in increasing Fx

16: MINE(P0, C0, M)17: return M

MINE first determines Pl+1, which is simply x. Secondly, it determines a newpossible set of combine elements for Pl+1, Hl+1, by first stripping the prefix Pl+1

from the previous prefix Pl, creating P′l+1. It then determines from the list of

elements in Cl, those which are greater than (appear after) Pl+1. For any suchelement, y, MINE strips it of the prefix Pl, creating y

′. It then checks whether

the attribute-interval set of P′l+1 and y

′are different. P

′l+1 and y

′are 1-specific

contrast sets and if they have the same attribute-interval set, it means theyoriginate from the same attribute and cannot be part of a new contrast set, aswe require contrast sets to have unique attributes. If they are not equal, y isadded to Hl+1 (lines 4-12). In our example, P1 = {E}, and since P0 = {}, thenH1 = {D, C, B, A}.

MINE next determines if the cardinality of the current set of maximal contrastsets, Ml, is greater than zero. If it is, MINE checks if Pl+1∪Hl+1 is subsumed byan existing maximal set. If yes, the current and subsequent contrast sets in Cl

can be pruned away (lines 13-17). If not, an extension is necessary. MINE thencreates a new combine set, Cl+1,by combining the prefix Pl+1 with each membery of the possible set of combine elements, Hl+1, to create a new contrast set z.For each z, it calculates its diffset, Dz, and its frequency, Fz , then determineswhether Equations 1, 2, 3, and 4 are satisfied. Each combination, z, that satis-fies the criteria is added to a new combine set Cl+1 (lines 20-27). Cl+1 is sortedin increasing order of the frequency of its members. Re-ordering a combine setin increasing order of frequency has been shown to more likely produce smallcombine sets at the next level [12]. This suggests that contrast sets with a lower


Algorithm 3. MINE(Pl, Cl, Ml)1: for each x ∈ Cl do2: Pl+1 = {x}3: Hl+1 = ∅4: Let P

′l+1 = Pl+1 − Pl

5: for each y ∈ Cl do6: if y > Pl+1 then

7: Let y′= y − Pl

8: if AI(y′) �= AI(P

′l+1) then

9: Hl+1 = Hl+1 ∪ {y}10: end if11: end if12: end for13: if |Ml| > 0 then14: if Z ⊇ Pl+1 ∪Hl+1 : Z ∈Ml then15: return16: end if17: end if18: LMl+1 = ∅19: Cl+1 = ∅20: for each y ∈ Hl+1 do21: z = Pl+1 ∪ {y}22: Determine Dz, and Fz

23: if significant(z) & large(z) & frequent(z) & specific(z) then24: Determine P (x|Gi) �= P (x|¬Gi), ∀i25: Cl+1 = Cl+1 ∪ {z}26: end if27: end for28: Sort Cl+1 by increasing Fz, ∀z ∈ Cl+1

29: if Cl+1 = ∅ then30: if Z �⊇ Pl+1 : Z ∈Ml then31: Ml = Ml ∪ Pl+1

32: end if33: else34: Ml+1 = {M ∈Ml : x ∈M}35: end if36: if Cl+1 �= ∅ then37: MINE(Pl+1, Cl+1,Ml+1)38: end if39: Ml = Ml ∪Ml+1

40: end for

frequency at one level are less likely to produce contrast sets that meet ourfrequency threshold on the next level. In our example, M1 = ∅, and C1 ={ED, EC, EB, EA}.

After creating the new combine set, Cl+1, if it is empty and Pl+1 is not asubset of any maximal contrast set in Ml, Pl+1 is added to Ml (lines 29-32).Otherwise, a new set of local maximal contrast sets, Ml+1, is created based on


Table 2. Dataset Description

Data Set Description # Transactions # Attributes # Groups

Census Census data 32561 14 2

Mushroom Mushroom characteristics 8124 22 2

Thyroid Thyroid disease data 7200 21 3

Pendigits Handwritten digits 10992 16 10

the notion of progressive focusing [11] [12], whereby only the contrast sets inMl that contain all the contrast sets in Pl are added to Ml+1 (line 34). Thisallows the number of maximal contrast sets of interest to be narrowed downas recursive calls are made. If Cl+1 is not empty, MINE is called again withPl+1, Cl+1, and the set of new local maximal contrast sets, Ml+1 (lines 36-38).After the recursion completes, the set of maximal contrast sets, Ml, is updatedwith the elements from Ml+1 (line 39). From our example, since |C1| �= ∅, we skipthe superset check, and create M1 = {}. In our example, COSINE calls MINEwith E, and {ED, EC, EB, EA}. This process continues until all the maximalcontrast sets are identified.


In this section, we present the results of an experimental evaluation of the CO-SINE algorithm which was implemented in Java and run on an Intel dual coreprocessor with 4GB of memory. Discovery tasks were performed on four datasetsobtained from the UCI Machine Learning Repository [13]. The characteristics ofthe four datasets are shown in Table 2.

6.1 Efficiency Evaluation

We ran a series of discovery tasks on the Census, Mushroom, Thyroid, andPendigits datasets in order to compare the efficiency of COSINE with that ofSTUCCO and CIGAR. We implemented STUCCO and CIGAR in the samelanguage and ran it on the same platform as COSINE. Although they each havedifferent objectives and thus place different constraints on the search process,STUCCO, CIGAR, and COSINE all use the support difference as a constraint,thus we can measure the time taken to complete the discovery task as the supportdifference varies. We ran STUCCO, CIGAR, and COSINE, using a significancelevel of 0.95, on the four datasets, of which the Mushroom dataset and a subsetof the Census dataset were utilized in [1] and [3]. Figure 2 shows the resultscomparing the run time to the minimum support difference. We use a minimumfrequency threshold of 0 and a maximum subset support ratio of 0 for COSINE.The results have been averaged over 10 consecutive runs. We use the same pa-rameters for CIGAR as outlined in [3] for these datasets. We also ran COSINEwithout controlling for Type I errors, referred to as COSINE-1 in Figure 2.


0

50

100

150

200

0 5 10 15 20 25 30

Tim

e(s)

Support Difference(%)

COSINESTUCCO

CIGARCOSINE-1

(a) Census

0

100

200

300

400

500

600

0 10 20 30 40 50

Tim

e(s)


COSINESTUCCO

CIGARCOSINE-1

(b) Mushroom

2

3

4

5

6

7

8

9

10

0 5 10 15 20 25 30

Tim

e(s)


COSINESTUCCO

CIGARCOSINE-1

(c) Thyroid

6

8

10

12

14

16

18

20

0 5 10 15 20 25 30 35 40

Tim

e(s)


COSINESTUCCO

CIGARCOSINE-1

(d) Pendigits

Fig. 2. CPU time versus Support Difference

On all four datasets, both COSINE and COSINE-I outperformed STUCCOand CIGAR. This observation was most acute on the Mushroom dataset whenthe minimum support difference is 0. Above a minimum support difference of 10,there is no difference in runtime amongst STUCCO, COSINE, and COSINE-I.The run time observed for STUCCO on the Mushroom dataset is consistent withthat in [1]. On both the Thyroid and Census datasets, the difference in runtimebecomes negligible as the minimum support difference increases, while on thePendigits dataset, the difference in runtime even at the largest support differencemeasured, 30, is substantially different between STUCCO and COSINE. For allfour datasets, CIGAR consistently has the longest runtime.

6.2 Interestingness Evaluation

In this section, we examine the effectiveness of the maximum subset ratio interms of the interestingness of the contrast sets that are discovered. Table 3 showsthe average distribution difference of the maximal contrast sets discovered foreach of the four datasets as the maximum subset support ratio is varied. Theseresults were generated with a minimum frequency threshold of 0, significancelevel of 0.95, and a minimum support difference of 0.

For each of the four datasets, as the maximum subset support ratio is variedfrom 0 to 0.5, we can observe an increase in the average distribution difference


Table 3. Effectiveness of the Maximum Subset Support Ratio

Data Set Distribution Difference

Maximum Subset Ratio 0 0.01 0.05 0.1 0.5

Census 0.35 0.45 1.37 2.17 2.54

Mushroom 0.87 1.23 1.45 2.01 2.76

Thyroid 1.24 1.55 1.87 2.98 3.21

Pendigits 1.98 2.34 2.87 3.41 3.65

of the contrast sets discovered. This indicates that the contrast sets discoveredhave a substantially different distribution amongst the groups, than that of theentire dataset, thus they are interesting. Thus the maximum subset support ratioserves as a good filter for producing interesting contrast sets.

7 Conclusion

In this paper, we introduced and demonstrated an approach for mining maxi-mal group differences: COSINE. COSINE mined maximal contrast sets that aresignificant, large, frequent and specific from categorical and quantitative dataand utilized a discretization technique for continuous-valued attributes by us-ing their mean and standard deviation to determine the number of intervals.We compared our approach with two previous contrast set mining approaches,STUCCO and CIGAR, and found our approach to be more efficient. Finally,we showed that the maximum subset support ratio was effective in filtering in-teresting contrast sets. Future work will examine further search space reductiontechniques.

References

1. Bay, S.D., Pazzani, M.J.: Detecting change in categorical data: Mining contrastsets. In: KDD, pp. 302–306 (1999)

2. Bay, S.D., Pazzani, M.J.: Detecting group differences: Mining contrast sets. DataMin. Knowl. Discov. 5(3), 213–246 (2001)

3. Hilderman, R., Peckham, T.: A statistically sound alternative approach to miningcontrast sets. In: AusDM, pp. 157–172 (2005)

4. Simeon, M., Hilderman, R.J.: Exploratory quantitative contrast set mining: A dis-cretization approach. In: ICTAI, vol. (2), pp. 124–131 (2007)

5. Bayardo Jr., R.J.: Efficiently mining long patterns from databases. In: SIGMODConference, pp. 85–93 (1998)

6. Wong, T.T., Tseng, K.L.: Mining negative contrast sets from data with discreteattributes. Expert Syst. Appl. 29, 401–407 (2005)

7. Holm, S.: A simple sequentially rejective multiple test procedure. ScandinavianJournal of Statistics 6, 65–70 (1979)

8. Lin, J., Keogh, E.J.: Group SAX: Extending the notion of contrast sets to timeseries and multimedia data. In: Furnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.)PKDD 2006. LNCS (LNAI), vol. 4213, pp. 284–296. Springer, Heidelberg (2006)


9. Lin, J., Keogh, E.J., Lonardi, S., chi Chiu, B.Y.: A symbolic representation of timeseries, with implications for streaming algorithms. In: DMKD, pp. 2–11 (2003)

10. Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining as-sociation rules in large databases. In: VLDB, pp. 432–444 (1995)

11. Zaki, M.J., Gouda, K.: Fast vertical mining using diffsets. In: KDD, pp. 326–335(2003)

12. Gouda, K., Zaki, M.J.: Genmax: An efficient algorithm for mining maximal fre-quent itemsets. Data Min. Knowl. Discov. 11(3), 223–242 (2005)

13. Asuncion, A., Newman, D.: UCI machine learning repository (2007)

Hybrid Reasoning for Ontology Classification

Weihong Song1,2, Bruce Spencer1,2, and Weichang Du1

1 Faculty of Computer Science, University of New Brunswick2 National Research Council, Canada{song.weihong,bspencer,wdu}@unb.ca

Abstract. Ontology classification is an essential reasoning task for on-tology based systems. Tableau and resolution are two dominant typesof reasoning procedures for ontology reasoning. Complex ontologies areoften built on more expressive description logics and are usually highlycyclic. When reasoning complex ontologies, the both approaches mayhave difficulties in terms of reasoning results and performance, but fordifferent ontology types. In this research, we investigate a hybrid rea-soning approach, which will employ well-defined strategies to decomposeand modify a complex ontology into subsets of ontologies based on capa-bilities of different reasoners, process the subsets with suitable individ-ual reasoners, and combine such individual classification results into theoverall classification result. The objective of our approach is to detectmore subsumption relationships than individual reasoners for complexontologies, and improve overall reasoning performance.

Keywords: Hybrid reasoning, Complex ontology, Classification, Tableau,Resolution.

1 Challenge for Ontology Classification

Ontology classification, which means computing the subsumption relation be-tween all pairs of concepts, is the foundation for other ontology reasoning prob-lems. Our task in this paper is for classification,i.e., Tbox[1] reasoning. Weconsider an ontology to be complex in one of the two cases: it uses an expressivelanguage, or it is highly cyclic. Ontologies use Description Logic (DL)to defineconcepts, and more complex ontologies require more expressive languages. Forexample, SROIQ(D)is more expressive than SHIQ. For the second situation, def-initions are cyclic when concepts are defined in terms of themselves or in termsof other concepts that indirectly refer to them. When numerous concepts in theontology are cyclic, we say the ontology is highly cyclic. These two situationsare independent; an ontology may be cyclic but use a simple DL, or it may usea complex DL but be acyclic.

We often encounter ontologies, that exhibit both the complexities mentioned:expressive DL language and highly cyclic. Due to the functional and performanceissues, such ontologies often cannot be classified by any available reasoners in-dividually, in terms of functionality and/or machine capacity. If they can, theyrequire powerful computers with large memory and much time.


Hybrid Reasoning for Ontology Classification 373

Table 1. Results of Performance Evaluation (Mem: memory; T: time; C/I: C:completeresult, I: incomplete result; M: Mbytes; S:second; -: the reasoner failed to return result)

Ontology Hermit Pellet CB

Mem(M) T(S) C/I Mem(M) T(S) C/I Mem(M) T(S) C/I

Dolce all 28.77 202.91 C 146.48 132.00 C 12.50 0.10 I

SnomedCT 4403.20 3420.02 C 8704.01 1200.00 C 767.00 62.00 C

Galen-Heart 4408.00 190800.05 C - - - 21.00 0.91 C

Galen-Full - - - - - - 947.00 66.78 C

FMAC 3891.20 3494.00 C - - - 212.00 11.40 I

FMA - - - - - - 580.11 32.66 I

2 Problem: Limitations of the DL Reasoning Procedures

Next, we illustrate functional and performance issue of the two reasoning proce-dures. First we depict the Functional Problem. The tableau based procedures areuseful for both simple and expressive DL languages, including such expressiveas SROIQ(D). However, the tableau is not able deal with ontologies whose con-cepts are highly cyclic. The representative reasoners using tableau procedure arePellet [6], Hermit [5]; Resolution based reasoning procedures can only cope witha smaller subset and less expressive DL, such as SHIQ. However, they are veryeffective for second kind of complexity, and can handle highly cyclic ontologies.CB [3], KAON2 [4] are examples of reasoners applying resolution. If an ontologyhas both kinds of complexity, it is very hard for a single reasoner to classify it.FMA is such an example, which is one of the largest and most complex medicalontology. None of current reasoners can fully process it because of its two typesof complexities. Even its subset FMA-constitutionalPartForNS(FMAC)[2], can’tbe fully classified by any current resolution reasoner because its DL languageis beyond SHIQ; None of current tableau reasoners can handle the biomedi-cal ontology Gallen-full, because it is highly cyclical. The functional problemis one of the dominant problems that prevent the reasoning technology frombeing widely used. Table1 illustrates all the results using a powerful computer(Intel R©Xeon(R)CPU 8-core, 40G memory, 64-bit OS).

Another major problem is memory space and time performance problems.Resolution reasoners are often significantly faster and also use much less mem-ory than tableau reasoners for the same task.Tableau-based reasoning buildslarge structures which greatly consume memory space. For the large ontol-ogy SNOMED-CT, as well as medium size but complex ontology FMAC andGalen-Heart, the tableau reasoner need huge memory to deal with them, from3891.00MB to 8704.01MB, this size of memory is not available for common PC.As of time efficiency, CB took about 1/20 and 1/50 of time than Pellet and Her-mit respectively when processing on Snomed-CT. As for ontology Gallen-Heart,CB only took 1/190800 of time of Hermit.

374 W. Song, B. Spencer, and W. Du

3 Proposed Solution: Hybrid Reasoning

We propose a hybrid reasoning approach to classify on complex ontologies. Eachkind of reasoning procedure and its corresponding reasoners has its advantagesand limitations. We propose to assemble the two kinds of reasoning technologytogether, to accomplish complex reasoning tasks, which cannot be done withacceptable performance by any individual reasoner.

An ontology is composed of many concepts; definitions of a concept consistof many axioms. The basic idea of our approach is to separates the ontologybased on the reasoning ability of different reasoners, i.e., separate the two sortsof complexities into different pieces of ontologies and assign them to the morecapable or efficient reasoner to handle. More elaboration is as follows: Supposewe have a complex ontology T , a resolution-based reasoner Rr and a tableau-based reasoner Rt. We construct Tr, which is derived from T by removing axiomsthat contain elements from languages that are more expressive and beyond Rr’scapability, with the result that Tr is completely classified by Rr. But this resultmay just a partial classification result for T . After that, we use Rt to do asecond round of classification on certain selected concepts C′ whose classificationresults may be affected by removing axioms in T − Tr. The purpose of this ideais reducing the number of concepts that tableau reasoner should work on tocompensate Rr’s incomplete result and thus get the sound and complete resultsplus better overall performance. It is possible that C′ still be highly cyclic. Then,similar thoughts of simplifying the ontology for Tr can be employed on Tt, whichis a subset of C’. This time we remove a different source of complexity; we willremove some axioms in C’ which lead to many definitions cycles to constructthe simplified ontology fragment Tt; we might also use the previous result ofreasoning and inject some subsumption results to Tt. Then, we will let Rt doreasoning on Tt and get more classification results. Further iteration rounds maybe required; each time we will use the previous classification result.

The approach is based on the following observation.

Proposition 1. Let T ′ be a terminology consisting a subset of the axioms in aconsistent terminology T . For any p, if T ′ entails p, then T entails p.

Based on this proposition, all the subsumption relationships we get from rea-soning on Tr holds on the entire terminology T . And if we further inject someof these sound results to the ontology, the future result will also be sound. How-ever,there may be other subsumption relationships that we neglected becauseof removing axioms. In other words, the subsumptions we get from Rr or Rt issound but not complete, and we need more rounds of classification using Rt andRr to make the results complete.

The benefits of this hybrid reasoning approach lie in two aspects:

1. Functional aspect: Compared with resolution based reasoners, our approachis able to do reasoning on a more expressive DL. Compared with Tableau-based reasoners, the proposed approach is able to do classification correctlyon some highly cyclic concepts.

Hybrid Reasoning for Ontology Classification 375

2. Performance aspect: Our approach outperforms tableau-based reasoners sincewe allocate some classification tasks to the more efficient resolution-basedreasoners.

4 Challenge and Potential Impacts for This Proposal

Our challenge is to identify and solve new problems that arise by applying ourapproach so that the hybrid reasoner can be effective.

1. How to reduce the size of C’. When we remove an axiom from the definitionsof a concept A, the basic way to identify the concepts C’ being affected bythis moving is to find all the concepts which depend on A and their transitivedependents. According to the concrete axiom moved and the different kindsof dependency relationship with A in C’, the challenge is find some conceptsthat in fact will not be affected by this moving action, and further eliminatethe concepts in set C’.

2. How to break cycles in Tt. We need to define various properties of the cyclicalsituation, including the number of concepts contributing to a cycle, andthe number of cycles a concept contributes to. We also need a strategy toidentify these properties and to study their effects on tableau reasoners. Thechallenge lies in the problem that how to choose the axioms to be removedand how to inject previous reasoning result and get Tt so that it has minimumnegative effects on the completeness of classification result because of movingaxioms, while the cycles have been reduced and the Rt can classify the Tt

and supplement more classification result.3. How to ensure completeness of the classification results through the combi-

nation of different strategies while improving efficiency of the reasoning inmost situations.

Current reasoners are prevented from classifying ontologies in many cases. Byusing hybrid reasoning, an ontology’s classification is not limited to one partic-ular reasoner but the task is given to a combination of different reasoners. Thisstrategy is adaptable to different language features and in many cases the overallperformance is enhanced compared to a single reasoner.

References

1. Baader, F.: The Description Logic Handbook: Theory, Implementation, and Appli-cations. Cambridge Univ. Pr., Cambridge (2010)

2. Glimm, B., Horrocks, I., Motic, B.: Optimized Description Logic Reasoning via CoreBlocking. Automated Reasoning, 457–471 (2010)

3. Kazakov, Y.: Consequence-Driven Reasoning for Horn SHIQ Ontologies. In: Proc.of IJCAI 2009, pp. 2040–2045 (2009)

376 W. Song, B. Spencer, and W. Du

4. Motik, B., Studer, R.: KAON2CA Scalable Reasoning Tool for the Semantic Web.In: Proceedings of the 2nd European Semantic Web Conference (ESWC 2005),Heraklion, Greece (2005)

5. Shearer, R., Motik, B., Horrocks, I.: HermiT: a Highly-Efficient OWL Reasoner. In:Proceedings of the 5th International Workshop on OWL: Experiences and Directions(OWLED 2008), pp. 26–27, Citeseer (2008)

6. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A PracticalOWL-DL Reasoner. Web Semantics: Science, Services and Agents on the WorldWide Web 5(2), 51–53 (2007)

Subspace Mapping of Noisy Text Documents

Axel J. Soto1, Marc Strickert2,Gustavo E. Vazquez3, and Evangelos Milios1

1 Faculty of Computer Science, Dalhousie University, [email protected]

2 Institute for Vision and Graphics, Siegen University, Germany3 Dept. Computer Science, Univ. Nacional del Sur, Argentina

Abstract. Subspace mapping methods aim at projecting high-dimen-sional data into a subspace where a specific objective function is opti-mized. Such dimension reduction allows the removal of collinear and ir-relevant variables for creating informative visualizations and task-relateddata spaces. These specific and generally de-noised subspaces spaces en-able machine learning methods to work more efficiently. We present anew and general subspace mapping method, Correlative Matrix Mapping(CMM), and evaluate its abilities for category-driven text organizationby assessing neighborhood preservation, class coherence, and classifica-tion. This approach is evaluated for the challenging task of processingshort and noisy documents.

Keywords: Subspace Mapping, Compressed Document Representation.

1 Introduction

Many data-oriented areas of science drive the need for faithfully representingdata containing thousands of variables. Therefore, methods for considerably re-ducing the number of variables are desired focusing on subsets being minimallyredundant and maximally task-relevant. Different approaches for subspace map-ping, manifold learning, and dimensionality reduction (DR) were proposed ear-lier [1,2]. A current challenge in information representation is the huge amountsof text documents being produced at increasing rates. Using the well-knownvector space representation, or “bag of words” model, a corpus of documents isdescribed by the set of words that each document contains. This approach yieldsa document-term matrix containing thousands of unique terms, and thus is verylikely to be sparse.

The text mining communities have developed methods for automatic clus-tering and classification of document topics using specific metrics and kernels.Yet fully developed human-in-the-loop approaches are rare to enable the user toperform visual data exploration and visual data mining. While the automaticlearning of data is crucial, visualization is another key aspect for providing anintuitive interface to contained information and for interactive tuning of the


378 A.J. Soto et al.

data/text mining algorithms. This makes DR methods indispensable for inter-active text corpus exploration.

In this paper, we present an application of a recent DR method, Correl-ative Matrix Mapping (CMM), which has been successfully applied in otherdomains [3]1 in the context of regression problems. This method is based onan adaptive matrix metric aiming at a maximum correlation of all pairwise dis-tances in the generated subspace and the associated target distances. Preliminarywork [4]1 showed some capabilities of this approach for the expert-guided visu-alization of labeled text corpora by integrating user feedback on the base of theinterpretable low-dimensional mapped document space. Here, we provide a com-prehensive comparison of CMM and other competitive DR methods for creatingrepresentative low-dimensional subspaces. Since machine learning methods relyon distance calculations, we investigate how such projections with label-drivendistance metrics can improve representations of short and noisy text documents.We refer to noisy documents as the ones that are not properly written in termsof spelling and grammatical structure. Such documents are quite common inbusiness environments such as aircraft maintenance records, online help desk orcustomer survey applications, and their analysis is thus highly relevant. Still,much work in the information extraction literature is focused on well-formedtext documents.

2 Correlative Matrix Mapping (CMM)

Given n m-dimensional data vectors xj ∈ X ⊂ Rm, 1 ≤ j ≤ n, such that each xj

is associated to a q-dimensional vector lj ∈ L ⊂ Rq. For text corpora, n is the

number of labeled documents in the corpus, m is the number of terms in thecorpus and lj is the vector representation of the label of the document xj . CMMaims at finding a subspace of X where the pairwise distances Dλ

X are in maxi-mum correlation with those on the label space (DL). Thus, pairwise distances inthe document-term space are sought to be in maximum correlation with thosedistances on the label space. Here, DL is used as the Euclidean distance on thelabel space and the λ superscript in Dλ

X indicates parameters of the adaptivedistance (Dλ

X)i,j = ((xi−xj)T ·λ ·λ

T

· (xi−xj))1/2, where λ is an m×u matrix,and u is specified by the user. This distance matrix metric resembles a Maha-lanobis distance, where Λ = λ · λ

T

has a rank of u. We obtain the parametermatrix as

λ = argmaxλ∗ r(DL,Dλ

X) (1)

where r is the Pearson correlation. Locally optimal solutions for (1) can be ob-tained by gradient methods using its derivative with respect to λ [3]. It is worthnoting that while the number of rows of the λ matrix is constrained by thenumber of terms in X, i.e. the document vector dimensionality, the number ofcolumns u, i.e. the dimensionality of the subspace, is defined by the user. Note

1 CMM was called differently in previous works but created naming conflicts therein.

Subspace Mapping of Noisy Text Documents 379

that λT

· X defines a u-dimensional subspace that is an informative representa-tion of the input space focused on its label association. If visualization is theultimate goal, a choice of u ≤ 3 is recommended. New documents with unknownlabels can be also projected to the new space by using the optimized λ matrix.An open source package with the implementation of CMM is available at [5].

3 Experiments

We selected four alternative DR methods that make use of label information andallow exact out-of-sample extension, and we used them to compare to CMM.Linear Discriminant Analysis (LDA) aims at finding optimal discriminantdirections by maximizing the ratio of the between-class variance to the within-class variance [7]. Since its solutions require inverses of covariance matrices, itusually has ill-conditioning problems for high-dimensional data. Therefore, wealso calculate a simplification of this approach based on the diagonal matrices ofthe covariance matrices, referred to as LDAd. Canonical Correlation Analy-sis (CCA) is a well-known technique for finding correlations between two setsof multidimensional variables by projecting them onto two lower-dimensionalspaces in which they are maximally correlated [8]. Although this method isstrongly related to CMM in the sense that both look for optimal correlations,CMM does not adapt data and labels space, but adapts distances in the dataspace. Neighborhood Component Analysis (NCA) aims at learning a lin-ear transformation of the input space such that the k-Nearest Neighbors methodperforms well in the transformed space [9]. The method uses a probability func-tion to estimate the probability pi,j that a data point i selects a data point jas its neighbor after data mapping. The method maximizes the expected num-ber of data points correctly classified under the current transformation matrix.Maximally Collapsing Metric Learning (MCML) aims at learning a linearmapping where all points in the same class are mapped into a single location,while all points in other classes are mapped to other locations, i.e. as far aspossible among data points of different classes [10]. This algorithm uses a proba-bilistic selection rule as in NCA. However, unlike NCA the optimization problemis convex, and thus MCML transformation can be completely specified from theobjective function. The Matlab Toolbox for Dimensionality Reduction [2] wasused for all methods except for CCA taken from the Statistics Toolbox [6].

3.1 Data

We used the publicly available Aviation Security Reporting System (ASRS)Database Report Set [11] and extracted the narrative fields of documents belong-ing to 4 out of 24 topics: Bird or animal strike records, Emergency medical serviceincidents, Fuel management issues and Inflight weather encounters. Each topichas 50 documents, thus providing a total of 200 documents. 6048 rare terms werediscarded yielding 1829 unique terms. Two major challenges are faced. First, the


average length of each document is only a few sentences, which makes it difficultto extract statistically significant terms. Second, texts are riddled with acronyms,ad hoc abbreviations and misspellings.

Binary representations are used for the document-term matrix, i.e. the com-ponent for to the kth term of the jth document xj is 0 if the term is not presentand 1 otherwise. This binary weighting approach is appropriate given the shortlength of the documents for which the frequency of a term might inflate itsimportance. In the case of CCA and CMM, the four label vectors (0,0,0,1),(0,0,1,0), (0,1,0,0), and (1,0,0,0) are used for class representation, thus, inducingequidistant classes. In LDA, NCA and MCML integer values are used for classassignment, because they do not quantify label dissimilarities.

For each experiment, 80% of the corpus was used for training, while the re-maining documents were held-out for testing. This process was restarted 10times, so that a new testing set was obtained in each iteration for implementinga repeated random sub-sampling validation scheme. All the applied DR algo-rithms showed convergence during the optimization phase, with the exceptionof MCML which, despite of its convex cost function, required a time-limitingstopping criterion, because of its excessive run time. Since NCA and CMMuse iterative methods for optimization, different early stopping criteria weresought using a portion of the training set. Otherwise, these methods are likelyto overfit training data. Since meaningful visualization is desirable for manytasks, all our experiments were deliberately constrained to 2- and 3-dimensionalsubspaces.

3.2 Assessing Subspace Mapping Performance

We divide different assessments applied on the methods into three types. Thefirst aims at evaluating the embedding without using label information. Twoperformance metrics are used: the area under the extrusion/intrusion tendencycurve (B) and neighborhood ranking preservation (Q) [12]. B quantifies the ten-dency to commit systematic neighborhood rank order errors for data pairs inthe projection space (B is not bounded; the closer to zero, the better), whileQ measures k-ary neighborhood preservation (Q varies between 0 and 1; thecloser to one, the better). The second quality class considers label informa-tion namely cohesion, which is the ratio of the pairwise Euclidean distances ofdocuments belonging to a same class to the pairwise distances of documents ofdifferent classes. The third class of assessments also uses label information.It evaluates the potential of supervised learning methods to exploit the givenlow-dimensional space for classification. It may be argued that the better theclassification accuracy is, the better the projection is. Classifying from a low-dimensional space may produce better results due to the removal of collinearor irrelevant variables. Thereby, we used k-nearest neighbors (kNN), DecisionTrees (DT), Support Vector Machines (SVM) using a Radial Basis Functionkernel (rbf) and using a multi-layer perceptron kernel (mlp).


4 Results

We will focus on the results obtained on the testing set, while the training set re-sults are still available for the reader. Table 1 shows the average of the computedmetrics of the different DR methods when they are projected into a 2D space. Itcan be observed that most methods have an intrusive embedding, i.e. a tendencyto positive rank errors in the subspace. Not surprisingly, NCA has the highestaverage preservation of the k-ary neighborhoods, since this is what its map-ping is trying to capture. Yet the difference is not statistically significant withCMM when a Dunnett test [13] is performed with a 1% familywise probabilityerror.

Although CMM does not have the lowest cohesion value, due to the varianceof this metric, no significant difference can be drawn here. Looking at the per-formance of the classification methods, CMM significantly outperforms all theother methods with the exception of NCA with the kNN method. Nevertheless,no significant difference was found between CMM and NCA with kNN.

Results for the 3D projection show a very similar behavior as the one showedfor the projections into the 2D space (Table 2). We can see that LDAd has a goodclassification accuracy when DT are used, although no significant differences withCMM and NCA were found. We can also see that CMM made an improvementon most of the metrics.

In summary, LDA and CCA had poor performances in most metrics. Thesemethods compute their optimal value in closed-form (using eigenvectors or in-verse of covariance matrices), and thus the computation might get corrupteddue to the large number of variables and relatively small number of documents.LDAd has better performance than LDA. However, most of the componentsof the parameter matrix in LDAd are zero. This yields a cluttered projec-tion of the data points to a few locations, which is not convenient on mostcases.

The remarkably poor performance of MCML might be due to an underfittingsituation. It is worth saying that MCML is the most compute-intensive methodby far and its calculations last more than 50 times the amount of time spenton any other algorithm. Moreover, delaying its stopping criterion does not seemto dramatically improve its performance. Finally, it is important to note that

Table 1. Comparison of DR methods using 2D spaces: rank-based quality measures(Q/B), cohesion, and classification accuracies of four classifiers.

LDA LDAd NCA MCML CCA CMMTrain Test Train Test Train Test Train Test Train Test Train Test

Q 0.534 0.563 0.515 0.503 0.599 0.613 0.533 0.548 0.521 0.539 0.544 0.596B 0.015 0.049 0.007 0.014 0.039 0.046 -0.058 -0.040 0.009 0.025 0.042 0.033

Cohesion 0,113 0,324 0,118 0,134 0,158 0,197 0,259 0,305 0,000 0,312 0,058 0,210

kNN 0.842 0.250 0.626 0.590 0.950 0.703 0.731 0.373 0.988 0.358 0.986 0.685SVMrbf 0.238 0.258 0.238 0.258 0.714 0.363 0.474 0.375 0.738 0.310 0.984 0.638SVMmlp 0.237 0.230 0.249 0.230 0.349 0.358 0.348 0.300 0.738 0.378 0.824 0.605DT 0.884 0.268 0.652 0.605 0.955 0.608 0.785 0.360 0.988 0.335 0.991 0.683


Table 2. Comparison of DR methods using 3D spaces: rank-based quality measures(Q/B), cohesion and classification accuracies of four classifiers.

LDA LDAd NCA MCML CCA CMMTrain Test Train Test Train Test Train Test Train Test Train Test

Q 0.537 0.573 0.536 0.527 0.608 0.618 0.540 0.559 0.517 0.548 0.560 0.617B 0.017 0.046 0.014 0.019 0.049 0.043 -0.071 -0.043 -0.004 0.026 0.079 0.040

Cohesion 0,085 0,324 0,149 0,173 0,157 0,202 0,253 0,298 0,006 0,314 0,081 0,213

kNN 0.872 0.268 0.716 0.673 0.976 0.750 0.773 0.450 0.991 0.415 0.990 0.743SVMrbf 0.238 0.258 0.233 0.263 0.622 0.310 0.755 0.405 0.811 0.265 0.974 0.650SVMmlp 0.237 0.230 0.246 0.248 0.363 0.328 0.411 0.305 0.833 0.325 0.971 0.658DT 0.928 0.263 0.761 0.723 0.972 0.653 0.828 0.415 0.991 0.343 0.990 0.695

CMM either on the 2D or 3D projection is the most stable method, since itgets the first or second best values for all the metrics. More specifically, CMMis the only method that has a consistent classification accuracy when SVM isused. Additional results that were not included here for a matter of space canbe looked up in [14].

5 Conclusions

Subspace mapping allows visualization of high-dimensional spaces on an infor-mative plotting space, suitable for visual data mining methods. Additionally,projections into low-dimensional spaces allow a reduction of the storage of datapoints and lead to improved prediction capacity of a subsequently applied su-pervised method. We emphasize the advantages of applying linear subspacetransformations, since they provide a simple interpretation of the new space.Moreover, they guarantee exact out-of-sample extensions. Methods that makeuse of calculation of eigenvectors may not be the best option when the inputdata dimensionality is considerably high.

This paper described the applicability of different DR methods for short andnoisy text documents. This is the first work where CMM is compared againstother well-established DR methods. From the results showed in Section 4 wecan state that our proposed method CMM represents a competitive subspacemapping method, with the advantage of more stable behavior than the othermethods tested in this work. NCA was its closest competitor, especially for Qand k-NN.

As future work, we plan to extend this development by considering a semi-supervised scenario. In this case the system can automatically classify documentsand, at the same time, the user can provide its feedback about reclassifying adocument or indicating the irrelevance of a term. Moreover, the system shouldadapt its behavior from the user feedback and correct future actions.

We thank NSERC, PGI-UNS (24/ZN16), the DFG Graduate School 1564, andMINCyT-BMBF (AL0811 - ARG 08/016) for their financial support.


References

1. Zhang, J., Huang, H., Wang, J.: Manifold Learning for Visualizing and AnalyzingHigh-Dimensional Data. IEEE Intel. Syst. 25, 54–61 (2010)

2. van der Maaten, L., Postma, E., van den Herik, J.: Dimensionality Reduction: AComparative Review. Tilburg University, TiCC TR 2009–005 (2009)

3. Strickert, M., Soto, A.J., Vazquez, G.E.: Adaptive Matrix Distances Aiming atOptimum Regression Subspaces. In: European Symposium on Artificial NeuralNetworks, Computational Intelligence and Machine Learning - ESANN 2010, pp.93–98 (2010)

4. Soto, A.J., Strickert, M., Vazquez, G.E., Milios, E.: Adaptive Visualization of TextDocuments Incorporating Domain Knowledge. In: Challenges of Data Visualiza-tion, NIPS 2010 Workshop (2010)

5. Machine Learning Open Source Software, http://mloss.org6. Matlab Statistics Toolbox, http://www.mathworks.com/products/statistics/7. McLachlan, G.: Discriminant Analysis and Statistical Pattern Recognition. Wiley-

Interscience, Hoboken (2004)8. Hardoon, D.R., Szedmak, S.R., Shawe-Taylor, J.R.: Canonical Correlation Anal-

ysis: An Overview with Application to Learning Methods. Neural Comput. 16,2639–2664 (2004)

9. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighborhood Compo-nents Analysis. Adv. Neural Inf. Process. Syst. 17, 513–520 (2005)

10. Globerson, A., Roweis, S.: Metric Learning by Collapsing Classes. Adv. Neural Inf.Process. Syst. 18, 451–458 (2006)

11. Aviation Safety Reporting System, http://asrs.arc.nasa.gov/12. Lee, J.A., Verleysen, M.: Quality Assessment of Dimensionality Reduction: Rank-

Based Criteria. Neurocomputing 72, 1431–1443 (2009)13. Dunnet, C.W.: A Multiple Comparisons Procedure for Comparing Several Treat-

ments with a Control. J. Am. Stat. Assoc. 50, 1096–1121 (1955)14. Soto, A.J., Strickert, M., Vazquez, G.E., Milios, E.: Technical Report, Dalhousie

University (in preparation), http://www.cs.dal.ca/research/techreports

http://mloss.org

http://www.mathworks.com/products/statistics/

http://asrs.arc.nasa.gov/

http://www.cs.dal.ca/research/techreports

Extending AdaBoost to Iteratively Vary Its Base

Classifiers

Erico N. de Souza and Stan Matwin�,��

School of Information Technology and EngineeringUniversity of Ottawa

Ottawa, ON, K1N 6N5, [email protected]

[email protected]

Abstract. This paper introduces AdaBoost Dynamic, an extension ofAdaBoost.M1 algorithm by Freund and Shapire. In this extension weuse different “weak” classifiers in subsequent iterations of the algorithm,instead of AdaBoost’s fixed base classifier. The algorithm is tested withvarious datasets from UCI database, and results show that the algorithmperforms equally well as AdaBoost with the best possible base learner fora given dataset. This result therefore relieves a machine learning analystfrom having to decide which base classifier to use.

1 Introduction

In [2, 3], Freund and Schapire introduced AdaBoost, a classifier induction methodthat converts a “weak” PAC learner with a slightly better performance than therandom classifier into a stronger, high accuracy algorithm. The final model willbe the weighted sum of each “weak” classifier applied to the dataset.

Freund and Schapire use only one “weak” PAC learner in the AdaBoost al-gorithm, and in this paper we extend this definition to allow different “weak”learners to be used in iterations. A similar approach is presented by Rodrıguezet al [6], that discusses a supervised classification method for time series.

AdaBoost tries to improve the quality of the learner iteratively consideringonly part of the distribution, i.e. in each iteration it calculates a new weightfor the current distribution and applies the same weak learner. This is a goodapproach, but we can try to improve the solution by applying a weak learner ina certain iteration, and a different one in the next, because depending on thetype of the data it is possible that a certain distribution is better fitted with adifferent base learner.

[2–4] present solutions considering only one weak classifier for all iterations,and this motivated our research in order to verify if different classifiers executedin each iteration will give an improvement. This paper presents a new algo-rithm, called AdaBoost Dynamic that uses different standard weak learners - like� The author is also affiliated with the Institute of Computer Science, Polish Academy

of Sciences, Poland.�� The authors acknowledge the support of NSERC and MITACS for this research.


Extending AdaBoost to Iteratively Vary Its Base Classifiers 385

decision trees, neural networks, Bayesian Networks, etc - applied to differentdatasets from UCI [1]. The idea is to relieve the machine learning analyst fromthe choice of different possible base learners, letting the system iteratively andautomatically define the best model for the data.

This paper is organized as follows: in Section discusses the 2 AdaBoost.M1algorithm as well the proposed modifications. Section 3 presents the experimentresults comparing AdaBoost Dynamic with AdaBoost.M1 using various “weak”classifiers. Finally, some conclusions are presented in Section 4.

2 Algorithm Modification

The original AdaBoost.M1, proposed in [3], takes a training set m of examplesS = 〈(x1, y1), ..., (xm, ym)〉, where xi is an instance drawn from some spaceX and represented in some manner (typically, a vector of attribute values),and yi ∈ Y is the class label of xi. The second parameter is the WeakLearneralgorithm. This algorithm will be called in a series of rounds that will update adistribution Dt. The WeakLearner algorithm is generic and must be chosen bythe user, respecting the requirement that it must correctly classify at least 1

2 ofthe data set.

This original algorithm considers that only one WeakLearner is boosted, andthe final output hypothesis is given by Hfinal = argmax

∑t:ht(x)=y log 1

βt[3]. In

order to use different algorithms, an array of WeakLearners was added as inputto the original algorithm, and in each iteration another WeakLearner from thearray is executed. The number of WeakLearners in the input array may be thesame as the number of iterations, but this is not mandatory. The restrictionremains, that each WeakLearner in the array must correctly classify at least 1

2the data set.

AdaBoost Dynamic is presented in Table 1. This is the AdaBoost.M1 algo-rithm with the proposed modifications. This algorithm resembles the originalalgorithm, except one of the inputs is a list of WeakLearners and line 3 in thealgorithm calls a different WeakLearner in each iteraction. In this case, the finaloutput will be Hfinal = arg max

∑t:ft(hj(x))=y log 1

βt. This means that the new

output hypothesis will be calculated considering a function ft(hj(x)) in eachiteraction, where hj(x) �= hj+1(x). The function ft is defined just to vary theweak learner in each iteration.

3 Results

The algorithm was implemented in Weka[5]. This first implementation was used10 different weak learners, executed in the following order:

– Neural Network– Naive Bayes implementation;– Decision Stumps;– Bayes Networks;

386 E.N. de Souza and S. Matwin

Table 1. AdaBoost Dynamic Algorithm with the proposed modification

Input: sequence of examples m 〈(x1, y1), ..., (xm, ym)〉 with labels yi ∈ Y = {1, ..., k}list W of the WeakLearner algorithmsinteger T specifying number of iterations

1.Initialize Di = 1m

for all i2.Do for t=1...T3.Call W[j], providing it with the distribution Dt

4.Get back hypothesis ht : X → Y5.Calculate the error of ht : εt =

∑i:ht(xi) �=yi

Dt(i)

6.If εt >12, then abort loop

7.Set βt = εt1−εt

8.Update distribution Dt : Dt+1(i) = Dt(i)Zt

×{βt if ht(xi) = yi

1 otherwise,

where Zt is a normalization constant.9.If Length(W) = j then j = 1, else j = j+1

– Random Tree;– Random Forest;– SVM (Support Vector Machine);– Bagging– ZeroR rules– Naive Bayes Tree.

All weak learner algorithms in the list were used considering only their de-fault configuration from Weka, that makes all experiments repeatable with otherdatasets. The datasets used in the experiment were collected from UCI database[1]. The presented results were obtained with 10 fold cross validation, and apair-wise T-Test with the confidence interval of 95%. For all examples, the totalnumber of iterations considered was 100.

18 datasets were chosen from UCI database in this experiment. SVM wasnot able to classify some of these datasets, because of missing data for someattributes. When an algorithm was not able to execute on a dataset, it is indi-cated with NA(Not Available). Tables 2 - 4 present the comparison of AdaBoostDynamic with to AdaBoost.M1 with same weak learners that were used in Ad-aBoost Dynamic.These tables use the following notation: if an algorithm is sta-tistically better than AdaBoost Dynamic, the symbol ◦ appears at its side, in theopposite situation the symbol • appears. Lack of symbol indicates that the differ-ence in performance is not significant. Table 2 shows results for the comparisonamong AdaBoost DynamicAdaBoost.M1 with Random Forest, Bayes Network,Naive Bayes and Decision Stump. Results show that Decision Stump has a weakperformance in nine of the simulations, and only in two simulations the perfor-mance was statistically better than AdaBoost Dynamic, in all other experimentsthe performance was equal. Considering other algorithms, AdaBoost with Ran-dom Forest and AdaBoost with Naive Bayes had worse performance four times


and had better performance in two and one times, respectively. AdaBoost withBayesian Network had worse performance five times and better performance intwo experiments.

Table 3 shows results of comparing comparing AdaBoost Dynamic and Ad-aBoost.M1 with Random Tree, ZeroR, Naive Bayes Trees (NBTrees) and SVMas base classifiers. In this case, the worst “weak” classifier was ZeroR, that wasstatistically inferior to AdaBoost Dynamic 11 times - for more than 50% ofdatasets. Only once this classifier was better than AdaBoost Dynamic. Other“weak” classifier that with a low accuracy level was Random Tree. In this caseeight times the classifier was statistically inferior to AdaBoost Dynamic, and allother datasets yielded equal performance. The NBTrees is a classifier that wasstatistically superior to AdaBoost Dynamic four times and was inferior only twotimes. SVM was inferior four times and had no better result than AdaBoostDynamic.

Table 4 shows the results comparing AdaBoost Dynamic and AdaBoost.M1with Bagging and with Neural Network. Bagging had inferior results six timesand only one time offered a better result. AdaBoost.M1 with Neural Networkhad only one superior result and in other experiments only equal performanceto AdaBoost Dynamic.

Using AdaBoost.M1 the analyst will have to look for the best “weak” learnerto be applied with the dataset, making the work hard. AdaBoost Dynamic offersa solution to this problem, because it uses various algorithms trying to improvethe final hypothesis considering each one. In this way, this approach leaves tothe machine the search for the model, instead of the machine learning analyst.

Table 2. Results of the T-Test comparing the Implemented solution (AdaBoostDynamic (1)) with AdaBoost.M1 with Random Forest(2), Bayes Network(3), NaiveBayes(4) and Decision Stump(5)

Dataset (1) (2) (3) (4) (5)contact-lenses 74.17±28.17 75.67±28.57 74.33±26.74 76.17 ±25.43 72.17±27.12iris 95.13± 4.63 94.73± 5.04 93.73± 5.98 95.07 ± 5.73 94.60± 5.33segment 95.81± 2.06 96.88± 1.96 93.84± 1.93 • 83.84 ± 3.85 • 29.43± 1.08 •soybean 93.45± 2.82 92.16± 2.91 93.35± 2.65 92.05 ± 3.05 27.96± 2.13 •weather 68.00±43.53 65.00±41.13 63.00±40.59 59.00 ±42.86 67.00±40.34weather.symbolic 61.00±41.79 72.00±38.48 71.00±37.05 67.00 ±39.07 67.50±39.81au1-balanced 76.95± 3.86 80.71± 3.19 ◦ 73.15± 4.23 76.16 ± 3.91 77.90± 3.44au1 72.25± 3.79 73.96± 3.61 74.02± 0.86 72.60 ± 1.99 72.95± 2.46CTG 100.00± 0.00 99.89± 0.25 99.86± 0.27 99.35 ± 0.53 • 45.30± 0.27 •BreastTissue 64.28±11.46 71.14±11.75 65.17±12.54 67.75 ±13.69 40.65± 4.97 •crx 81.42± 4.49 84.96± 3.87 ◦ 86.28± 3.77 ◦ 81.06 ± 4.14 86.17± 4.22 ◦car 99.14± 0.92 93.87± 2.02 • 90.60± 2.34 • 90.25 ± 2.48 • 70.02± 0.16 •cmc 54.08± 3.81 50.83± 3.69 • 50.15± 3.85 • 49.04 ± 3.98 • 42.70± 0.25 •glass 96.31± 3.95 97.84± 3.22 97.54± 3.27 93.37 ± 5.63 67.82± 2.50 •zoo 95.66± 5.83 89.92± 8.49 96.05± 5.61 96.95 ± 4.75 60.43± 3.06 •blood 77.93± 3.56 73.13± 4.11 • 75.01± 4.09 • 77.01 ± 3.07 78.84± 3.40balance-scale 89.71± 4.72 74.74± 4.85 • 74.44± 6.86 • 92.13 ± 2.90 71.77± 4.24 •post-operative 56.11±13.25 58.33±10.98 66.44± 8.79 ◦ 66.89 ± 8.05 ◦ 67.11± 8.19 ◦

◦, • statistically significant improvement or degradation

388 E.N. de Souza and S. Matwin

Table 3. Results of the T-Test comparing the Implemented solution (AdaBoost Dy-namic (1)) with AdaBoost.M1 with Random Tree(2), ZeroR(3), NBTrees (4) andSVM(5). The values with NA are due to the fact that SVM does not process datasetswith missing values.

Dataset (1) (2) (3) (4) (5)balance-scale 89.71± 4.72 78.09± 3.88 • 45.76± 0.53 • 80.78 ± 4.59 • 91.47 ± 3.48blood 77.93± 3.56 72.92± 4.11 • 76.21± 0.41 77.61 ± 3.68 71.95 ± 3.64 •cmc 54.08± 3.81 49.38± 4.09 • 42.70± 0.25 • 51.90 ± 4.52 54.82 ± 4.13glass 96.31± 3.95 92.53± 7.64 35.51± 2.08 • 95.72 ± 5.02 98.22 ± 2.89post-operative 56.11±13.25 59.78±12.61 70.00± 5.12 ◦ 56.56 ±14.05 NAzoo 95.66± 5.83 60.75±20.49 • 40.61± 2.92 • 95.84 ± 5.97 60.24 ± 12.40 •au1 72.25± 3.79 68.50± 4.24 • 74.10± 0.30 76.30 ± 3.92 ◦ 71.96 ± 3.96au1-balanced 76.95± 3.86 73.65± 4.35 58.86± 0.26 • 80.19 ± 3.15 ◦ 76.75 ± 3.67BreastTissue 64.28±11.46 66.91±12.70 19.01± 1.42 • 68.67 ±11.97 21.96 ± 5.10 •car 99.14± 0.92 84.90± 3.35 • 70.02± 0.16 • 98.58 ± 0.85 99.39 ± 0.68contact-lenses 74.17±28.17 75.50±29.91 64.33±23.69 76.17 ±25.43 79.50 ± 24.60iris 95.13± 4.63 93.53± 5.48 33.33± 0.00 • 94.33 ± 5.47 96.53 ± 4.29weather 68.00±43.53 61.50±43.14 70.00±33.33 71.00 ±38.39 52.50 ± 40.44weather.symbolic 61.00±41.79 70.50±38.99 70.00±33.33 67.00 ±39.07 61.50 ± 41.96crx 81.42± 4.49 84.23± 5.20 55.51± 0.67 • 86.30 ± 3.96 ◦ NACTG 100.00± 0.00 97.37± 1.76 • 27.23± 0.16 • 98.13 ± 1.40 • NAsegment 96.59± 1.58 94.49± 1.95 • 15.73± 0.33 • 98.19 ± 1.13 ◦ 56.22 ± 4.19 •

◦, • statistically significant improvement or degradation

Table 4. Results of the T-Test comparing the Implemented solution (AdaBoost Dy-namic (1)) with AdaBoost.M1 with Bagging (2) and AdaBoost.M1 with Neural Net-work (3)

Dataset (1) (2) (3)au1 72.25± 3.79 73.56± 3.32 73.22 ± 3.84au1-balanced 76.95± 3.86 79.49± 2.81 NABreastTissue 64.28±11.46 70.51±13.11 65.31 ± 11.38soybean 93.45± 2.82 88.05± 3.18 • 93.35 ± 2.68balance-scale 89.71± 4.72 77.25± 4.38 • 93.06 ± 3.13 ◦blood 77.93± 3.56 73.04± 4.16 • 78.47 ± 2.85cmc 54.08± 3.81 51.21± 3.63 • 54.33 ± 3.90glass 96.31± 3.95 97.75± 3.23 95.89 ± 4.00post-operative 56.11±13.25 57.22±12.27 57.22 ± 12.96zoo 95.66± 5.83 42.59± 4.93 • 95.66 ± 5.83crx 81.42± 4.49 83.36± 5.01 83.14 ± 4.18CTG 100.00± 0.00 99.99± 0.08 100.00 ± 0.00segment 96.59± 1.58 98.33± 0.98 ◦ 96.45 ± 1.59car 99.14± 0.92 97.11± 1.29 • 99.40 ± 0.65contact-lenses 74.17±28.17 74.17±28.17 74.17 ± 28.17iris 95.13± 4.63 94.13± 5.21 96.20 ± 4.37weather 68.00±43.53 68.00±40.53 64.00 ± 44.43weather.symbolic 61.00±41.79 69.50±38.86 61.00 ± 41.79◦, • statistically significant improvement or degradation

4 Conclusion

This work introduced a small modification to AdaBoost.M1 developed by Freundand Schapire. The modification is focused on the fact that the original approachonly boosted one “weak” learner. This work proposes that we can use different“weak” learners in each iteration, allowing a different weak learner be used inrelation to previous iteration.


Authors have made available a extended version of this work in http://www.site.uottawa.ca/~edeso096/AI_2011_extended.pdf with a proof that evenusing different “weak” learners in each iteration the algorithm will have the sameupper bound as AdaBoost.M1 approach. The is because the same assumptionsand requirements made for to AdaBoost.M1 are kept in place for AdaBoostDynamic.

Experimental results suggest that, for a large majority of the datasets, theperformance of AdaBoost Dynamic is as good as that of AdaBoost.M1 for thebest single weak learner. AdaBoost Dynamic can therefore be used as a defaultalgorithm that will provide a benchmark basis to which other weak learners canbe compared. AdaBoost Dynamic will be successfuly used if the analyst is notsure about the best base learner to be used with AdaBoost for a a particulardataset.

One possible improvement to AdaBoost Dynamic is to implement a techniqueto check what is the best “weak” classifier for a given distribution in a given iter-ation. This would allow the algorithm to offer the best weight for a distribution,and offer a better final hypothesis. Another work to be developed is to inves-tigate if the order of execution of weak learners has some influence in the finalhypothesis result.

Another improvement is to change the error test in line 6 from Table 1. Thetest only verifies if the hypothesis error is bigger than 0.5, and than exits theloop. It is possible to allow the execution of different classifiers, in the sameiteration, checking if any of them has hypothesis error smaller than 0.5. We areworking on this modification.

References

1. Frank, A., Asuncion, A.: UCI machine learning repository (2010)2. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Saitta,

L. (ed.) Proceedings of the 13th International Conference on Machine Learning, pp.148–156 (1996)

3. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learningand an application to boosting. Journal of Computer and System Sciences 55(1),119–139 (1997)

4. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statisticalview of boosting. Annals of Statistics 28(2), 337–407 (2000a)

5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: Theweka data mining software: An update. SIGKDD Explorations 11(1) (2009)

6. Rodrıguez, J.J., Alonso, C.J., Bostrom, H.: Boosting interval based literals. Intell.Data Anal. 5, 245–262 (2001)

http://www.

site.uottawa.ca/~edeso096/AI_2011_extended.pdf

Parallelizing a Convergent Approximate

Inference Method

Ming Su1 and Elizabeth Thompson2

1 Department of Electrical Engineering2 Department of StatisticsUniversity of Washington

{mingsu,eathomp}@u.washington.edu

Abstract. The ability to efficiently perform probabilistic inference taskis critical to large scale applications in statistics and artificial intelli-gence. Dramatic speedup might be achieved by appropriately mappingthe current inference algorithms to the parallel framework. Parallel exactinference methods still suffer from exponential complexity in the worstcase. Approximate inference methods have been parallelized and goodspeedup is achieved. In this paper, we focus on a variant of Belief Prop-agation algorithm. This variant has better convergent property and isprovably convergent under certain conditions. We show that this methodis amenable to coarse-grained parallelization and propose techniques tooptimally parallelize it without sacrificing convergence. Experiments on ashared memory systems demonstrate that near-ideal speedup is achievedwith reasonable scalability.

Keywords: Graphical Model, Approximate Inference, Parallel Algo-rithm.

1 Introduction

The ability to efficiently perform probabilistic inference task is critical to largescale applications in statistics and artificial intelligence. In particular, such prob-lems arise in the analysis of genetic data on large and complex pedigrees [1] ordata at large numbers of markers across the genome [2]. The ever-evolving par-allel computing technology suggests that dramatic speed-up might be achievedby appropriately mapping the existing sequential inference algorithms to theparallel framework. Exact inference methods, such as variable elimination (VE)and the junction tree algorithm, have been parallelized and reasonable speedupachieved [3–7].

However, the complexity of exact inference methods for a graphical model isexponential in the tree-width of the graph. For graphs with large tree-width, ap-proximate methods are necessary. While it has been demonstrated empiricallythat loopy and generalized BP work extremely well in many applications [8],Yedidia et al. [9] have shown that these methods are not guaranteed to convergefor loopy graphs. Recently a promising parallel approximate inference methodwas presented by Gonzalez et al., [10], where loopy Belief Propagation (BP)


Parallelizing a Convergent Approximate Inference Method 391

was optimally parallelized, but without guarantee of convergence. The UPS al-gorithm [11] has gained popularity due to its reasonably good performance andease of implemention [12, 13]. More important, the convex relaxation methodwhich incorporates UPS as a special case, is guaranteed to converge under mildconditions [14].

In this paper, we develop an effective parallel generalized inference methodwith special attention to the UPS algorithm. Even though the generalized infer-ence method possesses a structural parallelism that is straightforward to extract,problems of imbalanced load and excessive communication overhead can resultfrom ineffective task partitioning and sequencing. We focus on solving these twoproblems and demonstrating the performance of efficiently paralleled algorithmson large scale problems using a shared memory system.

2 Convex Relaxation Method and SubproblemConstruction

The convex relaxation method relies on the notion of region graphs to faciliatethe Bethe Approximation. In the Bethe approximation, one minimizes the Bethefree energy function and uses its solution to obtain an estimate of the partitionfunction and true marginal distributions [14]. The Bethe free energy is a functionof terms known as the pseudo-marginals. Definitions and examples of the Betheapproximation, Bethe region graphs and pseudo-marginals can be found in [9,15]. The UPS algorithm and the convex relaxation method were based on thefact that if the graphical model admits a tree-structured Bethe region graph,the associated Bethe approximation is exact [9, 15]. That is, minimization of theBethe free energy is a convex optimization problem.

We obtain a convex subproblem by fixing the pseudo-marginals associatedwith a selected subset of inner regions to a constant vector. The convex relax-ation method works by first finding a sequence of such convex subproblems thenrepeatedly solving them until convergence. Graphically, the subproblems are de-fined over a sequence of tree-structured subgraphs. Simple schemes of findingthese subgraphs in grid graphs are proposed in [11]. However, these schemes arenot optimal and cannot be extended to general graphs. We present a hyper-graph spanning tree algorithm that is more effective and is applicable to generalgraphs. With the hypergraph representation, the problem of finding these sub-graphs, which otherwise requires ad hoc treatment in bipartite region graphs,becomes well-defined. The definition of hypergraphs, hyperedges, hypergraphspanning trees and hyperforests can be found in [16].

In the hypergraph representation, nodes and hyperedges correspond to outerregions and inner regions, respectively. Specifically, an inner region can beregarded as a set, whose elements are adjacent outer regions. In the GreedySequencing procedure developed by [14], all outer regions are included in eachsubproblem. The sequence of tree-structured subgraphs corresponds to a se-quence of spanning hypertrees. In general, a spanning tree in a hypergraph maynot exist and even determination of its existence is strongly NP-complete [16].

392 M. Su and E. Thompson

T1 TmT2 …

T1 TmT2

Barrier 1

…

Barrier 2

Map

Reduce

1

65

3

2

4

1

3'

2

4

(a) (b)

Fig. 1. (a) MapReduce flowchart for a sequence of size 2; (b) Coarsening by contractingedge 3, 4 and 5

We develop a heuristic, hyperspan, by extending Kruskal’s minimum spanningtree algorithm for ordinary graphs. We apply hyperspan repeatedly to obtaina sequence of spanning hyperforests. In this context, the convergence crierionof [14] translates to a condition that every hyperedge has to appear in at leastone spanning forest. The Greedy Sequencing procedure guarantees that, in theworst case, the convergence criterion is still satisfied. Interestingly, for a gridgraph model with arbitrary size, the greedy sequencing procedure returns a se-quence of size two, which is optimal.

3 Parallel and Distributed Inference

In the greedy sequencing procedure, if a subproblem is defined on a forest ratherthan on a tree, we can run Iterative Scaling (IS) on disconnected components,independently and consequently in parallel. This suggests a natural way of ex-tracting coarse-grained parallelism uniformly across the sequence of subprob-lems. The basic idea is to partition the hypertree or even the hyperforest intoa prescribed number, t, of components and assign the computation associatedwith each component to a separate processing unit. There is no communica-tion cost incurred among the independent computation tasks. This maps to acoarse-grained MapReduce framework [17] as shown in Figure 1(a). Note thatsynchronization, accomplished by software barriers, is still required at the endof each inner iteration. In this paper, we only focus on mapping the algorithmto a shared memory system.

Task partitioning is performed using a multilevel hypergraph partitioning pro-gram hMETIS [18]. Compared to alternative programs, it has much shortersolution time and more importantly, it produces balanced partitions with a sig-nificantly fewer cut edges. The convergence crierion states that every hyperedgehas to appear in at least one spanning forest [14]. This means no hyperedge isallowed to be always a cut edge. A simple technique, edge contraction, prevents ahyperedge from being a cut edge. When a hyperedge is contracted, it is replacedby a super node, containing this edge and all nodes that are adjacent to thisedge. All other edges that are previously adjacent to any of these nodes become


adjacent to the super node (Figure 1(b)). After we partition once, we can con-tract a subset of cut edges, resulting in a coarsened hypergraph, repartitioningon which will not have any cut placed on the contracted edges.

Near optimal speedup is only achieved when we have perfect load balancing.Knowing that IS solution time is proportional to the number of nodes, we per-form weighted partitionings. The weight of a node is 1 for a regular node. Fora super node, the weight is the number of contained regular nodes. Reasonableload balance is achieved through weighted partitioning when the average inter-action between adjacent random variables is not too high. For high interaction,partitioning-based static load balancing (SLB) performs poorly. In Section 4, weshow this effect and propose some techniques to accommodate it.

We adopted the common multithreading scheme, where in general, n threadsare created in a n-core system and each thread is distributed to a separatecore. Thread synchronization ensures that all subproblems converge. We usenonblocking send and blocking receive because they are more efficient for theimplementation. For efficiency purpose, pseudo-marginals are sent and receivedin one package rather than individually. Sender and receiver, respectively, usethe predefined protocol to packing and unpacking the aggregate into individualpseudo-marginal messages.

Our experimenting environment is a shared memory 8-core system with 2Intel Xeon Quad Core E5410 2.33 GHz processors with Debian Linux. We im-plemented the algorithms in the Java programming language using MPJ Ex-press, an open source Java message passing interface (MPI) library that allowsapplication developers to write and execute parallel applications for multicoreprocessors and computer clusters/clouds.


The selected class of test problems are 100 × 100 Ising models, with joint distri-bution P (x) ∝ e

∑i∈V αixi+

∑(i,j)∈E βijxixj , where V and E are nodes and edges

of graph. αi’s are uniformly drawn from [−1, 1] and βij ’s are uniformly drawnfrom [-β, β]. When β > 1, loopy BP fails to converge even for small graphs.

Due to synchronization, the slowest task will determine the overall perfor-mance. The SLB introduced in Section 3 performs worse as β increases. In prac-tice, we apply two runtime techniques to mitigate the problem. First, a dynamicload balancing (DLB) scheme is developed. Instead of partitioning the graph inton components and distributing them to n threads, we partition the graphs intomore components and put them into a task pool. At runtime, each thread fetchesa task from the pool onces it finishes with its current task. The use of each core ismaximized and the length of bottleneck task is shortened. The second techniqueis the bottleneck task early termination (ET). A thread is terminated when allother threads become idle and no task is left in the pool. However terminatinga task prematurely has two undesirable effects. First, it breaks the convergencerequirement. Second, it may change the convergence rate. In order to ensureconvergence, we can occasionally switch back to non-ET mode, especially whenoscillation of messages is detected.

394 M. Su and E. Thompson

Fig. 2. (a) Load balance: DLB & ET vs. SLB. Normalized load (w.r.t. the largest)shown for each core. 3 cases listed: 2 cores (upper left), 4 cores (upper right) and 8cores (bottom). (b) Speedup: DLB & ET vs. SLB.

With β = 1.1, we randomly generated 100 problems. The number of coresranges from 2 up to 8 to demonstrate both raw speedup and scalability. Speedupis defined as the ratio between sequential and parallel elapsed time. At this inter-action level, the sequential run time exceeds 1 minute giving rise to paralleliza-tion, and SLB starts performing poorly. Figure 2(a) shows that with SLB, poorbalance results irrespective of the number of cores used. This is dramaticallymitigated by DLB and ET. Notice that almost perfect balance is achieved for asmall number of cores (2,4), but with 8 cores the load is less balanced.

The average speedup over 100 problems is shown in Figure 2(b), both for usingSLB and for using DLB and ET. DLB and ET universally improved the speedupand the improvement became more prominent as the number of cores increased.With DLB and ET, the speedup approaches the ideal case until the number ofcores reaches 6. We attribute this drop-in-speedup trend to two factors. First,as shown in Figure 2(a), even with DLB and ET, load becomes less balancedas the number of cores increases. Second, there is an increased level of resourcecontention in terms of memory bandwidth. The BP algorithm frequently accessesmemory. As more tasks are running in parallel, the number of concurrent memoryaccesses also increases.

5 Discussion

In the paper, we proposed a heuristic for subproblem construction. This heuristichas been shown to be effective and is provably optimal for grid graphs. Thor-ough testing on a complete set of benchmarking networks will be important inevaluating the performance of the heuristic. Our parallel implementation is atthe algorithmic level, which indicates that it can be combined with other lowerlevel parallelization techniques proposed by other researchers. Experiments ona shared memory system exhibit near-ideal speedup with reasonable scalability.Further exploration is necessary to demonstrate that the speedup scales up inpractice on large distributed memory systems, such as clusters.


Acknowledgments. This work is supported by NIH grant HG004175.

References

1. Cannings, C., Thompson, E.A., Skolnick, M.H.: Probability functions on complexpedigrees. Advances in Applied Probability 10, 26–61 (1978)

2. Abecasis, G.R., Cherny, S.S., Cookson, W.O., Cardon, L.R.: Merlin – rapid analysisof dense genetic maps using sparse gene flow trees. Nature Genetics 30, 97–101(2002)

3. Shachter, R.D., Andersen, S.K.: Global Conditioning for Probabilistic Inference inBelief Networks. In: UAI (1994)

4. Pennock, D.: Logarithmic Time Parallel Bayesian Inference. In: UAI, pp. 431–443(1998)

5. Kozlov, A., Singh, J.: A Parallel Lauritzen-Spiegelhalter Algorithm for Probabilis-tic Inference. In: Proceedings of the 1994 Conference on Supercomputing, pp. 320–329 (1994)

6. Namasivayam, V.K., Pathak, A., Prasanna, V.K.: Scalable Parallel Implementa-tion of Bayesian Network to Junction Tree Conversion for Exact Inference. In:18th International Symposium on Computer Architecture and High PerformanceComputing (SBAC-PAD 2006), pp. 167–176 (2006)

7. Xia, Y., Prasanna, V.K.: Parallel exact inference on the cell broadband engineprocessor. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing,pp. 1–12 (2008)

8. Botetz, B.: Efficient belief propagation for vision using linear constraint nodes. In:Proceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition (2007)

9. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Generalized belief propagation. In: NIPS,pp. 689–695. MIT Press, Cambridge (2000)

10. Gonzalez, J., Low, Y., Guestrin, C., O’Hallaron, D.: Distributed Parallel Inferenceon Large Factor Graphs. In: UAI (2009b)

11. Teh, Y.W., Welling, M.: The unified propagation and scaling algorithm. In: NIPS,pp. 953–960 (2001)

12. Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general con-textual object recognition. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS,vol. 3021, pp. 350–362. Springer, Heidelberg (2004)

13. Xie, Z., Gao, J., Wu, X.: Regional category parsing in undirected graphical models.Pattern Recognition Letters 30(14), 1264–1272 (2009)

14. Su, M.: On the Convergence of Convex Relaxation Method and Distributed Opti-mization of Bethe Free Energy. In: Proceedings of the 11th International Sympo-sium on Artificial Intelligence and Mathematics (ISAIM), Fort Lauderdale, Florida(2010)

15. Heskes, T.: Stable fixed points of loopy belief propagation are local minima of theBethe free energy. In: NIPS, pp. 343–350 (2002)

16. Tomescu, I., Zimand, M.: Minimum spanning hypertrees. Discrete Applied Math-ematics 54, 67–76 (1994)

17. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clus-ters. In: Proceedings of the Sixth Symposium on Operating System Design andImplementation, San Francisco, CA (2004)

18. Karypis, G., Kumar, V.: hMETIS: A Hypergraph Partitioning Package (1998),http://glaros.dtc.umn.edu/gkhome/fetch/sw/hmetis/manual.pdf

http://glaros.dtc.umn.edu/gkhome/fetch/sw/hmetis/manual.pdf

Reducing Position-Sensitive Subset

Ranking to Classification

Zhengya Sun1, Wei Jin2, and Jue Wang1

1 Institute of Automation, Chinese Academy of Sciences2 Department of Computer Science, North Dakota State University

Abstract. A widespread idea to attack ranking works by reducing itinto a set of binary preferences and applying well studied classificationtechniques. The basic question addressed in this paper relates to whetheran accurate classifier would transfer directly into a good ranker. In par-ticular, we explore this reduction for subset ranking, which is based onoptimization of DCG metric (Discounted Cumulated Gain), a standardposition-sensitive performance measure. We propose a consistent reduc-tion framework, guaranteeing that the minimal DCG regret is achievableby learning pairwise preferences assigned with importance weights. Thisfact allows us to further develop a novel upper bound on the DCG regretin terms of pairwise regrets. Empirical studies on benchmark datasetsvalidate the proposed reduction approach with improved performance.

1 Introduction

Supervised rank learning tasks often boil down to the problem of ordering a finitesubset of instances in an observable feature space. This task is referred to as sub-set rank learning [8]. One straightforward and widely-known solution for subsetranking has been based on a reduction to binary classification tasks consideringall pairwise preferences on the subset. Numerous ranking algorithms fall withinthe scope of this approach, i.e., building ranking models by running classifica-tion algorithms for binary preference problems [5–7, 9, 10, 12]. Ranking modelsare often evaluated by position-sensitive performance measures [15], which as-sign each rank position a discount factor to emphasize the quality near the top.This presumable difference poses a question on whether an accurate classifierwould transfer directly into a good ranker. Applications of the aforementionedalgorithms seem to support a positive answer. In this paper, we attempt to pro-vide a theoretical support for this phenomenon based on well-established regrettransform principles [4, 14], a mainstay of reduction analysis. Roughly speaking,regret here describes the gap between the incurred loss and the minimal loss.

Relevant work has shown that the ranking problem can be solved robustlyand efficiently with binary classification techniques. The proved regret boundsfor those reductions, however, mostly focus on measures that are not position-sensitive. For example, Balcan et al. [2] proved that the regret of ranking, asmeasured by the Area Under the ROC Curve (AUC), is at most twice as muchas that of the induced binary classification. Ailon and Mohri [1] described a


Reducing Position-Sensitive Subset Ranking to Classification 397

randomized reduction which guarantees that the pairwise misranking regretis not more than the binary classification regret. These inspiring results leadone to seek regret guarantees for ranking under position-sensitive criteria whichhave gained an enormous popularity in practice, such as Discounted CumulativeGain (DCG) [11]. Although [18] demonstrates, to some extent, the usefulness ofposition-sensitive ranking using importance weighted classification techniques,there is a lack of theoretical analysis on the principle for a successful reduction.The following critical questions remain unexplored. These are

• guarantees that the reduction is consistent, in the sense that given optimal(zero-regret) binary classifiers, the reduction can yield an optimal ranker,such that the expected position-sensitive performance measure is maximized;

• the regret bounds which demonstrate that the decrease of the classificationregret may provide a reasonable approximation for the decrease of the rank-ing regret of interest.

Our current study aims at addressing these problems. Although the first as-pect has been analogously pointed out in [8], there has been no comprehensivetheoretical analysis to our knowledge. We characterize the DCG metric by acombination of ‘relevance gain’ and ‘position discount’, and prove that undersuitable assumptions, the sufficient condition on consistent reduction is givenby learning pairwise preferences assigned with importance weights according torelevance gains. In particular, we derive an importance weighted loss functionfor the reduced binary problems that exhibits good properties in preserving anoptimal ranking. Such properties provide reassurance that optimizing the result-ing binary loss in expectation does not hinder the search for a zero-regret ranker,and allow such a search to proceed within the scope of off-the-shelf classificationalgorithms.

Subsequently, we quantify the reduction with consistency guarantee on at mosthow much the classification regret can be transferred into the position-sensitiveranking regret. Our regret analysis is based on the rank-adjacent transpositionstrategies which are first used to convert DCG regret into multiple pairwise re-grets. This, coupled with the majorization inequality proved by Balcan et al.(2007), allows us to yield upper bound in terms of the sum of the importanceweighted classification regrets over the induced binary problems. This bound isscaled by a position-discount factor, i.e., 2 times the maximum deviation be-tween adjacent position discounts (< 1). This constant does not depend on howmany instances are ranked, and can be regarded as an improvement over that inthe case of subset ranking using the regression approach [8]. Our results revealthe underlying connection between position-sensitive ranking and binary classi-fication, which is the improvement of the classification accuracy can reasonablyenhance the position-sensitive ranking performance.

This paper is organized as follows. Section 2 formulates subset ranking prob-lem and analyzes its optimal behavior. Section 3 presents pairwise classificationformulations, and describes a generic framework for subset ranking. Section 4 isdevoted to the proof of our main results, and section 5 present empirical evidenceon benchmark datasets. The conclusions are drawn in section 6.

398 Z. Sun, W. Jin, and J. Wang

2 Subset Ranking Problem

We consider the subset ranking problem described as follows. Provided withlabeled subsets, the ranker learns to predict a mapping from a finite subset to anordering over the instances in it. Each labeled subset is assumed to be generatedas S = {(xi, yi)}n

i=1 ⊆ X × Y, where xi is an instance in some feature space X ,and the associated relevance label yi belongs to the set Y = {0, . . . , l − 1}, withl − 1 representing the highest relevance and 0 the lowest.

2.1 Notation

We denote the finite subset as X = {xi}ni=1 ∈ U where U is the set of all

finite subsets of X , and the associated relevance label set as Y = {yi}ni=1.

For simplicity, the size of the subset n remains fixed throughout our analy-sis. We represent the ordering as a permutation π on [n] = {1, . . . , n}, usingπ(i) to denote the ranked position given to the instance xi, and π−1(j) to de-note the index of the instance ranked at jth position. The set of all possiblepermutations is denoted as Ω. For sake of brevity, we define an instance as-signment vector x = [xπ−1(1), . . . , xπ−1(n)] and a relevance assignment vectory = [yπ−1(1), . . . , yπ−1(n)] according to π, where xi = xπ−1(i) represents the in-stance ranked at the ith position, and yi = yπ−1(i) represents the relevance labelassigned to the instance ranked at the ith position.

2.2 Discounted Cumulative Gain (DCG)

Based on the perfect ordering π which is in non-increasing order of the relevancelabels, we evaluate the quality of the estimated ordering with DCG(π, Y ). Unlikeother ranking measures like AUC, DCG not only assesses each instance i by arelevance gain g(yi, Y ), but also discriminates each position π(i) by a discountfactor dπ(i), allowing the evaluation to concentrate on the top rank positions[11]. Let g(yi, Y ) = 2yi − 1 and dπ(i) = 1

log2(1+π(i)) , then we have

DCG(π, Y ) =n∑

i=1

g(yi, Y ) · dπ(i).

The discount factor defined above is positive and strictly decreasing, i.e.,∀π(i) < π(j), dπ(i) > dπ(j) > 0. When only the top k(k < n) instances need tobe ranked correctly, dπ(i) is set to zero for π(i) > k.

2.3 Ranking Formulations

In the standard supervised learning setup, the ranking problem that we areinvestigating can be defined as follows.


Definition 1. (position-sensitive subset ranking) Assume that each labeledsubset S = {(xi, yi)}n

i=1 ⊆ X×Y is generated independently at random accordingto some (unknown) underlying distribution D. The ranker works with a rankingfunction space H = {h : U → Ω} which maps a set X ∈ U to a permutationπ, namely h(X) = π. The position-sensitive ranking loss of the predictor h on alabeled subset S is defined as

lrank(h, S) = DCG(π, Y ) −DCG(h(X), Y ).

The learning goal is to find a predictor h so that the expected position-sensitiveranking loss with respect to D, given by

Lrank(h,D) = ES∼Dlrank(h, S) = EXLrank(h, X), (1)

is as small as possible, where

Lrank(h, X) = EY |X lrank(h, S). (2)

The loss lrank quantifies our intuitive notion of ‘how far the predicted permu-tation is from the perfect permutation’ based on DCG metric. The loss becomesminimum of zero when the subset X is ranked in non-increasing order of the rel-evance labels in Y ; maximal when in non-decreasing order. To characterize theoptimal ranking rule with the minimum loss in (1), it is reasonable to analyzeits conditional formulation Lrank(h, X) as a starting point.

Lemma 1. Given a set X ∈ U , we define the optimal subset ranking function has a minimizer of the conditional expectation in (2). Let π = h(X) be the outputpermutation by h. Then for any dπ(i) > dπ(j), i, j ∈ [n], it holds that

E(yi,yj,Y )|(xi,xj,X)(g(yi, Y ) − g(yj, Y )) ≥ 0.

Lemma 1 explicitly states that given X , the optimal subset ranking is in non-increasing order of the relevance gain functions in expectation. The proof oflemma 1 is straightforward, and omitted due to space limitation.

3 Reductions to Binary Classification

In this section, we turn to the reduction method which decomposes subset rank-ing problems into importance weighted binary classification problems consideringall weighted pairwise preferences between two instances.

3.1 Classification Formulations

In importance weighted binary classification, each instance-label pair is sup-plied with a non-negative weight which specifies the importance to predict thisinstance correctly [3]. The corresponding formulation [3, 14] can be naturallyextended to learn pairwise preferences, which is defined as follows.


Procedure 1. Binary Train (labeled set S of size n, binary classifica-tion learning algorithm A)

Set T = ∅.for all ordered pairs(i, j) with i, j ∈ [n], i �= j:

Set wij(X,Y ) = |g(yi, Y ) − g(yj , Y )|.Add to T an importance weighted example((xi, xj ,X), I(yi > yj), wij(X,Y )).

end forReturn c = A(T ).

Procedure 2. Rank Predict (instance set X , binary classifier c)for each xi ∈ X:

f(xi,X) = 12

∑j �=i(c(xi, xj ,X) − c(xj , xi, X) + 1), where xj ∈ X.

end forSort X in non-increasing order of f(xi,X).

Definition 2. (importance weighted binary classification for pairwisepreferences) Assume that each triple tij = ((xi, xj), I(yi > yj), wij) ∈ (X ×X )×{0, 1}×[0, +∞) is generated at random according to some (unknown) under-lying distribution P, where I(·) is 1 when the argument is true and 0 otherwise,and [0, +∞) indicates the importance of the correct classification. The classifierworks with a preference function space C = {c : X × X → {0, 1}} which mapsan ordered pair (xi, xj) ∈ X × X to a binary relation. The importance weightedclassification loss of the predictor c on a triple tij is defined as

lclass(c, tij) =12wij · I(yi > yj) · (1 − c(xi, xj) + c(xj , xi)). (3)

The learning goal is to find a predictor c such that the expected importanceweighted classification loss with respect to P, given by

Lclass(c,P) =Etij∼P lclass(c, tij), (4)

is as small as possible.

When learning pairwise preferences, the binary classifier c decides for each or-dered pair (xi, xj) whether xi or xj is preferred. A perfect prediction preservesthe target preference between two alternatives, i.e., yi > yj ⇔ c(xi, xj) −c(xj , xi) = 1, and a non-zero loss is incurred otherwise. When wij = 1, theexpected loss Lclass is simply the probability that discordant pairs happen as-suming that ties are broken at random.

3.2 Ranking a Subset with Binary Classifiers

We introduce a general framework for ranking a subset with a binary classifier,which unifies a large family of pairwise ranking algorithms such as RankingSVMs [10], RankBoost [9] and RankNet[5]. This framework is composed of twoprocedures as described below.


The training procedure (Binary Train) takes a set S of labeled instances inX ×{0, . . . , l−1} and transforms every pair of labeled instances into two binaryclassification examples, each of which is augmented with a non-negative weight.By running a binary learning algorithm A on the transformed example set T , aclassifier of the form c : X × X × U → {0, 1} is obtained, where xi, xj ∈ X .

We then define the induced distribution D on the binary classifier c. To gen-erate a sample from this distribution, we first draw a random labeled set S fromthe original distribution D, and subsequently draw uniformly from S an orderedpair (i, j) which is translated into ((xi, xj , X), I(yi > yj), wij(X, Y )). We definethe importance weight function wij(X, Y ) by

wij(X, Y ) = |g(yi, Y ) − g(yj , Y )| . (5)

Intuitively, the larger the difference between the relevance gains associated withtwo different examples, the more important it is to predict the preference betweenthem correctly. In theory, this choice of weights enjoys sound regret propertieswhich will be investigated in the next section.

The test procedure (Rank Predict) assigns a preference degree to each in-stance xi according to the degree function f(xi, X), which increases by 1 if xi

is strictly preferred to xj such that c(xi, xj , X)− c(xj , xi, X) = 1, and 12 if xi is

regarded as equally good as xj such that c(xi, xj , X) − c(xj , xi, X) = 0. Theseinstances are then sorted in non-increasing order of the preference degrees.

4 Regret Analysis

We now apply the well-established regret transform principle to analyze thereduction from subset ranking to binary classification. We first prove a guaranteeon the consistency of a reduction when zero regret is attained. Then we providenovel regret bounds when non-zero regret is attained.

4.1 Consistency of Reduction Methods

We shall rewrite (4) by replacing the original distribution P with the induceddistribution D due to the reduction:

Lclass(c, D) =1Z

ES∼D∑(i,j)

lclass(c, tij , S) = EXLclass(c, X), (6)

where Z = ES∼D∑

(i,j) wij · I(yi > yj) is the normalization constant, and

Lclass(c, X) =1Z

EY |X∑(i,j)

lclass(c, tij , S)

=1Z

∑i,j

E(yi,yj ,Y )|(xi,xj ,X)(lclass(c, tij , S) + lclass(c, tji, S)). (7)


Lemma 2. Given a set X ∈ U , define the optimal subset preference functionc ∈ C as a minimizer of (7). Let the importance weights be defined as in (5).Then for c(xi, xj , X)− c(xj , xj , X) = 1, it holds that

E(yi,yj,Y )|(xi,xj,X)(g(yi, Y ) − g(yj, Y )) ≥ 0.

Proof. Note that (7) takes its minimum when each conditional expectation termin the summation achieves the minimum. Substituting (3) and (5) into (7), wehave12· E(yi,yj,Y )|(xi,xj,X)[wij · I(yi > yj) · (1 − c(xi, xj , X) + c(xj , xi, X))

+ wij · I(yj > yi) · (1 − c(xj , xi, X) + c(xi, xj , X))]

=12·E(yi,yj ,Y )|(xi,xj ,X)[(c(xj , xi, X)− c(xi, xj , X)) · (I(yi > yj) + I(yj > yi))

· (g(yi, Y ) − g(yj , Y )) + (I(yi > yj) + I(yj > yi)) · wij ]

=12· [(c(xj , xi, X)− c(xi, xj , X)) ·E(yi,yj,Y )|(xi,xj,X)(g(yi, Y ) − g(yj, Y ))

+ E(yi,yj,Y )|(xi,xj,X)wij ].

Assume by contradiction that E(yi,yj ,Y )|(xi,xj ,X)(g(yi, Y ) − g(yj, Y )) < 0. Con-sider any k, k′ ∈ {1, . . . , n}, there exists a preference function c ∈ C such thatc(xk, xk′ , X)−c(xk′ , xk, X) = c(xk, xk′ , X)− c(xk′ , xk, X), when k, k′ �= i, j, andc(xi, xj , X) − c(xj , xi, X) = −1. Then we get that Lclass(c, X) < Lclass(c, X)which stands in contradiction to the subset preference optimality of c. $%The above lemma together with the result obtained in lemma 1 allows us toderive the following statement.

Theorem 1.Consider position-sensitive subset ranking using importance weightedclassification. Let the importance weights be defined as in (5). Let Rank Predict(c)be an ordering induced by the optimal subset preference function c with respect toX ∈ U . Then it holds that

Lrank(Rank Predict(c), X) = Lrank(h, X).

where h is the optimal subset ranking function that minimizes the conditionalexpectation in (2) with respect to X.

The theorem states conditions that lead to a consistent reduction method inthe sense that given an optimal (zero-regret) binary classifier, the reduction canyield a ranker with minimal expected loss conditioned on X .

4.2 Regret Bounds

Here, regret quantifies the difference between the achieved loss and optimal lossin expectation. More precisely, the regret of h on the subset X is

Rrank(h, X) = Lrank(h, X) − Lrank(h, X), (8)

where h is the optimal subset ranking function as defined previously.


Similarly, the regret of c on the subset X is

Rclass(c, X) = Lclass(c, X)− Lclass(c, X), (9)

where c is the optimal subset preference function as defined previously.Note that Rclass(c, X) is scaled by a normalization constant which relies on

the summation of importance weights for the induced pairwise preferences, whilethis is not used in Rrank(h, X). For fairness and simplicity, we leave out thenormalization constant in Rclass(c, X), and let Rclass(c, X) = Z ·Rclass(c, X). Wethen provide an upper-bound that relates the subset ranking regret Rrank(h, X)to the cumulative classification regret Rclass(c, X). Before continuing, we needto present some auxiliary results for proving the regret bounds.

Definition 3. (proper pairwise regret) Given a set X, for any two instancesxi, xj ∈ X, we denote the pairwise loss of ordering xi before xj by

Lpair(xi, xj , X) = E(yi,yj,Y )|(xi,xj,X)wij(X, Y ) · I(yj > yi),

and denote the associated pairwise regret by

Rpair(xi, xj , X) = max(0, Lpair(xi, xj , X)− Lpair(xj , xi, X)).

If Lpair(xi, xj , X)− Lpair(xj , xi, X) ≥ 0, then Rpair(xi, xj , X) is called proper.

The above definition is parallel to the proper pairwise regret defined in [2] withrespect to AUC loss function.

Lemma 3. Let the importance weights be defined as in (5). For any i, j, k ∈ [n],if Rpair(xi, xj , X) and Rpair(xj , xk, X) are proper, then

Rpair(xi, xk, X) = Rpair(xi, xj , X) + Rpair(xj , xk, X).

The proof of lemma 3 is straightforward, and omitted due to space limitation.

Lemma 4. For any sequence (a1, . . . , an), let (a(1), . . . , a(n)) be a sequence sort-ing the values of (a1, . . . , an) in non-increasing order. ∀i ∈ N, let

(i2

)= i·(i−1)

2 .If (a(1), . . . , a(n)) is majorized by (n− 1, . . . , 0), then for any j ∈ [n− 1], it holdsthat

j∑u=1

n∑v=j+1

I(av ≥ au) ≤ 2 ·(

n∑v=j+1

av −(

n− j

2

)).

This proof has appeared in [2]. Majorization is originally introduced in [16]: Asequence (a1, . . . , an) majorizes a sequence (b1, . . . , bn) if and only if a1 ≥ . . . ≥an, b1 ≥ . . . ≥ bn and

∑kj=1 aj ≥

∑kj=1 bj when k < n and

∑nj=1 aj =

∑nj=1 bj.

In what follows, we re-index the instances in X according to π, i.e., j =(π)(−1)(j). Taking π as the target permutation, any permutation π on the sameset can be transformed into π via successive rank-adjacent transpositions [13].By flipping one discordant pair with adjacent ranks, we come by an intermediate


permutation. Let π(i) denote the intermediate permutation via i transpositionoperations. For convenience of modeling, we map each discordant pair in theset Γ = {(v, u) : u < v, π(v) < π(u)} to the number of adjacent transpositionsrequired to flip it. Specifically, we adopt the transposition strategy of choosingthe instance xj in increasing order of j, and transposing the discordant pairsassociated with xj . More precisely, let u− ∈ {1, . . . , u − 1} and u+ ∈ {u +1, . . . , n}, we have

i =∑u−

τ1(u−, π) +∑u+

I(π(u+) < π(u)) · I(π(v) ≤ π(u+)),

where τ1(u−, π) =∑

j I(π(j) < π(u−))·I(u− < j) can be interpreted as the totalnumber of discordant pairs associated with xu− . Equipped with these prepara-tions, we are in a position to prove the upper regret bound for subset rankingproblem:

Theorem 2. Consider position-sensitive subset ranking on X using importanceweighted classification. Let the importance weights be defined as in (5). Then forany binary classifier c, the following bound holds,

Rrank(Rank Predict(c), X) ≤ 2(d1 − d2) · Rclass(c, X). (10)

Proof. Fix c. Let Rank Predict(c) = h, Rank Predict(c) = h. By the definitionof Rrank(h, X), we can rewrite the left-hand side of equation (10) as

Rrank(h, X) = EY |X(DCG(h(X), Y )) −DCG(h(X), Y )).

We then obtain that

Rrank(h, X) = EY |X

n∑j=1

dj · (g(yj , Y ) − g(yj , Y ))

= EY |X∑

(v,u)∈Γ

(dπ(i)(u) − dπ(i)(v)) · (g(yu, Y ) − g(yv, Y ))

≤ maxi

(dπ(i)(u) − dπ(i)(v))∑

(v,u)∈Γ

Rpair(xv, xu, X)

≤ (d1 − d2) ·∑

(v,u)∈Γ

v−1∑j=u

Rpair(xj+1, xj , X)

= (d1 − d2) ·n−1∑j=1

|{u ≤ j < v : π(v) < π(u)}| · Rpair(xj+1, xj , X)

= (d1 − d2) ·n−1∑j=1

[j∑

u=1

n∑v=j+1

I(f(xv, X) ≥ f(xu, X))

]· Rpair(xj+1, xj , X). (11)

The second equality is due to the fact that

DCG(π, Y ) −DCG(π, Y ) =γ∑

i=1

DCG(π(i), Y )) −DCG(π(i−1), Y ),


where π(0) = π, and γ = |Γ | denotes the total number of inversions in π (notethat π(γ) is equivalent to π). The second inequality follows by using the fact thatthe function (dj − dj+1) is monotonically decreasing with j and applying lemma3 repeatedly. The third equality follows from algebra, and the fourth from thefact that Rank Predict outputs a permutation in non-increasing order of thedegree function f .

The term on the right-hand side of equation (10) can be written as

Rclass(c, X)

=∑u,v

E(yu,yv ,Y )|(xu,xv,X)(lclass(c, tuv) + lclass(c, tvu) − lclass(c, tuv)−lclass(c, tvu))

=12·∑u,v

E(yu,yv,Y )|(xu,xv,X)(I(yu > yv) + I(yv > yu)) · (g(yu, Y ) − g(yv, Y ))

· [(−c(xu, xv, X) + c(xv, xu, X)) + (c(xu, xv, X)− c(xv, xu, X))]

=12·∑u,v

[(−c(xu, xv, X) + c(xv, xu, X)) + (c(xu, xv, X)− c(xv, xu, X))]

· (E(yu,Y )|(xu,X)g(yu, Y )− E(yv ,Y )|(xv,X)g(yv, Y ))

=12·∑u<v

(−c(xu, xv, X) + c(xv, xu, X) + 1) · Rpair(xv , xu, X)

=12·∑u<v

(−c(xu, xv, X) + c(xv, xu, X) + 1) ·v−1∑j=u

Rpair(xj+1, xj , X)

=12·

n−1∑j=1

(2 · |{u ≤ j < v : c(xv, xu, X) = 1, c(xu, xv, X) = 0}|

+ |{u ≤ j < v : c(xv, xu, X) = c(xu, xv, X)}|) · Rpair(xj+1, xj , X)

=12·

n−1∑j=1

[j∑

u=1

n∑v=j+1

c(xv, xu, X)− c(xu, xv, X) + 12

]· Rpair(xj+1, xj , X)

=1

2·n−1∑j=1

[n∑

v=j+1

∑u�=v

c(xv, xu,X) − c(xu, xv,X) + 1

2−(n− j

2

)]· Rpair(xj+1, xj , X)

=12·

n−1∑j=1

[n∑

v=j+1

f(xv, X)−(

n − j

2

)]·Rpair(xj+1, xj , X). (12)

The fourth equality follows from theorem 1 and some algebra. The last equalityuses the definition of the degree function f . Comparing (11) and (12), we obtainthe desired bound due to lemma 4. $%

The above theorem derives an upper bound which is up to a constant factor of lessthan 2 (due to d1 − d2 < 1) on the regret ratio, which extends and improves theprevious work in the literature [1, 2, 8, 18]. We will show that the bound is also


the best possible. Consider a 3-element lower bound example: let the distributionhave all its mass on a single 3-element subset X = {(x0, 0), (x1, 1), (x2, 2)}. Wehave a classifier c such that c(x0, x1) = 0, c(x1, x0) = 1; c(x0, x2) = 0, c(x2, x0) =1; c(x1, x2) = 1, c(x2, x1) = 1. Then it is easy to check that Rclass(c, X) is 1, andthe worst case for Rrank(h, X) is 0.74 which is exactly 2 · (d1 − d2).

5 Experiments

While the focus of this work is a theoretical investigation on the reduction ap-proach from subset ranking to classification, we have also conducted experimentsthat study its empirical evidence, in particular, the effect of importance weightsour analysis suggests. We used a public benchmark data set called OHSUMED[17] collected from medical publications; it contains altogether 106 queries and16,140 query-document pairs. For each query-document pair, there are 45 rank-ing features extracted and a 3-level relevance judgement provided, i.e. definitelyrelevant, possibly relevant or not relevant.

For computational reasons,we employed the well-established classification prin-ciple with desirable properties [19] : modified huber loss plus l2-regularization.We then evaluated the linear form solutions with and without the proposed impor-tance weighting scheme, referred to as IMPairRank, and PairRank respectively.In addition, two Letor baselines which aim at directly optimizing (normalized)DCG were also chosen as comparisons. All the results presented below were av-eraged over five folds off-the-shelf, each of which consists of training, validation,and test set. The validation set was used to identify the best set of parameters,which was then verified on the test set.

Table 1. Test NDCG for different ranking methods

IMPairRank PairRank AdaRank-NDCG SmoothRank

NDCG@1 0.5804 0.5553 0.5330 0.5576

NDCG@2 0.5151 0.4981 0.4922 0.5149

NDCG@3 0.5095 0.5048 0.4790 0.4964

NDCG@7 0.4671 0.4649 0.4596 0.4667

NDCG@10 0.4568 0.4512 0.4496 0.4568

It is interesting to note that IMPairRank achieves better test NDCG resultsat the given positions. In fact, it has achieved the highest at all top ten positionsexcept two. This means that the proposed weighting scheme makes the tra-ditional pairwise classification a better approximation to the position-sensitivemetric, and comparable with state of the art optimized NDCG baselines, whicheffectively confirms the theory.

6 Conclusion

In this paper, we attempt to provide a theoretical analysis supporting subsetranking using binary classifications, and derive novel instructive conclusions that


extend and improve the existing reduction approaches for subset ranking. Thepotential usefulness of theory is validated through experiments on a benchmarkdata set for learning to rank.

Acknowledgments. This work was supported partially by NNSFC 60921061.

References

1. Ailon, N., Mohri, M.: An efficient reduction of ranking to classification. In: Proc.21st COLT, pp. 87–98 (2008)

2. Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., Sorkin,G.B.: Robust reductions from ranking to classification. In: Bshouty, N.H., Gentile,C. (eds.) COLT. LNCS (LNAI), vol. 4539, pp. 604–619. Springer, Heidelberg (2007)

3. Beygelzimer, A., Dani, V., Hayes, T., Langford, J., Zadrozny, B.: Error limitingreductions between classification tasks. In: Proc. 22nd ICML, pp. 49–56 (2005)

4. Beygelzimer, A., Langford, J., Ravikumar, P.: Error-correcting tournaments. In:Gavalda, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809,pp. 247–262. Springer, Heidelberg (2009)

5. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullen-der, G.: Learning to rank using gradient descent. In: Proc. 22nd ICML, pp. 89–96(2005)

6. Burges, C.J.C., Ragno, R., Le, Q.V.: Learning to rank with non-smooth cost func-tions. In: Proc. 19th NIPS, pp. 193–200. MIT Press, Cambridge (2006)

7. Cortes, C., Mohri, M., Rastogi, A.: Magnitude-preserving ranking algorithms. In:Proc. 24th ICML, pp. 169–176 (2007)

8. Cossock, D., Zhang, T.: Statistical analysis of bayes optimal subset ranking. IEEETransactions on Information Theory 54, 5140–5154 (2008)

9. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm forcombining preferences. Journal of Machine Learning Research 4, 933–969 (2003)

10. Herbrich, R., Graepel, T., Obermayer, K.: Support vector learning for ordinalregression. In: Proc. 9th ICANN, pp. 97–102 (1999)

11. Jarvelin, K., Kekalainen, J.: Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems 20, 422–446 (2002)

12. Joachims, T.: Optimizing search engines using clickthrough data. In: Proc. 8thKDD, pp. 133–142. ACM Press, New York (2002)

13. Kendall, M.G.: A new measure of rank correlation. Biometrika 30, 81–93 (1938)14. Langford, J., Beygelzimer, A.: Sensitive error correcting output codes. In: Auer,

P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 158–172. Springer,Heidelberg (2005)

15. Le, Q.V., Smola, A.J.: Direct optimization of ranking measures. In: CoRR,abs/0704.3359 (2007)

16. Marshall, A., Olkin, I.: Inequalities: Theory of majorization and its applications.Mathematics in Science and Engineering, vol. 143 (1979)

17. Asia, M.R.: Letor3.0: benchmark datasets for learning to rank. Microsoft Corpo-ration (2008)

18. Sun, Z.Y., Qin, T., Tao, Q., Wang, J.: Robust sparse rank learning for non-smoothranking measures. In: Proc. 32nd SIGIR, pp. 259–266 (2009)

19. Zhang, T.: Statistical behavior and consistency of classification methods based onconvex risk minimization. The Annuals of Statistics 32, 56–85 (2004)

Intelligent Software Development Environments:

Integrating Natural Language Processing withthe Eclipse Platform

Rene Witte, Bahar Sateli, Ninus Khamis, and Juergen Rilling

Department of Computer Science and Software EngineeringConcordia University, Montreal, Canada

Abstract. Software engineers need to be able to create, modify, and ana-lyze knowledge stored in software artifacts. A significant amount of theseartifacts contain natural language, like version control commit messages,source code comments, or bug reports. Integrated software developmentenvironments (IDEs) are widely used, but they are only concerned withstructured software artifacts – they do not offer support for analyzing un-structured natural language and relating this knowledge with the sourcecode. We present an integration of natural language processing capabili-ties into the Eclipse framework, a widely used software IDE. It allows toexecute NLP analysis pipelines through the Semantic Assistants frame-work, a service-oriented architecture for brokering NLP services based onGATE. We demonstrate a number of semantic analysis services helpfulin software engineering tasks, and evaluate one task in detail, the qualityanalysis of source code comments.

1 Introduction

Software engineering is a knowledge-intensive task. A large amount of that knowl-edge is embodied in natural language artifacts, like requirements documents,user’s guides, source code comments, or bug reports. While knowledge workersin other domains now routinely make use of natural language processing (NLP)and text mining algorithms, software engineers still have only limited support fordealing with natural language artifacts. Existing software development environ-ments (IDEs) can only handle syntactic aspects (e.g., formatting comments) andsome basic forms of analysis (e.g., spell-checking). More sophisticated NLP anal-ysis tasks have been proposed for software engineering, but so far have not beenintegrated with common software IDEs and therefore not been widely adopted.

In this paper, we argue that software engineers can benefit from modern NLPtechniques. To be successfully adopted, this NLP must be seamlessly integratedinto the software development process, so that it appears alongside other soft-ware analysis tasks, like static code analysis or performance profiling. As softwareengineers are end users, not experts in computational linguistics, NLP servicesmust be presented at a high level of abstraction, without exposing the detailsof language analysis. We show that this kind of NLP can be brought to soft-ware engineers in a generic fashion through a combination of modern software


Intelligent Software Development Environments: Integrating NLP with Eclipse 409

engineering and semantic computing approaches, in particular service-orientedarchitectures (SOAs), semantic Web services, and ontology-based user and con-text models.

We implemented a complete environment for embedding NLP into softwaredevelopment that includes a plug-in for the Eclipse1 framework, allowing a soft-ware engineer to run any analysis pipeline deployed in GATE [1] through theSemantic Assistants framework [2]. We describe a number of use cases for NLPin software development, including named entity recognition and quality anal-ysis of source code comments. An evaluation with end users shows that theseNLP services can support software engineers during the software developmentprocess.

Our work is significant because it demonstrates, for the first time, how a majorsoftware engineering framework can be enhanced with natural language process-ing capabilities and how a direct integration of NLP analysis with code analysiscan provide new levels of support for software development. Our contributionsinclude (1) a ready-to-use, open source plug-in to integrate NLP services intothe Eclipse software development environment (IDE); (2) novel NLP servicessuitable for interactive execution in a software engineering scenario; and (3) anevaluation of a software comment quality assurance service demonstrating theusefulness of NLP services, evaluated against annotations manually created bya large group of software engineering students.

2 Software Engineering Background

From a software engineer’s perspective, natural language documentation con-tains valuable information of both functional and non-functional requirements,as well as information related to the application domain. This knowledge oftenis difficult or impossible to extract only from source code [3].

One of our application scenarios is the automation of source code commentquality analysis, which so far has to be performed manually. The motivation forautomating this task arises from the ongoing shift in development methodologiesfrom a document-driven (e.g., waterfall model) towards agile development (e.g.,Scrum). This paradigm shift leads to situations where the major documentation,such as software requirements specifications or design and implementation deci-sions, are only available in form of source code comments. Therefore, the qualityof this documentation becomes increasingly important for developers attemptingto perform the various software engineering and maintenance tasks [4].

Any well-written computer program should contain a sufficient number ofcomments to permit people to understand it. Without documentation, futuredevelopers and maintainers are forced to make dangerous assumptions aboutthe source code, scrutinizing the implementation, or even interrogating the orig-inal author if possible [5]. Development programmers should prepare these com-ments when they are coding and update them as the programs change. Thereexist different types of guidelines for in-line documentation, often in the form1 Eclipse, http://www.eclipse.org/

http://www.eclipse.org/

410 R. Witte et al.

of programming standards. However, a quality assurance for these comments,beyond syntactic features, currently has to be performed manually.

3 Design of the NLP/Eclipse Integration

We start the description of our work by discussing the requirements and designdecisions for integrating NLP with the Eclipse platform.

3.1 Requirements

Our main goal is to bring NLP to software engineers, by embedding it intoa current software development environment used for creating, modifying, andanalysing source code artifacts. There are a number of constraints for such anintegration: It must be possible to use NLP on existing systems without requir-ing extensive re-installations or -configurations on the end user’s side; it must bepossible to execute NLP services remotely, so that it is not necessary to installheavy-weight NLP tools on every system; the integration of new services mustbe possible for language engineers without requiring extensive system knowledge;it must be generic, i.e., not tied to a concrete NLP service, so that new servicescan be offered by the server and dynamically discovered by the end user; andthe services must be easy to execute from an end user’s perspective, withoutrequiring knowledge of NLP or semantic technologies. Our solution to these re-quirements is a separation of concerns, which directly addresses the skill-sets andrequirements of computational linguists (developing new NLP analysis pipelines),language engineers (integrating these services), and end users (requesting theseservices). The Web service infrastructure for brokering NLP services has beenpreviously implemented in the open source Semantic Assistants architecture [2](Fig 1).

Developing new client plug-ins is one of the extension points of the SemanticAssistants architecture, bringing further semantic support to commonly usedtools. Here, we chose the Eclipse platform, which is a major software developmentframework used across a multitude of languages, but the same ideas can beimplemented in other IDEs (like NetBeans).

3.2 An Eclipse Plug-in for NLP

Eclipse is a multi-language software development environment, comprising anIDE and an extensible plug-in system. Eclipse is not a monolithic program butrather a small kernel that employs plug-ins in order to provide all of its func-tionality. The main requirements for an NLP plug-in are: (1) a GUI integrationthat allows users to enquire about available assistants and (2) execute a desiredNLP service on a set of files or even complete projects inside the workspace,without interrupting the user’s task at hand. (3) On each enquiry request, alist of NLP services relevant to the user’s context must be dynamically gener-ated and presented to the user. The user does not need to be concerned aboutmaking any changes on the client-side – any new NLP service existing in the


Writer

OpenO

ffice.org

Plugin

Client S

ide Abstraction Layer

Tier 1: Clients

Plugin

Tier 4: Resources

applicationN

ew

Tier 2: Presentation and Interaction Tier 3: Analysis and Retrieval

Eclipse

NLP Subsystem

Web Information System

NLP Service Connector

Web S

erver

Navigation

Annotation

Presentation

Service Invocation

Service Information

Language Services

Web/IS Connector

Question Answering

Index Generation

Information Retrieval

Information Extraction

Automatic Summarization

Language

ServiceDescriptions

Indexed

Documents

External

Documents

Fig. 1. The Semantic Assistants architecture, brokering NLP pipelines through Webservices to connected clients, including the Eclipse client described here

project resources must be automatically discovered through its OWL metadata,maintained by the architecture. Finally, (4) NLP analysis results must be pre-sented in a form that is consistent with the workflow and visualization paradigmin a software IDE; e.g., mapping detected NL ‘defects’ to the corresponding lineof code in the editor, similar to code warnings displayed in the same view.

3.3 Converting Source Code into an NLP Corpus

A major software engineering artifact is source code. If we aim to support NLPanalysis in the software domain, it must be possible to process source code usingstandard NLP tools, e.g., in order to analyze comments, identifiers, strings, andother NL components. While it is technically possible to load source code intoa standard NLP tool, the unusual distribution of tokens will have a number ofside-effects on standard analysis steps, like part-of-speech tagging or sentencesplitting. Rather than writing custom NLP tools for the software domain, wepropose to convert a source code file into a format amenable for NLP tools.

In the following, we focus on Java due to space restrictions, but the sameideas apply to other programming languages as well. To convert Java sourcecode into a standard representation, it is possible to apply a Java fact extractiontool such as JavaML, Japa, or JavaCC and transform the output into the desiredformat. The tool that provides the most information regarding the constructsfound in Javadoc comments [6] is the Javadoc tool. Javadoc’s standard docletgenerates API documentation using the HTML format. While this is convenientfor human consumption, automated NLP analysis applications require a morestructured XML format. When loading HTML documents generated using thestandard doclet into an NLP framework (Fig. 2, left), the elements of an HTMLtag are interpreted as being entities of an annotation. For example, the Javapackage (org.argouml.model) is interpreted as being of the type h2. This is be-cause the Javadoc standard doclet extraction tool marked up the package usingthe <h2></h2> tags. As a result, additional processing is required in order to

412 R. Witte et al.

Fig. 2. Javadoc generated documentation loaded within an NLP Framework

identify the entity as being a package. In contrast, an XML document (Fig. 2,right), where the elements of the XML tags coincide with the encapsulated en-tity, clearly identifies them as being a Package, Class, etc. For transformingthe Javadoc output into an XML representation, we designed a doclet capableof generating XML documents. The SSL Javadoc Doclet [7] converts class, in-stance variable, and method identifiers and Javadoc comments into an XMLrepresentation, thereby creating a corpus that NLP services can analyse easier.

4 Implementation

The Semantic Assistants Eclipse plug-in has been implemented as a Java Archive(JAR) file that ships with its own specific implementation and an XML descrip-tion file that is used to introduce the plug-in to the Eclipse plug-in loader. Theplug-in is based on the Model-View-Controller pattern providing a flexibilitytowards presenting annotations to the user generated from various NLP ser-vices. The user interaction is realized through using the Eclipse Standard Wid-get Toolkit and service invocations are implemented as Eclipse Job instancesallowing the asynchronous execution of language services.

On each invocation of an NLP service, the plug-in connects to the Seman-tic Assistants server through the Client-Side Abstraction Layer (CSAL) utilityclasses. Additional input dialogues are presented to the user to provide NLP ser-vice run-time parameters after interpreting the OWL metadata of the selectedservice. Then, the execution will be instantiated as a job, allowing the underlyingoperating system to schedule and manage the lifecycle of the job. As the execu-tion of the job is asynchronous and running in the background (if so configuredby the user), two Eclipse view parts will be automatically opened to providereal-time logs and the retrieved annotations once NLP analysis is completed.

Eventually, after a successful execution of the selected NLP service, a set ofretrieved results is presented to the user in a dedicated ‘Semantic Assistants’ viewpart. The NLP annotations are contained inside dynamically generated tables,presenting one annotation instance per row providing a one-to-one mapping ofannotation instances to entities inside the software artifacts. The plug-in alsooffers additional, Eclipse-specific features. For instance, when executing sourcecode related NLP services, special markers are dynamically generated to attachannotation instances to the corresponding document (provided the invocation


results contain the position of the generated annotations in the code). This offersa convenient way for users to navigate directly from annotation instances in theSemantic Assistants view to the line of code in the project where it actuallybelongs, in the same fashion as navigating from compiler warnings and errors totheir location in the code.

5 Applications: NLP in Software Development

In this section, we discuss application examples, showing how software engineerscan benefit from integrated NLP services. One of them, the quality analysis ofsource code comments, is presented with a detailed evaluation.

5.1 Working with NLP Services in Eclipse

Once the Semantic Assistants plug-in is successfully installed, users can startusing the NLP services directly from the Eclipse environment on the resourcesavailable within the current workspace. One of the features of our plug-in is anew menu entry in the standard Eclipse toolbar:

This menu entry allows a user to enquire about available NLP services relatedto his context. Additionally, users can manually configure the connection to theSemantic Assistants server, which can run locally or remote. Upon selecting the‘Available Assistants’ option, the plug-in connects to the Semantic Assistantsserver and retrieves the list of available language services generated by the serverthrough reading the NLP service OWL metadata files. Each language service hasa name and a brief description explaining what it does. The user then selectsindividual files or even complete projects as input resources, and finally therelevant NLP service to be executed. The results of a successful service invocationare shown to the user in an Eclipse view part called “Semantic Assistants”. Inthe mentioned view, a table will be generated dynamically based on the serverresponse that contains all the parsed annotation instances.

For example, in Fig. 5, the JavadocMiner service has been invoked on a Javasource code file. Some of the annotations returned by the server bear a lineNum-ber feature, which attaches an annotation instance to a specific line in the Javasource file. After double-clicking on the annotation instance in the Semantic As-sistants view, the corresponding resource (here, a .java file) will be opened in aneditor and an Eclipse warning marker will appear next to the line defined by theannotation lineNumber feature.

414 R. Witte et al.

5.2 Named Entity Recognition

The developed plug-in allows to execute any NLP pipeline deployed in GATE, notjust software engineering services. For example, standard information extraction

Fig. 3. Semantic Assistants Invocation dialogue inEclipse, selecting artifacts to send for analysis

(IE) becomes immediatelyavailable to software devel-opers. Fig. 4 shows a sam-ple result set of an ANNIEinvocation, a named entityrecognition service run-ning on the licensing docu-mentation of a Java class.ANNIE can extract vari-ous named entities such asPerson, Organization, orLocation. Here, each rowin the table represents anamed entity and its corre-sponding resource file andbears the exact offset ofthe entity inside the tex-tual data so it can be eas-ily located. NE recognitioncan allow a software engi-neer to quickly locate im-portant concepts in a soft-ware artifact, like the names of developers, which is important for a number oftasks, including traceability link analysis.

5.3 Quality Analysis of Source Code Comments

The goal of our JavadocMiner tool [4] is to enable users to automatically assessthe quality of source code comments. The JavadocMiner is also capable of provid-ing users with recommendations on how a Javadoc comment may be improved

Fig. 4. Retrieved NLP Annotations from the ANNIE IE Service


based on the “How to Write Doc Comments for the Javadoc Tool” guidelines.2

Directly integrating this tool with the Eclipse framework now allows softwareengineers to view defects in natural language in the same way as defects in theircode.

In-line Documentation and Javadoc. Creating and maintaining documen-tation has been widely considered as an unfavourable and labour-intensive taskwithin software projects [8]. Documentation generators currently developed aredesigned to lessen the efforts needed by developers when documenting software,and have therefore become widely accepted and used. The Javadoc tool [6] pro-vides an inter-weaved representation where documentation is directly insertedinto Java source code in the form of comments that are ignored by compilers.

Different types of comments are used to document the different types of iden-tifiers. For example, a class comment should provide insight on the high-levelknowledge of a program, e.g., which services are provided by the class, and whichother classes make use of these services [9]. A method comment, on the otherhand, should provide a low-level understanding of its implementation.

When writing comments for the Javadoc tool, there are a number of guidelinespecifications that should be followed to ensure high quality comments. Thespecifications include details such as: (1) Use third person, declarative, ratherthan second person, prescriptive; (2) Do not include any abbreviations whenwriting comments; (3) Method descriptions need to begin with verb phrases; and(4) Class/interface/field descriptions can omit the subject and simply state theobject. These guidelines are well suited for automation through NLP analysis.

Automated Comment Quality Analysis. Integrating the JavadocMinerwith our Eclipse plug-in provides for a completely new style of software de-velopment, where analysis of natural language is interweaved with analysis ofcode.

Fig. 5, shows an example of an ArgoUML3 method doesAccept loadedwithin the Eclipse IDE. After analyzing the comments using the JavadocMiner,the developer is made aware of some issues regarding the comment: (1) ThePARAMSYNC metric detected an inconsistency between the Javadoc @paramannotation and the method parameter list: The developer should modify theannotation to begin with the name of the parameter being documented, “ob-jectToAccept” instead of “object” as indicated in PARAMSYNC Explanation.(2) The readability metrics [4] detected the Javadoc comment as being belowthe Flesch threshold FLESCHMetric and FleschExplanation, and above theFog threshold FOGMetric and FOGExplanation, which indicates a commentthat exceeds the readability thresholds set by the user. (3) Because the com-ment does not use a third person writing style as stated in guideline (1), theJavadocMiner generates a recommendation MethodCommentStyle that explainsthe steps needed in order for the comment to adhere to the Javadoc guidelines.2 http://oracle.com/technetwork/java/javase/documentation/index-137868.html

3 ArgoUML, http://argouml.tigris.org/

http://oracle.com/technetwork/java/javase/documentation/index-137868.html

http://oracle.com/technetwork/java/javase/documentation/index-137868.html

http://argouml.tigris.org/

416 R. Witte et al.

Fig. 5. NLP analysis results on a ArgoUML method within Eclipse

End-User Evaluation. We performed an end-user study to compare how wellautomated NLP quality analysis in a software framework can match human judge-ment, by comparing the parts of the in-line documentation that were evaluatedby humans with the results of the Javadoc-Miner. For our case study, we asked 14students from an undergraduate level computer science class (COMP354),and 27 students from a graduate level software engineering course (SOEN6431)to evaluate the quality of Javadoc comments taken from the ArgoUML opensource project [10]. For our survey, we selected a total of 110 Javadoc comments:

Fig. 6. A Sample Question from the Survey

15 class and interface com-ments, 8 field comments,and 87 constructor andmethod comments. Beforeparticipating in the survey,the students were asked toreview the Javadoc guide-lines discussed earlier. Thestudents had to log into thefree online survey tool KwikSurveys4 using their studentIDs, ensuring that all stu-dents completed the surveyonly once. The survey in-

cluded a set of general questions such as the level of general (Table 1, left)and Java (Table 1, right) programming experience.

The students were able to rate the comments as either Very Poor, Poor, Good,or Very Good as shown in Fig. 6, giving the comments a 50% chance of beingpositively or negatively classified. This also enabled us to know how strongly theparticipants felt about their sentiments, compared to using just a Good or Bad

4 Kwik Surveys, http://www.kwiksurveys.com/

http://www.kwiksurveys.com/


Table 1. Years of general and Java programming experience of study participants

General Experience Java ExperienceClass 0 Years 1-2 Years 3+ Years 0 Years 1-2 Years 3+ YearsCOMP 354 11% 31% 58% 7% 61% 32%SOEN 6431 02% 22% 76% 10% 49% 41%

selection. From the 110 manually assessed comments, we selected a total of 67comments: 5 class and interface comments, 2 field comments, and 60 constructorand method comments, that had strong agreement (≥ 60%) as being of eithergood (39 comments) or bad (28 comments) quality.

When comparing the student evaluation of method comments with some ofthe NL measures of the JavadocMiner (Table 2), we found that the commentsthat were evaluated negatively contained half as many words (14) compared tothe comments that were evaluated as being good. Regardless of the insufficientdocumentation of the bad comments, the readability index of Flesch, Fog andKincaid indicated text that contained a higher density, or more complex material,which the students found hard to understand. All of the methods in the survey

Fig. 7. A Sample Answer from the Survey

contained parameterlists that needed tobe documented us-ing the @param an-notation. When ana-lysing the results ofthe survey, we foundthat most studentsfailed to analyze theconsistency betweenthe code and com-ments as shown inFig. 7. Our JavadocMiner also detected a total of 8 abbreviations being usedwithin comments, that none of the students mentioned.

Finally, for twelve of the 39 comments that were analyzed by the studentsas being good, 12 of them were not written in third-person according to theguidelines, a detail that all students also failed to mention.

6 Related Work

We are not aware of similar efforts for bringing NLP into the realm of softwaredevelopment by integrating it tightly with a software IDE.

Some previous works exist on NLP for software artifacts. Most of this re-search has focused on analysing texts at the specification level, e.g., in order toautomatically convert use case descriptions into a formal representation [11] ordetect inconsistent requirements [12]. In contrast, we aim to support the rolesof software developer, maintainer, and quality assurance engineer.

418 R. Witte et al.

Table 2. Method Comments Evaluated by Students and the JavadocMiner

Student Evaluation Avg. Number of Words Avg. Flesch Avg. Fog Avg. KincaidGood 28.03 39.2 12.63 10.55Bad 14.79 5.58 13.98 12.66

There has been effort in the past that focused on analyzing source code com-ments; For example, in [13] human annotators were used to rate excerpts fromJasper Reports, Hibernate and jFreeChart as being either More Readable, Neu-tral or Less Readable, as determined by a “Readability Model”. The authorsof [14] manually studied approximately 1000 comments from the latest versionsof Linux, FreeBSD and OpenSolaris. The work attempts to answer questionssuch as 1) what is written in comments; 2) whom are the comments written foror written by; 3) where the comments are located; and 4) when the commentswere written. The authors made no attempt to automate the process.

Automatically analyzing comments written in natural language to detect code-comment inconsistencies was the focus of [15]. The authors explain that suchinconsistencies may be viewed as an indication of either bugs or bad comments.The author’s implement a tool called iComment that was applied on 4 largeOpen Source Software projects: Linux, Mozilla, Wine and Apache, and detected60 comment-code inconsistencies, 33 new bugs and 27 bad comments.

None of the works mentioned in this section attempted to generalize the inte-gration of NLP analysis into the software development process, which is a majorfocus of our work.


We presented a novel integration of NLP into software engineering, througha plug-in for the Eclipse platform that allows to execute any existing GATENLP pipeline (like the ANNIE information extraction system) through a Webservice. The Eclipse plug-in, as well as the Semantic Assistants architecture,is distributed as open source software.5 Additionally, we presented an exampleNLP service, automatic quality assessment of source code comments.

We see the importance of this work in two areas: First, we opened up thedomain of NLP to software engineers. While some existing work addressed anal-ysis services before, they have not been adopted in software engineering, as theywere not integrated with common software development tools and processes. Andsecond, we demonstrate the importance of investigating interactive NLP, whichso far has received less attention than the typical offline corpus studies. Ourcase study makes a strong case against a human’s ability to manage the variousaspects of documentation quality without (semi-)automated help of NLP toolssuch as the JavadocMiner. By embedding NLP within the Eclipse IDE, develop-ers need to spend less efforts when analyzing their code, which we believe willlead to a wider adoption of NLP in software engineering.5 See http://www.semanticsoftware.info/semantic-assistants-eclipse-plugin

http://www.semanticsoftware.info/semantic-assistants-eclipse-plugin

http://www.semanticsoftware.info/semantic-assistants-eclipse-plugin


Acknowledgements. This research was partially funded by an NSERC Discov-ery Grant. The JavadocMiner was funded in part by a DRDC Valcartier grant(Contract No. W7701-081745/001/QCV).

References

1. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A frameworkand graphical development environment for robust NLP tools and applications. In:Proceedings of the 40th Annual Meeting of the ACL (2002)

2. Witte, R., Gitzinger, T.: Semantic Assistants – User-Centric Natural LanguageProcessing Services for Desktop Clients. In: Domingue, J., Anutariya, C. (eds.)ASWC 2008. LNCS, vol. 5367, pp. 360–374. Springer, Heidelberg (2008)

3. Lindvall, M., Sandahl, K.: How well do experienced software developers predictsoftware change? Journal of Systems and Software 43(1), 19–27 (1998)

4. Khamis, N., Witte, R., Rilling, J.: Automatic Quality Assessment of Source CodeComments: The JavadocMiner. In: Hopfe, C.J., Rezgui, Y., Metais, E., Preece, A.,Li, H. (eds.) NLDB 2010. LNCS, vol. 6177, pp. 68–79. Springer, Heidelberg (2010)

5. Kotula, J.: Source Code Documentation: An Engineering Deliverable. In: Int. Conf.on Technology of Object-Oriented Languages, p. 505. IEEE Computer Society, LosAlamitos (2000)

6. Kramer, D.: API documentation from source code comments: a case study ofJavadoc. In: SIGDOC 1999: Proceedings of the 17th Annual International Con-ference on Computer Documentation, pp. 147–153. ACM, New York (1999)

7. Khamis, N., Rilling, J., Witte, R.: Generating an NLP Corpus from Java SourceCode: The SSL Javadoc Doclet. In: New Challenges for NLP Frameworks, Valletta,Malta, ELRA, May 22, pp. 41–45 (2010)

8. Brooks, R.E.: Towards a Theory of the Comprehension of Computer Programs.International Journal of Man-Machine Studies 18(6), 543–554 (1983)

9. Nurvitadhi, E., Leung, W.W., Cook, C.: Do class comments aid Java programunderstanding? In: Frontiers in Education (FIE), vol. 1 (November 2003)

10. Bunyakiati, P., Finkelstein, A.: The Compliance Testing of Software Tools withRespect to the UML Standards Specification - The ArgoUML Case Study. In:Dranidis, D., Masticola, S.P., Strooper, P.A. (eds.) AST, pp. 138–143. IEEE, LosAlamitos (2009)

11. Mencl, V.: Deriving behavior specifications from textual use cases. In: Proceedingsof Workshop on Intelligent Technologies for Software Engineering, pp. 331–341.Oesterreichische Computer Gesellschaft, Linz (2004)

12. Kof, L.: Natural language processing: Mature enough for requirements documentsanalysis? In: Montoyo, A., Munoz, R., Metais, E. (eds.) NLDB 2005. LNCS,vol. 3513, pp. 91–102. Springer, Heidelberg (2005)

13. Buse, R.P.L., Weimer, W.R.: A metric for software readability. In: Proc. Int. Symp.on Software Testing and Analysis (ISSTA), New York, NY, USA, pp. 121–130(2008)

14. Padioleau, Y., Tan, L., Zhou, Y.: Listening to programmers Taxonomies and char-acteristics of comments in operating system code. In: ICSE 2009, pp. 331–341.IEEE Computer Society, Washington, DC (2009)

15. Tan, L., Yuan, D., Krishna, G., Zhou, Y.: /*icomment: bugs or bad comments?*/.In: SOSP 2007: Proceedings of Twenty-first ACM SIGOPS Symposium on Oper-ating Systems Principles, pp. 145–158. ACM, New York (2007)

Partial Evaluation for Planning in Multiagent

Expedition

Y. Xiang and F. Hanshar

University of Guelph, Canada

Abstract. We consider how to plan optimally in a testbed, multiagentexpedition (MAE), by centralized or distributed computation. As op-timal planning in MAE is highly intractable, we investigate speedupthrough partial evaluation of a subset of plans whereby only the intendedeffect of a plan is evaluated when certain conditions hold. We apply thistechnique to centralized planning and demonstrate significant speedupin runtime while maintaining optimality. We investigate the technique indistributed planning and analyze the pitfalls.

1 Introduction

We consider a class of stochastic multiagent planning problems termed multi-agent expedition (MAE) [8]. A typical instance consists of a large open areapopulated by objects as well as mobile agents. Agent activities include mov-ing around the area, avoiding dangerous objects, locating objects of interest,and object manipulation depending on the nature of the application. Successfulmanipulation of an object may require proper actions of a single agent or mayrequire cooperation of multiple agents coordinating through limited communi-cation. Success of an agent team is evaluated based on the quantity of objectsmanipulated as well as the quality of each manipulation. MAE is an abstractionof practical problems such as planetary expedition or disaster rescue [3].

Planning in MAE may be achieved by centralized or distributed computation.Its centralized version can be shown to be a partially observable Markov decisionprocess (POMDP) and its distributed version can be shown to be a decentralizedPOMDP (DEC-POMDP). A number of techniques have been proposed for solv-ing POMDPs [4,6]. The literature for DEC-POMDPs is growing rapidly, e.g.,[1,5]. Optimal planning is highly intractable in general for either POMDP orDEC-POMDP. Inspired by branch-and-bound techniques to improve planningefficiency [2], we propose a method partial evaluation that focuses on the in-tended effect of a plan and skips evaluation of unintended effects when certainconditions are met.

We focus on on-line planning. We experiment with partial evaluation for cen-tralized planning in MAE and demonstrate a significant speedup in runtimewhile maintaining plan optimality. We also examine its feasibility in distributedplanning. It is found to be limited by local optimality without guaranteed globaloptimality or intractable agent communication. This result yields insight intodistributed planning that suggests future research on approximate planning.


Partial Evaluation for Planning in Multiagent Expedition 421

The remainder of the paper is organized as follows: Section 2 reviews back-ground on MAE. Sections 3-6 present partial evaluation for centralized planningwith experimental results reported in Section 7. Section 8 first reviews back-ground on collaborative design networks (CDNs), a multiagent graphical modelfor distributed decision making, and then investigates partial evaluation for dis-tributed planning based on CDNs.

2 Background on Multiagent Expedition

In MAE, an open area is represented as a grid of cells (Figure 1 (a)). At anycell, an agent can move to an adjacent cell by actions north, south, east, westor remain there (halt). An action has an intended effect (e.g., north in Figure 1(d)) and a number of unintended effects (other outcomes in (d)), quantified bytransition probabilities.

(b)(a) (c)0.025

0.0250.025

0.9

0.025

1

01 2 3 4# agents

reward

(d)

Fig. 1. a) Grid of cells and reward distribution in MAE. b) Cell reward distribution.c) Agent’s perceivable area. d) Intended effect (arrow) of action north.

The desirability of a cell is indicated by a numerical reward. A neutral cellhas a reward of a base value β. The reward at a harmful cell is lower than β.The reward at an interesting cell is higher than β and can be further increasedthrough agent cooperation.

When a physical object is manipulated (e.g., by digging), cooperation is oftenmost effective when a certain number of agents are involved, and the per-agentproductivity is reduced with more or less agents. We denote the most effectivelevel by λ. Figure 1(b) shows the reward distribution of a single cell with λ = 2.At this cell, the reward collected by a single agent is 0.3, if two agents coop-erate at the cell, each receives 0.8. Reward decreases with more than λ agents,promoting only effective cooperations.

After a cell has been visited by any agent, its reward is decreased to β. Asa result, wandering within a neighbourhood is unproductive. Agents have noprior knowledge how rewards are distributed in the area. Instead, at any cell, anagent can reliably perceive its location and reward distribution within a smallradius (e.g. shaded cells in Figure 1(c)). An agent can also perceive the locationof another agent and communicate if the latter is within a given radius.

Each agent’s objective is to move around the area, cooperate as needed, andmaximize the team reward over a finite horizon based on local observations and

422 Y. Xiang and F. Hanshar

limited communication. For a team of n agents and horizon h, there are 5nh

joint plans each of which has 5nh possible outcomes. With n = 6 and h = 2, atotal of 524 ≈ 6× 1016 uncertain outcomes need evaluated. Hence, solving MAEoptimally is highly intractable.

In the following, we refer to maximization of reward and utility interchange-ably with the following assumption: A utility is always in [0, 1], no matter if it isthe utility of an action, or a plan (a sequence of actions), or an joint action (simul-taneous actions by multiple agents), or a joint plan (a sequence of joint actions).In each case, the utility is mapped linearly from [min reward, max reward],with min reward and max reward properly defined accordingly.

3 Partial Evaluation

We study how to speedup planning in the context of MAE, based on an idea:partial evaluation. Let a be an action with two possible outcomes: an intendedand an unintended. The intended outcome has the probability p1 and utilityu1, and the unintended p2 = 1 − p1 and u2, respectively. Its expected utility isevaluated as

eu = p1u1 + p2u2. (1)

Let a′ be an alternative action with the same outcome probabilities p1 (forintended) and p2, and utilities u3 and u4, respectively. Its expected utility iseu′ = p1u3 + p2u4. The alternative action a′ is dominated by a if

eu− eu′ = eu− p1u3 − p2u4 > 0. (2)

From Eqn (2), the following holds:

u3 <eu

p1− p2

p1u4 (3)

Letting umax denote the maximum utility achievable, we have

eu

p1− p2

p1umax ≤ eu

p1− p2

p1u4. (4)

Eqn (3) is guaranteed to hold if we maintain

u3 <eu

p1− 1 − p1

p1umax ≡ t. (5)

When the number of alternative actions is large, the above idea can be usedto speed up search for best action: For an unevaluated action a′, if u3 satisfiesEqn (5), discard a′. We say that a′ is partially evaluated. Otherwise, eu′ will befully evaluated. If eu′ exceeds eu, then a will be updated as a′, eu updated aseu′, and u1 will be updated as u3.

Eqn (5) allows more efficient search without losing optimality, and is an exactcriterion for partial evaluation. The actual speed-up depends on the threshold tfor u3. The larger the value of t, the less actions that must be fully evaluated,and the more efficient the search.


Consider the value of umax. When utility is bounded by [0, 1], we have theobvious option umax = 1. That is, we derive umax from the global utility distribu-tion over all outcomes of actions. Threshold t increases as umax decreases. Hence,it is desirable to use a smaller umax while maintaining Eqn (4). One option toachieve this is to use umax from the local utility distribution over only outcomesof current alternative actions. The trade-off is the following: With umax = 1, itis a constant. With the localized umax, it must be updated before each planning.

4 Single-Agent Expedition

In single-agent expedition, an action a has an intended outcome and four unin-tended ones. We assume that the intended outcome of all actions have the sameprobability p1, and unintended outcomes have the same probability (1 − p1)/4.Hence, we have

eu = p1u1 +4∑

i=1

u2,i (1 − p1)/4, (6)

where u2,i is the utility of the ith unintended outcome. Comparing Eqn (1) andEqn (6) , we have

p2 u2 =4∑

i=1

u2,i1 − p1

4= (1 − p1) (

14

4∑i=1

u2,i).

If we aggregate the four unintended outcomes as an equivalent single unin-tended outcome, then this outcome has probability p2 = 1 − p1 and utilityu2 = 1

4

∑4i=1 u2,i.

Let uamax (where ‘a’ in ‘ua’ refers to ‘agent’) denote the maximum utilityof outcomes. Substituting u2 in Eqn (2) by 1

4

∑4i=1 u2,i, repeating the analysis

after Eqn (2), and noting that 14

∑4i=1 u2,i is upper-bounded by uamax, we have

an exact criterion for partial evaluation:

u3 < t =1

p1eu− 1 − p1

p1uamax (7)

As discussed in the last section, the smaller the value of uamax, the moreefficient the search. Since uamax was replacing 1

4

∑4i=1 u4,i (compare Eqns (3)

and (5)), we can alternatively replace 14

∑4i=1 u4,i with an upper bound tighter

than uamax. Since 14

∑4i=1 u4,i is essentially the average utility over unintended

outcomes, we can replace uamax by α uaavg, where uaavg is the average (local)utility of outcomes and α ≥ 1 is a scaling factor. This yields the following:

u3 < t =1

p1eu− 1 − p1

p1α uaavg (8)

According to Chebyshev’s inequality, the smaller the variance of utilities overoutcomes, the closer to 1 the α value can be without losing planning optimality.


5 Single Step MAE by Centralized Planning

Next, we consider multiagent expedition with n agents. Each agent action has kalternative outcomes o1, ..., ok, where o1 is the intended with probability p. Ajoint action by n agents consists of a tuple of n individual actions and is denotedby a. The intended outcome of a is the tuple made of the intended outcomes ofindividual actions, and is unique. We denote the utility of the intended outcomeof a by u. Outcomes of individual agent actions are independent of each othergiven the joint action plan. Hence, the intended outcome of a has probabilitypn. The expected utility of a is

eu = pn u+∑

i

pi ui, (9)

where i indexes unintended outcomes, ui is the utility of an unintended outcome,and pi is its probability. Note that pi �= pj in general for i �= j, and pn+

∑i pi = 1.

Let a′ be an alternative joint action whose intended outcome has utility u′.Denote the expected utility of a′ by eu′. The joint action a′ is dominated by jointaction a if

eu− eu′ = eu− pnu′ −∑

i

pi u′i > 0. (10)

Eqn (10) can be rewritten as follows:

u′ < p−n (eu−∑

i pi u′i)

Let utsavg (where ‘t’ in ‘uts’ refers to ‘team’ and ‘s’ refers to ‘single step’) denotethe average utility of outcomes of joint actions. From

0 < pi < 1 − pn, 0 < pi1−pn < 1,

∑i

pi1−pn = 1,

∑i

pi u′i = (1 − pn)

∑i

pi

1 − pnu′

i,

we have the expected value of∑

ipi

1−pn u′i (weighted mean with normalized

weights) to be utsavg, and the expected value of∑

i pi u′i to be (1− pn) utsavg.

We can choose α ≥ 1 (e.g. based on Chebyshev’s inequality) so that it is highlyprobable

∑i pi u′

i ≤ (1 − pn) α utsavg and hence eu −∑

i pi u′i ≥ eu − (1 −

pn) α utsavg. It then follows from Eqn (10) that the joint action a′ is dominatedby a with high probability if the following holds,

u′ < t =eu

pn− 1 − pn

pnα utsavg, (11)

in which case a′ can be discarded without full evaluation. Note that the conditionis independent of k.

In order to compute u′ by any agent Ag, it needs to know the intended outcomeof the action in a′ for each other agent, and use this information to determineif any cooperation occurs in the intended outcome of a′. To do so, it suffices for


Ag to know the current location of each agent as well as a′. Ag also needs toknow the unilateral or cooperative reward associated with the intended outcometo calculate u′. When other agents are outside of the observable area of Ag, thisinformation must be communicated to Ag. Similarly, in order to compute utsavg,Ag needs to collect from other agents the average rewards in their local areas.

Alternatively, following a similar analysis, we could base threshold t on utsmax,the maximum utility achievable by the outcome of any joint action, and test u′

by the following condition:

u′ < t =eu

pn− 1 − pn

pnutsmax (12)

Since utsmax > α utsavg, the search is less efficient, but its probability to getthe optimal plan is 1. To compute utsmax, Ag needs to collect from other agentsthe maximum rewards in their local areas, instead of average rewards as in thecase of utsavg.

6 Multi-step MAE by Centralized Planning

Consider multiagent expedition with horizon h ≥ 2 (single step is equivalent toh = 1). Each agent selects a sequence a of h actions. The n agents collectivelyselect a joint plan A (an n × h array). The intended outcome of joint plan A ismade of the intended outcomes of all individual actions of all agents. Assume thatthe outcome of each individual action of each agent is independent of outcomesof its own past actions and is independent of outcomes of actions of other agents(as is the case in MAE). Then the probability of the intended outcome of jointplan A is phn.

We denote the utility of the intended outcome of A by u. The expected utilityof A is then

eu = phn u+∑

i

pi ui, (13)

where i indexes unintended outcomes, ui is the utility of an unintended outcome,and pi is its probability. Note that phn +

∑i pi = 1.

Let A′ be an alternative joint plan whose intended outcome has utility u′.Denote the expected utility of A′ by eu′. The joint plan A′ is dominated by A if

eu− eu′ = eu− phn u′ −∑

i

pi u′i > 0. (14)

Through an analysis similar to that in the last section, and from the similarityof Eqns (14) and (10), we can conclude the following: Let utmavg (where ‘m’ in‘utm’ refers to ‘multi-step’) denote the average utility of outcomes of joint plans.Let α ≥ 1 to be a scaling factor. With a large enough α value, the joint plan A′

is dominated with high probability by plan A if the following inequation holds,

u′ < t =eu

phn− 1 − phn

phnα utmavg, (15)

in which case A′ can be discarded without full evaluation.


In order to compute u′ by any agent Ag, it needs to know A′, the currentlocation of each agent, and unilateral or cooperative reward associated with theintended outcomes. In order to compute utmavg, Ag needs to collect from otheragents average rewards in their local areas.

To increase the probability of plan optimality to 1, Ag can use the followingtest, with the price of less efficient search:

u′ < t =eu

phn− 1 − phn

phnutmmax (16)

7 Centralized Planning Experiment

The experiment aims to provide empirical evidence on efficiency gain and op-timality of partial evaluation in multi-step MAE by centralized planning. TwoMAE environments are used that differ in transition probability pt (0.8 or 0.9)for intended outcomes. Agent teams of size n = 3, 4 or 5 are run. The base re-ward β = 0.05. The most effective level of cooperation is set at λ = 2. Planninghorizon is h = 2.

Several threshold values from Section 6 are tested. The first, utmmax,1 = 1,corresponds to the global maximum reward. The second, utmmax, correspondsto the local maximum reward for each agent. The third, utmavg,α = α utmavg,corresponds to average reward over outcomes, scaled up by α. We report resultfor α = 1 as well as for a lower bound that yields an optimal plan by increasingα in 0.25 increments.

Tables 1 and 2 show the result for different values of pt. Each row correspondsto an experiment run. Full% refers to the percentage of plans fully evaluated.BFR denotes the team reward of the best joint plan found, and an asteriskindicates if the plan is optimal. BFR% denotes ratio of BFR over reward ofoptimal plan. T ime denotes runtime in seconds.

The results show that partial evaluation based on utmmax,1 is conservative: allplans are fully evaluated in 4 out of 6 runs. Second, utmmax finds an optimal plan

Table 1. Experiments with pt = 0.9

n Threshold Full%. BFR BFR% Time

utmmax,1 48.87 3.192* 100 3.3utmmax 0.780 3.192* 100 0.3

3 utmavg,1 0.172 3.102 97.18 0.1utmavg,3 0.812 3.192* 100 0.3

utmmax,1 83.51 4.940* 100 142.64 utmmax 0.053 4.940* 100 2.5utmavg,1 0.046 4.940* 100 1.9

utmmax,1 100 5.262* 100 4671.2utmmax 0.002 5.046 95.89 52.2

5 utmavg,1 0.001 5.046 95.89 52.1utmavg,5 0.19 5.262* 100 62.4

Table 2. Experiments with pt = 0.8

n Threshold Full%. BFR BFR% Time

utmmax,1 100 2.407* 100 6.2utmmax 2.0 2.407* 100 0.2

3 utmavg,1 0.16 2.327 96.67 0.1utmavg,3 2.25 2.407* 100 0.2

utmmax,1 100 3.630* 100 167.34 utmmax 0.068 3.630* 100 19.6utmavg,1 0.051 3.630* 100 19.0

utmmax,1 100 3.902* 100 6479.5utmmax 0.002 3.745 95.97 53.5

5 utmavg,1 0.001 3.745 95.97 52.3utmavg,4.5 1.704 3.902* 100 136.0


in 4 out of 6 runs, and utmavg,1 in 2 out of 6 runs. Third, partial evaluation basedon utmmax and utmavg,α shows significant speedup on all runs. For example,with pt = 0.8, n = 5 and utmavg,α, an optimal plan is found when α = 4.5 andonly 1.7% of joint plans are fully evaluated. The planning takes 136 seconds or2% of the runtime (108min) by utmmax,1 which evaluates all plans fully.

Table 3. Mean (μ) and standard deviation (σ) of team rewards over all plans

n # Plans pt μ σ pt μ σ

3 15,625 0.558 0.342 0.542 0.2604 390,625 0.9 0.738 0.462 0.8 0.713 0.3525 9,765,625 0.914 0.514 0.882 0.342

Table 3 shows the mean and standard deviation of team rewards over all jointplans for n = 3, 4 and 5, and pt = 0.8 and 0.9. The mean team reward in eachcase is no more than 23% of the corresponding optimal reward in Tables 1 and2. For example, consider n = 5 and pt = 0.8, the optimal reward from Table 2 is3.902 whereas the mean reward is 0.882, approximately 23% of the magnitude ofthe optimal plan. This signifies that the search space is full of low reward planswith very few good plans. Searching such a plan space is generally harder than aspace full of high reward plans. The result demonstrates that partial evaluationis able to traverse the search space, skip full evaluation of many low reward plans,and find high reward plans. This is true even for relatively aggressive thresholdutmavg,1, achieving at least 95% of the optimal reward (see Table 2).

8 Partial Evaluation in Distributed Planning

8.1 Collaborative Design Networks

Distributed planning in MAE can be performed based on multiagent graphicalmodels, known as collaborative design networks (CDNs) [8], whose backgroundis reviewed in this subsection. CDN is motivated by industrial design in supplychains. An agent responsible for a component encodes design knowledge into adesign network (DN) S = (V, G, P ). The domain is a set of discrete variablesV = D∪T ∪M ∪U . D is a set of design parameters. T is a set of environmentalfactors of the product under design. M is a set of objective performance measuresand U is a set of subjective utility functions of the agent.

Dependence structure G = (V, E) is a directed acyclic graph (DAG) whosenodes are mapped to elements of V and whose set E of arcs is from the fol-lowing legal types: Arc (d, d′) (d, d′ ∈ D) signifies a design constraint. Arc(d, m) (m ∈ M) represents dependency of performance on design. Arc (t, t′)(t, t′ ∈ T ) represents dependency between environmental factors. Arc (t, m) sig-nifies dependency of performance on environment. Arc (m, m′) defines a compos-ite performance measure. Arc (m, u) (u ∈ U) signifies dependency of utility onperformance.


P is a set of potentials, one for each node x, formulated as a probability dis-tribution P (x|π(x)), where π(x) are parent nodes of x. P (d|π(d)), where d ∈ D,encodes a design constraint. P (t|π(t)) and P (m|π(m)), where t ∈ T, m ∈ M ,are typical probability distributions. Each utility variable has a space {y, n}.P (u = y|π(u)) is a utility function util(π(u)) ∈ [0, 1]. Each node u is assigneda weight k ∈ [0, 1] where

∑U k = 1. With P thus defined,

∏x∈V \U P (x|π(x))

is a joint probability distribution (JPD) over D ∪ T ∪ M . Assuming additiveindependence among utility variables, the expected utility of a design d isEU(d) =

∑i ki(∑

m ui(m)P (m|d)), where d (bold) is a configuration of D,i indexes utility nodes in U , m (bold) is a configuration of parents of ui, and ki

is the weight of ui.Each supplier is a designer of a supplied component. Agents, one per supplier,

form a collaborative design system. Each agent embodies a DN called a subnetand agents are organized into a hypertree: Each hypernode corresponds to anagent and its subnet. Each hyperlink (called an agent interface) corresponds todesign parameters shared by the two subnets, which renders them condition-ally independent. They are public and other subnet variables are private. Thehypertree specifies to whom an agent communicates directly. Each subnet isassigned a weight wi, representing a compromise of preferences among agents,where

∑i wi = 1. The collection of subnets {Si = (Vi, Gi, Pi)} forms a CDN.

Figure 2 shows a trivial CDN for agents A0, A1, A2.

m2

G1

u1

d0

m0u0

G0

m1

��

��

s0

u2

m4m3

G2

0s s1

s 1

1

1

G2

G0

A1

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Gs

s0

A0

A2

Fig. 2. Subnets G0, G1, G2 (left) and hypertree (right) of a CDN. Design nodes aredenoted by s if public and d if private, performance nodes by m, and utility nodesby u.

The product∏

x∈V \∪iUi P (x|π(x)) is a JPD over ∪i(Di ∪ Ti ∪ Mi), whereP (x|π(x)) is associated with node x in a subnet. The expected utility of a designd is EU(d) =

∑i wi (

∑j kij (

∑m uij(m) P (m|d))), where d is a configuration

of ∪iDi, i indexes subnets, j indexes utility nodes {uij} in ith subnet, m is aconfiguration of parents of uij , and kij is the weight associated with uij . Givena CDN, decision-theoretical optimal design is well defined.

Agents evaluate local designs in batch before communicating over agent in-terfaces. An arbitrary agent is chosen as communication root. Communication isdivided into collect and distribute stages. Collect messages propagate expectedutility evaluation of local designs inwards along hypertree towards root. A re-ceiving agent knows the best utility of every local configuration when extendedby partial designs in downstream agents. At end of collect stage, the root agentknows the expected utility of the optimal design. Distribute messages propa-gate outwards along hypertree from root. After distribute stage, each agent has


identified its local design that is globally optimal (collectively maximize EU(d)).Computation (incl. communication) is linear on the number of agents [7] and isefficient for sparse CDNs.

8.2 Distributed Per-plan Evaluation

We consider partial evaluation in distributed planning based on CDN. EachMAE agent uses a DN to encode its actions (moves) as design nodes, outcomesof actions as performance nodes, and rewards as utility nodes. The hypertree fora team of agents (A, B, C) and DN for agent B are shown in Figure 3. An agentonly models and communicates with adjacent agents on hypertree. Movementnodes are labelled mv, performance nodes are labelled ps, and utility nodes arelabelled rw.

A,1mv mvC,1mvB,1

A,2mv B,2mv C,2mv

rwB,A,2

psB,2A,2psrw B,C,1

GB

GA GB GC

ps

rw

ps

rw

A,1

psC,1

C,2

B,A,1

B,C,2

psB,1

(a)

(b)A B C

Fig. 3. (a) DN for MAE agent B. (b) Hy-pertree.

GA DGD

C

A B

GB

C

Gx y z

12 4 1 |D |=4

|D |=2|D |=3

xy

z

Fig. 4. Message collection where Dx isthe domain of x

As shown earlier, partial evaluation relies on sequentially evaluating (fully orpartially) individual joint plans. A distributed per-plan evaluation involves fourtechnical issues: (1) How can a joint plan be evaluated fully? (2) How can it beevaluated partially? (3) As the root agent drives sequential per-plan evaluations,how can it know the total number of joint plans when it does not know otheragents’ private variables? (4) When a given joint plan is being evaluated, howdoes each agent know which local plan should be evaluated when it does notknow the joint plan as a whole?

First, the existing distributed MAE planning by CDN [8] processes all plansin one batch. At the end of collect stage, the root agent knows the utility of theoptimal plan. If we reduce the batch to a single joint plan, at the end of collectstage, root would know the expected utility of that plan.

Second, to evaluate a joint plan partially, instead of passing expected utility,collect messages should contain utility based only on intended outcome.

Third, we propose a method for root to determine the total number of jointplans. Consider the hypertree in Figure 4 over agents A, B, C and D with rootA. Assume that x, y and z are the only action variables and are public (noprivate action variables in MAE). Each agent i maintains a counting variabledi: the number of joint plans over agents downstream from i. Root A initiatesmessage collection along hypertree (Figure 4). Leaf agent D passes to C message


dD = 1 (no downstream agent). C passes dC = dD ∗ |Dz| = 4 to B, and B passesdB = dC ∗ |Dy| = 12 to A. In the end, A computes the total number of jointplans as dA = dB ∗ |Dx| = 24.

Fourth, as any joint plan is evaluated, each agent needs to know how toinstantiate their local (public) variables accordingly. For instance, B needs toknow the values of x and y, but not z. We assume that the order of domain valuesof each public variable, e.g., x ∈ (x0, x1), is known to corresponding agents. Jointplans are lexicographically ordered based on domains of public variables. Hence,0th joint plan corresponds to (x0, y0, z0), and 22nd to (x1, y2, z2).

We propose a message distribution for each agent to determine values of localvariables according to current joint plan. Each agent i maintains a working indexwri. Root A sets wrA to the index of current joint plan. Each other agent receiveswri in message. The index of a variable, say x, is denoted by xinx.

Suppose A initiates message distribution with wrA = 22. A computes xinx =&wrA % dA

dB' = 1, where % and & ' are mod and floor operations. A passes

the index wrB = &wrA % dA' = 22 to B. B computes xinx = &wrBdB

' = 1 andyinx = &wrB % dB

dC' = 2. B passes to C the index wrC = &wrB % dB' = 10.

Similar computations at C and D determine zinx = 2.The above can be combined for distributed planning with partial evaluation.

It consists of a sequence of message collection followed by one message distri-bution. The first collection fully evaluates the first joint plan. Local maximumand average utilities from agents are also collected and aggregated for use in allsubsequent evaluations (Section 6).

The second collection calls for a partial evaluation (Section 3) of the nextjoint plan. Upon receiving the response, A determines if the second joint planneeds full evaluation or can be discarded. If full evaluation is needed, A issuesthe next collection as a full evaluation of the second plan. Otherwise, a call ofpartial evaluation on the third joint plan is issued. This process continues untilall joint plans are evaluated.

One distribution is used after all plans are evaluated to communicate theoptimal plan. If 22nd joint plan is optimal, a message distribution as describedearlier suffices for each agent to determine its optimal local plan.

It can be shown that the above protocol achieves the same level of optimalityas centralized planning. However, for each joint plan, one round of communica-tion is required, resulting in a communication amount exponential on the numberof agents and horizon length. This differs from the existing method for planningin CDN (Section 8.1), where two rounds of communication are sufficient.

8.3 Aggregation of Local Evaluation

Given the above analysis, we consider an alternative that attempts to avoidintractable communication: Each agent applies partial evaluation to evaluate alllocal plans in a single batch. The results are then assembled through messagepassing in order to obtain the optimal joint plan. After local evaluation, agent ihas a set Ei of fully evaluated local plans and a set Li from partial evaluation.From analysis in Section 3, Ei contains the local optimal plan at i.


Table 4. Utilities for MAE team

Joint Plan UA UB UC UABC Joint Plan UA UB UC UABC

P1 0.3 0.3 0.3 0.9 P3 0.1 0.6 0.1 0.8P2 0.6 0.1 0.1 0.8 P4 0.1 0.1 0.6 0.8

Consider some selected joint plans in Table 4 for agents A, B and C. Eachrow corresponds to an evaluated joint plan. Each agent i evaluates expectedutilities locally as shown in the Ui column. Overall expected utilities are givenin the UABC column as sum of local values. Joint plan P2 is the best according toevaluation by agent A. P3 and P4 are the best according to B and C, respectively.All of them are inferior to P1.

From the above illustration, the following can be concluded. Optimal planningcannot be obtained from independent local partial evaluations in general. Itcannot be obtained based on Ei, nor Li or their combination.

9 Conclusion

The main contribution of this work is the method of partial evaluation for cen-tralized planning in uncertain environments such as MAE. The key assumptionon the environment is that each agent action has a distinguished intended out-come whose probability given the action is independent of (or approximatelyso) the chosen action. This assumption seems to be valid for many problemdomains where actions normally achieve some intended consequences wherefailures are rare occurrences. We devised simple criteria to divide planningcomputation into full and partial evaluations to allow only a small subset ofalternative plans to be fully evaluated while maintaining optimal or approx-imate optimal planning. Significant efficiency gains are obtained with ourexperiments.

Alternatively, extending the method to distributed planning has resulted inunexpected outcomes. Two very different schemes are analyzed. One evaluatesindividual plans distributively, which demands an intractable amount of agentcommunication. Another evaluates local plans in batch and assembles the jointplan distributively, but is unable to guarantee a globally optimal joint plan.These analyses discover pitfalls in distributed planning and facilitate devel-opment of more effective methods. As such, we are currently exploring otherschemes of distributed planning that can benefit from partial evaluation.

Acknowledgements

We acknowledge financial support from Discovery Grant, NSERC, Canada, tothe first author, and from NSERC Postgraduate Scholarship, to the secondauthor.


References

1. Besse, C., Chaib-draa, B.: Parallel rollout for online solution of Dec-POMDPs. In:Proc. 21st Inter. Florida AI Research Society Conf., pp. 619–624 (2008)

2. Corona, G., Charpillet, F.: Distribution over beliefs for memory bounded Dec-POMDP planning. In: Proc. 26th. Conf. on Uncertainty in AI, UAI 2010 (2010)

3. Kitano, H.: Robocup rescue: a grand challenge for multi-agent systems. In: Proc.4th Int. Conf. on MultiAgent Systems, pp. 5–12 (2000)

4. Murphy, K.: A survey of POMDP solution techniques. Tech. rep., U.C. Berkeley(2000)

5. Oliehoek, F., Spaan, M., Whiteson, S., Vlassis, N.: Exploiting locality of interactionin factored Dec-POMDPs. In: Proc. 7th Inter. Conf. on Autonomous Agents andMultiagent Systems, pp. 517–524 (2008)

6. Ross, S., Pineau, J., Chaib-draa, B., Paquet, S.: Online planning algorithms forPOMDPs. J. of AI Research, 663–704 (2008)

7. Xiang, Y., Chen, J., Havens, W.: Optimal design in collaborative design network.In: Proc. 4th Inter. Joint Conf. on Autonomous Agents and Multiagent Systems(AAMAS 2005), pp. 241–248 (2005)

8. Xiang, Y., Hanshar, F.: Planning in multiagent expedition with collaborative designnetworks. In: Kobti, Z., Wu, D. (eds.) Canadian AI 2007. LNCS (LNAI), vol. 4509,pp. 526–538. Springer, Heidelberg (2007)

Author Index

Aaron, Eric 1Aavani, Amir 13Abdelsalam, Wegdan 26Afzal, Naveed 32Ahmed, Maher 26Al-Obeidat, Feras 56An, Aijun 347Anton, Calin 44

Bakhshandeh Babarsad, Omid 313Bediako-Asare, Henry 50Belacel, Nabil 56Buffett, Scott 50

Calvo, Borja 186Carenini, Giuseppe 122Cercone, Nick 347Chaffar, Soumaya 62Chaib-draa, Brahim 86Chali, Yllias 68Charton, Eric 74Chau, Siu-Cheung 26Chinaei, Hamid R. 86Chiu, David 26Connor, Patrick 92Connors, Warren A. 174

De Angelis, Silvio 170de Souza, Erico N. 384Do, Thang M. 104Du, Weichang 372

Ebrahim, Yasser 26Elinas, Pantelis 110

Fan, Lisa 240Farzindar, Atefeh 32Ferguson, Daniel S. 110FitzGerald, Nicholas 122Fleming, Michael W. 50Fowler, Ben 128Frunza, Oana 140

Gagnon, Michel 74Gao, Qigang 234

Gaudette, Lisa 146Ghassem-Sani, Gholamreza 313Guo, Yuanyuan 158

Hanshar, F. 420Hasan, Sadid A. 68Higgins, Machel 170Hilderman, Robert 359Hoeber, Orland 281Hollesen, Paul 174

Imam, Kaisar 68Inkpen, Diana 62, 140, 192, 216Irurozki, Ekhine 186Islam, Aminul 192

Jain, Dreama 204Japkowicz, Nathalie 146Jiang, Yifei 210Jin, Wei 396Joty, Shafiq 122Joubarne, Colette 216

Kennedy, Alistair 222Kershaw, David 234Khamis, Ninus 408Khawaja, Bushra 240Khordad, Maryam 246Klement, William 258Kobti, Ziad 204

Ladani, Behrouz Tork 301Langlais, Philippe 323Larue, Othalia 265Li, Zijie 269Liu, Fei 104Liu, Hanze 281Liu, Xiaobo 158Loke, Seng W. 104Lozano, Jose A. 186Luo, Jigang 285

Matwin, Stan 258, 384Mendoza, Juan Pablo 1Mengistu, Kinfe Tadesse 291

434 Author Index

Mercer, Robert E. 246Michalowski, Wojtek 258Milios, Evangelos 377Mirroshandel, Seyed Abolghassem 313Mitchell, David 13Mitkov, Ruslan 32Mokhtari, Ehsan 301Mostafazadeh, Nasrin 313Mouine, Mohamed 319Muller, Philippe 323Murray, Gabriel 122

Nematbakhsh, Mohammad Ali 301Noorian, Zeinab 301

Ozell, Benoit 74

Pouly, Marc 335

Rilling, Juergen 408Rogan, Peter 246Rudzicz, Frank 291

Sarrafzadeh, Bahareh 347Sateli, Bahar 408Silver, Daniel L. 128Simeon, Mondelle 359Song, Weihong 372

Soto, Axel J. 377Spencer, Bruce 372Strickert, Marc 377Su, Ming 390Sun, Zhengya 396Szpakowicz, Stan 222

Ternovska, Eugenia 13Thompson, Elizabeth 390Trappenberg, Thomas 92, 174

van Beek, Peter 269Vazquez, Gustavo E. 377

Wang, Hai 234Wang, Jue 396Ward, Christopher 170Wilk, Szymon 258Witte, Rene 408Wu, Xiongnan 13

Xiang, Y. 420

Yakovets, Nikolay 347Yao, Yiyu 285

Zhang, Haiyi 210Zhang, Harry 158