by vivian yuen-chong tsang a thesis submitted in conformity

115
ANON- DUAL APPROACH TO MEASURING S EMANTIC DISTANCE BY I NTEGRATING ONTOLOGICAL AND DISTRIBUTIONAL I NFORMATION WITHIN A NETWORK-F LOW F RAMEWORK by Vivian Yuen-Chong Tsang A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright c 2008 by Vivian Yuen-Chong Tsang

Upload: khangminh22

Post on 31-Mar-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

A NON-DUAL APPROACH TOMEASURING SEMANTIC DISTANCE

BY

INTEGRATING ONTOLOGICAL AND DISTRIBUTIONAL INFORMATION

WITHIN A NETWORK-FLOW FRAMEWORK

by

Vivian Yuen-Chong Tsang

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Computer ScienceUniversity of Toronto

Copyright c© 2008 by Vivian Yuen-Chong Tsang

I believe that much unseen is also here.

Walt Whitman, Song of the Open Road

On dit qu’a force d’ascese certains bouddhistes

parviennent a voir tout un paysage dans une feve.

Roland Barthes, S/Z

ii

Abstract

A Non-dual Approach to Measuring Semantic Distance

by

Integrating Ontological and Distributional Information within a Network-Flow Framework

Vivian Yuen-Chong Tsang

Doctor of Philosophy

Graduate Department of Computer Science

University of Toronto

2008

Text comparison is a key step in many natural language processing (NLP) applications in which

texts can be classified based on their semantic distance (howsimilar or different the texts are).

For example, comparing the local context of an ambiguous word with that of a known word can

help identify the sense of the ambiguous word. Typically, a distributional measure is used to

capture the implicit semantic distance between two pieces of text. In this thesis, we introduce

an alternative method of measuring the semantic distance between texts as a non-dual com-

bination of distributional information and ontological knowledge. We define non-dualism as

combining two distinct components such that they are seamless in the combination. We achieve

this non-dual combination by proposing a novel distance measure within a network-flow for-

malism. First, we represent each text as a collection of frequency-weighted concepts within

an ontology. Then, we make use of a network-flow method which provides an efficient way

of measuring the semantic distance between two texts by taking advantage of the ontological

structure. We evaluate our method in a variety of NLP tasks.

In our task-based evaluation, we find that our method performs well on two of three tasks.

We introduce a novel approach to analysing the sensitivity of our network-flow method to any

dataset (represented as a collection of frequency-weighted concepts). Given that the ontolog-

ical and the distributional components are intricately knitted together in our method, we find

iii

that a non-dual approach, rather than a purely distributional or graphical analysis, is more ap-

propriate and more effective in explaining the performanceinconsistency.

Finally, we address a complexity issue that arises from the overhead required to incorporate

more sophisticated concept-to-concept distances into thenetwork-flow framework. We propose

a graph transformation method which generates a pared-downnetwork that requires less time

to process. The new method achieves a significant speed improvement, and does not seriously

hamper performance as a result of the transformation, as indicated in our analysis.

iv

Acknowledgements

I would like to thank, first and foremost, my family for their emotional support. My apprecia-

tion can only be expressed with a Greek symbol,µ.

Much thanks to my advisor, Suzanne Stevenson, for planting the initial seed for thinking

about distance as moving dark soil. Digging and moving earthturned out to be rather strenuous.

Her patience and encouragements are much appreciated.

Much kudos to suzgrp for their support, emotional and otherwise. In particular, I would

like to thank Afsaneh Fazly, whose careful editing commentsare indispensible; and Afra Al-

ishahi, who borrowed a book by Michel Foucault and allowed itto sit on her desk for about

three hours. . . Though I never cared much for deconstructionism (still don’t), the book kept me

thinking about (mis)interpretations.

I would like to thank Prof. Derek Corneil and Frank Chu for their helpful discussions on

network-flow methods.

Finally, much thanks to my Sifu, Dorje Jidgral, and my Vajra comrades, who made me

realize meaning is one (integral piece) and not one or two or more.

v

vi

Contents

1 Introduction 1

1.1 Distributional Approaches . . . . . . . . . . . . . . . . . . . . . . . .. . . . 4

1.2 Ontological Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 6

1.3 Graph-based Approaches in NLP . . . . . . . . . . . . . . . . . . . . . .. . . 8

1.4 Our Combined Approach to Semantic Distance . . . . . . . . . . .. . . . . . 9

2 The Network Flow Method 15

2.1 An Intuitive Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16

2.2 Minimum Cost Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Semantic Distance as MCF . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20

2.4 Ontological and Distributional Factors in MCF . . . . . . . .. . . . . . . . . 21

3 Task-based Evaluation 25

3.1 Task 1: Verb Alternation Detection . . . . . . . . . . . . . . . . . .. . . . . . 27

3.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Task 2: Name Disambiguation . . . . . . . . . . . . . . . . . . . . . . . .. . 35

3.2.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . .36

3.2.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 Task 3: Document Classification . . . . . . . . . . . . . . . . . . . . .. . . . 44

3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vii

3.3.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Measuring Coherence of Semantic Profiles 53

4.1 Profile Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54

4.2 Separate Distributional and Ontological Approaches . .. . . . . . . . . . . . . 56

4.3 Integrating Distributional and Ontological Factors . .. . . . . . . . . . . . . . 58

4.3.1 Profile Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3.2 Finding the Ancestor Set for Profile Density . . . . . . . . .. . . . . . 62

4.3.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.4 The Impact of the Number of Ancestors . . . . . . . . . . . . . . .. . 65

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Graph Transformation 69

5.1 Solving the MCF Problem Using a Non-additive Distance . .. . . . . . . . . . 70

5.2 Network Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 73

5.2.1 Path Shape in a Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.2 Network Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . .75

5.3 Analysing the Transformed Network . . . . . . . . . . . . . . . . . .. . . . . 77

5.3.1 Distance Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . .77

5.3.2 Junction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4 Evaluating the Transformed Network . . . . . . . . . . . . . . . . .. . . . . . 81

5.4.1 Junction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Conclusions 87

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .. . 89

6.2 Short-term Improvements: Within the MCF Framework . . . .. . . . . . . . . 91

viii

6.3 Long-Term Research Directions . . . . . . . . . . . . . . . . . . . . .. . . . 92

Bibliography 95

ix

x

List of Tables

1.1 A representation of two texts as word frequency vectors.. . . . . . . . . . . . 4

1.2 Word frequency distributions of four different texts. Italicized frequencies in

each row reflect the difference between Text A and the corresponding text. . . . 6

1.3 Concept frequency distributions of the four texts in Table 1.2. . . . . . . . . . . 6

3.1 Accuracies on development data. . . . . . . . . . . . . . . . . . . . .. . . . . 31

3.2 Accuracies on test data. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 32

3.3 Average accuracies on raw, Li and Abe, and Clark and Weir profiles. . . . . . . 33

3.4 Accuracies on development data on profiles generated using Clark and Weir’s

(2002) method. Best accuracies in each condition are shown in boldface. . . . . 34

3.5 Accuracies on test data on profiles generated using Clarkand Weir’s (2002)

method. Best accuracies in each condition are shown in boldface. . . . . . . . . 34

3.6 The pairs to be identified, the raw frequency, and the relative frequency of the

majority name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7 Network-flow results (accuracy) using 200 training instances on the random

samples and their average performance. . . . . . . . . . . . . . . . . .. . . . 40

3.8 Performance results using 200 instances per gold standard profile. . . . . . . . 40

3.9 SVM results using 200 training instances. . . . . . . . . . . . .. . . . . . . . 41

3.10 Average classification results of the network flow method using 200, 100, and

50 training data per classification task. . . . . . . . . . . . . . . . .. . . . . . 41

xi

3.11 The performance results of of Pedersen et al. (2005) (Ped05), as well as net-

work flow (NF) and SVM using 100 training instances, ranked inthe order of

the JS divergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.12 Average classification results using 10 and 30 trainingdocuments per newsgroup. 47

3.13 Average classification results using 30 and 10 trainingdocuments per newsgroup. 50

4.1 Summary of task-based results. . . . . . . . . . . . . . . . . . . . . .. . . . . 54

4.2 The normalized profile density scores for each dataset atfive different values

of α, as well as the average scores across theα values. . . . . . . . . . . . . . 64

4.3 The normalized density scores at five different values ofα, as well as the aver-

age scores, calculated using Jiang and Conrath’s (1997) distance. . . . . . . . . 65

4.4 Thenorm density3 scores at five different values ofα, as well as the average

scores, calculated using edge distance. . . . . . . . . . . . . . . . .. . . . . . 66

4.5 Thenorm density3 scores at five different values ofα, as well as the average

scores, calculated using Jiang and Conrath’s (1997) distance. . . . . . . . . . . 66

5.1 Name disambiguation results (accuracy) at a glance. . . .. . . . . . . . . . . . 83

xii

List of Figures

1.1 The content of three texts. . . . . . . . . . . . . . . . . . . . . . . . . .. . . 2

1.2 An illustration of two profiles within an ontology. . . . . .. . . . . . . . . . . 10

1.3 Two variations of Figure 1.2. . . . . . . . . . . . . . . . . . . . . . . .. . . . 13

1.4 A path from S to D via their common ancestor A. . . . . . . . . . . .. . . . . 14

2.1 A small text represented as a collection of weighted nodes in a fragment of

WordNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Two subgraphs with varying degrees of overlap. . . . . . . . .. . . . . . . . . 17

2.3 An illustration of flow entering and exiting nodei. . . . . . . . . . . . . . . . . 19

2.4 An example of transporting the weights at the square nodes (supply nodes) to

the triangle nodes (demand nodes). . . . . . . . . . . . . . . . . . . . . .. . . 22

3.1 Two noisy profiles, one represented by squares, the other, triangles. . . . . . . . 48

3.2 The same two profiles in Figure 3.1. The profile masses thatare “subtracted”

are shaded in grey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1 Examples of two profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 55

4.2 Two examples of profile density within an ontology. . . . . .. . . . . . . . . . 59

4.3 Two profiles with equal density value. . . . . . . . . . . . . . . . .. . . . . . 60

4.4 Two profile examples with different number of ancestors but of equalnorm density

value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.1 A bipartite network between the S and D profiles. . . . . . . . .. . . . . . . . 71

xiii

5.2 An example ontology with two profiles, S and D. . . . . . . . . . .. . . . . . 72

5.3 An example ontology with two profiles, S and D. Some commonancestors of

the profile nodes are highlighted (JS and JD nodes). . . . . . . . . . . . . . . . 75

5.4 Fragments of the transformed ontology with two profiles,S and D. The com-

mon ancestors of the profile nodes are labeled JS and JD. . . . . . . . . . . . . 76

5.5 The fully transformed ontology with two profiles, S and D.The common an-

cestors of the profile nodes are labeled JS and JD. . . . . . . . . . . . . . . . . 76

5.6 The original ontology, the bipartite graph and the fullytransformed graph with

two profiles, S and D. In the fully transformed graph, the common ancestors of

the profile nodes are labeled JS and JD. . . . . . . . . . . . . . . . . . . . . . . 78

5.7 Three clusters of concepts. . . . . . . . . . . . . . . . . . . . . . . . .. . . . 84

xiv

Chapter 1

Introduction

Rosencrantz: What are you playing at?

Guildenstern: Words. Words. They’re all we have to go on.

Tom Stoppard, Rosencrantz and Guildenstern are Dead

In this thesis, we address the problem of comparing the semantic content of natural language

texts. Given two texts, we measure their semantic distance by comparing the words in one

text with those in the other. Representing texts as bags of words, a simple way of measuring

the distance between two texts is to count the number of wordsthey have in common. Such

a measure, however, ignores the fact that the same notion maybe expressed using different,

though semantically related, words. Consider the simple example in Figure 1.1. Text A and

Text B have more words in common than Text A and Text C have. Butbecause both Text A

and Text C contain semantically similar words (dairy products) whereas the content of Text B

mostly consists of words of another type (automobiles), we consider Text A to be less similar

to Text B than to Text C. It is thus important to take into account the contribution of each word

as well as groups of semantically related words to the overall semantic distance between text.

Distributional methods for semantic distance are successfully and widely used in compar-

ing texts that are represented as bags of words with associated frequencies of occurrence (e.g.,

Lee, 2001; Weeds et al., 2004). In document classification, for example, the content of a docu-

1

2 CHAPTER 1. INTRODUCTION

Text A . . . brie . . . yoghurt . . . milk . . . milk . . .Text B . . . brie . . . van . . . car . . . trucks . . .Text C . . . camembert . . . camembert . . . cheese . . .

Figure 1.1: The content of three texts.

ment may be represented as a word frequency vector, which is compared using a distributional

distance to each of the word frequency vectors of the contentof other documents. In this

way, distributional distance between word vectors implicitly captures the semantic distance

between two texts (prepositional phrase attachment (Pantel and Lin, 2000); document classi-

fication (Scott and Matwin, 1998; Rennie, 2001; Al-Mubaid and Umair, 2006); and spelling

correction (Budanitsky and Hirst, 2001)).

Semantic distance can also be measured more explicitly by using the relations in an ontol-

ogy as the direct encoding of semantic association. Numerous measures have been proposed,

for example, for capturing the distance between two individual concepts in WordNet (Fell-

baum, 1998), typically relying on the synonymy (synset) andhyponymy (is-a) relations. (For

an overview of such methods see Budanitsky and Hirst, 2006.)Using an ontological measure

to compare two texts (collections of words instead of singlewords) might involve mapping

each word of a text to its appropriate concept(s) in the ontology, and then calculating the ag-

gregate distance between the two resulting sets of conceptsacross the ontological relations.

For example, one might calculate the semantic distance between the two texts as the average,

minimum, maximum, or summed ontological distance between the individual elements of the

two sets of concepts (Corley and Mihalcea, 2005).

As noted above, each of these approaches to text comparison—distributional and ontological—

encodes information not contained in the other. Distributional distance captures important

information about frequency of occurrence of words that comprise the target text, while onto-

logical distance captures essential semantic knowledge that has been encoded in the relations of

an ontology. In response, previous work has attempted to combine distributional and ontologi-

cal information in computing semantic distance. For example, some ontological measures use

3

corpus frequencies of words to yield concept weights that are taken into account in measuring

the distance between two concepts (Resnik, 1995; Jiang and Conrath, 1997). However, these

methods are restricted to finding the distance between two individual concepts, not the aggre-

gate distance between the two sets of concepts corresponding to two texts. Other researchers

have developed measures of semantic distance between textsthat apply distributional distances

to concept vectors of frequencies rather than to word vectors (McCarthy, 2000; Mohammad

and Hirst, 2006). However, these approaches only make pointwise comparisions across the

concept vectors, and do not take into account the important ontological relations among the

concepts. What has been missing is an approach to semantic distance of text that can truly in-

tegrate the distributional and ontological (relational) information, drawing more fully on their

complementary advantages.

Given the complementary nature of distributional and ontological methods, our goal is to

develop a semantic distance method that achieves the advantages of the two. We thus propose

a novel graph-based method that seamlessly combines the distributional and the ontological

factors. In other words, we see distributional and ontological information as two distinct but

not separate (non-dual) parts of a semantic distance measure. The key is that both word fre-

quency (distributional information) and word meaning (ontological knowledge) contribute to

the underlying text meaning. Moreover, word meaning shouldnot serve only to partition the

semantic space, as is the case in a purely distributional approach. The relationship between

word meanings (ontological relations among concepts) should also be taken into account.

The rest of this chapter is organized as follows. In Section 1.1, we use an example to

explain in detail which aspects of semantic distance a distributional method captures. We

further elaborate on how existing distributional methods have tried to incorporate ontological

information, and argue that such an approach is not sufficient. In Section 1.2 , we present

how some of the existing ontological measures take into account distributional information

in their calculation. Again, we argue that such methods still lack an appropriate account of

distributional properties of texts. Our proposed method for seemlessly combining the two

4 CHAPTER 1. INTRODUCTION

words w1 w2 w3 . . . wn−1 wn

Text A a1 a2 a3 . . . an−1 an

Text B b1 b2 b3 . . . bn−1 bn

Table 1.1: A representation of two texts as word frequency vectors, wherewi represents a wordappearing in a text,ai is the frequency ofwi in Text A andbi is the frequency ofwi in Text B.

factors involves the use of a graph-based framework. In Section 1.3, we thus briefly look at the

current graph-based approaches in NLP. In Section 1.4, we provide an outline of our proposal,

and present the organization of the thesis.

1.1 Distributional Approaches

By representing a text as a frequency distribution of words,a text can be viewed as a point in

ann-dimensional space, withn being the total number of unique words. Each word,wi, where

1 ≤ i ≤ n, represents one dimension (see Table 1.1). The semantic distance between two texts

can be approximated by the spatial or distributional distance1 of the corresponding two points

in then-dimensional space. For example, the Euclidean distance between Text A and Text B

(from Table 1.1), represented as frequency vectors,~a and~b, is calculated as:

distanceEuclidean(A, B) =

n∑

i=1

(ai − bi)2 (1.1)

whereai is the frequency of wordwi in Text A, andbi is the frequency of the same word in Text

B. Other spatial and distributional distances are calculated in a similarly pointwise manner (a1

is compared tob1, a2 to b2, and so on), i.e., each dimension (a word) is considered independent

of the other dimensions.

1Throughout the thesis, we often use the words “spatial” and “distributional” inter-changeably to refer tofrequency-based distance measures. However, we do note thedifference between the two as some distributionalmeasures, e.g., KL-divergence, is not strictlydistancesby definition as they do not obey the triangle inequality:

distance(x , z ) ≤ distance(x , y) + distance(y, z )

1.1. DISTRIBUTIONAL APPROACHES 5

Generally, recent work on text comparison tends to be word-based and distributional (e.g.,

Lee, 2001; Weeds et al., 2004; Pedersen et al., 2005; Al-Mubaid and Umair, 2006). Words

may be grouped into a smaller number of related terms using matrix factorization (e.g., SVD)

or other clustering techniques (e.g., Pereira et al., 1993;Scott and Matwin, 1998; McCarthy,

2000; Mohammad and Hirst, 2006). However, regardless of howwe partition the semantic

space by grouping similar words, the individual elements (clusters of words) are compared in a

pointwise manner, i.e., each element in one distribution isonly compared to the corresponding

element in the other distribution, and the distance across elements can still not be taken into

consideration.

Consider the example in Table 1.2, where the four vectors represent the frequency distri-

bution of four texts. Distributionally, each of Texts B, C, or D is only slightly different from

Text A. That is, Texts B, C, and D result, respectively, from displacing a mass of 0.05 from

camembertin Text A to milk, brie, or car. Moreover, Text A is equally far away from Text B,

Text C, and Text D:

distancedistrib(A,B) = distancedistrib(A,C ) = distancedistrib(A,D) (1.2)

However, by only looking at pointwise differences between word frequency distributions, one

cannot take into account the fact that the words themselves are semantically related in varying

degrees—the semantic distance between different words maycontribute to the overall text

distance. For example,camembertis similar tobrie (both are cheeses), but less similar tomilk

(dairy products) and rather different fromvan andcar (entities). If we displace a frequency

mass in a distribution from one word (e.g.,camembert) to another word (e.g.,brie, milk, or

car), the impact on the overall distance should not only depend on the size of the mass, but

also on the source and the destination words of the displacement. In our example, because

brie, milk, andcar in Texts B, C, and D, are not equally distant fromcamembertin Text A, we

expect the distance from Text A to reflect as such:

distance(A,B) < distance(A,C ) < distance(A,D) (1.3)

6 CHAPTER 1. INTRODUCTION

words camembert brie milk van carText A 0.2 0.2 0.2 0.2 0.2Text B 0.15 0.25 0.2 0.2 0.2Text C 0.15 0.2 0.25 0.2 0.2Text D 0.15 0.2 0.2 0.2 0.25

Table 1.2: Word frequency distributions of four different texts. Italicized frequencies in eachrow reflect the difference between Text A and the corresponding text.

concepts dairy products automobilesText A 0.6 0.4Text B 0.6 0.4Text C 0.6 0.4Text D 0.55 0.45

Table 1.3: Concept frequency distributions of the four texts in Table 1.2.

In order to take the semantic relations among words into account, one may consider group-

ing the words into, for example, dairy products and automobiles (Table 1.3). Nowcar belongs

to automobiles and the cheeses are grouped under dairy products, hence Text D is now less

similar to Text A than Text B and Text C are. However, such a method still does not com-

pletely alleviate the problem of pointwise comparison—removing the fine-grained distinction

between words renders the first three texts indistinguishable. In this example, the differences

among the first three texts come from their difference in the frequency of words grouped under

dairy products, but this difference is no longer captured inthe new representation. Generally,

regardless of the representation used, distributional techniques simply lack the flexibility to

allow inter-word or inter-concept comparison that can reflect the fine-grained semantic distinc-

tions between texts.

1.2 Ontological Approaches

Ontological approaches to semantic distance alleviate theproblem of not capturing the fine-

grained semantic distinctions among words by taking advantage of the semantic relations be-

1.2. ONTOLOGICAL APPROACHES 7

tween concepts in an ontology. Since an ontology provides a graph structure, given that con-

cepts are connected via ontological relations, the semantic distance between two concepts can

be measured as the graphical distance within the ontology. The most straightforward way is to

count the number of edges on the shortest path connecting thetwo concepts. Alternatively, if

the ontology has a hierarchical structure (e.g., WordNet),one can consider a similarily mea-

sure2 such as Wu and Palmer’s (1994) that uses the depth of conceptsin the calculation:

similaritywp(c1 , c2 ) =2 ∗ depth(lowest common ancestor(c1 , c2 ))

depth(c1 ) + depth(c2 )(1.4)

Note that Wu and Palmer’s (1994) measure does not consider graphical distance (i.e., the

connecting edges between concepts) in its calculation. In fact, a number of popular measures

ignore the underlying graphical structure as well. For example, Lin (1998) proposes the fol-

lowing measure:

similaritylin(c1 , c2 ) =2 ∗ IC (lowest common ancestor(c1 , c2 ))

IC (c1 ) + IC (c2 )(1.5)

in which IC (concept) stands for the information content of a concepts, a notion proposed by

Resnik (1995), and is estimated as:

IC (concept) = −log(p(concept)) (1.6)

Similar to Wu and Palmer’s (1994) measure, Lin’s (1998) measure does not consider graphical

distance in its calculation.

Although these methods are often used for measuring the distance between two words, it is

not straightforward to generalize them for measuring text distance. First, using these measures

for text comparison implies that each text needs to be represented not in terms of words, but in

terms of the concepts in an ontology. Second, to account for the word frequency distribution

in texts, the concepts have to be weighted accordingly. Then, the comparison task becomes

a task of calculating the distance between two concept frequency distributions. As we have

2We take the inverse of the similarity value to obtain distance.

8 CHAPTER 1. INTRODUCTION

emphasized earlier, by taking a purely distributional route, one can no longer take advantage of

the ontological structure to make finer-grained inter-wordor inter-concept distinctions between

texts.

One approach to comparing two texts might involve calculating the aggregate distance be-

tween the two resulting sets of concepts across the ontological relations. For example, we

mentioned Corley and Mihalcea’s (2005) work in which the semantic distance between the

two texts is calculated as the average, minimum, maximum, orsummed ontological distance

between the individual elements of the two sets of concepts.However, this approach ignores

distributional information of the texts, and hence treats all concepts as equally important in de-

termining the distance. Recall from Section 1.1 that the approaches which take the distribution

of concepts into account (e.g., McCarthy, 2001; Mohammad and Hirst, 2006) tend to ignore

the ontological relations among the concepts.

Our proposal is to capture both types of information with theaid of a graph-based method.

We will return to the details of our proposal in Section 1.4, after a brief description of current

uses of graph methods in NLP.

1.3 Graph-based Approaches in NLP

In recent years, we have seen an increasing use of graph-based methods in NLP (e.g., Pang

and Lee, 2004; Mihalcea, 2005; Navigli and Velardi, 2005). The graph-theoretic approach is

popular due to its elegance in representation, as well as theexistence of a large array of efficient

algorithms for graph processing. Graphs in general are a convenient mathematical formalism

to represent words or more complex semantic entities as nodes and the relationship between

them as edges.3 One of the most straightforward NLP examples is the use of WordNet as a

graph for measuring semantic relatedness (Rada et al., 1989; Wu and Palmer, 1994).

One popular graph method for NLP is the minimum-cut algorithm. For example, both

3The reverse is possible, though less intuitive, by using nodes to represent relations and edges for semanticentities. The choice of representation clearly depends on the NLP task itself.

1.4. OUR COMBINED APPROACH TOSEMANTIC DISTANCE 9

Pang and Lee (2004) and Barzilay and Lapata (2005) use minimum cut for two vastly differ-

ent applications, document polarity classification and content selection. In these works, the

sentences are represented as nodes in a graph and the edge connecting each pair of nodes is

weighted with an association score between the sentences (e.g., the distance between the sen-

tences in the text). The minimum-cut method partitions the nodes by finding the minimum cut

(the set of connecting edges with the minimum aggregate edgeweights). Thus, the sentences

are classified into different categories based on the node partition.

Another popular graph method is the random walk algorithm, which is successfully em-

ployed by the PageRank algorithm for ranking webpages (Brinand Page, 1998). The intuition

behind the algorithm is that the “popularity” (score) of a node depends on the “popularity” of

its neighbours. The more neighbours one has and/or the more popular the neighbours are, the

higher its popularity. This algorithm is useful when one wants to classify an item based on the

information contributed by related items. For example, Mihalcea (2006) uses random walk for

word sense disambiguation. In this work, each node represents an ambiguous (test) word, or a

(training) word labelled with one of its senses. Each edge indicates that the corresponding two

words co-occur in some context. The sense of an ambiguous word is determined by the sense

of its most relevant neighbour(s), by randomly traversing the graph until an equilibrium state

has been reached.

A graph-based method is necessary for us to take advantage ofthe intrinsic graph structure

of an ontology. More importantly, we need to choose an appropriate graph-based method which

calculates text distance that is simultaneously distribtional and ontological. In the next section,

we give an overview of our proposal which allows us to achievethis requirement.

1.4 Our Combined Approach to Semantic Distance

In our method, an ontology is treated as a graph in the usual manner, in which the concepts

are nodes and the relations are edges. A text can be represented as a collection of concepts in

10 CHAPTER 1. INTRODUCTION

Figure 1.2: An illustration of two profiles within an ontology (the outer triangle). Each shaperepresents the nodes of one profile (representing a text) andthe size represents the mass (fre-quency) at a particular node in the ontology. Relations (edges) between concept nodes areomitted for simplicity.

the ontology, by mapping the words in the text into their corresponding concepts, which are

weighted according to the word frequencies. (We call the resulting set of frequency-weighted

concepts asemantic profile.) We can then use a graph-based method over the ontology to

calculate the frequency-weighted semantic distance between two profiles representing the two

texts to be compared.

Consider Figure 1.2, where we show a diagrammatic representation of an ontology (the

large open triangle) with two profiles representing two texts, one indicated with filled squares

and the other with filled triangles. The location of a filled shape indicates the location of a

profile concept in the ontology, and its size indicates its frequency within the profile. We

omit edges between the nodes for simplicity of the diagram, but note that we assume we have

a hierarchical, connected ontology (e.g., hyponymy links). Our proposal is to calculate the

distance between the two profiles by determining how much effort is required to transport,

along the ontological links, the frequency mass from all of the squares to “fill” the available

space in the triangles (or vice versa). The amount of mass to move and the amount of space

available are indicated by the size of the squares and triangles, respectively. Degree of effort to

transport one to the other indicates the degree of semantic distance.

Clearly, a graph-based method is necessary for us to take advantage of the intrinsic graph

1.4. OUR COMBINED APPROACH TOSEMANTIC DISTANCE 11

structure of an ontology. More importantly, it is crucial toselect an appropriate graph-based

method which achieves our goal to calculate text distance which is simultaneously distribu-

tional and ontological. As we have illustrated in Figure 1.2, to compare two texts, we calculate

the distance between the two corresponding profiles as the amount of “effort” required to trans-

form one profile to match the other graphically. To account for the ontological component of

the distance, observe that each profile can be viewed as a subgraph of the bigger graph repre-

senting the ontology. The edges that connect the two profilesare key to calculating the ontolog-

ical (graphical) distance between them. To account for the distributional component, observe

that each profile node is weighted according to the word-frequency distribution of a text. The

distributional difference can serve as a weighing factor tothe ontological distance. In short,

the weighted graphical distance is the desired distance. Ofall the existing graph formalisms,

network flow is the best formalism that best fits our specific set of requirements.

In this thesis, we explore a three-pronged approach in examining our non-dual framework

for text comparison. First, we demonstrate the usefulness of our method in three different

NLP tasks. Next, we examine the distributional and ontological sensitivity of our method to

the different types of texts involved in the task-based experiments. Finally, we look into the

method from an algorithmic perspective. Below, we present adetailed outline of the thesis, and

summarize the main contributions of our work.

In Chapter 2, we present our network-flow formalism for text comparison. Specifically,

we achieve our goal via a minimum-cost flow formulation. For our task, we have (i) a graph

structure based on the ontology; (ii) ontological distance(i.e., graphical distance) defined be-

tween concepts; and (iii) the profiles for each text (a concept frequency distribution). Given

this information, a minimum-cost flow problem definition allows us to (i) find a set of paths

connecting the two profiles such that (ii) the weighted sum ofthe paths’ distance, based on

the distributional difference of the two profiles, is minimum. Clearly, the resulting aggregate

distance is the desired text distance as it accounts for the ontological distance as well as the

distributional difference between texts.

12 CHAPTER 1. INTRODUCTION

Chapter 3 presents our task-based evaluation by testing ourmethod in three NLP tasks:

verb alternation detection (Section 3.1), name disambiguation (Section 3.2), and document

classification (Section 3.3). These applications are selected because they can be cast as a text

comparison task. However, they vary in how the set of words tobe compared is determined.

In the first task, the words have a particular syntactic relation to a target verb. In the second

task, the syntactic restriction is relaxed such that words appearing within a local window of an

ambiguous name are considered. Finally, in the last task, the window size restriction is also

relaxed such that words within a document are included.

Somewhat disappointingly, our method is not consistently successful across the three tasks.

Our network-flow method is found to be superior to state-of-the-art distributional methods in

verb alternation detection and name disambiguation but notso in the final task. To explain

the performance differential, we analyze various properties of the datasets in Chapter 4. We

begin with using simple distributional and graphical measures for our analysis, but they fail

to explain our method’s behaviour on the three datasets. This is unsurprising, given that there

are intricate interactions between the two types of knowledge within the network-flow method.

We propose a non-dually combined approach, calledprofile density, to measure the distribu-

tional and ontological coherence of a set of frequency-weighted concepts. Intuitively, profile

density within an ontology is analogous to the geographicalsense of population density. The

idea is based on the observation that data that is dispersed throughout the ontology are difficult

to separate into different distinct classes. In contrast, data that is concentrated within a number

of distinct regions of the ontology suggests a high semanticcoherence and therefore can be

classified more easily—distinct clusters of related concepts suggests a possible classification.

Consider two variations of Figure 1.2 in Figure 1.3. In comparison to diagram (a), the two

profiles in diagram (b) are clearly more easily recognized astwo separate clusters, which sug-

gests they may belong to two distinct classes. Similar to ournetwork-flow formulation for text

comparison, both the mass at the individual concept nodes and the distance between the masses

play a role in determining the density of a dataset. Indeed, by taking a combined approach, we

1.4. OUR COMBINED APPROACH TOSEMANTIC DISTANCE 13

(a) (b)

Figure 1.3: Two variations of Figure 1.2.

will show that profile density is considered a good indicatorof the “classifiability” of a dataset

using our network-flow method.

Next, in Chapter 5, we take a different perspective by examining how the use of sophis-

ticated concept-to-concept distances (distances that aremore sophisticated than edge distance

such as, Wu and Palmer, 1994; Jiang and Conrath, 1997; Lin, 1998) impacts the efficiency of

our method. One key feature of our network-flow method is thatit incorporates ontological

distance between concepts into the overall semantic distance. However, more sophisticated

measures may cause a processing bottleneck. Algorithms solving minimum-cost flow prob-

lems take a greedy approach; their efficiency rests on the assumption that the distance between

any two nodes is additive, i.e., the distance of a path equalsto the sum of the distances of its

parts. For example, consider calculating the edge distanceof the S-D path (thick edges) in

Figure 1.4. Edge distance is additive. Since each edge constitutes a distance of one, the path

has a distance of five. However, many ontological distances do not fit this additive criterion.

To solve the minimum-cost flow exactly, the non-additive distance has to be turned additive,

which can be done by adding an edge between every pair of non-adjacent nodes. (The graphical

issues will be explained in further detail in the chapter.) However, generating the extra edges

results in an explosion in processing time. In this chapter,we focus on how we can alleviate

this processing bottleneck.

14 CHAPTER 1. INTRODUCTION

Figure 1.4: A path from S to D via their common ancestor A.

Our solution (to alleviate the bottleneck) is based on the observation that in an ontology,

any path between two nodes passes through their common ancestor, resulting in an A-shaped

path (e.g., the S-D path in Figure 1.4). We propose a novel graph transformation method for

constructing an approximate network which mimics the structure of the more precise network

by retaining the overall path shape. This way, the transformed network reduces the number of

extra edges required, making the text comparison process computationally practical. Moreover,

we can estimate the true non-additive distance by calculating it additively on the transformed

network. Because the transformed network is structurally similar to the original network, the

degree of distance distortion is small. In our evaluation, we will show that it is possible to ac-

commodate non-additive ontological distances without theexpensive processing nor significant

information loss as a result of the transformation.

Finally, in Chapter 6, we summarize the contributions of each strand of our work and

propose some general directions for future extensions.

Chapter 2

The Network Flow Method

Fred: – and one thing that keeps cropping up is this about “sub-

text.” Songs, novels, plays – they all have a subtext, which I take

to mean a hidden message or import of some kind.

Ted nods.

Fred: So subtext we know. But what do you call the meaning, or

message, that’s right there on the surface, completely open and

obvious? They never talk about that. What do you call what’s

above the subtext?

Ted: The text.

Fred: Okay. That’s right . . . But they never talk about that.

Whit Stillman, Barcelona

As noted in Chapter 1, we treat an ontology as a graph and represent a text as a semantic

profile—a collection of nodes in the graph (concepts in the ontology), each having a weight

(its frequency). For example, in Figure 2.1, a small text consisting of the wordscheeseand

wheat (among other words) with frequencies of 4 and 10, respectively, is represented as a

small weighted subgraph in an ontology by uniformly distributing the word frequencies among

the associated concepts. In this way, a text is a weighted subgraph within a larger graph (with

the thickness of the boxes in the figure indicating weight), and two such weighted subgraphs

are connected via a set of paths in the graph.

15

16 CHAPTER 2. THE NETWORK FLOW METHOD

Figure 2.1: A small text represented as a collection of weighted nodes in a fragment of Word-Net.

Our goal is to measure the distance between two subgraphs (representing two texts to be

compared), taking into account both the ontological distance between the component concepts

and their frequency distributions. To achieve this, we measure the amount of “effort” to trans-

form one profile to match the other graphically: the more similar they are, the less effort it

takes to transform one into the other. In Section 2.1, we firstgive the intuitive motivation for

the approach in terms of the properties of semantic distancethat we want to capture by consid-

ering “transportion effort”. We then present the mathematical formulation of our graph-based

method as a minimum cost flow (MCF) problem in Section 2.2, anddescribe the formulation

of our task within this network flow framework in Section 2.3.In Section 2.4, we return to the

properties we identify in Section 2.1 to explain how they arereflected in the MCF formulation.

2.1 An Intuitive Overview

Let us return to our diagrammatic representation of an ontology (the large open triangle) with

two profiles shown in Figure 2.2. One profile is indicated withfilled squares and the other

with filled triangles. The location of a filled shape indicates the location of a profile concept in

the ontology, and its size indicates its frequency within the profile. We omit edges between the

2.1. AN INTUITIVE OVERVIEW 17

(a)

(b) (c)

Figure 2.2: Two subgraphs (one represented by squares, the other, triangles) with varyingdegrees of overlap, and therefore, similarity within an ontology. Figure (b) differs from Figure(a) in terms of the ontological distance between the square and the triangle clusters. Figure (c)differs from Figure (a) in terms of the size of the individualsquares.

nodes for simplicity of the diagram, but we assume we have a hierarchical, connected ontology.

Recall that our goal is to calculate the similarity between the two profiles by determining how

much effort is required to transport, along the ontologicallinks, the frequency mass from all of

the squares to “fill” the available space in the triangles. The amount of mass to move and the

amount of space available are indicated by the size of the squares and triangles, respectively.

Degree of effort to transport one to the other indicates the degree of semantic distance.

The transport effort is determined by both the amount of massto move and the graphical

distance over which it must travel. First consider graphical (ontological) distance between the

profiles. Assume the calculated distance between the two profiles in Figure 2.2(a) isd. In

Figure 2.2(b), the triangle profile is exactly the same. By contrast, while the square profile has

the same internal properties (same frequency distributionand graphical structure), its location

18 CHAPTER 2. THE NETWORK FLOW METHOD

is further from the triangles. Since the two profiles occupy more distant portions of the on-

tological space, they are less semantically similar than inFigure 2.2(a). As desired, the extra

ontological distance over which the square frequency mass must be transported to the triangles

will cause the calculated distance in Figure 2.2(b) to be larger thand.

Next consider the effect of varying the frequency distribution over the profile nodes. Again,

in Figure 2.2(c), the triangle profile is exactly the same as in Figure 2.2(a). However, while the

nodes of the square profile in Figure 2.2(c) are in the same locations as in Figure 2.2(a), their

distributional properties are different. The bulk of the frequency distribution is now shifted

closer to the nodes of the triangle profile. Since the two profiles have more distributional

weight located closer within the ontology, this indicates that the semantic space they occupy

is more similar than in Figure 2.2(a). Correspondingly, since much of the mass of the square

profile needs to travel less far to fill the space of the triangle nodes, the calculated distance in

Figure 2.2(c) will be less thand.

These intuitive examples show that calculating semantic distance as “transport effort” cap-

tures in a well-motivated way both the ontological distancebetween the profiles and their

weighting by the distributional amounts of the concept nodes. Next we turn to a mathematical

formulation that captures these properties in a network flowframework.

2.2 Minimum Cost Flow

Our intuitive “transport effort” examples above can be viewed as a supply-demand problem,

in which we find the minimum cost flow (MCF) from the supply profile to the demand profile

to meet the requirements of the latter. Mathematically, letG = (N ,E ) be a connected graph

representing an ontology, whereN is the set of nodes representing the individual concepts, and

E is the set of edges representing the relations between the concepts.1 Each edge has a cost

c : E → R, which is the ontological distance of the edge. Each nodei ∈ N is associated

1Most ontologies are connected; in the case of a forest, adding an arbitrary root node yields a connected graph.

2.2. MINIMUM COST FLOW 19

Figure 2.3: An illustration of flow entering and exiting nodei.

with a valueb(i) such thatb : N → R indicates its available supply (b(i) > 0), its demand

(b(i) < 0), or neither (b(i) = 0). The goal is to find a flow from supply nodes to demand nodes

that satisfies the supply/demand constraints of each node and minimizes the overall “transport

cost”.

First, we have to define a function to describe the flow entering i via an incoming edge

(h, i) and exitingi via an outgoing edge(i, j). Let INi be the set of edges(h, i) with a flow

entering nodei, and similarly,OUTi be the set of edges(i, j) with a flow exiting nodei. Then,

the flow entering and exiting nodei is captured byx : E → R such that we can observe

the combined incoming flow,∑

(h,i)∈INix(h, i), from the entering edgesINi , as well as the

combined outgoing flow,∑

(i,j)∈OUTix(i, j), via the exiting edgesOUTi (see Figure 2.3). A

valid flow, x, must be found such that the net flow at each node—the difference between its

exiting flow and its entering flow—equals its specified supplyor demand constraints. For

example, in Figure 2.2 where the squares represent the supply and the triangles represent the

demand, a solution forx would allow us to transport all the weight at the squares to fill the

triangles, via a set of routes connecting them.

Formally, the MCF problem can be stated as:

Minimize z(~x) =∑

(i,j)∈E

c(i, j) · x(i, j) (2.1)

20 CHAPTER 2. THE NETWORK FLOW METHOD

subject to∑

(i,j)∈OUTi

x(i, j) −∑

(h,i)∈INi

x(h, i) = b(i), ∀i ∈ N (2.2)

and x(i, j) ≥ 0, ∀(i, j) ∈ E (2.3)

The constraint specified by eqn. (2.2) ensures that the difference between the flow entering

and exiting each nodei matches its supply or demand (b(i)) exactly. The next constraint,

eqn. (2.3), ensures that the flow is transported from the supply to the demand but not in the

opposite direction. The calculation ofz in eqn. (2.1) (which is subject to these constraints)

multiplies the amount of flow travelling along each edge,x(i, j), by the transportation cost of

using that edge,c(i, j). Taking the summation over all edges of the productc(i, j) · x(i, j)

yields the desired “transport effort” of using the supply tofill the demand.

2.3 Semantic Distance as MCF

To cast our text comparison task into this framework, we firstrepresent each text as a semantic

profile in an ontology. The profile of one text is chosen as the supply (S) and the other as the

demand (D); our distance measure is symmetric, so this choice is arbitrary. In our examples in

Section 2.1, the square profile was seen as the supply and the triangle profile as the demand.

The concept frequencies of the profiles are normalized, so that the total supply equals the total

demand.

The cost of the routes between nodes is determined by a semantic distance measure defined

over the nodes in the ontology. A relation (such as hyponymy)between two conceptsi andj

is represented by an edge(i, j), and the costc on the edge(i, j) can be defined as the semantic

distance betweeni andj within the ontology. For simplicity in this paper, we use edge distance

as our semantic distance measurec; that is, each edge(i, j) has a cost of 1, and the distance

between any two concepts is the number of edges separating them.2

2Some semantic distances, such as those of Lin (1998) and Resnik (1995), do not take into account the under-

2.4. ONTOLOGICAL AND DISTRIBUTIONAL FACTORS IN MCF 21

Next, we must determine the value ofb(i) at each concept nodei. In the simple case,i

occurs in only one profile or the other. Ifi ∈ S, b(i) is set to the normalized supply frequency,

fS (i). If i ∈ D, b(i) is set to the negative of the normalized demand frequency, -fD(i), since

demand is indicated by a value less than zero. However,i may be part of both the supply and

demand profiles, and thenb(i) must be set to the net supply/demand at nodei. Thus we have:

b(i) = fS (i) − fD(i) (2.4)

For example, if the supply profile contains a nodecar with frequency of 0.25, and the same

node in the demand profile has a frequency of 0.7, thenb(car) is −0.45. In other words, the

nodecar has a net demand of 0.45.

Recall that our goal is to transport all the supply to meet thedemand—the key step is

to determine the optimal routes betweenS andD such that the constraints in eqn. (2.2) and

eqn. (2.3) are satisfied. The total distance of the routes, orthe MCF—z(~x) in eqn. (2.1)—is

the distance between the two semantic profiles.

2.4 Ontological and Distributional Factors in MCF

To see how the factors of ontological distance and frequencydistribution play out in the MCF

formulation, let’s return to our square and triangle profileexample. Consider a hypothetical

zoomed in area of the earlier diagram in Figure 2.2(a), shownin Figure 2.4. Here we assume

that the square nodes have a net supply (b(i) > 0) and the triangle nodes have a net demand

(b(i) < 0).3 The size of the square and triangle nodes in the figure indicates |b(i)|—i.e.,

the relative supply/demand, respectively. The circles indicate nodes with neither supply nor

demand constraints—i.e.,b(i) = 0. Each arrow from nodei to nodej indicates the source

lying graph structure of the ontology in calculating the distance between two concepts. Using this type of distancein our MCF framework requires an extra graph transformationstep; see Chapter 5 for more details.

3Earlier we made the simplifying assumption that square nodes were the supply profile and triangle nodes thedemand profile. We have now seen that a node can belong to both profiles, and its characterization more accuratelyis stated in terms ofnetsupply/demand. Thus, for example, a square node may belong to just the supply profileor to both the supply and demand profile; the defining factor isthat it has a net supply.

22 CHAPTER 2. THE NETWORK FLOW METHOD

Figure 2.4: An example of transporting the weights at the square nodes (supply nodes) to thetriangle nodes (demand nodes). The circle nodes have zero supply/demand requirement.

and destination for transported flow from a square node to a triangle. The length of an arrow

represents the ontological distance,c(i, j), and the width indicates the amount of flow,x(i, j).

Note that both the ontological distance between nodes and the node weights are important in

determining the minimum cost flow. For example, the mass at the leftmost square is transported

over a path with one edge (as indicated by the arrow nearby) instead of a path with three edges

(with two circle nodes on the path). The mass at the rightmostsquare has to be distributed

over the two triangles, and the mass at the leftmost square istransported over a path with one

edge (as indicated by the arrow nearby) instead of a path withthree edges (with two circle

nodes on the path). The aggregated length and width of the three arrows corresponds to the

minimum cost flow, i.e, the semantic distance between the profiles represented by the squares

and triangles.

It is clear that ontological information plays a crucial role in the MCF formulation. If

the squares were further away from the triangles in the ontology in Figure 2.4—i.e., if more

edges separated the squares and the triangles—the sets of concepts they represent would be

less semantically similar. In other words, the length of thearrows (representingc(i, j)) would

be greater, and the resulting MCF would be larger, reflectingthe greater semantic distance

between the profiles. Distributional information in this method is equally critical to the distance

2.4. ONTOLOGICAL AND DISTRIBUTIONAL FACTORS IN MCF 23

calculation, because it determines the amount of supply/demand at each node. If the squares

in Figure 2.4 were more uniformly sized, the two profiles would be more semantically similar

because the weight would be distributed more similarly across the ontological space. In this

case, less flow would have to travel from the rightmost squareto the leftmost triangle (i.e.,

the corresponding arrow would be thinner, representingx(i, j)), and the resulting MCF would

therefore be smaller. Finally, despite that MCF is a graph method, the minimum cost between

two profiles has been shown to be a distributional distance between the profiles as well—MCF

is equivalent to the Mallows distance on probability distributions (Levina and Bickel, 2001).4

In short, our MCF method captures the desired property that both ontological distance between

profile nodes and their frequency distributions determine the overall semantic distance between

two profiles.

Now that we have presented the network-flow framework for measuring text distance, in the

next three chapters, we examine our method in more detail, both empirically and analytically.

First, we perform a traditional task-based evaluation of our method in three text comparison

tasks (Chapter 3). Then, we examine the distributional and graphical properties of the three sets

4The Mallows distance between two (discrete) probability distributions,X andY , is defined as:

MF (X, Y ) =

m∑

i=1

n∑

j=1

fij‖xi − yj‖ (2.5)

whereX = {x1, x2, . . . , xm} andY = {y1, y2, . . . , ym}. F = (fij) is the joint distribution ofX andY ,subjected to the following constraints:

fij ≥ 0, 1 ≤ i ≤ m, 1 ≤ j ≤ n (2.6)m

i=1

fij = yj , 1 ≤ j ≤ n (2.7)

n∑

j=1

fij = xi, 1 ≤ i ≤ m (2.8)

m∑

i=1

n∑

j=1

fij =m

i=1

xi =n

j=1

yj = 1 (2.9)

The Mallows distance is highly similar to our MCF definition (eqn. (2.1) to eqn. (2.3)).X andY can representthe frequency distribution of the texts; the joint distribution, fij , is analogous to the amount of flow transportingfrom nodei to nodej. ‖xi − yj‖ is analogous to the concept-to-concept distance between nodei and nodej.

24 CHAPTER 2. THE NETWORK FLOW METHOD

of data in relation to our method’s performance (Chapter 4).Finally, we examine the method

from an algorithmic perspective (Chapter 5).

Chapter 3

Task-based Evaluation

In surfaces, perfection is less interesting. For instance, a page

with a poem on it is less attractive than a page with a poem on it

and some tea stains.

Anne Carson, “The Art of Poetry No. 88.” Interview with Will

Aitken. The Paris Review, Issue 171, Fall 2004.

We evaluate our network-flow method on three different NLP tasks that can be formulated as

text comparison problems based on semantic distance between the texts. In each case, the

texts to be compared are treated as bags of words with associated frequencies. The tasks are

chosen to reflect different types of relations used to extract the relevant words, to see if a

varying amount of constraint on the words comprising a text influences the performance of our

method.

In verb alternation detection (Section 3.1), we identify which verbs, out of a set of target

and filler verbs, allow a certain variation in the syntactic expression of their underlying ar-

gument structure. The task is achieved by comparing the set of head words that occur with

the verb in each of two different syntactic positions (e.g.,subject of intransitive and object of

transitive). In this task, the words that comprise the textsto be compared have a particular syn-

tactic relation to the verb under consideration. In proper name disambiguation (Section 3.2),

a variant of word sense disambiguation (WSD), we classify the sense of an ambiguous name

25

26 CHAPTER 3. TASK-BASED EVALUATION

according to its local context. We compare the text comprising the ambiguous instance to texts

representing each of the known referents of the name. Here, the words of a text are extracted

from a small window of occurrence around the target name token (25 words on each side), re-

gardless of syntactic relations among the words. For the known referents, the words from these

windows are aggregated across a small set of labelled instances. In document classification

(Section 3.3), a text is classified into one of a restricted number of topic categories. The text

to be classified consists of all the words in a document; for each topic, it is compared to a set

of words corresponding to a small set of known documents for that topic. The extracted words

are not constrained by syntactic relation (as in verb alternation) or even by distance to a target

element (as in name disambiguation).

In each case, the resulting bag of words for a text must be mapped into a semantic profile—

a frequency-weighted set of concepts in an ontology. Because all three of our tasks involve

general domain text, we use WordNet (Fellbaum, 1998). (A domain-restricted task may moti-

vate the use of a domain-specific ontology, such as UMLS for comparing medical texts as in

Bodenreider 2004.) Because the noun hierarchy of the WordNet ontology is most developed,

we restrict our semantic profiles to use only the nouns from the bag of words corresponding to

a text.

The bag of nouns with their associated frequencies must be mapped to the appropriate

concepts in WordNet. A simple method is to distribute the frequency of each word to its

corresponding concepts. For example, Ribas (1995) maps theword frequency to the most

specific concept(s) for the word, while Resnik (1993) distributes the word frequency across the

most specific concept(s) as well as their hypernyms. Other approaches estimate the appropriate

probability distribution over a set of concepts to represent a given bag of nouns as a whole,

rather than mapping each noun individually to its concepts (Li and Abe, 1998; Clark and Weir,

2002). For all three of our tasks, we map each noun individually to its most specific concepts,

uniformly dividing the word frequency among them. In verb alternation, we also experiment

with the possibility of finding the best set of frequency-weighted concepts for the full bag of

3.1. TASK 1: VERB ALTERNATION DETECTION 27

nouns, to see if this affects the performance of our method.

The precise classification experiment performed using these semantic profiles is described

in detail below in the section for each task. In each case, we compare the performance of our

MCF method on the semantic profiles to one or more purely distributional methods using the

original word frequencies.

3.1 Task 1: Verb Alternation Detection

Verb alternation refers to variations in the syntactic expression of verbal arguments. If a verb

participates in an alternation, the same underlying semantic argument may appear in varying

positions (slots) of the verb’s subcategorization frames.For example, the following sentences

show that the argument undergoing the melting action can appear as the subject of an intransi-

tive use ofmelt(1a) or as the object of a transitive use (1b).

1a. The chocolatemelted.

1b. The cook melted the chocolate.

This type of intransitive/transitive pairing is known as the causative alternation because of the

explicit expression of the causer (the cook) in the transitive alternant.

It has long been hypothesized that the semantics of a verb andits relations to its argu-

ments at least partially determine the syntactic expression of those arguments (see Pinker,

1989, among others). Influential work by Levin (1993) showedthat this relationship could be

exploited “in reverse” by using alternation behaviour as anindicator of the underlying seman-

tics of a verb—specifically, that verbs undergoing the same sets of alternations form classes

with similar semantics. Computational linguists have built on this work by demonstrating that

statistical cues to alternation behaviour can be used to automatically place verbs into semantic

classes (e.g., Merlo and Stevenson, 2001; Schulte im Walde,2006).

28 CHAPTER 3. TASK-BASED EVALUATION

Detection of verb alternation behaviour can be cast as a textcomparison problem (Merlo

and Stevenson, 2001; McCarthy, 2000). Consider an alternation, such as the causative illus-

trated in (1) above. The set of nouns appearing in the subjectof the intransitive (such as

chocolate) have the same relation to the verb as the set of nouns appearing in the object of the

transitive. Because the verb places constraints on what kinds of entities can be in that relation

(here, things that are meltable), the two sets of nouns should be similar. Hence, to identify a

particular alternation for a verb, the set of nouns in a certain slot of one of its subcategorization

frames is compared to the set of nouns in the alternating slotfor that semantic argument in

another subcategorization frame.

For example, Merlo and Stevenson (2001) devise a simple lemma overlap score that counts

the number of tokens appearing inboth of the relevant syntactic slots. McCarthy (2000) in-

stead compares two semantic profiles in WordNet that containthe concepts corresponding to

the nouns from the two argument positions. In McCarthy’s method, the profiles are first gen-

eralized to a set of higher level nodes in the hierarchy (starting with the method of Li and

Abe, 1998); next, skew divergence is used to find the distancebetween the resulting vectors

of concepts. Here we use our network flow method to directly compare the semantic profiles

corresponding to the noun sets. Our method allows us to compare sets of weighted concepts as

in McCarthy (2000), but using a distance method that applieswithin the ontology graph, rather

than simply using a distributional distance measure over concept vectors.

3.1.1 Experimental Setup

3.1.1.1 Experimental Verbs

We evaluate our method on the causative alternation. As noted above, in this alternation the

target syntactic slots for comparison are the subject of theintransitive (Subj-Intrans) and the

object of the transitive (Obj-Trans). (These are the positions ofthe chocolatein (1a) and (1b)

above, respectively.) To identify verbs undergoing this alternation, we randomly select verbs

3.1. TASK 1: VERB ALTERNATION DETECTION 29

from among Levin classes that are indicated to allow the causative alternation. This allows

us to test our method’s ability to detect alternation behaviour among verbs from a range of

semantic classes, which may differ in other respects.

We refer to the verbs that are expected to undergo the causative alternation as causative

verbs. For comparison, we randomly select an equal number offiller verbs, subject to the

constraint that their Levin classes do not allow a causativealternation. (Specifically, none of

the classes containing a filler verb allows an alternation inwhich the same underlying argument

appears in the Subj-Intrans slot as well as the Obj-Trans slot.) The full set of potential causative

and filler verbs are filtered according to corpus counts, as described next.

3.1.1.2 Corpus Data and Argument Extraction

We use a randomly selected 35M-word portion of the British National Corpus (BNC, Burnard,

2000). The text is parsed using the RASP parser of Briscoe andCarroll (2002), and subcate-

gorization frames are extracted using the system of Briscoeand Carroll (1997). Each subcate-

gorization frame entry for a verb includes a list of the observed argument heads per slot along

with their frequencies. For each verb/slot pair, we can thusextract the set of nouns used in that

slot along with their frequency of occurrence.

Verbs are filtered from the potential list of experimental items if they occur less than 10

times in our corpus in either the transitive or intransitiveframe. The verbs are then divided

into multiple frequency bands: high (at least 450 instances), medium (between 150 and 400

instances), and low (between 10 and 100 instances). An equalnumber of verbs of each type

(causative and filler) is randomly selected within each band, yielding a total of 120 experimen-

tal verbs in balanced datasets of 60 items for development and 60 items for testing. We evaluate

our method on the full set of 60 verbs in each of the datasets, as well as individually on the

three frequency bands of 20 verbs each.

30 CHAPTER 3. TASK-BASED EVALUATION

3.1.1.3 Comparing Semantic Profiles

For each verb, we create a semantic profile for each of the Subj-Intrans and Obj-Trans slots.

We map the argument head frequencies from the extracted subcategorization frame for the verb

to the corresponding nodes in WordNet, as described in the introduction of this chapter. (We

also consider here a different profile generation method, discussed later in Section 3.1.2.2.)

We then calculate the network flow distance between the two semantic profiles for each verb,

yielding a distance calculation for that verb. Recall that we expect verbs that participate in

the alternation to have more similar semantic profiles corresponding to the Subj-Intrans and

Obj-Trans nouns. We thus rank all the verbs by the distance calculation, and (as in McCarthy,

2000) set a threshold to divide the verbs into causative (smaller distance values) and non-

causative (larger distance values). Following McCarthy, we experimented with both the mean

and median values as the threshold, but found little difference. We report the results using the

median distance as the threshold, since this provided more consistent results with our method.

3.1.2 Results and Analysis

We present results on both development and test data, and also examine the effect of using

alternative profile generation methods. Because we label all verbs in our experiments, we use

accuracy as the performance measure; the random baseline (given our balanced datasets) is

50%. We compare our network-flow distance (NF) to a number of other distance measures

including probability distributional distances given by Jensen-Shannon divergence (JS) and

skew divergence (skew div) (Lee, 2001), as well as the general vector distances of cosine,

Manhattan distance, and Euclidean distance.

3.1.2.1 Development and Test Results

On the development data, our network flow distance performs better than or as well as all

other measures on the individual frequency bands. (See Table 3.1. Best performance in each

3.1. TASK 1: VERB ALTERNATION DETECTION 31

All Frequency Bands Avg ofVerbs High Medium Low Bands

NF 0.60 0.70 0.70 0.70 0.70cosine 0.57 0.60 0.60 0.60 0.60Manhattan 0.63 0.70 0.70 0.70 0.70Euclidean 0.47 0.40 0.50 0.40 0.43skew div 0.57 0.60 0.60 0.50 0.57JS 0.60 0.70 0.60 0.70 0.67

Table 3.1: Accuracies on development data by the network-flow method (NF), cosine, Manhat-tan distance, Euclidean distance, skew divergence (skew div), and Jensen-Shannon divergence(JS). Best accuracies in each condition are shown in boldface.

condition is shown in boldface.) However, on all verbs combined (the “All Verbs” column) the

performance of our method is not the best, and indeed is worsethan the performance on the

individual frequency bands.

In response to this trend on development data, we examined the distance values across the

frequency bands. We found that low frequency verbs tend to have smaller distances between

the two slots and high frequency verbs tend to have larger distances. As a result, the threshold

for all verbs lies in between the thresholds for each of thesefrequency bands. When classify-

ing all verbs, the frequency effect may result in more false positives for low frequency verbs

(which have generally smaller distance values), and more false negatives for high frequency

verbs (which have generally larger distance values). The column labelled “Avg of Bands” of

Table 3.1 shows the performance when averaging the results across the individual frequency

bands. For most methods, including ours, the “Avg of Bands” results are much better than

when considering all verbs together (the “All Verbs” column).

Table 3.2 reports the performance on the unseen test data, which is similar to that on devel-

opment data. Again, we find that our method is tied for the bestperformance in all conditions

except for all verbs combined. Here, taking the average of the frequency bands does not help

performance of our method compared to “All Verbs”, but neither does it hurt (and for most

methods “Avg of Bands” does better or the same as “All Verbs”). We conclude that separating

items by frequency may be required to achieve robust resultsin this type of task.

32 CHAPTER 3. TASK-BASED EVALUATION

All Frequency Bands Avg ofVerbs High Medium Low Bands

NF 0.67 0.60 0.80 0.60 0.67cosine 0.50 0.60 0.50 0.50 0.53Manhattan 0.63 0.60 0.80 0.60 0.67Euclidean 0.60 0.50 0.70 0.50 0.57skew div 0.63 0.60 0.80 0.60 0.67JS 0.70 0.60 0.80 0.60 0.67

Table 3.2: Accuracies on test data by the network-flow method(NF), cosine, Manhattan dis-tance, Euclidean distance, skew divergence (skew div), andJensen-Shannon divergence (JS).Best accuracies in each condition are shown in boldface.

Although our method is tied for best in every condition except “All Verbs”, neither is our

method distinguished from several of the other distance measures. Given the relatively small

amounts of data per verb (with profiles averaging about 900 nodes in size), it is possible that the

raw profiles suffer from a sparse data problem and are not sufficiently capturing the conceptual

similarities among alternating slots. McCarthy (2000) addressed this issue by using a technique

for generalizing concept nodes prior to comparing profiles.We explore this issue next.

3.1.2.2 Comparing Different Profile Generation Methods

Our above experiments use semantic profiles created directly from the word frequencies, as

described earlier. However, research has explored the possibility of generalizing this kind of

“raw” data to a semantic profile that more appropriately reflects the coherent concepts ex-

pressed in the original set of weighted concept nodes. This can be especially useful when

creating semantic profiles from small amounts of data, giventhe noise introduced in the map-

ping of words to concepts.1 To explore the effect of different profile generation methods on this

task, we consider here two approaches, that of Li and Abe (1998) and Clark and Weir (2002).

Both these methods start with a semantic profile generated asdescribed earlier in the chapter

and attempt to find the set of nodes in the ontology that appropriately generalize the concepts

1Because we divide the frequency of a word uniformly among allthe word’s concepts, with no attempt atdisambiguation or informed weighting, much noise is introduced. Given small amounts of data, the noise may besufficient to mislead our network flow method.

3.1. TASK 1: VERB ALTERNATION DETECTION 33

raw Li and Abe Clark and WeirDev Test Dev Test Dev Test

NF 0.70 0.67 0.50 0.67 0.73 0.70Manhattan 0.70 0.67 0.57 0.67 0.60 0.57skew div 0.57 0.67 0.53 0.67 0.68 0.60JS 0.67 0.67 0.63 0.67 0.63 0.53

Table 3.3: Average accuracies by the network-flow method (NF), Manhattan distance (Man),skew divergence (skew div), and Jensen-Shannon divergence(JS) on different profiles: original(“raw”), Li and Abe, and Clark and Weir profiles. Best accuracies in each condition are shownin boldface.

in the “raw” profile and calculate the probability estimate of the resulting set of generalized

concepts.

Table 3.3 compares the performance of the network flow distance with that of several other

measures on the original (“raw”) profiles, the Li and Abe profiles, and the Clark and Weir

profiles. Results are reported for the average of the individual frequency bands, since that pro-

duced the best results overall in our earlier experiments. The results for cosine and Euclidean

distance are omitted, since they perform worse overall thanthe other measures.

The best results across both development and test data are achieved by our network flow

method on the Clark and Weir profiles. Considering the results across all profile types, the

network flow approach is most consistent, achieving the best(or tied for best) performance in

but one condition (development data with Li and Abe profiles). The distributional methods

(Manhattan, skew div, JS) in most cases perform worse on the generalized profiles than on the

“raw” profiles. (The one exception is that skew divergence does better on development data on

the Clark and Weir profiles.)

Overall, then, it seems that raw data is likely best for a purely distributional method, but

the Clark and Weir profiles enable the network flow method to outperform them by exploiting

the graph structure of the ontology. Indeed, when comparingour method to the others on the

Clark and Weir profiles for the individual frequency bands (Table 3.4 and Table 3.5), we find

that much of our performance advantage comes on the low frequency verbs. This indicates that

34 CHAPTER 3. TASK-BASED EVALUATION

All Frequency Bands Avg ofVerbs High Medium Low Bands

NF 0.73 0.70 0.80 0.70 0.73cosine 0.67 0.70 0.40 0.60 0.57Manhattan 0.67 0.65 0.75 0.40 0.60Euclidean 0.60 0.65 0.70 0.50 0.62skew div 0.67 0.70 0.75 0.60 0.68JS 0.67 0.65 0.75 0.50 0.63

Table 3.4: Accuracies on development data on profiles generated using Clark and Weir’s (2002)method. Best accuracies in each condition are shown in boldface.

All Frequency Bands Avg ofVerbs High Medium Low Bands

NF 0.67 0.70 0.80 0.60 0.70cosine 0.50 0.50 0.50 0.50 0.50Manhattan 0.50 0.60 0.70 0.40 0.57Euclidean 0.53 0.50 0.80 0.40 0.57skew div 0.67 0.60 0.80 0.40 0.60JS 0.57 0.50 0.70 0.40 0.53

Table 3.5: Accuracies on test data on profiles generated using Clark and Weir’s (2002) method.Best accuracies in each condition are shown in boldface.

the combination of our method with a suitable generalization technique is especially important

when dealing with sparse data.

We examine the data further to discover why the Li and Abe profiles yield poorer perfor-

mance in most cases on the development data. We find that Li andAbe’s method tends to

generate profiles with more general concepts. For example, when given an original set of con-

cepts such asEdam, Brie, Sockeye, andChinook, the method may produce a single general

concept such asfood instead of the two conceptscheeseandsalmonthat capture the two kinds

of food that are indicated. The loss of semantic informationfrom using overly general concepts

may produce the decrease in performance.

For comparison, we also apply McCarthy’s (2000) method to our test dataset, and find

that it achieves only 0.60 on all verbs and 0.53 averaged overthe three frequency bands. Her

method is especially poor on low frequency verbs (below chance at 0.40). We hypothesize that

3.2. TASK 2: NAME DISAMBIGUATION 35

her method is less robust to low frequency counts because it may overgeneralize the data by

first applying Li and Abe’s (1998)’s method, and then generalizing the nodes even further.

We see that while some amount of generalization of the semantic profiles is useful in this

task, overgeneralization may be harmful. We leave it to future work to explore the interaction

of our network flow method with different types of profile generation across various tasks.

Since the next two tasks we consider use larger amounts of data, we only experiment with raw

profiles in those cases.

3.2 Task 2: Name Disambiguation

Interest in the NLP problem of name disambiguation has increased as the growth of the World

Wide Web has led to large numbers of ambiguous name references in on-line text. For example,

websites or documents containing the nameJohn Edwardsmay refer to the U.S. presidential

candidate for 2008, an NBA basketball player, or a British medical geneticist. As in word

sense disambiguation, an ambiguous name may be resolved by comparing its local textual

context—the set of words it co-occurs with—with the local textual contexts of the name when

its reference is known. For example, the text surrounding the nameJohn Edwardsin its various

uses are very likely to include distinguishing words such aspolitician vs. gamevs. research.

Many approaches have been proposed for resolving name ambiguity by using distributional

methods over contextual information (e.g., Han et al., 2005; Pedersen et al., 2005; Xu et al.,

2003).

In this section, we present the application of our network flow distance measure to a name

disambiguation task, and demonstrate the benefits of combining ontological and distributional

knowledge in this task. The particular task we examine is oneof “pseudo name disambigua-

tion”, in which the texts containing matched pairs ofdifferentnames are extracted, and then the

two different names are replaced by a single symbol, leadingto an ambiguous “name” across

the two sets of texts. The goal is to recover the correct target name in each instance. For exam-

36 CHAPTER 3. TASK-BASED EVALUATION

ple, the names of two soccer players (Ronaldo and David Beckham) form one disambiguation

task, while the names of an ethnic group and a diplomat (Tajikand Rolf Ekeus) form another.

This task was established by Pedersen et al. (2005) to provide “annotated” experimental data

(with each text indicating the correct name), without the need for expensive manual annotation.

In Pedersen et al.’s (2005) work, an unsupervised method of name discrimination through

text clustering was used to address this task. This is infeasible for a method like ours, in

which each distance calculation requires access to an ontology. (The worst-case complexity of

clustering with our method is quadratic in the size of the ontology used; a detailed discussion

can be found in Chapter 5.) Instead, we use a supervised methodology, but experiment with

varying small amounts of data in a minimally supervised approach. Although our method

requires extra manual effort in the form of data annotation for training, we find that the amount

of annotated data required is modest.

3.2.1 Experimental Methodology

3.2.1.1 Corpus Data

We use Pedersen et al.’s (2005) dataset, which was taken fromthe Agence France Press English

Service portion of the GigaWord English corpus distributedby the Linguistic Data Consortium.

They extracted the local context of six pairs of names of varying confusability, including: the

names of two soccer players (Ronaldo and David Beckham); an ethnic group and a diplomat

(Tajik and Rolf Ekeus); two companies (Microsoft and IBM); two politicians (Shimon Peres

and Slobodan Milosevic); a nation and a nationality (Jordan and Egyptian); and two countries

(France and Japan). For each name instance, the extracted text consists of 50 words (25 words

to the left and to the right of the target name), with the target name obfuscated. For example,

for the task of distinguishing “David Beckham” and “Ronaldo”, the target name in each in-

stance becomes “DavidBeckhamRonaldo”. Each pair of names thus serves as one of sixname

disambiguation tasks. Table 3.6 shows the number of instances per task (name pair). The “Ma-

3.2. TASK 2: NAME DISAMBIGUATION 37

Name 1 Count Name 2 Count Total MajorityRonaldo 1700 David Beckham 752 2452 0.69Tajik 3002 Rolf Ekeus 1071 4073 0.74Microsoft 3401 IBM 2406 5807 0.59Shimon Peres 7686 Slobodan Milosevic 6048 13734 0.56Jordan 25039 Egyptian 21392 46431 0.54Japan 116379 France 110435 226814 0.51

Table 3.6: The pairs to be identified, the raw frequency, and the relative frequency of themajority name.

jority” column also indicates the relative frequency of themajority name in each pair, which

we adopt as the baseline accuracy.

3.2.1.2 Classification Using the Network-Flow Method

As mentioned above, we take a supervised approach, in which name instances are classified

with the use of annotated training data. To generate our training data, we randomly select a

portion of the instances for each of the 12 names. All the training instances for a name are

used to form a single aggregate semantic profile, which serves as the gold-standard for that

name. The remaining instances serve as test data; for each ofthese, we build an individual

semantic profile. All profiles are generated as described in the introduction of this chapter, i.e.,

each frequency count for a word is distributed uniformly among the corresponding concepts

in WordNet. A gold-standard profile is constructed in exactly the same way except that its

word frequency vector is created by aggregating the word counts from all the relevant training

instances. Note that there is nothing special about such a profile or how it is formed; it simply

aggregates counts from multiple contexts.

To classify a name instance, we measure the network-flow distance between the individual

profile of the ambiguous instance and each of the two gold-standard profiles for that task. The

name whose gold-standard profile has the shortest distance to the instance profile is the name

assigned to the ambiguous instance. For example, assume we have a “DavidBeckhamRonaldo”

instance to be classified. We compare its profile to each of thegold standard profiles for “David

38 CHAPTER 3. TASK-BASED EVALUATION

Beckham” and “Ronaldo” by measuring the distance between each of the two pairs of profiles.

If the instance profile has a shorter distance to the profile for “David Beckham” than to that of

“Ronaldo,” then it is classified as “David Beckham,” otherwise as “Ronaldo.”

3.2.1.3 Evaluation Methodology

We use the accuracy of labelling all instances as our evaluation measure. To compare to prior

results using F-measure, we report that in some tables. Since we label all instances, accuracy

and F-measure are equivalent, using2rp/(r + p) as the definition of F-measure.

The random baseline for our task is the accuracy of labellingall instances with the pre-

dominant name, as shown in the “Majority” column of Table 3.6. Since we use the dataset of

Pedersen et al. (2005), we compare our performance to their distributional method (reporting

their best results both with and without singular value decomposition). Because their method

is an unsupervised one, we also train and test a supervised learner using distributional data

(LIBSVM by Chang and Lin, 2001). For each set of training data, we remove stopwords and

use the remaining words as input features for the SVM. We thenobtain the optimal parameters

(i.e., optimal values for cost and gamma in LIBSVM) by using 10-fold cross-validation over

the training data. Finally, we perform classification on thetest data using those parameters.

This enables us to compare our results to a purely distributional method with access to the

same training data.

Because our method is supervised, it is important to minimize the amount of annotated

data required to build the gold-standard profiles.2 Since it is unclear a priori what amount of

training data is sufficient, we experiment with several quantities. We initially select 200 random

instances per pair of names, respecting the relative proportions of the two names overall. (200

instances constitute about 0.1–10% of the data per pair of names.) Subsequently, we decrease

the quantity further, to one-half and one-quarter the original amount (100 and 50 instances,

2Lengthy training time can also be an issue for a supervised method, but here “training” is the straightforwardtask of building an aggregate semantic profile.

3.2. TASK 2: NAME DISAMBIGUATION 39

respectively) to observe how the performance is influenced by the amount of data used to

construct the gold standard profiles.3 To reduce the impact of possible skewed sampling of

training data, we repeat the random sampling five times, withno overlap between the random

samples. We report the performance of each sample set as wellas the average over the five

samples.

3.2.2 Results and Analysis

3.2.2.1 Initial Experiments

Table 3.7 shows the performance of our method over five randomsamples of 200 training

instances per task. Observe that the performance over the five rounds varies very little (a

maximum difference of 0.08, and most are much closer). This shows the robustness of our

method to different make-ups of training data. Table 3.8 shows the average performance of

our method, in comparison to the chance (majority) baseline, as well as the results produced

by the unsupervised method of Pedersen et al. (2005) (with singular value decomposition—

SVD—reported as Ped05SVD, and without SVD as Ped05), and the supervised SVM on the

same training data as our method. Observe that our method notonly significantly outperforms

the random baseline, it is moreover the best performer amongst all the methods (paired t-test,

p < 0.05).

There are cases for which Pedersen et al.’s methods have at best chance performance (Mi-

crosoft/IBM and Japan/France). The authors suggest that these pairs of names arise in contexts

of news text in which there are “no consistently strong discriminating features” useful in the

clustering algorithm. (Interestingly, this is the case even with SVD, where words are grouped

into a small number of unnamed concepts.) Even the SVM has difficulty with these pairs, also

3We also experiment with 400 training instances to see whether increasing the amount of training data helps.The performance benefit is minimal: two tasks have the same average performance, three improve by 1%, andone by 2%, with an improvement in the average over all the tasks of 1.25%. A paired t-test between the results on400 and 200 training instances yields a highp value (p = 0.73), indicating that the differences between the twoare statistically insignificant.

40 CHAPTER 3. TASK-BASED EVALUATION

Random Samples Average ofName Pair 1 2 3 4 5 SamplesRonaldo/Beckham 0.78 0.83 0.76 0.79 0.84 0.80Tajik/Ekeus 0.98 0.98 0.97 0.96 0.98 0.97Microsoft/IBM 0.73 0.72 0.73 0.74 0.73 0.73Peres/Milosevic 0.96 0.96 0.97 0.96 0.97 0.96Jordan/Egyptian 0.79 0.78 0.78 0.77 0.76 0.77Japan/France 0.79 0.73 0.77 0.70 0.73 0.75

Table 3.7: Network-flow results (accuracy) using 200 training instances on the random samplesand their average performance.

Name Pair Majority Ped05 Ped05SVD SVM200 NF200

Ronaldo/Beckham 0.69 0.73 0.65 0.85 0.80Tajik/Ekeus 0.74 0.96 0.89 0.90 0.97Microsoft/IBM 0.59 0.51 0.59 0.62 0.73Peres/Milosevic 0.56 0.97 0.94 0.90 0.96Jordan/Egyptian 0.54 0.59 0.62 0.72 0.77Japan/France 0.51 0.51 0.50 0.48 0.75Unweighted Average 0.61 0.71 0.70 0.75 0.84Weighted Average 0.53 0.55 0.55 0.55 0.77

Table 3.8: Performance results for the network flow (NF) method using 200 instances pergold standard profile, SVM using 200 training vectors, and Ped05 and Ped05SVD (the bestresults without and with SVD, respectively, in Pedersen et al., 2005). The weighted averageis calculated based on the number of instances in each pair ofnames. The best result for eachname pair is indicated in boldface.

performing at just around chance. Yet our method performs well above chance for these pairs.

In general, SVM produces results that are little better on average than the unsupervised results

in Pedersen et al. (2005) (with some tasks performing better, and some worse). This shows that

the performance improvement by the network-flow method doesnot depend solely on access

to training data. Instead, it seems that the use of ontological relations in calculating distance

can significantly enhance the discriminatory power over simply using words.

Note that there is one difference between the data used in theSVM and the network-flow

experiments: the SVM is trained using all words as features,while only WordNet noun con-

cepts are used in the network-flow experiments. It is possible that using just nouns or a mapping

3.2. TASK 2: NAME DISAMBIGUATION 41

Name Pairs NF Concepts Only Nouns Only All WordsRonaldo/Beckham 0.80 0.85 0.86 0.85Tajik/Ekeus 0.97 0.96 0.90 0.90Microsoft/IBM 0.73 0.61 0.63 0.62Peres/Milosevic 0.96 0.87 0.91 0.90Jordan/Egyptian 0.77 0.72 0.72 0.72Japan/France 0.75 0.51 0.49 0.48Unweighted Average 0.84 0.77 0.75 0.75Weighted Average 0.77 0.57 0.56 0.55

Table 3.9: SVM results using 200 training instances.

Number of Training InstancesName Pair 200 100 50Ronaldo/Beckham 0.80 0.79 0.76Tajik/Rolf Ekeus 0.97 0.98 0.96Microsoft/IBM 0.73 0.73 0.72Peres/Milosevic 0.96 0.97 0.94Jordan/Egyptian 0.77 0.74 0.70Japan/France 0.75 0.75 0.70Unweighted Average 0.83 0.83 0.80Weighted Average 0.77 0.76 0.72

Table 3.10: Average classification results of the network flow method using 200, 100, and 50training data per classification task. The weighted averageis calculated based on the numberof test instances per task.

of nouns to WordNet concepts could bring the performance of the SVM into line with our net-

work flow measure. We thus perform two replications of the SVMexperiments, one using only

nouns as features and one using noun concepts as features (and the relevant frequencies as the

feature values in both cases). However, both of these approaches produce little to no improve-

ment over the all-words results (see Table 3.9). We concludethat our network-flow method is

superior to, and more consistent than, the purely distributional methods, and that this difference

is attributable to the integration of distributional and ontological (relational) information in our

measure.

42 CHAPTER 3. TASK-BASED EVALUATION

3.2.2.2 Reducing the Amount of Training Data

Because, in contrast to Pedersen et al. (2005), we use a supervised approach, we want to de-

termine whether we can reduce our dependence on training data. Here, we report experiments

using one-half (100 instances) and one-quarter (50 instances) of the training data used above.

As before, we repeat the random sampling of the training instances five times in each case, and

report the average performance here.

Table 3.10 shows the network flow performance for 200, 100, and 50 training instances.

Numerically, the results do not differ by much when the training data is reduced from 200 to

100 instances, and a paired t-test finds the difference to be non-significant. The performance

drop is more pronounced in the 50-instance experiment, where every pair of names shows

some drop in performance compared to 100 instances. Here, a paired t-test shows that the

performance drop in the 50-instance experiment is statistically significant (p = 0.04). Despite

this, we still outperform the other methods: our results using 50 training instances are much

better than those of Pedersen et al. (2005) in all but one task, and even better overall than the

SVM using 200 training instances (compare the SVM column of Table 3.8).

For comparison, we also train the SVM on 100 training instances, and find a decrease

of 3% on average from using 200 training instances. We conclude that our method is more

robust to minimal training conditions. To explore the leastamount of training data needed

for our measure, we further reduce the amount for producing gold-standard profiles to 20 and

5 instances per task, and observe a continual drop in performance. The performance of one

task (Ronaldo/David Beckham) drops below chance with 20 training instances and another

(Microsoft/IBM) drops below chance with 5. For this set of data, we conclude that a minimum

of 50 instances per task are required to provide enough discriminating power for our method.

Although unsupervised methods have the advantage of requiring no training data, in our

case, 50 to 100 training instances constitute only a very small portion of the data, as well as

a small amount of annotation effort in absolute terms. We conclude that the (small) labelling

effort is justified by the performance gain achieved using our minimally-supervised approach.

3.2. TASK 2: NAME DISAMBIGUATION 43

Name Pair JS Ped05 SVM (100) NF (100)Japan/France 0.31 0.51 0.45 0.75Jordan/Egyptian 0.31 0.59 0.73 0.74Microsoft/IBM 0.46 0.51 0.54 0.73Peres/Milosevic 0.68 0.97 0.84 0.97Ronaldo/Beckham 0.69 0.73 0.84 0.79Tajik/Rolf Ekeus 1.01 0.96 0.89 0.98Standard deviation 0.27 0.21 0.18 0.12

Table 3.11: The performance results of of Pedersen et al. (2005) (Ped05), as well as networkflow (NF) and SVM using 100 training instances, ranked in the order of the JS divergence.

3.2.2.3 The Contribution of Textual Data

The better performance of our method, in comparison to a state-of-the-art supervised learner,

indicates that sensitivity to word frequency distributionalone is not sufficiently discriminating

for this task. To further investigate this hypothesis, we create an aggregate word frequency

vector using all disambiguated instances of each name, and then compare the context vectors

of the two names in each disambiguation task. The comparisonis done by measuring their

distance using a symmetric distributional measure, Jensen-Shannon divergence:

JS (p, q) = 12[D(p‖avg(p, q)) + D(q‖avg(p, q))] (3.1)

If a method is indeed sensitive to distributional information, we expect to see a positive corre-

lation between the distributional distances of the contextvectors and the performance results.

Indeed, Table 3.11 shows that, generally, the larger the distributional distance, the better the

name disambiguation methods perform.

In addition, we calculate the Pearsonr correlation between each set of performance results

with the distributional distances given in column JS of the table. That is, each set of results

is compared to the JS divergences measured on the “All” aggregate vectors. If the method

producing the results is a supervised one, we compare it withthe JS divergences measured on

the aggregate vectors created using the same training data.For all comparisons, the correlation

44 CHAPTER 3. TASK-BASED EVALUATION

coefficients are high (Pearson:r ≥ 0.6, p < 0.05; with one exception, between JS and Ped05,

p = 0.07).4 This confirms our hypothesis that all methods are sensitive to the distributional

information of texts.

In spite of the above results, we argue that our network-flow method is not sensitive only

to distributional information. Observe that on the three pairs of names that are the most similar

distributionally (Japan/France, Jordan/Egyptian, and Microsoft/IBM), our method consistently

does better than Pedersen et al.’s (2005) results and almostalways better than SVM (the excep-

tion is on the Ronaldo/Beckham pair). This observation is further confirmed by calculating the

standard deviation on the performance on the six name pairs (last row in Table 3.11). The re-

sults produced by our method have the smallest standard deviation (0.12), while the accuracies

of the two distributional methods yield standard deviationvalues much closer to the standard

deviation on the JS distance (0.18 and 0.21 vs. 0.27). Therefore, we conclude that the distri-

butional methods are more susceptible to the distributional “signal”, noise or otherwise, in the

data, whereas in addition to capturing the distributional distinctions, our method is also able

to detect semantic distinctions between texts. This is yet another piece of evidence that onto-

logical information can complement distributional information, especially in cases where word

frequency distribution alone does not have sufficient discriminating power. In Chapter 4, we

will return to this discussion by examining further the distributional as well as the ontological

properties of textual data from different sources.

3.3 Task 3: Document Classification

Document classification is an NLP task in which a previously unseen document is given a topic

label (or a set of such labels) based on its subject matter. For example, a financial document

discussing the fluctuation of crude oil prices may be labelled “commerce” or “crude oil” in the

Reuters corpus. In our version of the task, each document hasa single topic label. Document

4Note that Spearman rank correlation is non-parametric, andtherefore, more conservative than the Pearsonr

correlation. For comparison, we also calculated the Spearman rank correlation and obtained similar results.

3.3. TASK 3: DOCUMENT CLASSIFICATION 45

classification is typically performed by comparing the textof an unlabelled document to the

text of documents whose topics (labels) are known, and assigning the label of the closest such

document (e.g, Joachims, 2002; Iwayama et al., 2003; Esuli et al., 2006; Nigam et al., 2006).

This task is thus similar to the name disambiguation task in the previous section, and our

approach is similar as well: here again, we form gold-standard profiles from a small collection

of texts of known classes, and then compare each test instance to each of the gold-standard

profiles. As in name disambiguation, we experiment with different amounts of training data

for creating the gold-standard profiles.

There are two differences of note in comparison to name disambiguation. First, in docu-

ment classification we use the entire set of words comprisingthe document to create a seman-

tic profile, rather than a smaller window around a target word. Second, while each ambiguous

name instance in the earlier task had exactly two potential labels (and thus there were two gold-

standard profiles for comparison), the number of labels in the document classification task is

much larger, leading to more ambiguity in the task.

3.3.1 Experimental Setup

3.3.1.1 Corpus Data

Our data is a corpus of articles from 20 different Usenet newsgroups released by Mitchell

(1999). Since each newsgroup corresponds to a topic, the articles can be classified using

the (single) newsgroup label. We use the collection maintained by Rennie (2001), in which

all the duplicates (cross-posts) are removed, resulting in18,828 articles. The articles are ap-

proximately evenly distributed among the 20 newsgroups. Stopwords and article headers are

removed before processing each text.

Work that relies on word frequency vectors to represent the texts in document classification

has revealed the importance of preprocessing the word frequency data to emphasize those terms

that are likely to be most meaningful. For example, word frequencies have typically been

46 CHAPTER 3. TASK-BASED EVALUATION

weighted by inverse document frequencies (tf · idf ), to lessen the impact of very common but

less distinguishing words. According to Rennie (2001), their best system on the same corpus

uses thelog tf +1log idf

weighting scheme. In order to compare our system to theirs, we use this

same word weighting scheme in the creation of the word vectors that are used to produce our

semantic profiles.5

3.3.1.2 Training and Evaluation

As mentioned before, we treat the classification task similarly to name disambiguation, tak-

ing a minimally supervised approach. We randomly select a small number of documents as

training data for creating the gold-standard semantic profiles. We use 10 or 30 documents per

newsgroup, or approximately 1–3% of the documents. The remaining documents are used as

testing data. Again, we use a random sample of documents for each gold-standard profile, re-

peated five times to minimize the impact of possible skewed sampling. We report the average

accuracy over the five samples.

Because there are 20 possible topic labels, the random baseline is very low, at 5%. (Using

the predominant label raises this only slightly.) A more informative evaluation of our method

is to compare to a state-of-the-art approach that is purely distributional. A comparison to

Rennie (2001) is natural, since we use the same dataset. However, they trained an SVM on 30

documents per class and tested on 10% of the documents, repeated 10 times. Since our training

approach differs somewhat (training on 10 or 30 documents per class, testing on all remaining

documents, repeated 5 times), we also replicate their SVM experiment using our training and

test sets. As in the name disambiguation task, we use the LIBSVM software package (Chang

and Lin, 2001) and tune the classifier in the training phase for the best SVM parameters prior

to the testing.

5We have experimented with using raw word frequencies as wellastf · idf to produce profiles. Both methodsyield approximately the same results as thelog tf +1

log idffrequency weighting scheme.

3.3. TASK 3: DOCUMENT CLASSIFICATION 47

SVM SVM SVMTraining Size / Class NF Noun Concepts Nouns (Words) All words

10 31.2 42.7 47.8 47.130 32.0 61.4 66.4 66.2

Table 3.12: Average classification results using 10 and 30 training documents per newsgroup.

3.3.2 Results and Analysis

3.3.2.1 Initial Results

Table 3.12 presents the classification results using 10 and 30 training documents per class for

our network flow and SVM methods. Our network flow method performs well above the ran-

dom baseline, but is far from achieving state-of-the-art results. The SVM experiments using all

words in the document perform much better than our network-flow method, and are consistent

with the accuracy of 68.7% achieved by Rennie (2001) using anSVM. One possible reason is

that the SVM is trained on all words (minus stopwords and article headers), while our network

flow method applies to noun concepts only. As in our name disambiguation task, we also train

the SVM on just the nouns in a document (rather than all words), and also on the nouns mapped

to concepts (i.e., a concept frequency vector rather than a word frequency vector). The SVM

performance on noun-only data is similar to that of all words, while there is a marked decrease

in performance on concepts, but SVM still outperforms our method.

The poorer SVM performance on concept frequencies suggeststhat concept frequency vec-

tors are less easily distinguishable than word frequency vectors. Recall, however, that we found

no difference with these various training approaches for SVM in name disambiguation. It is

possible that the mapping from words to concepts is a problemhere because the full text is

used, rather than a relatively small window around a target word. Since each word can map

to multiple (potentially unrelated) concepts, the use of a larger, unconstrained bag of words

may lead to a high degree of ambiguity, introducing more noise in the semantic profile than

our method can handle. This may also explain why the network flow method does not im-

prove with additional training data, showing virtually no improvement (0.8% difference). We

48 CHAPTER 3. TASK-BASED EVALUATION

Figure 3.1: Two noisy profiles, one represented by squares, the other, triangles.

speculate that the amount of noise in a semantic profile basedon a larger amount of text may

increase along with the increase in the training size, offsetting any potential gain from having

additional data.

If this hypothesis is correct, it is natural to ask why the SVMresult using concepts shows a

substantial increase in accuracy from 10 to 30 training documents. If larger texts yield nosier

semantic profiles, why does this not negatively affect the SVM as well? This highlights a

fundamental distinction of our approach: our method is novel because it finds the distance

between conceptsas embedded in a graph (the ontology), not just between contextvectors.

Generally, our thesis is that this is an advantage of our model: it entails that all concepts

generated from a text play a role in determining the distanceof that text from another. As we

noted earlier, this allows us to find similarity between texts that use related but not equivalent

concepts. For example, our measure will find greater similarity between a text that discusses

“milk” and one that discusses “cheese” than between one thatdiscusses “milk” and one that

discusses “bread”. A vector distance would find each of theseequally dissimilar, because there

are no concepts in common, and there is no way to relate “milk”to “cheese”.

However, the performance of our method in this document classification task reveals a

potential drawback of this property of our method. Because it takes all concepts into account

in determining distance, it is more susceptible to noise. Figure 3.1 illustrates the problem. We

3.3. TASK 3: DOCUMENT CLASSIFICATION 49

see that the square and triangle profiles are noisy—that is, they each have a number of nodes

that are not part of their coherent semantic content. These noisy aspects of the two profiles

are less separated in ontological space, making the two profiles more similar according to our

measure than their “true” semantic content would indicate.Because a vector representation of

concepts does not form connections between differing concepts, it is not led astray in the way

our method is.

3.3.2.2 Removing Noise from the Profiles

Our conjecture is that the poor performance of our network flow method is due to noise intro-

duced in the mapping of each word to all of its concepts (i.e.,not just the relevant ones to the

topic). This effect could also be exacerbated by the fact that in using the full document, we

may have a higher number of less relevant words than when a profile is formed from a more

constrained set of words (as in verb alternation detection and name disambiguation). If this

hypothesis is true, then the noisy (irrelevant) concepts should be distributed within each profile

according to some prior probability distribution. If we knew that distribution, then we could

“subtract out” the noise and form more semantically coherent profiles. Referring to Figure 3.1,

the idea is that we would like to remove the small, disperse squares and triangles, leaving only

the larger ones that form a semantically more coherent set.

We test this idea, experimenting with two possible noise distributions. The first is sim-

ply the uniform distribution, and the second is a distribution determined empirically using

frequency counts from a domain-general corpus. For the latter, we determine a distribution

over concepts based on the nouns in the BNC. Because the BNC isa balanced corpus, the

distribution of its nouns can be considered a prior that is treatable as noise compared to the

distribution in a newsgroup posting which is specific to a particular topic. In each case, we cre-

ate a semantic profile representing the expected noise, and then “subtract” the resulting noise

profile from each of our gold-standard semantic profiles in the document classification task.

The “subtraction” is actually a process of setting to zero all of the semantic profile frequencies

50 CHAPTER 3. TASK-BASED EVALUATION

Figure 3.2: The same two profiles in Figure 3.1. The profile masses that are “subtracted” areshaded in grey.

Training Size / Class NF NF− Uniform NF− BNC10 31.2 28.2 27.430 32.0 37.2 35.6

Table 3.13: Average classification results using 30 and 10 training documents per newsgroup,using the original profiles (NF), and using profiles after the“noise subtraction” process de-scribed in the text (“NF− Uniform” and “NF − BNC” are results subtracting the uniformdistribution and the BNC noun frequency distribution, respectively).

that are less than the noise value for that concept. Any node with a value higher than the noise

value for that node is expected to be a potentially relevant concept. We leave such nodes at

their original value so that they are more distinguished from the remaining values (now set to

zero). Figure 3.2 illustrates the result of applying this kind of noise reduction to the profiles in

Figure 3.1.

Table 3.13 presents the network-flow results on the noise-subtracted data, showing a 3–5%

increase in the performance using 30 training documents perclass. The performance decreases

with noise-subtraction when we have only 10 training documents per class, suggesting that

there may not be enough data in this case to use this simplistic subtractive method.

Interestingly, subtracting the uniform noise distribution from the profiles has a more fa-

vorable effect than subtracting the BNC noise distribution. The BNC distribution is perhaps

inappropriate for our data. Newsgroup data includes a variety of subjects which may make it

3.4. SUMMARY 51

more similar to a balanced corpus than we have originally anticipated, thus what we are treat-

ing as a “noise” distribution in this case may not actually represent noise. That said, there is

a small but notable increase even using the BNC noise distribution when we have sufficient

training data. The idea of subtracting out noise seems promising, but we leave the appropriate

representation of noise, and the mechanism for removing it effectively, as an area of future

research.

3.4 Summary

In this chapter, we have presented a task-based evaluation of our network-flow method for text

comparison. In comparison to a traditional distributionalapproach, we have demonstrated that

a non-dual approach to text comparison can add semantic sensitivity. What distinguishes our

approach from traditional distributional methods is that our method does not attempt to parti-

tion the semantic space (i.e., grouping words into related concepts), but rather it lets the onto-

logical structure as well as the frequency distribution of the target texts determine which words

are compared. As shown in the first two tasks, the non-dual combination is a strength in cases

where both types of knowledge offer discriminating power intext comparison. However, in the

last task, the semantic relation between words does not provide additional benefits—in fact,

frequency information is sufficient for classification. This suggests that either the ontological

relations provided by WordNet are inappropriate for this task, or the data is not semantically

coherent enough to take advantage of the ontological relations. In the next chapter, we will

examine the factors that contribute to the sensitivity of our method on a dataset.

52 CHAPTER 3. TASK-BASED EVALUATION

Chapter 4

Measuring Coherence of Semantic Profiles

(Nine on the Fourth. Because you are ready, there is much to

gain. Do not hesitate. Gather friends around you. As a hair clasp

holds hair together.)

Yao Text of Hexagram 16, Line 4, Yijing

We have seen a performance difference across the three taskswe used in evaluation: the

network-flow method outperforms purely distributional measures on verb alternation detection

and name disambiguation, but does poorly on document classification compared to a distri-

butional approach. (See Table 4.1 for a summary of the results.) We use the same ontology

(WordNet) and the same concept distance (number of edges) inour network-flow measure

across all three tasks, hence there must be some difference in the three datasets themselves that

impacts the ability of our method to distinguish the semantic profiles corresponding to one class

of data (one usage of an ambiguous name, for example) from theprofiles of a different class of

data (the other usage of the name). In this section, we develop a measure that can capture this

property and explain the performance differential we have observed for our method.

53

54 CHAPTER 4. MEASURING COHERENCE OFSEMANTIC PROFILES

Verb Alt’n Detection random Manhattan skew div JS NFDevelopment Avg 0.50 0.70 0.57 0.67 0.70Test Avg 0.50 0.67 0.67 0.67 0.67Name Disamb’n random SVM (100) SVM (200) NF (100) NF (200)Unweighted Avg 0.61 0.72 0.75 0.83 0.83Weighted Avg 0.53 0.52 0.55 0.76 0.77Document Class’n random SVM (10) SVM (30) NF (10) NF (30)20 newsgroups 0.05 0.43 0.61 0.31 0.32

Table 4.1: Summary of task-based results. The numbers in parentheses indicate the number oftraining instances used. The best result for each task is shown in bold.

4.1 Profile Coherence

Our goal is to find a property of individual semantic profiles that, when averaged across the

profiles in a dataset, indicates how well our method will be able to distinguish profiles of

different classes in that dataset. That is, we aim to learn about the overall separability of

the classes in a dataset by investigating the properties of individual profiles that comprise the

dataset. Our hypothesis is that the important factor for ourmethod is what we refer to as

profile coherence: the degree to which profile mass is concentrated within a constrained space

(or set of constrained spaces) of the ontology. The more spatially coherent the sets of weighted

concepts are for the profiles in a dataset, the more likely it is that our method will be able to

distinguish contrasting profiles. Conversely, less coherent profiles, whose frequency mass is

more distributed across a wider area of the ontology, will bemore difficult to separate into

classes.

For example, consider the square and triangle profiles in Figure 4.1. Coherent profiles

have their profile mass (the concept weights) focused withinsmall, distinct regions of the

ontology, as in Figure 4.1(a). These types of profiles tend tobe highly distinguishable from

each other. Less coherent profiles, whose mass is more dispersed through the ontology, such

as those in Figure 4.1(b), are likely to be less distinguishable. Note, however, that it is not

simply occupying greater or fewer nodes in the hierarchy that determines profile coherence

(and distinguishability). The profiles in Figure 4.1(c) are“spread out” as in (b), but are more

4.1. PROFILE COHERENCE 55

(a)

(b) (c)

Figure 4.1: Examples of two profiles (indicated by squares and triangles) of varying coherence.The profiles in (a) are more distinguishable than those in (b)and (c); the profile in (c) is inturn more distinguishable than that in (b). The degree of distinguishability of these profiles isreflected in their degree ofcoherence.

coherent (and distinguishable) due to having areas of high mass.

The considerations illustrated in Figure 4.1 suggest that both distributional and ontological

factors contribute to the coherence of a semantic profile, and that we must determine a suitable

measure of coherence that captures both factors. A simpler,alternative hypothesis is that either

purely distributional or purely ontological factors may sufficiently capture the coherence of a

semantic profile. To explore these ideas, we examine different ways to assess the coherence

of the semantic profiles in our example datasets. We develop various measures of coherence,

and then evaluate whether the degree of coherence as determined by each measure indeed

corresponds to the performance of our network-flow method onthe datasets. We expect a

useful measure of profile coherence to have a high average value across the datasets on which

we perform well (verb alternation and name disambiguation), and a low average value across

56 CHAPTER 4. MEASURING COHERENCE OFSEMANTIC PROFILES

the dataset on which we perform poorly (document classification).

In Section 4.2, we briefly review several measures intended to separately capture the dis-

tributional or ontological coherence of a semantic profile.We show that such measures are

insufficient for accounting for the performance differences of our method across the datasets.

In Section 4.3, we develop a novel measure to capture the coherence of our profiles in terms of

both distributional and ontological information. This measure, calledprofile density, expresses

the degree to which a semantic profile forms a coherent clustering of weighted concepts in an

ontology. We demonstrate that our profile density measure can account for the performance

differential across our datasets.

4.2 Separate Distributional and Ontological Approaches

We explored several (unsuccessful) means for capturing profile coherence with a purely distri-

butional or purely ontological measure. While we could not exhaustively investigate all pos-

sible measures of this kind, the underlying reasons for the lack of success of these measures

in explaining the differing performance of our method across the datasets convinced us of the

need for a measure that integrates distributional and ontological factors (which we present in

the following section). We mention the single-factor measures here for completeness.

Potential Distributional Coherence. Recall that Section 3.3.2.2 shows that removing the

“noise” distribution from each profile improves the document classification performance of

our method. In other words, subtracting the noise distribution from a profile makes it distri-

butionally more distinct from other profiles. Based on this observation, we hypothesize that

the less a profile resembles a noise distribution over the ontology, the more coherent it is—that

is, the more likely the frequency mass is situated in meaningful clusters of concepts. To test

this hypothesis, we calculate the average distance (using KL-divergence, Kullback and Leibler,

1951) of the profiles in a dataset from a profile created from a noise distribution (the uniform

distribution of words, or their distribution in the BNC, as in Section 3.3.2.2). Higher values of

4.2. SEPARATE DISTRIBUTIONAL AND ONTOLOGICAL APPROACHES 57

this measure indicate further distance from the uniform distribution.

Potential Ontological Coherence.Here we consider two observations. First, we hypoth-

esize that profiles with fewer concepts are more coherent, since a smaller number of concepts

is more likely to be less dispersed in the ontology. We simplyuse average profile size to cap-

ture this property (here, smaller values of profile size indicate greater coherence). Second, we

hypothesize that profiles whose concepts have greater specificity are more coherent, because

use of less specific concepts is indicative of vagueness and potential lack of coherence. Since

specificity corresponds well to depth in WordNet, we use a simple measure of average profile

depth to indicate the specificity of the set of concepts in a profile (here, greater values of depth

should indicate greater coherence).

Analysis of the Single-Factor Measures. For each task, we calculate the average of

each of the hypothesized distributional and ontological coherence measures over the profiles

in the dataset, and find that there is no consistent correspondence with the performance of our

network-flow method across the tasks. Despite the intuitions and observations presented above,

these results are not surprising. For example, the profiles of a dataset may all be distribution-

ally very similar overall to the noise profile, supposedly indicating low coherence, but they

may be quite coherent in the actual ontological space they occupy. Similarly, the profiles in a

dataset may all have a small average depth in the ontology or large size (again supposedly indi-

cating low coherence), but their distributional properties (the weights on the concepts that are

occupied) may yield coherent clusters of mass in the profile.This analysis then confirms our

hypothesis that, because distributional and ontological information are intertwined in the rep-

resentation of a semantic profile, a useful measure of profilecoherence must take into account

an integration of these two information sources.

58 CHAPTER 4. MEASURING COHERENCE OFSEMANTIC PROFILES

4.3 Integrating Distributional and Ontological Factors

As noted earlier, and tentatively confirmed by the above results, we assume that the interaction

of distributional and ontological factors determines the coherence of profiles—i.e., a coherent

profile has its frequency mass concentrated within a reasonably constrained space (or set of

constrained spaces) of the ontology. We observe that this issimilar to the geographical notion

of population density, which is determined by the population mass divided by the area occu-

pied. Here we extend the geographical definition of density within our network framework by

relating population mass to distributional weights on concepts, and occupied area to the spread

of the weighted concepts in the ontology. We call the resulting measure of profile coherence

profile density.

4.3.1 Profile Density

To adapt the definition of geographical density to our problem, we first need to determine

the analogs of population mass and occupied area in a semantic profile. The profile mass at

each concept node is directly analogous to the population mass. Defining the occupied area

within an ontology is not as straightforward, as there is no simple definition of area within a

graph.1 We develop a definition of area that captures the actual spatial spread of the profile

mass through the ontology.

To begin, we note that any subgraph of the WordNet hypernym hierarchy is hierarchical

itself. Thus, any region of the ontology that contains some profile mass is a hierarchy rooted

at some common ancestor of those profile nodes.2 As shown in Figure 4.2, the more dispersed

(less closely clustered together) a set of nodes is, the further away their common ancestor is.

That is, a highly related (and spatially constrained) set ofconcept nodes can be generalized

to a more specific ancestor concept (i.e., near the descendants, as in Figure 4.2(b)), while

1Agirre and Rigau (1996) use the number of nodes within a subgraph as its area, but this fails to take intoaccount how dispersed the nodes are throughout the ontology.

2Although WordNet contains instances of multiple inheritance, the rate is low. As a result, the likelihood of aset of profile nodes sharing multiple ancestors is low as well.

4.3. INTEGRATING DISTRIBUTIONAL AND ONTOLOGICAL FACTORS 59

(a) (b)

Figure 4.2: Two examples of profile density within an ontology. The hollow triangles are thecommon ancestors of the filled triangles, which are concept nodes in the profile. The profile in(a) is fairly dispersed, requiring a single but distant ancestor node. The profile in (b) is moreclustered; two ancestor nodes are required but each is closeto its descendants.

a semantically distant set of concepts will be generalized to a semantically general ancestor

concept (i.e., far from the descendants, as in Figure 4.2(a)). The ontological distance between

a set of nodes and their common ancestor thus indicates how closely clustered the descendant

nodes are.

Next note that any semantic profile can be represented by a setof ancestor nodes, and these

ancestor nodes capture the spatial clusterings of the profile mass. For example, the profile in

Figure 4.2(a) is represented by one ancestor node, and that in Figure 4.2(b) by two such nodes.

Combining these observations, we see that given a suitable manner for identifying ancestor

nodes to represent a profile, we can use the combined ontological distance between each of

those nodes and their descendants as an indication of how closely clustered the concepts of the

profile are. We can now complete our definition of profile density by using the total distance

between each identified ancestor and its descendants as an indication of the occupied area of

the ontology.

Formally, letP be a profile andA be a set of ancestor concept nodes such that each profile

noded ∈ P is guaranteed to have an ancestora ∈ A. (We will explain in Section 4.3.2 how to

60 CHAPTER 4. MEASURING COHERENCE OFSEMANTIC PROFILES

(a) (b)

Figure 4.3: These two profiles have equal density value givenour originalprofile density

formula in eqn. (4.1), but are suitable distinguished (withthe profile in (b) having higher densitythan that in (a)) by thenorm density formula in eqn. (4.2). See the text for discussion.

find the setA.) The profile density ofP is then defined as follows:

profile density(P) =∑

a∈A

d∈P,d∈descendant(a)

mass(d)

distance(d , a)(4.1)

wheremass(d) is the profile mass (concept frequency) at noded, anddistance(d , a) is the

distance in the ontology between noded and nodea, as given by a suitable concept-to-concept

distance measure (such as edge distance that we have used in our task-based evaluations).

There is one more subtle detail we must address. Consider thetwo examples in Figure 4.3,

where the distance between each ancestor and all its descendants is the same (here, say, a

distance of 1), but the distribution of the profile mass differs. The first diagram has ten equally

weighted profile nodes, and the second has two. Our current formulation in eqn. (4.1) yields

a density of 1 for both diagrams (i.e.,(0.1/1) ∗ 10 = 1 = (0.5/1) ∗ 2). However, the profile

mass in diagram (a) is distributed among more nodes than thatin diagram (b). Intuitively, the

second profile is more densely clustered and should have a higher density value.

Looking more closely at our density formula in eqn. (4.1), observe that the number of

profile nodes has an impact on the calculation—that is, density increases as the number of

profile nodes increases due to the inner summation in the formula. To achieve an appropriate

density measure, then, we normalize the original density value by the number of profile nodes,

4.3. INTEGRATING DISTRIBUTIONAL AND ONTOLOGICAL FACTORS 61

(a) (b)

Figure 4.4: Two profile examples with different number of ancestors but of equalnorm density value.

resulting in a normalized density for a profile:

norm density(P) =density(P)

sizeof (P)

=1

sizeof (P)

a∈A

d∈P,d∈descendant(a)

mass(d)

distance(d , a)(4.2)

Returning to our example in Figure 4.3, eqn. (4.2) assigns the first profile a normalized density

of 0.1, and the second profile a normalized density of 0.5. Themodified measure now appro-

priately distinguishes the two profile densities, indicating that the profile in Figure 4.3(a) is less

tightly clustered than the profile in Figure 4.3(b).

In addition to profile size, we also consider that the number of ancestors may have an

impact on the density calculation. Consider the example in Figure 4.4. Both profiles contain

four profile nodes (filled triangles), but they are generalized to different number of ancestors

(hollow triangles). Again, for simplicity, assume the distance between each ancestor to each

of its descendants is one. Using the currentnorm density formulation yields a density value

of 0.25 for both diagrams (i.e.,(0.25/1) ∗ 4/4 = 0.25). Although the distribution of profile

mass and the descendant–ancestor distances are the same in both cases, the first profile can be

viewed as more densely clustered than the second as it can be generalized to fewer number of

ancestors.

62 CHAPTER 4. MEASURING COHERENCE OFSEMANTIC PROFILES

To account for the difference, similarly to the number of profile nodes, we observe that

the number of ancestors can have an influence on how densely clustered a profile is—more

ancestors result in a less densely clustered profile. To achieve the desired inversely proportional

relationship (between the number of ancestors and density), we explore two variations:

norm density2 (P) =density(P)

sizeof (P) ∗ sizeof (A)(4.3)

norm density3 (P) =density(P)

sizeof (P) + sizeof (A)(4.4)

In both cases, we increase the denominator in eqn. (4.2) by the number of ancestors,sizeof (A),

such that density decreases when the number of ancestors increases. In eqn. (4.3), we divide

density by the product of the number of profile nodes (sizeof (P)) and the number of ancestors

(sizeof (A)). In eqn. (4.4), we divide by their sum. Returning to our example in Figure 4.4, the

first diagram has anorm density2 of 0.25 andnorm density3 of 0.2; and the second diagram

has anorm density2 of 0.125 andnorm density3 of 0.167.

Note thatnorm density2 penalizes considerably more thannorm density3 because of the

multiplication (instead of addition) in the denominator. The size of the ancestor set depends

not only on how densely the profile nodes are clustered, but also on how conservative the

method that searches for the ancestor set is. In the extreme case, we may have an ancestor set

that is the same as the profile itself. Dividing bysizeof (A) over-penalizes density, hence we

opt for norm density3 for a less severe penalty. We will report results fornorm density and

norm density3 .

4.3.2 Finding the Ancestor Set for Profile Density

As noted earlier, our definition of profile density depends onidentifying a suitable set of ances-

tor nodes of the concept nodes in the profile: the distance of the ancestors to the profile nodes

indirectly indicates the degree to which the profile nodes are spatially clustered close together.

Thus, given a profileP , we need to findA, the set of nodes that are ancestors of the profile

nodesd ∈ P . (The nodesa ∈ A correspond to the hollow triangles indicated in Figure 4.2 and

4.3. INTEGRATING DISTRIBUTIONAL AND ONTOLOGICAL FACTORS 63

Figure 4.3.) Recall that these ancestor nodes are intended to be a set of concepts that serves as

an appropriate generalization of the nodes in the profile—each ancestor in a sense represents a

coherent cluster of profile nodes. However, we do not knowa priori what the appropriate level

of generalization is—we simply want a level that gives a useful assessment of how clustered

together the profile nodes are.

For this purpose, we make use of Clark and Weir’s (2002) method for generalizing a set

of weighted concept nodes in an ontology. As we noted in Section 3.1, given a frequency

distribution over all concept nodes, Clark and Weir (2002) use a statistical method to search for

the set of nodes (i.e., our node setA) that best generalize the original weighted concepts. This

method is particularly appropriate for our purposes because it includes a parameter,α ∈ (0, 1),

that controls the level of generalization. We varyα over five values (0.05, 0.25, 0.5, 0.75,

and 0.95) to obtain five different (more to less generalized)sets of ancestors. In our analysis,

we calculate the density using each ancestor set in order to evaluate the impact of the precise

choice of ancestor nodes on our measure.

4.3.3 Results and Analysis

For each of the three tasks in our earlier task-based evaluation, we calculate the profile density

of the corresponding dataset. We define the profile density ofa dataset to be the average of the

normalized density values over its profiles (eqn. (4.2) and eqn. (4.4)). For the verb alternation

detection task, we perform the analysis on all 240 profiles used in the task (120 verbs, with

2 profiles per verb, one for the subject slot, one for the object slot). In the remaining two

tasks, because each instance profile is compared to a gold-standard profile, we believe that

the performance depends primarily on the coherence of the gold-standard profiles. We thus

perform our analysis on the gold-standard profiles only. Forname disambiguation, we have

60 profiles (5 samplings with 12 gold-standard profiles each); for document classification,

we have 100 profiles (5 samplings with 20 gold-standard profiles each). For each profile, we

calculate the normalized density using each of five ancestorsets (based on theα value, as noted

64 CHAPTER 4. MEASURING COHERENCE OFSEMANTIC PROFILES

α value 0.05 0.25 0.5 0.75 0.95 AvgVerb Alternation 5.59e-4 5.90e-4 6.32e-4 7.14e-4 8.87e-4 6.76e-4Name Disamb’n (200) 8.93e-5 9.89e-5 1.08e-4 1.18e-4 1.35e-4 1.10e-4Name Disamb’n (100) 1.11e-4 1.26e-4 1.38e-4 1.52e-4 1.78e-4 1.41e-4Doc Class’n (30) 5.25e-5 5.94e-5 6.59e-5 7.43e-5 8.78e-5 6.80e-5Doc Class’n (10) 8.03e-5 8.85e-5 9.87e-5 1.11e-5 1.33e-5 5.84e-5

Table 4.2: The normalized profile density scores for each dataset at five different values ofα,as well as the average scores across theα values.

above). For the concept-to-concept distance measure,distance(d , a) in eqn. (4.2), we use edge

distance, the same measure used in the tasks in earlier sections of the paper.

We expect that, if our profile density measure does indeed reflect the coherence of a dataset,

then we will see a correspondence between the density valuesand the performance of our

network-flow method. Higher density values indicate a profile whose weighted concepts form

more coherent clusters in the ontology. Specifically, then,we expect higher density values for

the datasets from our verb alternation detection and name disambiguation tasks (on which our

method had better performance than distributional methods), and lower density values for the

document classification dataset (on which our method had worse performance than a purely

distributional method).

Table 4.2 shows the profile densities of each dataset using edge distance. (For comparison,

we have also computed the density values using Jiang and Conrath’s (1997) distance and it

yields similar results. See Table 4.3.) First note that the density values are relatively stable

across all values ofα, indicating that the precise level of generalization is notcritical to the

usefulness of our density measure. Next, observe that, as predicted, the document classification

dataset is shown to have the lowest density for both trainingset sizes. This observation is in

accord with our hypothesis that the profile density measure indicates the coherence of the

profiles in a dataset and is therefore informative about the network-flow performance on that

dataset.

Interestingly, we also observe that, across all values ofα and training set sizes, the verb

alternation dataset has the largest densities, followed bythe name disambiguation dataset, then

4.3. INTEGRATING DISTRIBUTIONAL AND ONTOLOGICAL FACTORS 65

α value 0.05 0.25 0.5 0.75 0.95 AvgVerb Alternation 2.25e3 2.81e3 3.72e3 5.29e3 9.44e34.70e3Name Disamb’n (200) 6.35e2 7.99e2 9.75e2 1.24e3 1.66e31.06e3Name Disamb’n (100) 5.85e2 8.11e2 1.01e3 1.31e3 1.85e31.11e3Doc Class’n (30) 3.30e2 4.51e2 5.74e2 7.60e2 1.10e36.43e2Doc Class’n (10) 2.92e2 3.86e2 5.24e2 7.05e2 1.10e36.02e2

Table 4.3: The normalized density scores at five different values ofα, as well as the averagescores, calculated using Jiang and Conrath’s (1997) distance.

the document classification data. (The differences betweenall three datasets are statistically

significantly,p ≪ 0.05.) This result might stem from the fact that there are varyingdegrees of

constraint placed upon the data in the three tasks. In verb alternation, the nouns used to generate

a profile appear either all in the subject or all in the object position of the target verb. In name

disambiguation, we loosen the restriction to include all nouns in a small window surrounding

the target word. Lastly, in document classification, the only restriction on the nouns used to

generate a profile is that they appear in the same document. This suggests that the syntactic

and semantic constraints placed upon a set of nouns can have an impact on the coherence of

the profile created from them.

This latter observation suggests that our profile density measure may be useful not only in

indicating the ability of our network-flow method to distinguish relevant profiles. More gener-

ally, it may also reflect the varying degrees of syntactic andsemantic constraints placed upon

the set of words that generate a profile. Our profile density measure may indeed be generally

useful as a measure of semantic coherence of a set of conceptsin an ontology (Gurevych et al.,

2003), a matter we plan to explore in future work.

4.3.4 The Impact of the Number of Ancestors

Table 4.4 shows thenorm density3 score of each dataset using edge distance, and Table 4.5

using Jiang and Conrath’s (1997) distance. Similar to thenorm density scores seen previously,

we observe that verb alternation data has the highest density, and document classification data is

66 CHAPTER 4. MEASURING COHERENCE OFSEMANTIC PROFILES

α value 0.05 0.25 0.5 0.75 0.95 AvgVerb Alternation 4.74e-4 4.97e-4 5.24e-4 5.70e-4 6.57e-45.44e-4Name Disamb’n (200) 8.29e-5 9.02e-5 9.65e-5 1.04e-4 1.15e-49.78e-5Name Disamb’n (100) 1.04e-4 1.16e-4 1.25e-4 1.35e-4 1.51e-41.26e-4Doc Class’n (30) 5.03e-5 5.61e-5 6.13e-5 6.78e-5 7.78e-56.27e-5Doc Class’n (10) 7.28e-5 7.86e-5 8.50e-5 9.12e-5 7.59e-58.07e-5

Table 4.4: Thenorm density3 scores at five different values ofα, as well as the average scores,calculated using edge distance.

α value 0.05 0.25 0.5 0.75 0.95 AvgVerb Alternation 1.91e3 2.37e3 3.07e3 4.20e3 6.93e33.69e3Name Disamb’n (200) 5.85e2 7.23e2 8.68e2 1.08e3 1.41e39.34e2Name Disamb’n (100) 5.46e2 7.42e2 9.10e2 1.15e3 1.57e39.84e2Doc Class’n (30) 3.14e2 4.24e2 5.32e2 6.93e2 9.75e25.88e2Doc Class’n (10) 2.80e2 3.64e2 4.87e2 6.41e2 9.68e25.48e2

Table 4.5: Thenorm density3 scores at five different values ofα, as well as the average scores,calculated using Jiang and Conrath’s (1997) distance.

the least dense. Althoughnorm density3 produces smaller density values thannorm density ,

the difference is small. For the data from the three tasks, the number of ancestors appears to

have negligible impact on the overall density.

4.4 Summary

In summary, our analysis in this section has shown that both distributional and ontological

properties contribute to the coherence of a profile, but neither alone is indicative of the network-

flow performance in a particular task. Our new measure of profile density serves as a tool for

analyzing profiles that integrates their distributional and ontological coherence, and provides a

post-hoc means for explaining the performance differential of our method across the different

tasks we performed here. The results also point to the possibility of devising a diagnostic tool

for the suitability of the network-flow method on novel data.An analysis of the data and results

across a larger set of tasks will allow us to investigate the possibility of determining a density

4.4. SUMMARY 67

threshold that would be indicative of expected positive results with our method.

68 CHAPTER 4. MEASURING COHERENCE OFSEMANTIC PROFILES

Chapter 5

Graph Transformation

(Gone, gone. Gone all the way. Everyone gone over to the other shore. Enlightenment!)

Prajnapamita Hr. daya Sutra

Thus far, for simplicity, we have only considered edge distance as the distance between each

pair of nodes (i.e.,c(i, j)). In other words, one edge constitutes a distance of one, andthe

distance of a path is the number of edges it has. This is an appropriate node-to-node (or

concept-to-concept) distance with the MCF framework, because the MCF problem definition

assumesc(i, j) to be additive (the distance of a path equals the sum of the distances of the

edges on the path). Note, however, that there exists a numberof non-additive distances widely

used to measure the semantic distance between two concepts in an ontology such as WordNet.

In this chapter, we first focus on the impact on accuracy and efficiency by using a non-additive

distance within our network-flow framework. Then we will present our novel solution that

allows us to maximize both accuracy and efficiency. Finally,we will present our evaluation

and analysis.

69

70 CHAPTER 5. GRAPH TRANSFORMATION

5.1 Solving the MCF Problem Using a Non-additive Distance

In this section, we will discuss the issues relevant to solving the MCF problem accurately and

efficiently. We will describe the impact of using non-additive distances and offer two possible

(but not ideal) solutions. The first solution allows us to calculate the final distance exactly but is

computationally expensive, while the opposite is true for the second solution (the final distance

is approximated but the method is efficient). Based on these two solutions, we propose a third

possibility: we trade off the exactness of the calculation with efficiency, a discussion we will

return to in Section 5.2.

Recall that a distance is additive if the distance between any two nodes is the sum of the

distances of the edges connecting them. Say, if the edges,(j0, j1), (j1, j2), . . . , (jn−1, jn),

wherei = j0, jn = k, lie along a path connecting nodei and nodek, then the additive distance

between them is:

distance(i , k) =

n−1∑

m=0

distance(jm , jm+1 ) (5.1)

Interestingly, the additivity issue arises from the MCF problem definition itself. Note that both

the objective function, eqn. (2.1), and the constraints, eqn. (2.2) and eqn. (2.3), are expressed

in terms of the flow and/or the cost of the individual edges. Ateach step of the search, to

consider an edge as part of the solution, not only does it haveto satisfy the constraints, it has

to be the cheapest (i.e., locally). Observe that the objective function, eqn. (2.1), in particular,

is expressed as a linear combination of edge costs. Thus, thelocally cheapest edges would

eventually lead to the cheapest route globally. Essentially, the problem is defined in a way that

additivity is assumed to hold (and therefore this greedy approach works). However, many ex-

isting concept-to-concept distances, such as those proposed by Resnik (1995) and Lin (1998),

are non-additive. Moreover, these measures have often beenshown to be superior to the simple

edge distance in comparing WordNet concepts (Jarmasz and Szpakowicz, 2003; Weeds, 2003).

Unfortunately, for these non-additive concept-to-concept distances, the cheapest set of edges

locally does not yield the cheapest set of edges globally, rendering the use of MCF without

5.1. SOLVING THE MCF PROBLEM USING A NON-ADDITIVE DISTANCE 71

Figure 5.1: A bipartite network between the S and D profiles.

modification infeasible for these distances.

In order to solve the MCF problem exactly, the underlying graph structure must be changed

such that the non-additive distance can be calculated additively. More specifically, given a non-

additive distance, for any two non-adjacent concepts,i andj (i.e., concepts that are separated

by two or more edges), one can add a new edge(i, j) and assign the non-additive distance to the

edge. Thus, any pair of nodes is separated by exactly one edge—locally optimal distance equals

globally optimal distance. Note that adding a new edge for each pair of non-adjacent nodes

results in a complete graph. Hence, the number of edges generated as well as the processing

time are drastically increased. Alternatively, one can consider using only the profile nodes

and build a complete bipartite network based on the larger complete graph. For example, two

profiles, each with seven nodes, will result in a graph with 49edges (Figure 5.1). The number

of edges generated is still quadratic in the number of nodes required.1

Now, let us consider an alternative solution in which the processing time is reduced. Instead

of calculating the exact non-additive distance for every path in the original graph, one may

consider assigning the non-additive distance to the individual edges only, and approximating

1Empirically, we also find the process of generating bipartite graphs impractical. For example, for the verbalternation experiment, with an average of 900 nodes per profile, the process can take as long as 10 days. Thecode is scripted in perl. The experiment was performed on a machine with two P4 Xeon CPUs running at 3.6GHz,with a 1MB cache and 6GB of memory. The above method does not scale up well for tasks with comparable orlarger number of comparisons.

72 CHAPTER 5. GRAPH TRANSFORMATION

Figure 5.2: An example ontology with two profiles, S and D.

the distance for non-adjacent nodes as:

distanceNA(i , k) ≈

n−1∑

m=0

distanceNA(jm , jm+1 ) (5.2)

However, this solution is also not ideal. The additive version of the non-additive distance

grows monotonically with the number of edges, but not every non-additive distance has such

a growth rate. The difference between the true non-additivedistance and the approximated

additive version may therefore increase as the number of edges on a path increases.

Consider the ontology in Figure 5.2, in which there are two profiles, labelled S and D;

the edges connecting the profile nodes are highlighted (thick edges). If we use a non-additive

distance, the number of edges separating two nodes would notbe indicative of the true distance

between them. For example, consider the two shaded nodes, one labelled S, the other labelled

D, connected by a path highlighted by very thick edges. This path contains seven edges (an

edge distance of 7). In comparison, using Wu and Palmer’s (1994) measure, for example,

yields a distance of 5.2 Alternatively, we can approximate the non-additive distance additively

as in eqn. (5.2). Consider the same doubly-edged path again.The additive version of Wu and

Palmer’s (1994) distance yields a distance of2(32) +2(5

4) +2(7

6) +9

8≈ 8.93. In comparison to

2Here, we assign the root node with a depth of1 to avoid division by zero. The distance of the S-D path is

5.2. NETWORK TRANSFORMATION 73

the exact distance (a distance of 5) between S and D, there is clearly a substantial difference

using the additive approximation.

In spite of the above shortcomings, both methods have their advantages. By enumerating

every path as an edge, the non-additive distance can be calculated precisely. By approximat-

ing the distance of a path additively, the construction of the graph can be done efficiently; the

original graph is unchanged in this case. Since both advantages cannot be achieved simultane-

ously, our idea is to trade off the exactness of the distance calculation with the efficiency of the

network construction such that both factors are maximized.Our method will be presented in

details in the next section.

5.2 Network Transformation

In this section, we present our method of alleviating the processing bottleneck by reducing the

processing load from generating a large number of edges. Instead of generating a complete

bipartite graph, we generate a graph which approximates both the structure of the original

network as well as that of the complete bipartite network. The goal is to construct a new

network such that (i) the efficiency is improved by reducing the number of edges generated,

and (ii) the resulting distance distortion does not hamper performance significantly. We first

discuss the graphical property that is relevant to our method in Section 5.2.1, and then propose

our graph transformation method in Section 5.2.2.

calculated as:

distancewp(S ,D) =depth(S ) + depth(D)

2depth(A)

=5 + 5

2 ∗ 1= 5

74 CHAPTER 5. GRAPH TRANSFORMATION

5.2.1 Path Shape in a Hierarchy

To understand our transformation method, let us further examine the graphical properties of

an ontology as a network. In a hierarchical network, (e.g., WordNet, UMLS (Bodenreider,

2004)), calculating the distance between two concept nodesusually involves travelling “up”

and “down” the hierarchy. The simplest route is a single hop from a child to its parent or

vice versa. Generally, travelling from one nodei to another nodej consists of an A-shaped

path ascending from nodei to a common ancestor ofi and j, and then descending to node

j, as shown in Figure 5.2 (very thick edges, with the end nodes and their common ancestor

italicized).

Interestingly, observe that the A-shaped path relating twonodes via their common ancestor

is relevant to the design of a number of concept-to-concept distances. For example, distances

that are defined in terms of Resnik’s (1995) information content (IC),− log(p(concept)), such

as Jiang and Conrath’s (1997) and Lin’s (1998) measures, consider both the (lowest) common

ancestor as well as the two nodes of interest in the distance calculation.

Recall that our goal is to trade off the exactness of the distance calculation with the effi-

ciency of the network reconstruction. We propose to take advantage of the path shape in two

ways. First, we construct a new network that preserves only the node-ancestor-node relation

for every pair of nodes. This way, we limit the total number ofedges between each node pair.

Because the non-additive distance is approximated over paths with a limited number of edges,

the distortion effect is reduced. Second, because the number of ancestors is smaller than the

number of profile nodes, we can construct a new network requiring much fewer edges than the

complete bipartite graph. The key is to select a set of ancestors that reduces the reconstruction

time considerably. The details of the network reconstruction will be described next.

5.2. NETWORK TRANSFORMATION 75

Figure 5.3: An example ontology with two profiles, S and D. Some common ancestors of theprofile nodes are highlighted (JS and JD nodes).

5.2.2 Network Reconstruction

Let us return to our example in Figure 5.2. Finding the A-shaped path between any two nodes

involves finding the corresponding lowest common ancestor.However we encounter a circular

requirement: the ancestors are found only when the minimum-cost routes are determined by

solving the MCF. Without solving the MCF, we can only select aless precise set of ancestors

for each profile instead, which we will refer to asjunction nodes. These nodes serve as the

bridge between the supply and demand profile nodes, through which the flow is transported

from the supply nodes to the demand nodes. (For ease of explanation, we assume that the

junction nodes have been selected for this section. We will discuss how they are selected in the

next section.)

Now consider a slight modification of Figure 5.2 in Figure 5.3, where some ancestor nodes

are highlighted as JS and JD nodes. Each S node has an ancestor with the JS label, and similarly,

each D node has a JD ancestor. Consider the same two S and D nodes connected by thethick

double edges. Now they are connected via three paths: S to JS, JS to JD, and finally, JD to D.

76 CHAPTER 5. GRAPH TRANSFORMATION

Figure 5.4: Fragments of the transformed ontology with two profiles, S and D. The commonancestors of the profile nodes are labeled JS and JD.

Figure 5.5: The fully transformed ontology with two profiles, S and D. The common ancestorsof the profile nodes are labeled JS and JD.

Note that we have to pass through two J nodes. In comparison tothe original A-shaped path,

the tip (one common ancestor A) is now elongated (the path from JS to JD).

Assuming that we have selected the J (junction) nodes, the next step is to construct the

transformed graph. Figure 5.4 gives a snapshot of the process. For each J node, we connect

it to its descendents in the corresponding profile and associate each added edge with the true

5.3. ANALYSING THE TRANSFORMED NETWORK 77

non-additive distance in the original graph. (For ease of understanding, we show edge distance

here.) Next, we connect each JS node to each JD node, and again, associate each edge with the

corresponding non-additive distance in the original network. For the two target S and D nodes

in the current example, they are now connected via an A-shaped path with an elongated JS-JD

tip (thick edges). The completely transformed graph is shown in Figure 5.5.

5.3 Analysing the Transformed Network

In summary, Figure 5.6 presents all three networks. Figure (a) is the original network and is

only feasible when the concept-to-concept distance is additive. In the case of non-additivity,

we can replace each A-shaped path (S-D path) in Figure (a) with an edge to create the bipartite

graph in Figure (b). Alternatively, we can select a set of junction nodes from the original

network, then replace each A-shaped S-D path from the original network with a S-JS-JD-D

path in Figure (c), such that the bipartite portion (betweenJS nodes and JD nodes) is shrunk

considerably. We refer to this graph as the transformed network.

Recall that we have two objectives for adapting the MCF framework for non-additive

concept-to-concept distances: (i) the adaptation should not severely compromise the exact-

ness of the distance calculation; (ii) in comparison to the algorithm using additive distances,

the resulting method should be relatively efficient. In Section 5.3.1, we will address the first

objective by analysing the distortion effect of the distance approximation. In Section 5.3.2, we

will address the second objective by examining how we can keep the overall processing cost

low in the junction selection.

5.3.1 Distance Distortion

To address the first objective of minimizing the distance distortion, let us first define the cost

function on the transformed network. For each supply-demand node pair,S andD, the precise

concept-to-concept distance is simplydistanceNA(S ,D) (both in the original network and the

78 CHAPTER 5. GRAPH TRANSFORMATION

(a)

(b) (c)

Figure 5.6: The original ontology, the bipartite graph and the fully transformed graph with twoprofiles, S and D. In the fully transformed graph, the common ancestors of the profile nodesare labeled JS and JD.

bipartite network). Now that the S-D path, A-shaped in the original network, one edge in

bipartite network, is replaced with a S-JS-JD-D path in Figure (c), the transformed distance

betweenS andD, distancetrans(S ,D), becomes:

distancetrans(S ,D) = distanceNA(S , JS ) + distanceNA(JS , JD) + distanceNA(JD ,D) (5.3)

Here, the transformed distance becomes the additive sum of three edges in the new network.

Because each path between a supply node and a demand node is fixed at three edges, the

transformed distance no longer depends on the number of edges along the path in the original

5.3. ANALYSING THE TRANSFORMED NETWORK 79

network (cf. eqn. (5.2)). As a result, we reduce the distortion effect on the transformed distance.

Despite that each path has exactly three edges, there is still some distortion from approxi-

mating the concept-to-concept distance additively. To illustrate the distortion effect, consider

Jiang and Conrath’s (1997) distance, which measures the difference in information content

between two concepts and their lowest common ancestors, i.e.,

distancejc(S ,D) = IC (S ) + IC (D) − 2IC (LCA(S ,D)) (5.4)

After the transformation,distancejctrans(S ,D) becomes:

distancejctrans(S, D) = distancejc(S , JS) + distancejc(JS , JD) + distancejc(JD ,D)

= [IC (S ) + IC (JS) − 2IC (JS)] +

[IC (D) + IC (JD) − 2IC (JD)] +

[IC (JS) + IC (JD) − 2IC (LCA(JS , JD))]

= IC (S ) + IC (D) − 2IC (LCA(JS , JD)) (5.5)

whereJS andJD are the junction ancestors ofS andD, respectively. The transformation

replaces the lowest common ancestorLCA(S ,D) in eqn. (5.4) with some other common an-

cestor (LCA(JS , JD)). UnlessLCA(JS , JD) = LCA(S ,D), the distance is distorted by using

a less precise quantity,IC (LCA(JS , JD)).

Note that the degree of distortion depends on the distance and the choice of junction nodes.

In the current example, we use the information content of a concept as given by its maximum

likelihood estimate based on its frequency in a large corpus. An increment in the frequency

of a concept leads to an increment in the frequency of all its ancestors. Due to the frequency

percolation, concepts with a small depth tend to accumulatehigher counts than those deeper

in the hierarchy. Thus, we expect the information content ofa concept to be higher than its

ancestors, because a concept is more semantically specific than its ancestors (the notion of

semantic specificity is captured by the use of the negativelog function in the definition of IC).

The transformed distance is distorted accordingly, i.e.,IC (LCA(JS , JD)) ≤ IC (LCA(S ,D)),

80 CHAPTER 5. GRAPH TRANSFORMATION

as LCA(JS , JD) is an ancestor ofIC (LCA(S ,D)) and thus semantically less specific. In

the next section, we will discuss the impact of the choice of junction nodes in relation to the

distance distortion.

For other concept-to-concept distances, the analysis is similar. Given that these measures

are also defined in terms of the two concepts of interest and their common ancestor, our approx-

imation minimizes the distortion from the additive calculation by using two ancestors instead

of one, as in eqn. (5.3).

5.3.2 Junction Selection

Now we turn to our second objective. Returning to Figure 5.6 (c), observe that the middle bi-

partite portion between JS and JD nodes (in the transformed network) is considerably smallerin

size than the bipartite graph in Figure (b). Therefore, to significantly reduce the amount of time

generating the transformed network, we need to choose the junction set (or the set of ancestors)

that contains considerably fewer nodes than the supply and demand profiles do. Selection of

junction nodes is a key component of the network transformation. Trivially, a junction consist-

ing of profile nodes yields a network equivalent to the complete bipartite network. The key is

to select a junction that is considerably smaller in size than its corresponding profile, hence,

cutting down the number of edges generated, which results insignificant savings in complexity.

Note that there is a tradeoff between the overall computational efficiency and the similar-

ity between the transformed network and the complete bipartite network, and therefore, the

degree of distance distortion. The closer the junctions areto the corresponding profiles, the

closer the transformed network resembles the complete bipartite network. Though the distance

calculation is more accurate, such a network is also more expensive to process. On the other

hand, there are fewer nodes in a junction as it approaches theroot level, i.e., the transformed

network becomes more different from the complete bipartitenetwork, and thus there is more

distortion in the transformed concept-to-concept distance. Clearly, it is important to balance

the two factors.

5.4. EVALUATING THE TRANSFORMED NETWORK 81

Selecting junction nodes involves finding a small set of ancestor nodes representing the

profile nodes in a hierarchy. In other words, the junction canbe viewed as an alternative

representation of the profile which is also a generalizationof the profile nodes. Finding

a generalization of a profile is explored in the works of Clarkand Weir (2002) and Li and

Abe (1998). Unfortunately, the complexity of these algorithms is quadratic (the former) or

cubic (the latter) in the number of nodes in a network, which is unacceptably expensive for

our transformation method, given that generating the complete bipartite graph itself is just as

expensive. Note that to ensure every profile node has an ancestor node in the junction, the

selection process has a linear lower bound. To keep the cost low, it is best to keep a linear

complexity for the junction selection process. However, ifthis is not possible, it should be

significantly less expensive than a quadratic complexity inthe number of nodes. We will

empirically explore the process further in section 5.4.1.

5.4 Evaluating the Transformed Network

To demonstrate the change in processing time and performance, we choose to compare our

transformation method with the original MCF method in the name disambiguation task (Sec-

tion 3.2) given the large number of comparisons required (nearly 300,000 comparisons).

Recall that in our name disambiguation experiment, we use the data collected by Peder-

sen et al. (2005) for the same name disambiguation task. Thisdata is taken from the Agence

France Press English Service portion of the GigaWord English corpus distributed by the Lin-

guistic Data Consortium. It consists of the contexts of six pairs of names to reflect a range of

confusability between names. Each pair of names serves as one of six name disambiguation

tasks. Each name instance consists of a window of 50 words with the target name obfuscated.

The goal is to recover the correct target name in each instance. In Section 5.4.1, we describe

our experimental setup for junction selection. Our resultsare presented in Section 5.4.2.

82 CHAPTER 5. GRAPH TRANSFORMATION

5.4.1 Junction Selection

We reported earlier that a complete bipartite graph with 900nodes is too expensive to process.

Our first attempt is to select a junction on the basis of the number of nodes it contains. Here,

the junctions we select are simple to find by taking a top-downapproach. We start at the top

nine root nodes of WordNet (nodes of zero depth) and proceed downwards. We limit the search

within the top two levels because the second level consists of 158 nodes, while the following

level consists of 1307 nodes, which, clearly, exceeds 900 nodes. Here, we select the junction

which consists of eight of the top root nodes (siblings ofentity) and the children ofentity, given

thatentity is semantically more general than its siblings.3

In our current experiment, we use Jiang and Conrath’s distance for its ease of analysis. As

shown in section 5.3.1, only one term in the distance,IC (LCA(i , j )), is replaced because of

the use of the junction nodes. Any change in the performance (in comparison to our method

without the transformation) can be attributed to the distance distortion as a result of this term

being replaced. The analysis of experimental results (nextsection) is made easy because we can

assess the goodness of the transformation given the selected junction—a significant degradation

in performance is an indication that the junction nodes should be brought closer to the profile

nodes, yielding a more precise distance.

5.4.2 Results and Analysis

To compare the two variants of our method, we perform our namedisambiguation experiment

using 100 and 200 training instances per ambiguous name to create the gold standard profiles.

See Table 5.1 for the results. Comparing the results using the full network and the transformed

network, observe that there is very little performance degradation; in fact, in most cases, there

3The complexity of this selection process is linear in the number of profile nodes because all profile nodesmust be examined to ensure they have an ancestor in the junction, as it is possible that a profile node is an ancestorof a junction node. Thus, we have to add any such profile nodes to the junction. This process can only be avoidedby using root nodes as junction nodes exclusively.

5.4. EVALUATING THE TRANSFORMED NETWORK 83

Name Pairs Baseline 200 (Full) 200 (Trans) 100 (Full) 100 (Trans)Ronaldo/Beckham 0.69 0.80 0.88 0.79 0.84Tajik/Rolf Ekeus 0.74 0.97 0.99 0.98 0.99Microsoft/IBM 0.59 0.73 0.75 0.73 0.71Peres/Milosevic 0.56 0.96 0.99 0.97 0.99Jordan/Egyptian 0.54 0.77 0.76 0.74 0.76Japan/France 0.51 0.75 0.82 0.75 0.83Weighted Average 0.53 0.77 0.82 0.76 0.82

Table 5.1: Name disambiguation results (accuracy) at a glance. The baseline is the relativefrequency of the majority name. “200” and “100” give the averaged results (over five differ-ent runs) using 200 and 100 randomly selected training instances per ambiguous name. Theweighted average is calculated based on the number of test instances per task. “Full” and“Trans” refer to the results using the full network (pre-transformation, edge distance) or thepared-down network (with transformation, Jiand and Conrath’s measure), respectively.

is some increase in accuracy (the difference is significant,paired t-test withp ≪ 0.05).4 The

increase in performance on the transformed network is interesting. Clearly the transformed

distance is less precise than the true distance, however, a concept-to-concept distance that is

more sophisticated than the edge distance may not only compensate for the distance distortion

from the transformation, but also improve the performance.

In our experiment, we use junction nodes with a small depth. Such nodes distort the dis-

tance more than those with a larger depth. Surprisingly, ourexperiment indicates that using

such nodes produces equally good or better performance. This suggests that selecting a junc-

tion with a larger depth, at least for the data in this task, isnot necessary.

As mentioned earlier, Jiang and Conrath’s distance is more sophisticated than the simple

edge distance, which may compensate for the distance distortion. Moreover, the name disam-

biguation data was previously shown to be easily classifiable by our method based on the good

performance on the full network (see Chapter 3), and the moderate density value (see Chap-

ter 4). In other words, not only do the profile nodes cluster closely together, nodes of similar

profiles cluster more closely than nodes of dissimilar profiles. Consider Figure 5.7 where there

4Because we run the experiment five times per name pair per experimental condition (two training sizes andtwo network variants), six name pairs yield 30 results per experimental condition. The t-tests are calculated tocompare the performances of the two network variants.

84 CHAPTER 5. GRAPH TRANSFORMATION

Figure 5.7: Three clusters of concepts.

are three shapes, each represents a profile with profile nodesclustered within the shaded area,

and we would like to measure the distance between the triangle profile (nodes belonging to

the class “cheeses”) and each of the square profiles (nodes belonging to either class, “pasta”

or “shoes”). Regardless of the junctions chosen, the right-most square profile (“shoes”) is still

spatially far from the triangle profile (“cheeses”). In comparison, the “pasta” square profile is

much closer to the triangle profile, as indicated by the shorter arrows. Therefore, the overall

classifiability of a dataset may influence the type of junction nodes that are the most effective,

for example, the negative impact of imprecise junctions on highly classifiable datasets should

be small. For future research, one can examine the junction selection process depending on

overall classifiability of a dataset.

In comparison to our reported running time on the pre-transformation network (120 com-

parisons running for 10 days), on the same machine, making 12,000 comparisons can now be

accomplished within two hours. In terms of complexity, if wehaven profile nodes andj junc-

tion nodes, the number of edges to be processed isO(n + j2). Given that our junctions have

much fewer nodes than the original profiles, the running timeis much less than quadratic in the

number of profile nodes.

5.5. SUMMARY 85

5.5 Summary

In this chapter, we have presented a method of incorporatingnon-additive concept-to-concept

distances by transforming the underlying ontological structure because the pre-transformation

network is inefficient to process. To remedy this, we proposea novel technique that mimics

the structure of the more computationally intensive network. Our evaluation shows that it is

possible to transform the structure of the original networkwithout hampering the network-flow

method’s ability to make fine-grained semantic distinctions, and the computational complexity

is drastically reduced as well. Our transformed network offers a competitive alternative to the

pre-transformation network.

86 CHAPTER 5. GRAPH TRANSFORMATION

Chapter 6

Conclusions

About six weeks ago Gertrude Stein said, it does not look to me

as if you were ever going to write that autobiography. You know

what I am going to do? I am going to write it for you. I am going

to write it as simply as Defoe did the autobiography of Robinson

Crusoe. And she has and this is it.

Gertrude Stein, The Autobiography of Alice B. Toklas

In this thesis, we have presented a graph-theoretic approach to calculating semantic distance

between two texts (collections of words). Our method takes advantage of the relational se-

mantic information among words provided by an ontology, andis simultaneously sensitive to

distributional information taken from a corpus. Given a suitable ontology, a word frequency

distribution for a text can be transformed into a frequency distribution over concept nodes.

Hence, each text is treated as a weighted and connected subgraph within a larger graph (the

ontology). By incorporating the semantic distance betweenindividual concepts, the ontology

becomes a metric space in which we calculate the distance between two texts as the minimum-

cost flow between the corresponding subgraphs.

We have explored a three-pronged approach in examining our graph-theoretic method for

text comparison. First, we have evaluated our network-flow approach in three different text

comparison tasks. We selected these tasks in a way that we cantest our method across texts

with varying degrees of constraint on the words comprising them. Second, in relation to

87

88 CHAPTER 6. CONCLUSIONS

the task-based evaluation, we have examined the classifiability of a dataset (a set of texts,

each represented as a collection of frequency-weighted concepts) with respect to our network-

flow method. Classifiability is defined as how well the data canbe clustered together within

the graphical structure of the ontology, or how semantically coherent the concepts are in the

dataset. Accordingly, we have devised a novel measure called profile densityto measure the se-

mantic coherence and by extension, as an indirect indication of classifiability. Finally, we have

examined a computational efficiency issue of our method thatstems from the need to general-

ize our network-flow method to using non-additive concept-to-concept distances such as those

by Wu and Palmer (1994) and Jiang and Conrath (1997). Incorporation of these concept-to-

concept distances requires an expensive pre-processing step if computed exactly. Instead, we

have developed a graph transformation method which allows us to reduce the computational

complexity without significant performance degradation.

We address the problem of text comparison in terms of the semantic distance between texts.

In particular, we are interested in examining how differences in word frequency and in word

meaning contribute to the overall text distance. Our methodis unique—we combine the two

factors via a network-flow framework. Via the intrinsic graphical structure of an ontology

and the formation of semantic profiles, a text can be thought of as a weighted collection of

concepts which are connected via ontology links. Then, the distance between two texts is their

ontological distance weighted by their difference in word frequency.

The idea of non-dualism can be traced back to the Sanskrit term, advaita, which refers

to things that are distinct while not being separate (Katz, 2007). This is an apt description

of our approach as the distributional and the ontological function as one integral piece (i.e.,

transporting frequency masses in a connected graph) instead of two (e.g., two separate features

fed into a machine learning method). The key connective tissue comes from the ontological

relations, which link two words or concepts via a path withinthe ontology. If we took a

purely distributional approach, each text would be treatedas a point in an n-dimensional space,

where each word would occupy one dimension, completely orthogonal (unrelated) to other

6.1. SUMMARY OF CONTRIBUTIONS 89

words/dimensions. For the purpose of measuring distance, our method has the flexibility to

allow inter-word or inter-concept comparison. Furthermore, MCF has been shown to measure

the distributional distance between two frequency distributions. Our method is the first to

introduce such a non-dual combination systematically.

6.1 Summary of Contributions

In this section, we will summarize our contributions with emphasis on our experimental results.

Applying Our Graph-Theoretic Model to Measuring Text Distance. In Chapter 3, we

evaluated our network-flow method in three different NLP tasks: verb alternation detection,

name disambiguation, and document classification. We selected these tasks to test our method

on data with varying degrees of syntactic and semantic constraints imposed on them. In the

first task, the words have a particular syntactic relation toa target verb. In the second task,

the syntactic restriction is relaxed such that words appearing within a local window of an

ambiguous name are considered. Finally, in the last task, the window size restriction is relaxed

further so that words within a document are included.

The results show that our method is superior to other distributional methods in the first two

tasks. In the verb alternation task, our method achieves an average accuracy of 0.67 on ran-

domly selected verbs and is the best method in most conditions. In the name disambiguation

task, where the syntactic restriction on the text is relaxed, our method achieves an average ac-

curacy of 0.83 (weighted) and 0.76 (unweighted). In contrast, purely distributional approaches

at best reach 0.72 and 0.52 of weighted and unweighted accuracies.

Our method is less successful in the document classificationtask, in which the window size

restriction is further removed. In this task, distributional methods reach an accuracy of 0.61 or

above, whereas our method achieves an accuracy only in the low 0.30s. Increasing the window

size clearly introduces more noise to the data. In an attemptto remove the noise from the data,

we created a noise frequency distribution of concepts and subtracted it from the data. The noise

90 CHAPTER 6. CONCLUSIONS

removal results in a slight increase in the performance but is not sufficient to improve it to be

on par with the best distributional results.

Measuring Semantic Coherence within an Ontology. Because of the network-flow frame-

work, there is intricate interaction between ontological and distributional information used in

the calculation of the minimum-cost flow. Therefore, in Chapter 4, we proposed a non-dual

approach to measuringprofile densityas an indicator of how well our network-flow method

can classify a dataset. Our analysis shows that profile density correlates very well with the per-

formance of our method: the datasets of verb alternation detection and name disambiguation

are denser, hence more easily classifiable, than document classification data.

In our task-based evaluation, our data has varying degrees of syntactic and semantic con-

straints. Interestingly, the degree of constraints influences the relatedness or the semantic co-

herence within the dataset. In the first task, the words have the most restrictions imposed on

them: in addition to syntactic constraints, because we selected verbs from a handful of se-

mantic classes, the verbs exert a high degree of selectionalrestriction on their arguments. Not

surprisingly, this dataset has the highest profile density and therefore semantic coherence. As

we relax the constraints further, the density values decrease. Name disambiguation data has

the next highest density. Document classification data has the least degree of restriction and,

indeed is the least dense.

Maximizing the Accuracy and Efficiency in the Calculation of Non-additive Distances.

In Chapter 5, we addressed the problem of incorporating non-additive distances into our net-

work-flow framework via a graph transformation method. Because the MCF problem defini-

tion assumes additivity to hold for the concept-to-conceptdistance, the use of a non-additive

distance becomes impractical without modification—the exactness of distance calculation and

efficiency cannot be simultaneously achieved. In this chapter, we introduced a graph transfor-

mation method that constructs a new graph in which we can balance the two factors. In our

evaluation, we compared the name disambiguation results onthe transformed graph vs. those

6.2. SHORT-TERM IMPROVEMENTS: WITHIN THE MCF FRAMEWORK 91

on the original graph. Not only have we improved the speed (120 comparisons in 10 days

vs. 12,000 comparisons in two hours), there is no major performance degradation; in fact, our

results on the transformed graph showed some performance improvement.

Our result suggests that there is a link between semantic distance and density. We have

shown that density is an indicator of classifiability using our text distance. Given moderate to

high density value and good performance on the full network,similar profiles nodes are closer

in distance than dissimilar profiles are, regardless of the precision of junction nodes selected.

Indeed, we have shown that this is indeed the case for the datadisambiguation data using high

imprecise junction nodes. We conclude that for a highly classifiable dataset, an approximate

network is sufficiently precise to yield comparable results.

6.2 Short-term Improvements: Within the MCF Framework

Text Representation. Currently, we have used a simple profile representation by uniformly

distributing word counts to relevant concepts. A more accurate frequency estimates of the

concepts would clearly result in more accurate classification, especially in the document clas-

sification task. There exist statistical methods such as those by Li and Abe (1998) and Clark

and Weir (2002), which produce probability estimates over acollection of concepts. However,

when applied to every profile, these methods are impracticalgiven their complexity. One low

complexity option is to pre-process the whole dataset once by pruning out concepts to which

low frequency and/or highly ambiguous words are mapped, then we can then form more “ac-

curate” profiles by considering only the remaining concepts. This way, we improve on the

frequency estimates of concepts with very little extra overhead attached to the profile genera-

tion.

The Use of Different Ontological Relations. Currently, we only use the hyponymy links

within the ontology to capture the semantic relations amongwords or concepts. It has been

shown that readers process the content of a text by relating the concepts in a variety of ways

92 CHAPTER 6. CONCLUSIONS

other then hyponymy (Morris, 2006). It is thus possible thatthe inclusion of other ontological

relations will be useful in some applications. As a result, the graphical representation would no

longer be hierarchical, which poses little problem as the MCF definition makes no assumption

on the graph structure. However, we may have to re-consider what a reasonable concept-to-

concept distance is. Many distance measures consider both the two target concepts as well

as their common ancestors. Given that there are many relations, there are potentially multi-

ple common ancestors. Hence, computing the distance between two concepts become more

complicated and computationally inefficient (e.g., Hirst and St-Onge, 1998, who consider all

ontological relations in WordNet). A new method of measuring concept-to-concept distance

may be necessary to account for the more complicated graph structure.

Classifiability within the Network-Flow Framework. We use profile density to analyse

the behaviour of our MCF method on three NLP tasks. One area worthy of exploration is the

use of profile density as an indicator of the overall classifiability of a dataset using the MCF

framework. Currently, we are able to rank the three datasetsin terms of their density values,

but further examination is needed to determine a reliable indicator of classifiability within our

network-flow framework. One straightforward method is to test on more variety of texts to

establish a meaningful threshold (range) to predict whether the MCF will be useful for a task.

6.3 Long-Term Research Directions

Semantic Coherence. We suggest that not only is our profile density useful in predicting the

performance of our network-flow method on unseen data, it canalso be useful for measuring

the semantic coherence of a text in general. Note that a text that is semantically coherent tends

to form profiles with highly frequent and highly related concepts within an ontology. Coin-

cidentally, our profile density formulation measures the overall coherence of a collection of

concepts by taking into account the distance between the concepts as well as their frequencies.

For example, if we relax the notion of a text to include a collection of verbal arguments (e.g.,

6.3. LONG-TERM RESEARCH DIRECTIONS 93

nouns appearing as the direct object of a verb), and not just the words appearing sequentially

in a document, semantic coherence of a text can be thought of as the selectional preference

strength a verb imposes on its arguments. As future work, we intend to investigate profile

density as an indicator of selectional preference strength.

Verb Alternation Discovery. Verb alternation discovery is a generalized version of verb

alternation detection. The idea is to discover possible alternation behaviour given a labelled

slot and an unlabelled slot, which is potentially useful in tasks such as semantic role labelling

and in detecting non-compositionality (McCarthy et al., 2007). The assumption here is that

given a verb, syntactic slots with the same role label would have similar selectional preference

strength. For example, theTHEME slots (subject of the intransitive and direct object of the

transitive) of the verbmeltare likely to be “meltable” things. Given that both slots would have

similar selectional preferences, the instances of their selectional preference strength (or profile

density as selectional preference strength) would also be similar. Then, the goal is to detect

new alternations if an unlabelled syntactic slot is shown tohave similar selectional strength as

that of a known slot.

Hot-spot Detection. Finally, profile density measures the graphical density of acollection

of weighted nodes. A useful extension ishot-spot detection. Given a collection of weighted

nodes, the idea is to detect clusters (“hot spots”) by measuring the density of subsets of nodes

in comparison to the overall graphical density. Since the current work assumes a hierarchical

graphical structure (e.g., hyponymy hierarchy in WordNet), subset partition is made possible

with the use of ancestors (lowest common ancestors) to the nodes. Hot-spot detection is useful

in applications such as verb sense detection. Consider the verbpour as in “I pour some milk

into the glass” vs. “The Bank of England poured£3M into Northern Rock”. Here, there are

two distinct senses of the verb: a liquid-displacement sense and a financial sense. Assuming

the corpus counts of the direct objects reflect the two senses, there would be two detectable hot

spots in the ontology. Generally, we believe profile densitymay offer a quantitative measure

94 CHAPTER 6. CONCLUSIONS

for semantic coherence and other related NLP applications.

Bibliography

Agirre, E. and Rigau, G. (1996). Word sense disambiguation using conceptual density. InIn

Proceedings of the 12th International Conference of Computational Linguistics (COLING-

1996), pages 16–22, Copenhagen, Denmark.

Al-Mubaid, H. and Umair, S. A. (2006). A new text categorization technique using distribu-

tional clustering and learning logic.IEEE Transaction on Knowledge and Data Engineering,

18(9).

Barzilay, R. and Lapata, M. (2005). Collective content selection for concept-to-text generation.

In Proceedings of the Joint Conference on Human Language Technology / Empirial Methods

in Natural Language Processing (HLT/EMNLP).

Bodenreider, O. (2004). The unified medical language system(UMLS): Integrating biomedical

terminology.Nucleic Acids Research, 32:D267–D270.

Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.

Computer Networks and ISDN Systems, 30(1–7):107–117.

Briscoe, T. and Carroll, J. (1997). Automatic extraction ofsubcategorization from corpora.

In Proceedings of the 5th Applied Natural Language ProcessingConference (ANLP), pages

356–363.

Briscoe, T. and Carroll, J. (2002). Robust accurate statistical annotation of general text. In

95

96 BIBLIOGRAPHY

Proceedings of the Third International Conference on Language Resources and Evaluation

(LREC 2002), pages 1499–1504.

Budanitsky, A. and Hirst, G. (2001). Semantic distance in wordnet: An experimental,

application-oriented evaluation of five measures. InProceedings of the Workshop on Word-

Net and Other Lexical Resources, in the North American Chapter of the Association for

Computational Linguistics (NAACL-2001), Pittsburgh, PA.

Budanitsky, A. and Hirst, G. (2006). Evaluating WordNet-based measures of semantic distance.

Computational Linguistics, 32(1):13–47.

Burnard, L. (2000).The British National Corpus Users Reference Guide. Oxford University

Computing Services, Oxford, UK.

Chang, C.-C. and Lin, C.-J. (2001).LIBSVM: a library for support vector machines. Software

available athttp://www.csie.ntu.edu.tw/∼cjlin/libsvm.

Clark, S. and Weir, D. (2002). Class-based probability estimation using a semantic hierarchy.

Computational Linguistics, 28(2):187–206.

Corley, C. and Mihalcea, R. (2005). Measuring the semantic similarity of texts. InProceedings

of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment.

Esuli, A., Fagni, T., and Sebastiani, F. (2006). TreeBoost.MH: A boosting algorithm for multi-

label hierarchical text categorization. InProceedings of the 13th International Symposium

on String Processing and Information Retrieval (SPIRE’06), pages 13–24, Glasgow, UK.

Fellbaum, C., editor (1998).WordNet: An Electronic Lexical Database. MIT Press.

Gurevych, I., Malaka, R., Porzel, R., and Zorn, H.-P. (2003). Semantic coherence scoring

using an ontology. InProceedings of the Joint Human Language Technology and Northern

Chapter of the Association for Computational Linguistics Conference (HLT-NAACL), pages

88–95, Edmonton, Canada.

BIBLIOGRAPHY 97

Han, H., Zha, H., and Giles, C. L. (2005). Name disambiguation in author citations using a

K-way spectral clustering method. InJoint Conference on Digital Libraries (JCDL’05).

Hirst, G. and St-Onge, D. (1998). Lexical chains as representations of context for the detection

and correction of malapropisms. In Fellbaum (1998), pages 305–332.

Iwayama, M., Fujii, A., Kando, N., and Marukawa, Y. (2003). An empirical study on retrieval

models for different document genres: Patents and newspaper articles. InProceedings of the

26th ACM SIGIR International Conference on Research and Development in Information

Retrieval, pages 251–258.

Jarmasz, M. and Szpakowicz, S. (2003). Rogets thesaurus andsemantic similarity. InProceed-

ings of Conference on Recent Advances in Natural Language Processing (RANLP 2003),

pages 212–219, Borovets, Bulgaria.

Jiang, J. and Conrath, D. (1997). Semantic similarity basedon corpus statistics and lexical

taxonomy. InProceedings on the International Conference on Research inComputational

Linguistics, pages 19–33.

Joachims, T. (2002).Learning to Classify Text Using Support Vector Machines – Methods,

Theory, and Algorithms. Kluwer/Springer.

Katz, J. (2007).One: Essential Writings on Nonduality. Sentient Publications.

Kullback, S. and Leibler, R. (1951). On information and sufficiency. Annals of Mathematical

Statistics, 22:79–86.

Lee, L. (2001). On the effectiveness of the skew divergence for statistical language analysis.

In Artificial Intelligence and Statistics, pages 65–72.

Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation. Uni-

versity of Chicago Press.

98 BIBLIOGRAPHY

Levina, E. and Bickel, P. (2001). The earth mover’s distanceis the mallows distance: Some

insights from statistics. InProceedings of the Eighth IEEE International Conference on

Computer Vision, volume 2, pages 251–256.

Li, H. and Abe, N. (1998). Word clustering and disambiguation based on co-occurrence data.

In Proceedings of COLING-ACL 1998, pages 749–755.

Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of Interna-

tional Conference on Machine Learning.

McCarthy, D. (2000). Using semantic preferences to identify verbal participation in role

switching alternations. InProceedings of Applied Natural Language Processing and North

American Chapter of the Association for Computational Linguistics (ANLP-NAACL 2000),

pages 256–263.

McCarthy, D. (2001).Lexical Acqusition at the Syntax-Semantics Interface: Diathesis Alter-

nations, Subcategorization Frames and Selectional Preferences. PhD thesis, University of

Sussex, Brighton, UK.

McCarthy, D., Venkatapathy, S., and Joshi, A. K. (2007). Detecting compositionality of verb-

object combinations using selectional preferences. InProceedings of the Conference on

Empirical Methods in Natural Language Processing (EMNLP 2007), Prague, Czech Repub-

lic.

Merlo, P. and Stevenson, S. (2001). Automatic verb classification based on statistical distribu-

tions of argument structure.Computational Linguistics, 27(3):393–408.

Mihalcea, R. (2005). Unsupervised large-vocabulary word sense disambiguation with graph-

based algorithms for sequence data labeling. InProceedings of the Joint Conference

on Human Language Technology / Empirial Methods in Natural Language Processing

(HLT/EMNLP).

BIBLIOGRAPHY 99

Mihalcea, R. (2006). Random walks on text structures. InProceedings of Computational Lin-

guistics and Intelligent Text Processing (CICLing) 2006, pages 249–262. Springer-Verlag.

Mitchell, T. (1999). 20 newsgroups usenet articles.http://kdd.ics.uci.edu/

/databases/20newsgroups/20newsgroups.data.html.

Mohammad, S. and Hirst, G. (2006). Distributional measuresof concept-distance: A task-

oriented evaluation. InProceedings of the 2006 Conference on Empirical Methods in Natural

Language Processing (EMNLP 2006), Sydney, Australia.

Morris, J. (2006). Readers’ subjective perceptions of lexical cohesion and implications for

computers’ interpretations of text meaning. InProceedings of CaSTA Conference on Breadth

of Text, University of New Brunswick, Canada.

Navigli, R. and Velardi, P. (2005). Structural semantic interconnections: A knowledge-based

approach to word sense disambiguation.IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 27(7).

Nigam, K., McCallum, A., and Mitchell, T. (2006).Semi-supervised Text Classification Using

EM, pages 33–56. MIT Press, Boston, MA, USA.

Pang, B. and Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity

summarization based on minimum cuts. InProceedings of the 42nd ACL, pages 271–278.

Pantel, P. and Lin, D. (2000). An unsupervised approach to prepositional phrase attachment us-

ing contextually similar words. InProceedings of Association for Computational Linguistics

(ACL-00), pages 101–108, Hong Kong.

Pedersen, T., Purandare, A., and Kulkarni, A. (2005). Name discrimination by clustering

similar context. InProceedings of the Sixth International Conference on Intelligent Text

Processing and Computational Linguistics.

100 BIBLIOGRAPHY

Pereira, F., Tishby, N., and Lee, L. (1993). Distributionalclustering of english words. In

Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics.

Pinker, S. (1989).Learnability and Cognition: The Acquisition of Argument Structure. MIT

Press, Cambridge, MA.

Rada, R., Mili, H., Bicknell, E., and Blettner, M. (1989). Development and application of a

metric on semantic nets.IEEE Transactions on Systems, Man and Cybernetics, 19:17–30.

Rennie, J. (2001). Improving multi-class text classification with naive bayes. Master’s thesis,

Massachusetts Institute of Technology.

Resnik, P. (1993).Selection and Information: A Class-Based Approach to Lexical Relation-

ships. PhD thesis, University of Pennsylvania, Philadelphia, PA, USA.

Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In

Proceedings of the 14th International Joint Conference on Artificial Intelligence.

Ribas, F. (1995). On learning more appropriate selectionalrestrictions. InProceedings of the

Seventh Conference of the European Chapter of the Association for Computation Linguistics,

pages 112–118, Dublin, Ireland.

Schulte im Walde, S. (2006). Experiments on the automatic induction of German semantic

verb classes.Computational Linguistics, 32(2):159–194.

Scott, S. and Matwin, S. (1998). Text classification using WordNet hypernyms. InProceed-

ings of the COLING-ACL Workshop on Usage of WordNet in Natural Language Processing

Systems, pages 45–51.

Weeds, J. (2003).Measures and Applications of Lexical Distributional Similarity. PhD thesis,

University of Sussex, Sussex, UK.

BIBLIOGRAPHY 101

Weeds, J., Weir, D., and McCarthy, D. (2004). Characterising measures of lexical distributional

similarity. InProceedings of the 20th International Conference of Computational Linguistics

(COLING-2004).

Wu, Z. and Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the

32nd Annual Meeting of the Association for Computational Linguistics, pages 133–138.

Xu, W., Liu, X., and Gong, Y. (2003). Document clustering based on non-negative matrix

factorization. InProceedings of the 26th ACM SIGIR International Conferenceon Research

and Development in Information Retrieval.