asm kernel: graph kernel using approximate subgraph matching ?· asm kernel: graph kernel using...

Download ASM Kernel: Graph Kernel using Approximate Subgraph Matching ?· ASM Kernel: Graph Kernel using Approximate…

Post on 09-May-2019

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

ASM Kernel: Graph Kernel using Approximate Subgraph Matching forRelation Extraction

Nagesh C. Panyam, Karin Verspoor, Trevor Cohn and Kotagiri RamamohanaraoDepartment of Computing and Information Systems,

The University of Melbourne, Australianpanyam@student.unimelb.edu.au

{karin.verspoor, t.cohn, kotagiri}@unimelb.edu.au

Abstract

Kernel methods have been widely stud-ied in several natural language process-ing tasks such as relation extraction andsentence classification. In this work, wepresent a new graph kernel that is de-rived from a distance measure describedin prior work as Approximate SubgraphMatching (ASM). The classical ASM dis-tance, shown to be effective for event ex-traction, is not a valid kernel and was pri-marily designed to work with rule basedsystems. We modify this distance suitablyto render it a valid kernel (ASM kernel)and enable its use in powerful learning al-gorithms such as Support Vector Machine(SVM).

We compare the ASM kernel with SVMsto the classical ASM with a rule based ap-proach, for two relation extraction tasksand show an improved performance withthe kernel based approach. Compared toother kernels such as the Subset tree ker-nel and the Partial tree kernel, ASM ker-nel outperforms in relation extraction tasksand is of comparable performance in ageneral sentence classification task. Wedescribe the advantages of the ASM ker-nel such as its flexibility and ease of modi-fication, which offers further directions forimprovement.

1 Introduction

Many natural language processing tasks suchas relation extraction or question classifica-tion are cast as supervised classification prob-lems (Bunescu and Mooney, 2005), with the ob-ject to classify being an entity pair or a sentence.Traditional approaches have typically focussed ontransforming the input into a feature vector which

is then classified using learning algorithms suchas decision trees or SVM. A primary limitation ofthis approach has been the manual effort requiredto construct a rich set of features that can yielda high performance classification. This effort isevident for the construction of features from thesyntactic parse of the text, which is often repre-sented as an ordered structure such as a tree or agraph. Linearizing a highly expressive structuresuch as a graph, by transforming it into a flat arrayof features is inherently harder. This problem ofconstructing explicit feature sets for complex ob-jects is generally overcome by kernel methods forclassification. Kernel methods allow for an im-plicit exploration of a vast high dimensional fea-ture space and shift the focus from feature engi-neering to similarity score design. Importantly,such a kernel must be shown to be symmetric andpositive semi-definite (Burges, 1998), to be validfor use with kernelized learning algorithms suchas SVM. Deep learning based approches (Zeng etal., 2014; Xu et al., 2015) are other alternatives toeliminate the manual feature engineering efforts.However, in this work we are primarily focussedon kernel methods.

In NLP, kernel methods have been effectivelyused for relation extraction and sentence classifi-cation. Subset tree kernels (SSTK) and partial treekernels (PTK) were developed to work with con-stituency parse trees and basic dependency parsetrees. However, these kernels are not suitable forarbitrary graph structures such as the enhanceddependency parses (Manning et al., 2014). Sec-ondly, tree kernels can only handle node labelsand not edge labels. As a work around, these ker-nels require that the original dependency graphsbe heuristically altered to translate edge labels intospecial nodes to create different syntactic repre-sentations such as the grammatical relation cen-tered tree (Croce et al., 2011). These limitationswere overcome with the Approximate Subgraph

Nagesh C Panyam, Karin Verspoor, Trevor Cohn and Rao Kotagiri. 2016. ASM Kernel: Graph Kernel using ApproximateSubgraph Matching for Relation Extraction. In Proceedings of Australasian Language Technology Association Workshop,pages 6573.

Matching (ASM) (Liu et al., 2013), that was de-signed to be a flexible distance measure to handlearbitrary graphs with edge labels and edge direc-tions. However, the classic ASM is not a validkernel and therefore cannot be used with power-ful learning algorithms like SVM. It was thereforeused in a rule-based setting, where it was shown tobe effective for event extraction (Kim et al., 2011).

1.1 ContributionsIn this work, our primary contribution is a newgraph kernel (ASM kernel), derived from the clas-sical approximate subgraph matching distance,that:

is flexible, working directly with graphs withcycles and edge labels. is a valid kernel for use with powerful learn-

ing algorithms like SVM. outperforms classical ASM distance with rule

based method for relation extraction. outperforms tree kernels for relation extrac-

tion and is of comparable performance for asentence classification task.

2 Methods

In this section, we first describe the classical ASMdistance measure that was originally proposed in(Liu et al., 2013). We then discuss the modifica-tions we introduce to transform this distance mea-sure into a symmetric, L2 norm in a valid featurespace. This step allows us to enumerate the under-lying feature space and to elucidate the mappingfrom a graph to a vector in a high dimensional fea-ture space. We then define the ASM kernel as adot product in this high dimensional space of welldefined features. Besides establishing the validityof the kernel, the feature map clarifies the seman-tics of the kernel and helps design of interpretablemodels.

2.1 Classic ASM distanceWe describe the classic ASM distance in the con-text of a binary relation extraction task. Considertwo sample sentences drawn from the training setand test set of such a task corpus, as illustratedin Figure 1. Entity annotations are given for thewhole corpus, which are character spans referringto two entities in a sentence. In the illustrated ex-ample, the entities are chemicals (metoclopramideand pentobarbital) and diseases (dyskinesia andamnesia). The training data also contains relation

annotations, which are related entity pairs (meto-clopramide, dyskinesia). We assume that the rela-tion (causation) is implied by the training sentenceand then to try to infer a similar relation or its ab-sence in the test sentence.

Preprocessing The first step in the ASM eventextraction system is to transform each sentenceto a graph, whose nodes represent tokens in thesentence. Node labels are derived from the cor-responding tokens properties, such as the wordlemma or part of speech (POS) tag or a combi-nation of both. The node labels for entities areusually designated as Entity1 and Entity2. Thisprocess is referred to as entity blinding and isknown to improve generalization (Thomas et al.,2011). Labelled edges are given by a dependencyparser (Manning et al., 2014). A graph from a testsentence is referred to as a main graph. Given atraining sentence and its corresponding graph, weextract the subgraph within it, that consists of onlythose nodes that represent the entities or belong tothe shortest path1 between the two entities. This isreferred to as a rule subgraph (see Figure 1a).

Approximate Subgraph Isomorphism Themain idea in ASM is that a test sentence isconsidered to be of same type or express thesame relation as that of a training sentence, ifwe can find a subgraph isomorphism of rulegraph (training sentence) in the main graph(test sentence). Exact subgraph isomorphism(boolean) is considered too strict and is expectedto hurt generalization. Instead, ASM tries tocompute a measure (a real number) of subgraphisomorphism. This measure is referred to as theApproximate Subgraph Matching distance. If theASM distance between a rule graph and maingraph is within a predefined threshold, then thetest sentence is considered positive, or of the samerelation type as the rule graph.

ASM distance We first compute an injectivemapping M from rule graph to main graph. Aninjective matching scheme essentially maps eachnode of the subgraph to a node in the main graph,with identical labels. If no matching scheme canbe found, then the ASM distance is set to a verylarge quantity () . Following the node matching,we do not demand a matching of edges between

1Throughout this paper, shortest path refers to the pathwith least number of edges in the undirected version of thegraph.

66

A case of tardive dyskinesia caused by metoclopramide(Entity1) (Entity2)

det

case

amod

nmod:of

acl case

nmod:by

(a) Graph from a training sentence. The rule subgraph within is shown with a surrounding box.

Learning of rats under amnesia caused by pentobarbital(Entity1) (Entity2)

case

nmod:of

case

nmod:under

acl case

nmod:by

(b) Main graph from a test sentence.

Figure 1: Sample dependency graphs from two sentences expressing a relation of type causation be-tween two entities.

the two graphs, like in a typical exact isomorphismsearch. Instead, we compute the difference be-tween these edges to get an approximate subgraphmatching (ASM) distance. The ASM distance isa weighted summation of 3 components, namelystructural distance, label distance and direction-ality distance. These are described below, withthe aid of notations described in Table 1. Notethat edge directions are interpreted as special di-rectional labels of type forward or backward.

The structural distance (SD), label distance(LD) and the directionality distance (DD) for apath P rx,y is defined as:

SD(P rx,y, Pmx,y) = |Len(P rx,y) Len(Pmx,y)|

LD(P rx,y, Pmx,y) = #EL(P

rx,y)4EL(Pmx,y)

DD(P rx,y, Pmx,y) = #DL