1 simple algorithms for complex relation extraction with applications to biomedical ie ryan mcdonald...

17
1 Simple Algorithms for Compl ex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University of Pennsylvania, Philadelphia, PA Scott Winters Yang Jin Pet e White Division of Oncology, Children’s Hospital of Pennsylva nia, Philadelphia, PA ACL 2005

Upload: brenda-storrs

Post on 31-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

1

Simple Algorithms for Complex Relation Extraction with Applications to Bio

medical IE

Ryan McDonald Fernando Pereira Seth KulickCIS and IRCS, University of Pennsylvania, Philadelphia, PA

Scott Winters Yang Jin Pete WhiteDivision of Oncology, Children’s Hospital of Pennsylvania, Philadelphia, PA

ACL 2005

Page 2: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

2

Abstract

• Simple two-stage method for extracting complex relations between named entities in text. – n-ary relation– first stage: create a graph from pairs of entities– two stage: maximal cliques in the graph

• Experiment on biomedical text

Page 3: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

3

Introduction - 1/2

• n-ary relation– The relation is definded by the schema (t1,…, tn)

• ti is entity types

– The tuple in the relations is a list of entities (e1,...,en) • Type(e1)=t1 or ei=

• Example : – Type : {person, job, company}

• “John Smith is the CEO at Inc. Corp. “• (John Smith, CEO, Inc. Corp.)• “Everyday John Smith goes to his office at Inc. Corp.”• (John Smith, , Inc. Corp.)

Page 4: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

4

Introduction - 2/2

• Application :– Question answer– Automatic database generation– Intelligent document searching and indexing

• Most relation extraction systems focus on:– Binary relation : Such as

• employee of relation

• protein-protein interaction relation

– Extracting keyphrases to represent relation in social networks from Web. (Matsuo et al., IJCAI-07)

Page 5: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

5

Previous Work

• Zelenko et al., 2003– Binary relation in news text

• “John Smith, not Jane Smith, works at IBM.”• (John Smith, IBM) : positive• (Jane Smith, IBM) : negative

• Miller et al., 2000– Identify all relations

• Relation extraction from probabilistic parsing tree

• Rosario and Hearst, 2004– Extracting seven relationships between treatments

and diseases

Page 6: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

6

Definitions

• n-ary relation– The relation is definded by the schema (t1,…, tn)

• ti is entity types

– The tuple in the relations is a list of entities (e1,...,en) • Type(e1)=t1 or ei=

• A maximal clique– An undirected graph G=(V,E)

• V: vertices , E: a set of edges

– A clique C of G is a subgraph of G in which there is an edge between every pair of vertices.

– A maximal clique of G is a clique C=(Vc, Ec) such that there is no other clique C’=(Vc’, Ec’) such that Vc Vc’.

Page 7: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

7

Methods : Classifying Binary Relations-1/3

• Example : {person, job, company}– John and Jane are CEOs at Inc. Corp. and Biz. Corp. respectively.

– 12 possible tuples

• Problems with building a classifier– Exponential run time

– How to manage incomplete but correct instances• (John, ,Inc. Corp.)

• If it is marked as negative, – the model might incorrectly disfavor features that correl

ate John to Inc.Corp..

• If it is labeled as positive , – the model may tend to prefer the shorter and more comp

act incomplete relations.

• If we ignore instances of this form, – the data would be heavily skewed towards negative insta

nces.

Page 8: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

8

Methods : Classifying Binary Relations-2/3

• Solution :– The set of all possible pairs is much smaller then the set of

all possible complex relation instances.

– To train a classifier to identify pairs of related entities.

• Positive : – (John,CEO), (John, Inc. Corp.), (CEO, In

c. Corp.), (CEO, Biz. Corp.), (Jane,CEO) and (Jane, Biz. Corp.).

• Negative :– (John, Biz. Corp.) and (Jane, Inc. Corp.)

Page 9: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

9

Methods : Classifying Binary Relations-3/3

• Learning a binary relation classifier :– A standard maximum entropy classifier (Berger et

al., 1996) implemented as part of MALLET (McCallum, 2002)

Page 10: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

10

Methods : Reconstructing Complex Relations

• Example : According to binary classifier– (John,CEO), (John, Inc. Corp.), (John, Biz. Corp.), (CEO, Inc. Corp.),

(CEO, Biz. Corp.) and (Jane,CEO). – Relation Graph : Figure 2a– Cliques : Figure 2b

• Algorithm for finding all maximal cliques :– Born and Kerbosch, 1973

Page 11: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

11

Methods : Probabilistic Cliques

• The above approach has a major shortcoming in that it assumes the output of the binary classifier to be absolutely correct.

• Weight of a clique (C)

– w(e) : weight (probabilistic) of edge e

• A vaild tuple : (C) 0.5

Page 12: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

12

Experiments-1/2• Extracting genomic variation events from biomedical

text (Mcdonal et al., 2004)• (var-type, location, initial-state, altered-state)

– “At codons 12 and 61, the occurrence of point mutations from G/A to T/G were observed”

– (point mutation, codon 12, G, T)– (point mutation, codon 61, A, G)

• 447 abstracts selected from MEDLINE– 4691 sentences– 4773 entities and 1218 relations– Of the 1218 relations :

• 760 have two , 283 have one , 175 have no arguments• 38% cannot be handled using binary relations• 4% of the relations annotated are non-sentential • Maximum recall : 96%

Page 13: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

13

Experiments-2/2

• MC: – Uses the maximum entropy binary classifier coupl

ed with the maximal clique complex relation reconstructor.

• PC: – Same as above, except it uses the probabilistic cliq

ue complex relation reconstructor.

• NE: – A maximum entropy classifier that naively enumer

ates all possible relation instances as described in Page 7.

Page 14: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

14

Experiments : Results-1/2

Page 15: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

15

Experiments : Results-2/2

Page 16: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

16

Conclusions and Future Work

• Complex relation extraction:– Binary relation learning: Maximum Entropy

Classifier – Finding maximal cliques in graph– Genomic variation relations

• Future work– Parse trees– Learn how to cluster vertices into relational groups– A vertex/entity can participate in one or more

relation

Page 17: 1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University

17

• Learning Field Compatibilities to Extract Database Records from Unstructured Text – M Wick, A Culotta, A McCallum - (EMNLP 2006)

• Using Dependency Parsing and Probabilistic Inference to Extract Rela-tionships between Genes – B Goertzel, H Pinto, A Heljakka, IF Goertzel, M –(B

ioNLP 2006)

• Relation Extraction for Semantic Intranet Annotations – L Specia, C Baldassarre, E Motta - kmi.open.ac.uk – Relation Extraction for Semantic Intranet Annotatio

ns Technical Report