learning to extract relations for protein annotation

32
Learning to Extract Relations Learning to Extract Relations for Protein Annotation Jee-Hyub Kim a , Alex Mitchell b,c , Teresa K. Attwood b,c , Melanie Hilario a a University of Geneva b University of Manchster c European Bioinformatics Institute ISMB/ECCB2007, 23 July 2007 1 / 32

Upload: jee-hyub-kim

Post on 13-Apr-2017

52 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Learning to Extract Relations for ProteinAnnotation

Jee-Hyub Kima, Alex Mitchellb,c , Teresa K. Attwoodb,c ,Melanie Hilarioa

aUniversity of Geneva

bUniversity of Manchster

cEuropean Bioinformatics Institute

ISMB/ECCB2007, 23 July 2007

1 / 32

Page 2: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Contents

Introduction

Related Work

Problem and Approach

Methods

Experimental Results

Conclusion

2 / 32

Page 3: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Introduction

Protein AnnotationI Definition

I Given a protein sequence or name, describe the proteinwith all relevant information

I Two main methodsI Sequence analysis method (e.g. BLAST, CLUSTALW, etc.)I Literature analysis method

I Traditionally, done manually by human annotators, andautomation needed

I Text-mining has been used for automation.

I Two main tasks in text-miningI Information retrieval (IR): to retrieve relevant documentsI Information extraction (IE): to extract certain pieces of

information from text for pre-defined entities and theirrelations.

3 / 32

Page 4: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Introduction

Adaptive Information Extraction

I IE is domain-specific.I Generally, one developed system cannot be used for a new

domain.

I Developing IE systems requires a significant amount ofdomain knowledge.

I Developing IE rulesI Defining relationsI These are two main bottlenecks.

I Still many new domains need to develop their own IEsystems.

I e.g., cell cycle, tissue specificity, etc.

4 / 32

Page 5: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Related Work

IE System Development: Developing IE Rules

I Kowledge engineering (KE) based approachI Knowledge engineers write hand-crafted rules with the help

of domain experts (e.g., biologists).I Not scalable

I Machine learning (ML) based approachI To increase robustness and coverage of IE rulesI ML has been used to learn IE rules.

I Annotated corpora, pre-labelled corpora, raw corpora.

I Labor required to develop IE systemsI KE > ML (annotated corpora > pre-labelled corpora > raw

corpora)

5 / 32

Page 6: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Related Work

IE System Development: Defining Relations

I All the previous IE systems assume relations to beextracted are already defined.

I It is hard to specify precisely all possible relations toextract, especially in complex and dynamically-evolvingdomains (e.g., biological domain)

I Positioning (ML-based approach)

Corpora \ Relations Pre-defined Not defined

Annotated Soderland (1999), Freitag (2000),

Califf and Money (2003)

Pre-labeled Riloff (1996) Our work

Raw Hasegawa et al. (2004) Collier (1996)

6 / 32

Page 7: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Problem and Approach

Our Problem

I GoalI To alleviate the burden of developing IE systems for

biologists

I Problem definitionI Given relevant sentences that describe protein X in terms

of any topic Y and irrelevant sentencesI Learn to extract relations for protein annotation

I Two sub-problemsI What to extractI How to extract

7 / 32

Page 8: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Problem and Approach

Bottom-up Approach

Figure: From sentences to relations

8 / 32

Page 9: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

System Architecture

Figure: Overall IE system architecture9 / 32

Page 10: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Analyzing Sentences: MBSP

I Memory-Based Shallow Parser (MBSP)I Developed by Walter DaelemansI Extended with named entity taggers and SVO relation finder

I Provides various types of information: POS, SVO, NE, etc.

I Adapted to the biological domain on the basis of the GENIAcorpus

I 97.6% accuracy on POS taggingI 71.0% accuracy on protein named entity recognition

10 / 32

Page 11: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Analyzing Sentences: ExampleI Example

I INPUT: Examples of this are the RNA-binding protein containingthe RNA-binding domain (RBD) ...

I OUTPUT:

Chunk Syntactic Semantic SVO relation

Examples noun phrase subject of ’are’

of preposition

this noun phrase

are verb phrase

the RNA-binding protein noun phrase protein subject of ’contain’

containing verb phrase

the RNA-binding domain (RBD) noun phrase domain object of ’contain’

11 / 32

Page 12: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Learning IE Rules: Inductive Logic Programming (ILP)

I We applied ILP to learn IE rules.I ILP is a ML algorithm that induces rules from examples.

I Outputs are readable and interpretable by the domainexperts.

I Can deal with relational information (e.g., parse trees).

I Problem Setting

B ∧ H |= E

I Given B (Background Knowledge) and E (Examples), find H(Hypothesis).

12 / 32

Page 13: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Learning IE Rules: Representation

I B (Background Knowledge)I Linguistic heuristics (for single-slot IE pattern)

I e.g., <subj> verb, verb <dobj>, verb preposition <np>, nounprep <np>, etc.

I Sentence descriptions (ie., analyzed sentences)

I E (Examples)I Positive and negative examples (ie., relevant and irrelevant

sentences)

I H (Hypothesis)I A set of IE rules

13 / 32

Page 14: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Learning IE Rules: Generalization

Figure: From a sentence to an IE rule

14 / 32

Page 15: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Learning IE Rules: Step 1

Figure: From a sentence to an IE rule

15 / 32

Page 16: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Learning IE Rules: Step 2

Figure: From a sentence to an IE rule

16 / 32

Page 17: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Learning IE Rules: Step 3

Figure: From a sentence to an IE rule

17 / 32

Page 18: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Learning IE Rules: Step 4

Figure: From a sentence to an IE rule

WRAcc(Rule) = coverage(Rule) ∗ (accuracy(Rule)− accuracy(Head ← true))

18 / 32

Page 19: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Selecting IE Rules

I Why is this step necessary?I Rules are learned to classify sentences. Not to extract

information.I Spurious IE rules are learned from the previous step.I Need to be filtered out by domain experts.

I Rules are provided to users with information.I Example

I RULE: <subj:*> vp:contain & vp:contain <dobj:domain> [9, 0.9]I S: Myocilin is a secreted glycoprotein that forms multimers and contains a

leucine zipper and an olfactomedin domain.

19 / 32

Page 20: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Transforming Rules

I Once IE rules are selected, transformed into relations.I Mapping between IE rules and relations

IE Rules Relations

extract argument

trigger relation name

syntactic tag argument position

I ExampleI <subj:*>[X] vp:contain & vp:contain <dobj:domain>[Y] → contain(X,Y)

20 / 32

Page 21: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Grouping Rules

I Post-processing rules

Pattern Example

A verb (active form) B A activate B

B be ed-participle by A B be activated by A

nominal form (with suffix -tion) of verb of B by A activation of B by A

A be nominal form (with suffix -or) of verb of B A is an activator of B

A be ... that verb (active form) B A is ... that activates B

21 / 32

Page 22: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Methods

Applying Rules to Extract Relations

I Now, we have a set of IE rules for each relation.I IE rule and relation

I RULE: <subj:protein>[X] vp:promote & vp:promote <dobj:disease>[Y]I TRIGGER: promoteI RELATION: promote(X,Y)

I ExampleI INPUT: Our data demonstrate that PKC beta II promotes colon cancer, at

least in part, through induction of Cox-2, suppression of TGF-betasignaling, and establishment of a TGF-beta-resistant, hyperproliferativestate in the colonic epithelium.

I OUTPUT: promote(’PKC beta II’,’colon cancer’)

22 / 32

Page 23: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Experimental Results

Experiments

I Our IE system was applied for PRINTS (a protein familydatabase) annotation.

I Development Corpora

Topic Positives Negatives Class Distribution

Disease 777 1403 36-64%

Function 1268 2625 33-67%

Structure 1159 1750 40-60%

I 80% for training, 20% for test

23 / 32

Page 24: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Experimental Results

EvaluationI All the extracted relations are manually evaluated by

domain experts.I Results

Topic Learned Selected Relations Precision Recall F1-measure

rules rules

Disease 55 32 21 75 18.3 29.4

Function 125 64 23 66.3 15.1 24.6

Structure 146 76 20 85.3 61 71.1

I cf. a recent work of extracting regulatory gene/proteinnetworks

I F1-measure of 44% (Saric et al, 2006)

24 / 32

Page 25: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Experimental Results

Protein Annotation Examples

I Discovered relations

Disease be_associated, is_a, be_mutated, be_caused, be_deleted, contribute, etc.

Function induce, block, mediate, is_a, belong, act, etc.

Structure contain, form, share, lack, bind, encode, be_conserved, etc.

I Annotation example for protein NF-kappaB

be_implicated(,’NF-kappaB’, in:’the pathogenesis’)

regulate(’IkappaBalpha’, ’NF-kappaB’), activate(’BCMA’, ’NF-kappaB’)

be_composed(’NF-kappaB’, of:’heterodimeric complexes’)

25 / 32

Page 26: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Experimental Results

Limitations

I Failure to find inverse relationI S: Expression of uPAR in tumor extracts also inversely correlates with

prognosis in many forms of cancer.I RULE: np:expression <of:protein>[X] & vp:correlate <with:prognosis>[Y]I RELATION: correlate(expression(’expression’,of:’uPAR’), with:’prognosis’)

I Anaphora problemI S: Whereas the overall structure resembles that of the NF-kappaB

p50-DNA complex , pronounced differences are observed within the ’insert region ’.

I RULE: <subj:structure>[X] vp:resemble & vp:resemble <dobj:C>[Y]I RELATION: resemble(’the overall structure’,’that’)

26 / 32

Page 27: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Conclusion

Conclusion

I Proposed a methodology for developing IE systems withresources that can be provided by biologists.

I Learned relations as well as IE rules without annotatedcorpora.

I Annotated proteins with structured information (i.e.,predicate argument structure) in terms of any topic.

I Validated the methodology over different topics (function,structure, disease, cancer) in bio-medical domain.

I Advantage: will alleviate the burden of developing IEsystems for users who have little or no formal IE training.

27 / 32

Page 28: Learning to Extract Relations for Protein Annotation

Thank you for your attention!

Page 29: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Appendix

For Further Reading

For Further Reading I

Mary Elaine Califf and Raymond J. Mooney.Bottom-up relational learning of pattern matching rules forinformation extraction.Journal of Machine Learning Research, 4:177–210, 2003.

R. Collier.Automatic template creation for information extraction, anoverview, 1996.

Dayne Freitag.Machine learning for information extraction in informaldomains.Machine Learning, 39(2/3):169–202, 2000.

29 / 32

Page 30: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Appendix

For Further Reading

For Further Reading II

Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman.Discovering relations among named entities from largecorpora.In ACL, pages 415–422, 2004.

Ellen Riloff.Automatically generating extraction patterns from untaggedtext.In Proceedings of the Thirteenth National Conference onArtificial Intelligence (AAAI-96), pages 1044–1049, 1996.

30 / 32

Page 31: Learning to Extract Relations for Protein Annotation

Learning to Extract Relations

Appendix

For Further Reading

For Further Reading III

Jasmin Saric, Lars Juhl Jensen, Peer Bork, RossitzaOuzounova, and Isabel Rojas.Extracting regulatory gene expression networks frompubmed.In ACL, pages 191–198, 2004.

Stephen Soderland.Learning information extraction rules for semi-structuredand free text.Machine Learning, 34(1-3):233–272, 1999.

31 / 32

Page 32: Learning to Extract Relations for Protein Annotation

[Sod99][Fre00][CM03][Ril96][HSG04][Col96][SJB+04]