learning to extract relations for protein annotation

Learning to Extract Relations

Learning to Extract Relations for ProteinAnnotation

Jee-Hyub Kima, Alex Mitchellb,c , Teresa K. Attwoodb,c ,Melanie Hilarioa

aUniversity of Geneva

bUniversity of Manchster

cEuropean Bioinformatics Institute

ISMB/ECCB2007, 23 July 2007

1 / 32


Contents

Introduction

Related Work

Problem and Approach

Methods

Experimental Results

Conclusion

2 / 32


Introduction

Protein AnnotationI Definition

I Given a protein sequence or name, describe the proteinwith all relevant information

I Two main methodsI Sequence analysis method (e.g. BLAST, CLUSTALW, etc.)I Literature analysis method

I Traditionally, done manually by human annotators, andautomation needed

I Text-mining has been used for automation.

I Two main tasks in text-miningI Information retrieval (IR): to retrieve relevant documentsI Information extraction (IE): to extract certain pieces of

information from text for pre-defined entities and theirrelations.

3 / 32


Introduction

Adaptive Information Extraction

I IE is domain-specific.I Generally, one developed system cannot be used for a new

domain.

I Developing IE systems requires a significant amount ofdomain knowledge.

I Developing IE rulesI Defining relationsI These are two main bottlenecks.

I Still many new domains need to develop their own IEsystems.

I e.g., cell cycle, tissue specificity, etc.

4 / 32


Related Work

IE System Development: Developing IE Rules

I Kowledge engineering (KE) based approachI Knowledge engineers write hand-crafted rules with the help

of domain experts (e.g., biologists).I Not scalable

I Machine learning (ML) based approachI To increase robustness and coverage of IE rulesI ML has been used to learn IE rules.

I Annotated corpora, pre-labelled corpora, raw corpora.

I Labor required to develop IE systemsI KE > ML (annotated corpora > pre-labelled corpora > raw

corpora)

5 / 32


Related Work

IE System Development: Defining Relations

I All the previous IE systems assume relations to beextracted are already defined.

I It is hard to specify precisely all possible relations toextract, especially in complex and dynamically-evolvingdomains (e.g., biological domain)

I Positioning (ML-based approach)

Corpora \ Relations Pre-defined Not defined

Annotated Soderland (1999), Freitag (2000),

Califf and Money (2003)

Pre-labeled Riloff (1996) Our work

Raw Hasegawa et al. (2004) Collier (1996)

6 / 32



Our Problem

I GoalI To alleviate the burden of developing IE systems for

biologists

I Problem definitionI Given relevant sentences that describe protein X in terms

of any topic Y and irrelevant sentencesI Learn to extract relations for protein annotation

I Two sub-problemsI What to extractI How to extract

7 / 32



Bottom-up Approach

Figure: From sentences to relations

8 / 32


Methods

System Architecture

Figure: Overall IE system architecture9 / 32


Methods

Analyzing Sentences: MBSP

I Memory-Based Shallow Parser (MBSP)I Developed by Walter DaelemansI Extended with named entity taggers and SVO relation finder

I Provides various types of information: POS, SVO, NE, etc.

I Adapted to the biological domain on the basis of the GENIAcorpus

I 97.6% accuracy on POS taggingI 71.0% accuracy on protein named entity recognition

10 / 32


Methods

Analyzing Sentences: ExampleI Example

I INPUT: Examples of this are the RNA-binding protein containingthe RNA-binding domain (RBD) ...

I OUTPUT:

Chunk Syntactic Semantic SVO relation

Examples noun phrase subject of ’are’

of preposition

this noun phrase

are verb phrase

the RNA-binding protein noun phrase protein subject of ’contain’

containing verb phrase

the RNA-binding domain (RBD) noun phrase domain object of ’contain’

11 / 32


Methods

Learning IE Rules: Inductive Logic Programming (ILP)

I We applied ILP to learn IE rules.I ILP is a ML algorithm that induces rules from examples.

I Outputs are readable and interpretable by the domainexperts.

I Can deal with relational information (e.g., parse trees).

I Problem Setting

B ∧ H |= E

I Given B (Background Knowledge) and E (Examples), find H(Hypothesis).

12 / 32


Methods

Learning IE Rules: Representation

I B (Background Knowledge)I Linguistic heuristics (for single-slot IE pattern)

I e.g., <subj> verb, verb <dobj>, verb preposition <np>, nounprep <np>, etc.

I Sentence descriptions (ie., analyzed sentences)

I E (Examples)I Positive and negative examples (ie., relevant and irrelevant

sentences)

I H (Hypothesis)I A set of IE rules

13 / 32


Methods

Learning IE Rules: Generalization

Figure: From a sentence to an IE rule

14 / 32


Methods

Learning IE Rules: Step 1


15 / 32


Methods



16 / 32


Methods



17 / 32


Methods



WRAcc(Rule) = coverage(Rule) ∗ (accuracy(Rule)− accuracy(Head ← true))

18 / 32


Methods

Selecting IE Rules

I Why is this step necessary?I Rules are learned to classify sentences. Not to extract

information.I Spurious IE rules are learned from the previous step.I Need to be filtered out by domain experts.

I Rules are provided to users with information.I Example

I RULE: <subj:*> vp:contain & vp:contain <dobj:domain> [9, 0.9]I S: Myocilin is a secreted glycoprotein that forms multimers and contains a

leucine zipper and an olfactomedin domain.

19 / 32


Methods

Transforming Rules

I Once IE rules are selected, transformed into relations.I Mapping between IE rules and relations

IE Rules Relations

extract argument

trigger relation name

syntactic tag argument position

I ExampleI <subj:*>[X] vp:contain & vp:contain <dobj:domain>[Y] → contain(X,Y)

20 / 32


Methods

Grouping Rules

I Post-processing rules

Pattern Example

A verb (active form) B A activate B

B be ed-participle by A B be activated by A

nominal form (with suffix -tion) of verb of B by A activation of B by A

A be nominal form (with suffix -or) of verb of B A is an activator of B

A be ... that verb (active form) B A is ... that activates B

21 / 32


Methods

Applying Rules to Extract Relations

I Now, we have a set of IE rules for each relation.I IE rule and relation

I RULE: <subj:protein>[X] vp:promote & vp:promote <dobj:disease>[Y]I TRIGGER: promoteI RELATION: promote(X,Y)

I ExampleI INPUT: Our data demonstrate that PKC beta II promotes colon cancer, at

least in part, through induction of Cox-2, suppression of TGF-betasignaling, and establishment of a TGF-beta-resistant, hyperproliferativestate in the colonic epithelium.

I OUTPUT: promote(’PKC beta II’,’colon cancer’)

22 / 32



Experiments

I Our IE system was applied for PRINTS (a protein familydatabase) annotation.

I Development Corpora

Topic Positives Negatives Class Distribution

Disease 777 1403 36-64%

Function 1268 2625 33-67%

Structure 1159 1750 40-60%

I 80% for training, 20% for test

23 / 32



EvaluationI All the extracted relations are manually evaluated by

domain experts.I Results

Topic Learned Selected Relations Precision Recall F1-measure

rules rules

Disease 55 32 21 75 18.3 29.4

Function 125 64 23 66.3 15.1 24.6

Structure 146 76 20 85.3 61 71.1

I cf. a recent work of extracting regulatory gene/proteinnetworks

I F1-measure of 44% (Saric et al, 2006)

24 / 32



Protein Annotation Examples

I Discovered relations

Disease be_associated, is_a, be_mutated, be_caused, be_deleted, contribute, etc.

Function induce, block, mediate, is_a, belong, act, etc.

Structure contain, form, share, lack, bind, encode, be_conserved, etc.

I Annotation example for protein NF-kappaB

be_implicated(,’NF-kappaB’, in:’the pathogenesis’)

regulate(’IkappaBalpha’, ’NF-kappaB’), activate(’BCMA’, ’NF-kappaB’)

be_composed(’NF-kappaB’, of:’heterodimeric complexes’)

25 / 32



Limitations

I Failure to find inverse relationI S: Expression of uPAR in tumor extracts also inversely correlates with

prognosis in many forms of cancer.I RULE: np:expression <of:protein>[X] & vp:correlate <with:prognosis>[Y]I RELATION: correlate(expression(’expression’,of:’uPAR’), with:’prognosis’)

I Anaphora problemI S: Whereas the overall structure resembles that of the NF-kappaB

p50-DNA complex , pronounced differences are observed within the ’insert region ’.

I RULE: <subj:structure>[X] vp:resemble & vp:resemble <dobj:C>[Y]I RELATION: resemble(’the overall structure’,’that’)

26 / 32


Conclusion

Conclusion

I Proposed a methodology for developing IE systems withresources that can be provided by biologists.

I Learned relations as well as IE rules without annotatedcorpora.

I Annotated proteins with structured information (i.e.,predicate argument structure) in terms of any topic.

I Validated the methodology over different topics (function,structure, disease, cancer) in bio-medical domain.

I Advantage: will alleviate the burden of developing IEsystems for users who have little or no formal IE training.

27 / 32

Thank you for your attention!


Appendix

For Further Reading

For Further Reading I

Mary Elaine Califf and Raymond J. Mooney.Bottom-up relational learning of pattern matching rules forinformation extraction.Journal of Machine Learning Research, 4:177–210, 2003.

R. Collier.Automatic template creation for information extraction, anoverview, 1996.

Dayne Freitag.Machine learning for information extraction in informaldomains.Machine Learning, 39(2/3):169–202, 2000.

29 / 32


Appendix

For Further Reading

For Further Reading II

Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman.Discovering relations among named entities from largecorpora.In ACL, pages 415–422, 2004.

Ellen Riloff.Automatically generating extraction patterns from untaggedtext.In Proceedings of the Thirteenth National Conference onArtificial Intelligence (AAAI-96), pages 1044–1049, 1996.

30 / 32


Appendix

For Further Reading

For Further Reading III

Jasmin Saric, Lars Juhl Jensen, Peer Bork, RossitzaOuzounova, and Isabel Rojas.Extracting regulatory gene expression networks frompubmed.In ACL, pages 191–198, 2004.

Stephen Soderland.Learning information extraction rules for semi-structuredand free text.Machine Learning, 34(1-3):233–272, 1999.

31 / 32

[Sod99][Fre00][CM03][Ril96][HSG04][Col96][SJB+04]

learning to extract relations for protein annotation

Data & Analytics