gleaning relational information from biomedical text mark goadrich computer sciences department...

47
Gleaning Relational Gleaning Relational Information from Information from Biomedical Text Biomedical Text Mark Goadrich Mark Goadrich Computer Sciences Department Computer Sciences Department University of Wisconsin - Madison University of Wisconsin - Madison Joint Work with Jude Shavlik and Louis Joint Work with Jude Shavlik and Louis Oliphant Oliphant CIBM Seminar - Dec 5th 2006 CIBM Seminar - Dec 5th 2006

Upload: norman-moody

Post on 13-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Gleaning Relational Gleaning Relational Information from Biomedical Information from Biomedical

TextText

Mark GoadrichMark GoadrichComputer Sciences DepartmentComputer Sciences Department

University of Wisconsin - MadisonUniversity of Wisconsin - Madison

Joint Work with Jude Shavlik and Louis OliphantJoint Work with Jude Shavlik and Louis Oliphant

CIBM Seminar - Dec 5th 2006CIBM Seminar - Dec 5th 2006

OutlineOutline

The Vacation GameThe Vacation Game Formalizing with LogicFormalizing with Logic Biomedical Information ExtractionBiomedical Information Extraction Evaluating HypothesesEvaluating Hypotheses Gleaning Logical RulesGleaning Logical Rules ExperimentsExperiments Current DirectionsCurrent Directions

The Vacation GameThe Vacation Game

PositivePositive

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

NegativeNegative

The Vacation GameThe Vacation Game

PositivePositive– AppleApple– FeetFeet– LuggageLuggage– MushroomsMushrooms– BooksBooks– WalletWallet– BeekeeperBeekeeper

NegativeNegative– PearPear– SocksSocks– CarCar– FungusFungus– NovelNovel– MoneyMoney– HiveHive

PositivePositive– AApppplele– FFeeeett– LuLuggggageage– MushrMushroooomsms– BBooooksks– WaWalllletet– BBeeeekkeeeeperper

The Vacation GameThe Vacation Game

My Secret RuleMy Secret Rule– The word must have The word must have two adjacent two adjacent

lettersletters which are the which are the same lettersame letter..

Found by using Found by using inductive logicinductive logic– Positive and Negative ExamplesPositive and Negative Examples– Formulating and Eliminating HypothesesFormulating and Eliminating Hypotheses– Evaluating Success and FailureEvaluating Success and Failure

Inductive Logic Inductive Logic ProgrammingProgramming

Machine LearningMachine Learning– Classify data into categoriesClassify data into categories– Divide data into Divide data into traintrain and and test test setssets– Generate hypotheses onGenerate hypotheses on train train set and set and

then measure performance on then measure performance on testtest set set In ILP, data are In ILP, data are ObjectsObjects … …

– person, block, molecule, word, phrase, person, block, molecule, word, phrase, ……

and and RelationsRelations between them between them– grandfather, has_bond, is_member, …grandfather, has_bond, is_member, …

Formalizing with LogicFormalizing with Logic

apple

a b c d e f g h i j k l mn o p q r s t u v w x y z

w2169

a p p l ew2169_1 w2169_5w2169_4w2169_3w2169_2

Objects

Relations

Formalizing with LogicFormalizing with Logic

word(w2169). letter(w2169_1).word(w2169). letter(w2169_1).has_letter(w2169, w2169_2). has_letter(w2169, w2169_2). has_letter(w2169, w2169_3). has_letter(w2169, w2169_3). next(w2169_2, w2169_3).next(w2169_2, w2169_3).letter_value(w2169_2, ‘p’).letter_value(w2169_2, ‘p’).letter_value(w2169_3, ‘p’).letter_value(w2169_3, ‘p’).

pos(X) :- has_letter(X, A), has_letter(X, pos(X) :- has_letter(X, A), has_letter(X, B),B), next(A, B), letter_value(A, next(A, B), letter_value(A, C), C), letter_value(B, C).letter_value(B, C).

a b c d e f g h i j k l mn o p q r s t u v w x y z

w2169

w2169_1 w2169_5w2169_4w2169_3w2169_2

‘apple'

head body Variables

Biomedical Information Biomedical Information ExtractionExtraction

*image courtesy of SEER Cancer Training Site

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

DatabaseStructured

Biomedical Information Biomedical Information ExtractionExtraction

http://www.geneontology.orghttp://www.geneontology.org

Biomedical Information Biomedical Information ExtractionExtraction

NPL3 encodes a nuclear protein with an RNA NPL3 encodes a nuclear protein with an RNA recognition motif and similarities to a family of recognition motif and similarities to a family of proteins involved in RNA metabolism.proteins involved in RNA metabolism.

ykuD was transcribed by SigK RNA polymerase ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.from T4 of sporulation.

Mutations in the COL3A1 gene have been Mutations in the COL3A1 gene have been implicated as a cause of type IV Ehlers-Danlos implicated as a cause of type IV Ehlers-Danlos syndrome, a disease leading to aortic rupture in syndrome, a disease leading to aortic rupture in early adult life.early adult life.

Biomedical Information Biomedical Information ExtractionExtraction

The dog running down the street The dog running down the street tackled and bit my little sister.tackled and bit my little sister.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Biomedical Information Biomedical Information ExtractionExtraction

NPL3 encodes a nuclear protein NPL3 encodes a nuclear protein with … with …

verbnoun article adj noun prep

sentence

prepphrase

…verb

phrasenoun

phrasenoun

phrasenoun

phrasenoun

phrase

MedDict Background MedDict Background KnowledgeKnowledge

http://cancerweb.ncl.ac.uk/omd/http://cancerweb.ncl.ac.uk/omd/

MeSH Background KnowledgeMeSH Background Knowledge

http://www.nlm.nih.gov/mesh/http://www.nlm.nih.gov/mesh/MBrowser.htmlMBrowser.html

GO Background KnowledgeGO Background Knowledge

http://www.geneontology.orghttp://www.geneontology.org

Some Prolog PredicatesSome Prolog Predicates Biomedical PredicatesBiomedical Predicates

– phrase_contains_medDict_term(Phrase, Word, WordText)phrase_contains_medDict_term(Phrase, Word, WordText)– phrase_contains_mesh_term(Phrase, Word, WordText)phrase_contains_mesh_term(Phrase, Word, WordText)– phrase_contains_mesh_disease(Phrase, Word, WordText)phrase_contains_mesh_disease(Phrase, Word, WordText)– phrase_contains_go_term(Phrase, Word, WordText)phrase_contains_go_term(Phrase, Word, WordText)

Lexical PredicatesLexical Predicates– internal_caps(Word) alphanumeric(Word)internal_caps(Word) alphanumeric(Word)

Look-ahead Phrase PredicatesLook-ahead Phrase Predicates– few_POS_in_phrase(Phrase, POS)few_POS_in_phrase(Phrase, POS)– phrase_contains_specific_word_triple(Phrase, W1, W2, W3)phrase_contains_specific_word_triple(Phrase, W1, W2, W3)– phrase_contains_some_marked_up_arg(Phrase, Arg#, Word,phrase_contains_some_marked_up_arg(Phrase, Arg#, Word, Fold)Fold)

Relative Location of PhrasesRelative Location of Phrases– protein_before_location(ExampleID)protein_before_location(ExampleID)– word_pair_in_between_target_phrases(ExampleID, W1, W2)word_pair_in_between_target_phrases(ExampleID, W1, W2)

Still More PredicateStill More Predicate High-scoring words in High-scoring words in proteinprotein phrases phrases

– bifunction, repress, pmr1, … bifunction, repress, pmr1, …

High-scoring words in High-scoring words in locationlocation phrases phrases– golgi, cytoplasm, ergolgi, cytoplasm, er

High-scoring High-scoring BETWEENBETWEEN protein & location protein & location– across, cofractionate, inside, …across, cofractionate, inside, …

Biomedical Information Biomedical Information ExtractionExtraction

Given:Given: Medical Journal abstracts tagged Medical Journal abstracts tagged

with biological relations with biological relations Do:Do: Construct system to extract Construct system to extract

related related phrases from phrases from unseen textunseen text

Our Gleaner ApproachOur Gleaner Approach

Develop Develop fast ensemble algorithmsfast ensemble algorithms focused on focused on recallrecall and and precisionprecision evaluation evaluation

Using Modes to Chain Using Modes to Chain RelationsRelations

Phrase

Sentence

Word

alphanumeric(…)alphanumeric(…)

internal_caps(…)internal_caps(…)

verb(…)verb(…)

phrase_child(…, …)phrase_child(…, …)

long_sentence(…)long_sentence(…)

phrase_parent(…, …)phrase_parent(…, …)

noun_phrase(…)noun_phrase(…)

Growing Rules From SeedGrowing Rules From Seed

NPL3 encodes a nuclear protein with … NPL3 encodes a nuclear protein with …

prot_loc(prot_loc(ab1392078_sen7_ph0ab1392078_sen7_ph0, , ab1392078_sen7_ph2ab1392078_sen7_ph2, , ab1392078_sen7ab1392078_sen7).).

phrase_contains_novelword(phrase_contains_novelword(ab1392078_sen7_ph0, ab1392078_sen7_ph0, ab1392078_sen7_ph0_w0).ab1392078_sen7_ph0_w0).

phrase_next(phrase_next(ab1392078_sen7_ph0ab1392078_sen7_ph0, , ab1392078_sen7_ph1ab1392078_sen7_ph1).).

……

noun_phrase(noun_phrase(ab1392078_sen7_ph2ab1392078_sen7_ph2).).

word_child(word_child(ab1392078_sen7_ph2ab1392078_sen7_ph2, , ab9018277_sen5_ph11_w3ab9018277_sen5_ph11_w3).).

……

avg_length_sentence(avg_length_sentence(ab1392078_sen7ab1392078_sen7).).

……

Phrase Phrase Sentence

Phrase

Word

Word

Growing Rules From SeedGrowing Rules From Seed

prot_loc(prot_loc(ProteinProtein,,LocationLocation,,SentenceSentence) :-) :-

phrase_contains_some_alphanumeric(phrase_contains_some_alphanumeric(ProteinProtein,E),,E),

phrase_contains_some_internal_cap_word(phrase_contains_some_internal_cap_word(ProteinProtein,,E),E),

phrase_next(phrase_next(ProteinProtein,_), ,_),

different_phrases(different_phrases(ProteinProtein,,LocationLocation),),

one_POS_in_phrase(one_POS_in_phrase(LocationLocation,noun), ,noun),

phrase_contains_some_arg2_10x_word(phrase_contains_some_arg2_10x_word(LocationLocation,_),,_),

phrase_previous(phrase_previous(LocationLocation,_), ,_),

avg_length_sentence(avg_length_sentence(SentenceSentence). ).

Rule EvaluationRule Evaluation

Prediction vs ActualPrediction vs ActualPositive or NegativePositive or Negative

True or FalseTrue or False

FNTP

TP

FPTP

TP

TP

FP FN

TN

actu

al

prediction

RP

2PR

F1 Score =F1 Score =

Focus on positive examplesFocus on positive examplesRecall = Recall =

Precision = Precision =

Protein Localization Rule 1Protein Localization Rule 1

prot_loc(prot_loc(ProteinProtein,,LocationLocation,,SentenceSentence) :-) :-

phrase_contains_some_alphanumeric(phrase_contains_some_alphanumeric(ProteinProtein,E),,E),

phrase_contains_some_internal_cap_word(phrase_contains_some_internal_cap_word(ProteinProtein,,E),E),

phrase_next(phrase_next(ProteinProtein,_), ,_),

different_phrases(different_phrases(ProteinProtein,,LocationLocation),),

one_POS_in_phrase(one_POS_in_phrase(LocationLocation,noun), ,noun),

phrase_contains_some_arg2_10x_word(phrase_contains_some_arg2_10x_word(LocationLocation,_),,_),

phrase_previous(phrase_previous(LocationLocation,_), ,_),

avg_length_sentence(avg_length_sentence(SentenceSentence).).

0.15 Recall0.15 Recall 0.51 Precision0.51 Precision 0.23 F1 Score0.23 F1 Score

Protein Localization Rule 2Protein Localization Rule 2

prot_loc(prot_loc(ProteinProtein,,LocationLocation,,SentenceSentence) :-) :-

phrase_contains_some_marked_up_arg2(phrase_contains_some_marked_up_arg2(LocationLocation,C),C)

phrase_contains_some_internal_cap_word(phrase_contains_some_internal_cap_word(ProteinProtein,,_),_),

word_previous(C,_).word_previous(C,_).

0.86 Recall0.86 Recall 0.12 Precision0.12 Precision 0.21 F1 Score0.21 F1 Score

Precision-Focused SearchPrecision-Focused Search

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Recall-Focused SearchRecall-Focused Search

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

F1-Focused SearchF1-Focused Search

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Aleph - LearningAleph - Learning

Aleph learnsAleph learns theories of rulestheories of rules (Srinivasan, v4, 2003)(Srinivasan, v4, 2003)– Pick positive seed examplePick positive seed example– Use heuristic search to find best ruleUse heuristic search to find best rule– Pick new seed from uncovered positivesPick new seed from uncovered positives

and repeat until threshold of positives and repeat until threshold of positives coveredcovered

Learning theories is time-consumingLearning theories is time-consuming Can we reduce time with ensembles?Can we reduce time with ensembles?

GleanerGleaner

Definition of GleanerDefinition of Gleaner– One who gathers grain left behind by One who gathers grain left behind by

reapersreapers

Key Ideas of GleanerKey Ideas of Gleaner– Use Aleph as underlying ILP rule engineUse Aleph as underlying ILP rule engine– Search rule space with Rapid Random Search rule space with Rapid Random

RestartRestart– Keep wide range of rules usually discardedKeep wide range of rules usually discarded– Create separate theories for diverse recallCreate separate theories for diverse recall

Gleaner - LearningGleaner - LearningP

reci

sion

Recall

Create Create BB Bins Bins Generate ClausesGenerate Clauses Record Best per Record Best per

BinBin

Gleaner - LearningGleaner - Learning

Recall

Seed 1

Seed 2

Seed 3

Seed K

.

.

.

Gleaner - EnsembleGleaner - Ensemble

.

.

.

.

.

pos1: prot_loc(…)

pos1: prot_loc(…) 12

pos2: prot_loc(…) 47

pos3: prot_loc(…) 55

neg1: prot_loc(…) 5

neg2: prot_loc(…) 14

neg3: prot_loc(…) 2

neg4: prot_loc(…) 18

12pos2: prot_loc(…) 47

Pos

Neg

Pos

Pos

Neg

Neg

Pos

Rules from bin 5

Gleaner - EnsembleGleaner - Ensemble

Recall

Pre

cisi

on

1.0

1.0pos3: prot_loc(…)

neg28: prot_loc(…)

pos2: prot_loc(…)

neg4: prot_loc(…)

neg475: prot_loc(…)

.

pos9: prot_loc(…)

neg15: prot_loc(…).

55

52

47

18

17

17

16

ScoreExamples

1.00 0.05

0.50 0.05

0.66 0.10

0.12 0.85

0.13 0.90

0.12 0.90

Precision Recall

Gleaner - OverlapGleaner - Overlap

For each bin, take the topmost curveFor each bin, take the topmost curve

Recall

Pre

cisi

on

How to use GleanerHow to use GleanerP

reci

sion

Recall

Generate Test CurveGenerate Test Curve User Selects Recall BinUser Selects Recall Bin Return ClassificationsReturn Classifications

Ordered By Their ScoreOrdered By Their Score

Recall = 0.50Precision = 0.70

Aleph EnsemblesAleph Ensembles We compare to We compare to ensembles of theoriesensembles of theories AlgorithmAlgorithm ( (Dutra Dutra et alet al ILP 2002 ILP 2002))

– Use Use KK different initial seeds different initial seeds – Learn Learn KK theories containing theories containing CC rules rules– Rank examples by the number of theoriesRank examples by the number of theories

Need to balance Need to balance CC for high for high performanceperformance– Small Small CC leads to low recall leads to low recall– Large Large CC leads to converging theories leads to converging theories

Evaluation MetricsEvaluation Metrics

Area Under Recall-Area Under Recall-Precision Curve Precision Curve (AURPC)(AURPC)– All curves All curves

standardized standardized to cover full recall to cover full recall rangerange

– Averaged AURPC Averaged AURPC over 5 foldsover 5 folds

Number of clauses Number of clauses consideredconsidered– Rough estimate of Rough estimate of

timetime

Recall

Pre

cisi

on

1.0

1.0

YPD Protein LocalizationYPD Protein Localization

Hand-labeled datasetHand-labeled dataset (Ray & Craven ’01)(Ray & Craven ’01)

– 7,245 sentences from 871 abstracts 7,245 sentences from 871 abstracts – Examples are phrase-phrase combinationsExamples are phrase-phrase combinations

1,810 positive & 279,154 negative1,810 positive & 279,154 negative

1.6 GB of background knowledge1.6 GB of background knowledge– Structural, Statistical, Lexical and Structural, Statistical, Lexical and

OntologicalOntological– In total, 200+ distinct background In total, 200+ distinct background

predicatespredicates

Experimental MethodologyExperimental Methodology Performed five-fold cross-validationPerformed five-fold cross-validation Variation of parametersVariation of parameters

– Gleaner (20 recall bins)Gleaner (20 recall bins) # seeds = {25, 50, 75, 100}# seeds = {25, 50, 75, 100} # clauses = {1K, 10K, 25K, 50K, 100K, 250K, # clauses = {1K, 10K, 25K, 50K, 100K, 250K,

500K}500K}

– Ensembles (0.75 minacc, 1K and 35K nodes)Ensembles (0.75 minacc, 1K and 35K nodes) # theories = {10, 25, 50, 75, 100}# theories = {10, 25, 50, 75, 100} # clauses per theory = {1, 5, 10, 15, 20, 25, 50}# clauses per theory = {1, 5, 10, 15, 20, 25, 50}

PR Curves - 100,000 ClausesPR Curves - 100,000 Clauses

PR Curves - 1,000,000 PR Curves - 1,000,000 ClausesClauses

Protein Localization ResultsProtein Localization Results

Genetic Disorder ResultsGenetic Disorder Results

Current DirectionsCurrent Directions

Learn diverse rules across seedsLearn diverse rules across seeds Calculate probabilistic scores for Calculate probabilistic scores for

examplesexamples Directed Rapid Random RestartsDirected Rapid Random Restarts Cache rule information to speed Cache rule information to speed

scoringscoring Transfer learning across seedsTransfer learning across seeds Explore Active Learning within ILPExplore Active Learning within ILP

Take-Home MessageTake-Home Message

Biology, Gleaner and ILPBiology, Gleaner and ILP– Challenging problems in biology can be Challenging problems in biology can be

naturally formulated for Inductive Logic naturally formulated for Inductive Logic ProgrammingProgramming

– Many rules constructed and evaluated in Many rules constructed and evaluated in ILP hypothesis searchILP hypothesis search

– Gleaner makes use of those rules that Gleaner makes use of those rules that are not the highest scoring ones for are not the highest scoring ones for improved speed and performanceimproved speed and performance

AcknowledgementsAcknowledgements

USA DARPA Grant F30602-01-2-0571USA DARPA Grant F30602-01-2-0571 USA Air Force Grant F30602-01-2-0571USA Air Force Grant F30602-01-2-0571 USA NLM Grant 5T15LM007359-02USA NLM Grant 5T15LM007359-02 USA NLM Grant 1R01LM07050-01USA NLM Grant 1R01LM07050-01 UW Condor GroupUW Condor Group David Page, Vitor Santos Costa, Ines David Page, Vitor Santos Costa, Ines

Dutra, Soumya Ray, Marios Skounakis, Dutra, Soumya Ray, Marios Skounakis, Mark Craven, Burr Settles, Jesse Davis, Mark Craven, Burr Settles, Jesse Davis, Sarah Cunningham, David Haight, Ameet Sarah Cunningham, David Haight, Ameet SoniSoni