biomedical information extraction using inductive logic programming mark goadrich and louis oliphant...

Biomedical Information Extraction using Inductive

Logic Programming

Mark Goadrich and Louis Oliphant Advisor: Jude Shavlik

Acknowledgements to NLM training grant 5T15LM007359-02

AbstractAutomated methods for finding relevant information from the large amount of biomedical literature are needed. Information extraction (IE) is the process of finding facts from unstructured text such as biomedical journals and putting those facts in an organized system. Our research mines facts about a relationship (e.g. protein localization) from PubMed abstracts. We use Inductive Logic Programming (ILP) to learn a set of logical rules that explain when and where a relationship occurs in a sentence. We build rules by finding patterns in syntactic as well as semantic information for each sentence in a training corpus that has been previously marked with the relationship. These rules can then be used on unmarked text to find new instances of the relation. Some major research issues involved in this approach are handling unbalanced data, searching the enormous space of clauses, learning probabilistic logical rules, and incorporating expert background knowledge.

The Central Dogma

Discoveries protein - protein

interactions protein localizations genetic diseases

Most knowledge stored in articles

Just Google it?

*image courtesy of National Human Genome Research Institute

World of Publishing

Current authors write articles in Word, LaTeX, and publish

in conferences, journals, etc humans index and extract relevant information

(time and cost intensive) Future?

all published articles available on the Web semantic web – extension of HTML for content articles automatically annotated and indexed into

searchable databases

Information Extraction

Given a set of abstracts tagged with biological

relationships between phrases

Do learn a theory (eg, set of inference rules)

that accurately extracts these relations

Training Data

Why Use ILP?

KDD Cup 2002 logical rules (handcrafted) did best on IE task

Hypotheses are comprehensible written in first-order predicate calculus (FOPC) aim to cover only positive examples

Background knowledge easily incorporated expert advice linguistic knowledge of English parse trees biomedical knowledge (eg. MESH)

ILP Example: Family Tree Positive

daughter(mary, ann) daughter(eve, tom)

Negative daughter(tom, ann) daughter(eve, ann) daughter(ian, tom) daughter(ian, ann) …

Background Knowledge mother(ann, mary) mother(ann, tom) father(tom, eve) father(tom, ian) female(ann) female(mary) female(eve) male(tom) male(ian)

Ann

IanEve

MaryTom

Possible Rules daughter(A,B) if male(A) and father(B,A) daughter(A,B) if mother(B,A) daughter(A,B) if female(A) and male(B) daughter(A,B) if female(A) and mother(B,A)

Father Father

Mother Mother

Sundance ParsingSentence

…NP-Conj seg VP segment NP segment

smf1 and smf2

unk conj unk

are

cop

mitochondrial membrane_proteins

Sentence StructurePredicates

parent(smf1,np-conj seg)parent(np-conj seg,sentence)child(np-conj seg,smf1)child(sentence,np-conj seg)next(smf1,and)next(np-conj seg,vp seg)after(np-conj seg,np seg)…

Part of SpeechPredicates

noun(membrane_proteins)verb(are)unk(smf1)noun_phrase(np seg)verb_phrase(vp seg)…

…

……

unk noun

Lexical Word Predicates

novelword(smf1)novelword(smf2)alphabetic(and)alphanumeric(smf1)…

Biomedical Knowledge Predicates

in_med_dict(mitochondrial)go_mitochondrial_membrane(smf1)go_mitochondrion(smf1)…

Sample Learned Rule

gene_disease(E,A) :-isa_np_segment(E), isa_np_segment(A), prev(A,B), pp_segment(B), child(A,C), next(C,D), alphabetic(D), novelword(C), child(E,F), alphanumeric(F).

Sent.

A EB

C D F

Noun Phrase Noun PhrasePrepositional Phrase

Novel Word Alphabetic Word Alphanumeric Word

Ensembles for Rules

N heads are better than one… learn multiple (sets of) rules with

training data aggregate the results by voting on

classification of testing data Bagging (Brieman ’96)

each rule-set gets one vote Boosting (Freund and Shapire ’96)

each rule gets weighted vote

Drawing a PR Curve

Conf Class Pre Rec

.98 + 1.00 0.20

.97 + 1.00 0.40

.84 - 0.66 0.40

.78 + 0.75 0.60

.55 + 0.80 0.80

.43 - 0.66 0.80

.23 - 0.57 0.80

.22 - 0.50 0.80

.12 + 0.55 1.00

Recall

Pre

cisi

on

Testset Results

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Recall

Pre

cisi

on

Craven Group

Boosting

Rule Quality

Bagging

Handling Large Skewed Data 5 fold cross-validation

train : 1007 positive / 240,874 negative test : 284 positive / 243,862 negative

With a 95% accurate rule set … 270 true positives 12,193 false positives! recall = 270 / 284 = 95.0% precison = 270 / 12,363 = 2.1%

Handling Large Skewed Data Ways to handle data

assign different costs to each class much more important to not cover negatives

under-sampling with bagging negatives under-represented key is to pick good negatives

filter data to restore equal ratio in testing data use naïve Bayes to learn relational parts

pos

neg

noun phrase filter

split into parts

genes diseases

naïve Bayes filter

naïve Bayes filter

join back

pos

neg

Filters to Reduce Negatives

1 : 485

1 : 1,979

1 : 39

Probabilistic Rules

Logical rules are too strict and often overfit Add probabilistic weight to each rule

based on accuracy on tuning set Learn parameters

make each rule a binary feature use any standard Machine Learning algorithm

(Naïve Bayes, perceptron, logistic regression…) to learn the weights

assign probability to examples based on weights

Weighted Exponential Model

where is a weight for each feature

Taking logs we get

We need to set to maximize log probability of the tuning set

i

i

N

i

fiieZ

foP1

1)|(

N

iii fZfoP

1

log)|(log

Weighted Exponential Model

Incorporating Background Knowledge Creation of predicates that capture salient

features endsIn(word, ‘ase’) occursInAbstractNtimes(word, 5)

Incorporation of prior knowledge into the learning system protein(word) if endsIn(word, ‘ase’) and

occursInAbstractNtimes(word, 5).

Searching in Large Spaces

Probabilistic bottom clause probabilistically remove least significant

predicates from the “bottom clause” Random rule generation

in place of hill-climbing, randomly select rules of a given length from the bottom clause

retain only those rules which do well on a tune set Learn coverage of clauses

neural network, Bayesian learning, etc.

References Nelson, Stuart J.; Powell, Tammy; Humphreys, Betsy L.

The Unified Medical Language System (UMLS) Project. In: Encyclopedia of Library and Information Science. Forthcoming.

Christopher D. Manning and Hinrich Schutze Foundations of Statistical Natural Language Processing 1999. MIT Press

Ellen Riloff The Sundance Sentence Analyzer. 2002 Ines De Castro Dutra, et. al. An Emperical Evaluation of Bagging in Inductive Logic

Programming. 2002. in Proceedings of the International Conference on Inductive Logic Programming. Syndey, Australia.

Dayne Frietag and Nicholas Kushmerick Boosted Wrapper Induction. 2001. in Proceedings of American Association of Artificial Intelligence (AAAI-2000)

Souyma Ray and Mark Craven Representing Sentence Structure in Hidden Markov Models for Information Extraction. 2001. in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-2001)

Tina Eliassi-Rad and Jude Shavlik A Theory-Refinement Approach to Information Extraction. 2001. in Proceedings of the 18th International Conference on Machine Learning

M. Craven and J. Kumlien Constructing biological knowledge-bases by extracting information from text sources. 1999. in Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 77-86. Germany.

Leo Breiman. Bagging Predictors. 1996. Machine Learning, 24(2):123-140. Yoav Freund and Robert E. Schapire. Experiments with a New Boosting Algorithm. 1996. in

International Conference on Machine Learning, pages 148-156.

biomedical information extraction using inductive logic programming mark goadrich and louis oliphant...

Documents

npconj segparentnpconj

andnextnpconj seg

sentencechildnpconj

npconj segnextsmf1

set of logical rules

vp segafternpconj seg

semantic information

probabilistic logical