ontology engineering approaches based on semi-automated curation of the primary literature

Ontology Engineering approaches based on semi-automated curation of the primary literature

Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed HovyBiomedical Knowledge Engineering Group,Information Sciences Institute,University of Southern California

Where’s all the knowledge?

Image taken from U.S. Geological Survey Energy Resource Surveys Program

The primary research literature...… is the end-product of all scientific research … forms the basis for human understanding of the subject... is written in natural language … is structured… is interpretable… is expensive… is terse

Precision and imprecision in biological representation

Assay:define model

system

Experiment: perform

measurements

Conceptual model

‘Stress’, ‘energy balance’,‘homeostasis’, ‘glucoprivation’

2-deoxyglucose (2DG) administrated intravenously to rats, look for activation in ‘stress-responsive’ neurons

MAP-K and pERK activate in neurons in PVH, BST and CEAl

High-level concepts

Independent variables

Dependent variables

Imprecise

Precise

Partitioning the literature

The problem with knowledge: an over-abundance of data

Corpus Preparation for Natural Language Processing

The Journal of Comparative Neurology is the foremost international journal for neuroanatomy. We downloaded ~12,000 PDFs in total from 1970-2005.

We preprocessed papers with consistent formatting from vol. 204 - 490 (1982-2005) providing a corpus of 9,474 PDF files. This corpus contains 99,094,318 words

Active Learning / Information Extraction Methodology

The logical structure of a tract-tracing experiment

Tracer Chemical [1] Injection Site [1]

Location brain structure topography side

Labeled region [1...*] Location

brain structure topography ipsi-contra relative to

injection site? Label type Label density

‘anterograde’

‘retrograde’

Annotated XML Example from Albanese & Minciacchi, 1983, JCN 216:406-420

expt. labeldelineation injectionlabelingdescription

Recall, Precision and F-Score

Field Labeling Results –overall label level

System Features Precision Recall F-Score

Baseline 0.3926 0.1673 0.2346

Lexicon 0.5689 0.3771 0.4536

Lexicon + Surface Words 0.7415 0.6817 0.7103

Lexicon + Surface Words + Window Words

0.7843 0.7039 0.7420

Lexicon + Surface + Window Words + Dependency features

0.7756 0.7347 0.7546

Preliminary data from a training set of 14 documents+ testing on 16 documents

Counts

O

injection Location

injection S

pread

labeling D

escription

labeling Location

tracer C

hemical

O 41087 141 97 338 1751 6 43420injectionLocation 545 744 48 6 820 1 2164injectionSpread 126 43 147 11 155 0 482labelingDescription 1121 5 0 3773 82 47 5028labelingLocation 1988 224 110 27 9251 0 11600tracerChemical 108 1 12 0 0 623 744

44975 1158 414 4155 12059 677

machine labels

human labels

Field Labeling Results-Confusion Matrices

Generalizing the methodology: ‘Histology’

[from Gonzalo-Ruiz et al 1992, JCN 321: 300-311]

The logical structure of a tract-tracing experiment

Tracer Chemical [1] Injection Site [1]

Location brain structure topography side

Labeled region [1...*] Location

brain structure topography ipsi-contra relative to

injection site? Label type Label density

‘anterograde’

‘retrograde’

Time and effort Current performance achieved by annotating 40

documents Each document contains 97 sentences (in results

section) on average Annotation rate

~ 40 Sent/hr (no support) ~115 Sent/hr (after 20 documents)

Time taken to annotate document to train system to perform at this standard ~65 hours with no support Estimate ~2 months for a 50% RA (20 hours / week)

Can we discover the schema from the text?

Given a large review or a grant proposal specific to a single laboratory

Annotate independent and dependent variables in papers.

Can we learn and extract these patterns?

An example from current set of annotations

10 independent variables:•age•species•sex•weight•agonist/antagonist combinations (9)•primary antibody•preparation•protocol•brain region

1 dependent variable:•signal density

Acknowledgements

Funding Information Sciences Institute, seed

funding * National Library of Medicine (RO1-

LM07061) * NSF (LONI MAP project) HBP (USCBP)

Neuroscience consultants Alan Watts * Larry Swanson * Arshad Khan * Rick Thompson * Joel Hahn * Lori Gorton * Kim Rapp *

Computer Scientists Eduard Hovy * Donghui Feng * Patrick Pantel *

Developers Tommy Ingulfsen * Wei-Cheng Cheng

ontology engineering approaches based on semi-automated curation of the primary literature

Documents