ontology engineering approaches based on semi-automated curation of the primary literature
DESCRIPTION
Ontology Engineering approaches based on semi-automated curation of the primary literature. Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed Hovy Biomedical Knowledge Engineering Group, Information Sciences Institute, University of Southern California. Where’s all the knowledge?. - PowerPoint PPT PresentationTRANSCRIPT
Ontology Engineering approaches based on semi-automated curation of the primary literature
Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed HovyBiomedical Knowledge Engineering Group,Information Sciences Institute,University of Southern California
Where’s all the knowledge?
Image taken from U.S. Geological Survey Energy Resource Surveys Program
The primary research literature...… is the end-product of all scientific research … forms the basis for human understanding of the subject... is written in natural language … is structured… is interpretable… is expensive… is terse
Precision and imprecision in biological representation
Assay:define model
system
Experiment: perform
measurements
Conceptual model
‘Stress’, ‘energy balance’,‘homeostasis’, ‘glucoprivation’
2-deoxyglucose (2DG) administrated intravenously to rats, look for activation in ‘stress-responsive’ neurons
MAP-K and pERK activate in neurons in PVH, BST and CEAl
High-level concepts
Independent variables
Dependent variables
Imprecise
Precise
Partitioning the literature
The problem with knowledge: an over-abundance of data
Corpus Preparation for Natural Language Processing
The Journal of Comparative Neurology is the foremost international journal for neuroanatomy. We downloaded ~12,000 PDFs in total from 1970-2005.
We preprocessed papers with consistent formatting from vol. 204 - 490 (1982-2005) providing a corpus of 9,474 PDF files. This corpus contains 99,094,318 words
Active Learning / Information Extraction Methodology
The logical structure of a tract-tracing experiment
Tracer Chemical [1] Injection Site [1]
Location brain structure topography side
Labeled region [1...*] Location
brain structure topography ipsi-contra relative to
injection site? Label type Label density
‘anterograde’
‘retrograde’
Annotated XML Example from Albanese & Minciacchi, 1983, JCN 216:406-420
expt. labeldelineation injectionlabelingdescription
Recall, Precision and F-Score
Field Labeling Results –overall label level
System Features Precision Recall F-Score
Baseline 0.3926 0.1673 0.2346
Lexicon 0.5689 0.3771 0.4536
Lexicon + Surface Words 0.7415 0.6817 0.7103
Lexicon + Surface Words + Window Words
0.7843 0.7039 0.7420
Lexicon + Surface + Window Words + Dependency features
0.7756 0.7347 0.7546
Preliminary data from a training set of 14 documents+ testing on 16 documents
Counts
O
injection Location
injection S
pread
labeling D
escription
labeling Location
tracer C
hemical
O 41087 141 97 338 1751 6 43420injectionLocation 545 744 48 6 820 1 2164injectionSpread 126 43 147 11 155 0 482labelingDescription 1121 5 0 3773 82 47 5028labelingLocation 1988 224 110 27 9251 0 11600tracerChemical 108 1 12 0 0 623 744
44975 1158 414 4155 12059 677
machine labels
human labels
Field Labeling Results-Confusion Matrices
Generalizing the methodology: ‘Histology’
[from Gonzalo-Ruiz et al 1992, JCN 321: 300-311]
The logical structure of a tract-tracing experiment
Tracer Chemical [1] Injection Site [1]
Location brain structure topography side
Labeled region [1...*] Location
brain structure topography ipsi-contra relative to
injection site? Label type Label density
‘anterograde’
‘retrograde’
Time and effort Current performance achieved by annotating 40
documents Each document contains 97 sentences (in results
section) on average Annotation rate
~ 40 Sent/hr (no support) ~115 Sent/hr (after 20 documents)
Time taken to annotate document to train system to perform at this standard ~65 hours with no support Estimate ~2 months for a 50% RA (20 hours / week)
Can we discover the schema from the text?
Given a large review or a grant proposal specific to a single laboratory
Annotate independent and dependent variables in papers.
Can we learn and extract these patterns?
An example from current set of annotations
10 independent variables:•age•species•sex•weight•agonist/antagonist combinations (9)•primary antibody•preparation•protocol•brain region
1 dependent variable:•signal density
Acknowledgements
Funding Information Sciences Institute, seed
funding * National Library of Medicine (RO1-
LM07061) * NSF (LONI MAP project) HBP (USCBP)
Neuroscience consultants Alan Watts * Larry Swanson * Arshad Khan * Rick Thompson * Joel Hahn * Lori Gorton * Kim Rapp *
Computer Scientists Eduard Hovy * Donghui Feng * Patrick Pantel *
Developers Tommy Ingulfsen * Wei-Cheng Cheng