mining literature and medical records
TRANSCRIPT
Lars Juhl Jensen
Mining literature and medical records
>10 km
exponential growth
~45 seconds per paper
outline
information retrieval
named entity recognition
augmented browsing
information extraction
text corpora
web resources
electronic health records
medical text mining
questions
information retrieval
find the relevant papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
PubMed
indexing
fast lookup
stemming
word endings
dynamic query expansion
MeSH terms
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
no tool will find that
named entity recognition
identify the concepts
computer
as smart as a dog
teach it specific tricks
comprehensive lexicon
small molecules
proteins
cellular components
tissues
organisms
environments
diseases
phenotypes
behaviors
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
orthographic variation
prefixes and postfixes
CDC28 vs. Cdc28p
Myc vs. c-Myc
singular and plural forms
noun and adjective forms
flexible matching
upper- and lower-case
spaces and hyphens
disambiguation
homonyms
“black list”
unfortunate names
SDS
a
scalable implementation
>10 km<10 hours
augmented browsing
show relevant information
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Reflect
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010
browser add-on
Firefox
Google Chrome
Safari
Internet Explorer
PDF viewer
Utopia Documents
web services
still too much to read
information extraction
formalize the facts
the starting point
named entity recognition
two approaches
co-mentioning
within documents
within paragraphs
within sentences
weighted counts
co-mentioning score
absolute co-mentionings
relative overrepresentation
NLPNatural Language Processing
grammatical analysis
part-of-speech tagging
noun, verb, etc.
multiword detection
semantic tagging
binding, regulation, etc.
sentence parsing
Gene and protein names
Cue words for entity recognition
Verbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
extract stated facts
handle negations
high precision
poor recall
highly domain specific
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
text corpora
a body of text
most use abstracts
few use full-text articles
no access
~22 mio. abstracts
~1.8 mio. free articles
~1.4 mio. Elsevier articles
~7.5 mio. patents
web resources
information on proteins
iHOP
STRING
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
text mining channel
what is known
not in databases
human proteins
co-mentioning dominates
NLP provides actions
homology transfer
STITCH
small molecules
COMPARTMENTS
subcellular localization
DISEASES
human diseases
search for a protein
search for a disease
STRING payload
evidence viewers
electronic health records
what happens at a hospital
Jensen et al., Nature Reviews Genetics, 2012
two types of data
structured data
Jensen et al., Nature Reviews Genetics, 2012
unstructured data
clinical narrative
getting access
patient consent
opt-out
opt-in
ethical approval
medical question
no explorative studies
data security
not anonymized
not transferable
hospital IT systems
not standardized
clinical narrative
not normal language
trouble for NLP
in native language
not English
few tools
no dictionaries
by busy doctors and nurses
typos
medical text mining
what is possible?
a psychiatric corpus
clinical narrative
in Danish
dictionaries
diseases
drugs
adverse drug reactions
disease comorbidity
Jensen et al., Nature Reviews Genetics, 2012
multiple testing
comorbidity matrix
Roque et al., PLoS Computational Biology, 2011
patient clustering
Jensen et al., Nature Reviews Genetics, 2012
clustering algorithm
Roque et al., PLoS Computational Biology, 2011
patient stratification
temporal correlation
drug treatment
adverse drug events
Eriksson et al., in preparation, 2012
pharmacovigilance
thank you!
questions?