text mining
TRANSCRIPT
Lars Juhl Jensen
Text mining
>10 km
exponential growth
~45 seconds per paper
corpus
most use abstracts
few use full-text articles
no access
information retrieval
find the relevant papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
PubMed
indexing
fast lookup
stemming
word endings
dynamic query expansion
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
no tool will find that
still too much to read
computer
as smart as a dog
teach it specific tricks
named entity recognition
identify the concepts
comprehensive lexicon
small molecules
proteins
cellular components
tissues
diseases
environments
organisms
orthographic expansion
prefixes and postfixes
Cdc28 vs. Cdc28p
singular and plural forms
flexible matching
upper- and lower-case
spaces and hyphens
“black list”
SDS
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
information extraction
formalize the facts
the starting point
named entity recognition
two approaches
co-mentioning
within documents
within paragraphs
within sentences
weighted counts
NLPNatural Language Processing
part-of-speech tagging
semantic tagging
sentence parsing
Gene and protein names
Cue words for entity recognition
Verbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
handle negations
high precision
poor recall
highly domain specific
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
text/data integration
augmented browsing
Reflect
show relevant information
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O ’Donoghue et al., Journal of Web Semantics, 2010
guilt by association
heterogeneous evidence
knowledge
experiments
text mining
predictions
common identifiers
quality scores
web interface
STRING
proteins
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
Frishman et al., Modern Genome Annotation, 2009
STITCH
small molecules
Kuhn et al., Nucleic Acids Research, 2012
COMPARTMENTS
subcellular localization
compartments.jensenlab.org
TISSUES
human tissue expression
tissues.jensenlab.org
DISEASES
human diseases
evidence viewers
web services
bulk download
summary
text mining
simpler
more useful
less boring
thank you!
questions?