text mining

Post on 10-May-2015

401 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lars Juhl Jensen

Text mining

>10 km

exponential growth

~45 seconds per paper

corpus

most use abstracts

few use full-text articles

no access

information retrieval

find the relevant papers

ad hoc retrieval

user-specified query

“yeast AND cell cycle”

PubMed

indexing

fast lookup

stemming

word endings

dynamic query expansion

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

no tool will find that

still too much to read

computer

as smart as a dog

teach it specific tricks

named entity recognition

identify the concepts

comprehensive lexicon

small molecules

proteins

cellular components

tissues

diseases

environments

organisms

orthographic expansion

prefixes and postfixes

Cdc28 vs. Cdc28p

singular and plural forms

flexible matching

upper- and lower-case

spaces and hyphens

“black list”

SDS

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

information extraction

formalize the facts

the starting point

named entity recognition

two approaches

co-mentioning

within documents

within paragraphs

within sentences

weighted counts

NLPNatural Language Processing

part-of-speech tagging

semantic tagging

sentence parsing

Gene and protein names

Cue words for entity recognition

Verbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

handle negations

high precision

poor recall

highly domain specific

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

text/data integration

augmented browsing

Reflect

show relevant information

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O ’Donoghue et al., Journal of Web Semantics, 2010

guilt by association

heterogeneous evidence

knowledge

experiments

text mining

predictions

common identifiers

quality scores

web interface

STRING

proteins

Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011

Frishman et al., Modern Genome Annotation, 2009

STITCH

small molecules

Kuhn et al., Nucleic Acids Research, 2012

COMPARTMENTS

subcellular localization

compartments.jensenlab.org

TISSUES

human tissue expression

tissues.jensenlab.org

DISEASES

human diseases

evidence viewers

web services

bulk download

summary

text mining

simpler

more useful

less boring

thank you!

questions?

top related