text mining

112
Lars Juhl Jensen Text mining >10 km

Upload: lars-juhl-jensen

Post on 10-May-2015

401 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Text mining

Lars Juhl Jensen

Text mining

>10 km

Page 2: Text mining

exponential growth

Page 3: Text mining
Page 4: Text mining
Page 5: Text mining

~45 seconds per paper

Page 6: Text mining

corpus

Page 7: Text mining

most use abstracts

Page 8: Text mining

few use full-text articles

Page 9: Text mining

no access

Page 10: Text mining
Page 11: Text mining

information retrieval

Page 12: Text mining

find the relevant papers

Page 13: Text mining

ad hoc retrieval

Page 14: Text mining

user-specified query

Page 15: Text mining

“yeast AND cell cycle”

Page 16: Text mining

PubMed

Page 17: Text mining
Page 18: Text mining

indexing

Page 19: Text mining

fast lookup

Page 20: Text mining

stemming

Page 21: Text mining

word endings

Page 22: Text mining

dynamic query expansion

Page 23: Text mining

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

Page 24: Text mining

no tool will find that

Page 25: Text mining

still too much to read

Page 26: Text mining

computer

Page 27: Text mining

as smart as a dog

Page 28: Text mining

teach it specific tricks

Page 29: Text mining
Page 30: Text mining
Page 31: Text mining

named entity recognition

Page 32: Text mining

identify the concepts

Page 33: Text mining

comprehensive lexicon

Page 34: Text mining

small molecules

Page 35: Text mining

proteins

Page 36: Text mining

cellular components

Page 37: Text mining

tissues

Page 38: Text mining

diseases

Page 39: Text mining

environments

Page 40: Text mining

organisms

Page 41: Text mining

orthographic expansion

Page 42: Text mining

prefixes and postfixes

Page 43: Text mining

Cdc28 vs. Cdc28p

Page 44: Text mining

singular and plural forms

Page 45: Text mining

flexible matching

Page 46: Text mining

upper- and lower-case

Page 47: Text mining

spaces and hyphens

Page 48: Text mining

“black list”

Page 49: Text mining

SDS

Page 50: Text mining

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

Page 51: Text mining

information extraction

Page 52: Text mining

formalize the facts

Page 53: Text mining

the starting point

Page 54: Text mining

named entity recognition

Page 55: Text mining

two approaches

Page 56: Text mining

co-mentioning

Page 57: Text mining

within documents

Page 58: Text mining

within paragraphs

Page 59: Text mining

within sentences

Page 60: Text mining

weighted counts

Page 61: Text mining

NLPNatural Language Processing

Page 62: Text mining

part-of-speech tagging

Page 63: Text mining

semantic tagging

Page 64: Text mining

sentence parsing

Page 65: Text mining

Gene and protein names

Cue words for entity recognition

Verbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 66: Text mining

handle negations

Page 67: Text mining

high precision

Page 68: Text mining

poor recall

Page 69: Text mining

highly domain specific

Page 70: Text mining

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

Page 71: Text mining

text/data integration

Page 72: Text mining

augmented browsing

Page 73: Text mining

Reflect

Page 74: Text mining

show relevant information

Page 75: Text mining

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O ’Donoghue et al., Journal of Web Semantics, 2010

Page 76: Text mining

guilt by association

Page 77: Text mining
Page 78: Text mining

heterogeneous evidence

Page 79: Text mining

knowledge

Page 80: Text mining

experiments

Page 81: Text mining

text mining

Page 82: Text mining

predictions

Page 83: Text mining

common identifiers

Page 84: Text mining

quality scores

Page 85: Text mining

web interface

Page 86: Text mining

STRING

Page 87: Text mining

proteins

Page 88: Text mining

Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011

Page 89: Text mining

Frishman et al., Modern Genome Annotation, 2009

Page 90: Text mining

STITCH

Page 91: Text mining

small molecules

Page 92: Text mining

Kuhn et al., Nucleic Acids Research, 2012

Page 93: Text mining

COMPARTMENTS

Page 94: Text mining

subcellular localization

Page 95: Text mining

compartments.jensenlab.org

Page 96: Text mining

TISSUES

Page 97: Text mining

human tissue expression

Page 98: Text mining

tissues.jensenlab.org

Page 99: Text mining

DISEASES

Page 100: Text mining

human diseases

Page 101: Text mining
Page 102: Text mining

evidence viewers

Page 103: Text mining
Page 104: Text mining

web services

Page 105: Text mining

bulk download

Page 106: Text mining

summary

Page 107: Text mining

text mining

Page 108: Text mining

simpler

Page 109: Text mining

more useful

Page 110: Text mining

less boring

Page 111: Text mining

thank you!

Page 112: Text mining

questions?