large-scale integration of data and text

157
Large-scale integration of data and text Lars Juhl Jensen

Upload: lars-juhl-jensen

Post on 15-Jul-2015

279 views

Category:

Science


2 download

TRANSCRIPT

Large-scale integration of data and text

Lars Juhl Jensen

three parts

association networks

guilt by association

protein networks

STRING

2000+ genomes

genomic context

gene fusion

Korbel et al., Nature Biotechnology, 2004

gene neighborhood

Korbel et al., Nature Biotechnology, 2004

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

a real example

Cell

Cellulosomes

Cellulose

experimental data

gene coexpression

physical interactions

Jensen & Bork, Science, 2008

genetic interactions

Beyer et al., Nature Reviews Genetics, 2007

curated knowledge

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

many databases

different formats

different identifiers

variable quality

not comparable

not same species

hard work

(students)

parsers

mapping files

quality scores

phylogenetic profiles

affinity purification

von Mering et al., Nucleic Acids Research, 2005

score calibration

gold standard

von Mering et al., Nucleic Acids Research, 2005

implicit weighting by quality

common scale

homology-based transfer

orthologous groups

Franceschini et al., Nucleic Acids Research, 2013

Exercise 1Query STRING for human TYMS

Show network in confidence mode

Show up to 20 interaction partners

Show only experimental evidence

Show also low-confidence links

text mining

>10 km

too much to read

exponential growth

~40 seconds per paper

computer

as smart as a dog

teach it specific tricks

named entity recognition

comprehensive lexicon

cyclin dependent kinase 1

CDC2

orthographic variation

expansion rules

prefixes and suffixes

CDC2

hCdc2

flexible matching

spaces and hyphens

cyclin dependent kinase 1

cyclin-dependent kinase 1

“black list”

SDS

text corpus

~22 million abstracts

Medline

~2 million full-text articles

restricted access

information extraction

co-mentioning

counting

within documents

within paragraphs

within sentences

scoring scheme

score calibration

NLPNatural Language Processing

grammatical analysis

part-of-speech tagging

what you learned in schoolpronoun pronoun verb preposition noun

semantic tagging

words of special interest

sentence parsing

Gene and protein namesCue words for entity

recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Saric et al., Proceedings of ACL, 2004

more precise

worse recall

web resources

general approach

text mining

curated knowledge

experimental data

computational predictions

common identifiers

quality scores

score calibration

visualization

STRING

protein networks

Szklarczyk et al., Nucleic Acids Research, 2015string-db.org

STITCH

chemical networks

Kuhn et al., Nucleic Acids Research, 2014stitch-db.org

PubChem

pathways

drug targets

high-throughput screens

COMPARTMENTS

subcellular localization

Binder et al., Database, 2014compartments.jensenlab.org

Gene Ontology

GO annotations

UniProtKB

model organism databases

sequence-based predictions

PSORT

YLoc

TISSUES

tissue expression

tissues.jensenlab.org Santos et al., submitted, 2015

Brenda Tissue Ontology

high-throughput studies

EST libraries

microarrays

RNA-Seq

mass spectrometry

immunohistochemistry

DISEASES

disease associations

diseases.jensenlab.org Frankild et al., Methods, 2015

Disease Ontology

genetics studies

Genetics Home Reference

NHGRI GWAS Catalog

DistiLD

cancer mutation data

COSMIC

Exercise 2Find TYMS-related diseaseshttp://diseases.jensenlab.org

Find some inhibitors of TYMShttp://stitch-db.org

Assess their tissue specificityhttp://tissues.jensenlab.org

thank you!