integration of heterogeneous data

Post on 10-May-2015

683 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

10th Course in Bioinformatics and Systems Biology for Molecular Biologists, Schloss Hohenkammer, Hohenkammer, Germany, March 15, 2010.

TRANSCRIPT

Integration of heterogeneous data

Lars Juhl Jensen

data mining

text mining

interaction networks

Kuhn et al., Nucleic Acids Research, 2010

parts lists

630 genomes

2.5 million proteins

~74,000 small molecules

many databases

different formats

model organism databases

Ensembl

RefSeq

PubChem

genomic context

gene fusion

Korbel et al., Nature Biotechnology, 2004

conserved neighborhood

operons

Korbel et al., Nature Biotechnology, 2004

bidirectional promoters

Korbel et al., Nature Biotechnology, 2004

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

experimental data

gene coexpression

protein interactions

Jensen & Bork, Science, 2008

genetic interactions

Beyer et al., Nature Reviews Genetics, 2007

small molecule interactions

in vitro binding assays

cellular activity assays

many databases

GEOGene Expression Omnibus

BINDBiomolecular Interaction Network Database

BioGRIDGeneral Repository for Interaction Datasets

DIPDatabase of Interacting Proteins

IntAct

MINTMolecular Interactions Database

HPRDHuman Protein Reference Database

PDBProtein Data Bank

BindingDB

CTDComparative Toxicogenomics Database

DrugBank

GLIDAGPCR-Ligand Database

MATADOR

PDSP KiPsycoactive Drug Screening Program

PharmGKBPharmacogenomics Knowledge Base

different formats

different identifiers

partially redundant

Campillos & Kuhn et al., Science, 2008

curated knowledge

complexes

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

many databases

Gene Ontology

MIPSMunich Information center

for Protein Sequences

KEGGKyoto Encyclopedia of Genes and Genomes

MetaCyc

Reactome

PIDNCI-Nature Pathway Interaction Database

high confidence

different formats

different identifiers

partially redundant

literature mining

>10 km

human readable

not computer readable

different names

text corpus

MEDLINE

SGDSaccharomyces Genome Database

The Interactive Fly

OMIMOnline Mendelian Inheritance in Man

thesaurus

co-mentioning

statistical methods

NLPNatural Language Processing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxgene The GAL4 gene]

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

restricted access

Reflect

augmented browsing

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009

integration

the easy problems

many databases

different formats

different identifiers

partially redundant

parsers

thesaurus

book keeping

the hard problems

many data types

not comparable

variable quality

raw quality scores

intergenic distances

Korbel et al., Nature Biotechnology, 2004

correlations

reproducibility

von Mering et al., Nucleic Acids Research, 2005

score calibration

gold standard

von Mering et al., Nucleic Acids Research, 2005

spread over 630 genomes

transfer by orthology

von Mering et al., Nucleic Acids Research, 2005

two modes

COG mode

von Mering et al., Nucleic Acids Research, 2005

protein mode

von Mering et al., Nucleic Acids Research, 2005

combine all evidence

P = 1-(1-P1)(1-P2)(1-P3) …

visualize

Kuhn et al., Nucleic Acids Research, 2010

access

access for humans

web interfaces

access for computers

web services

RESTRepresentational State Transfer

SOAPSimple Object Access Protocol

Acknowledgments

STITCH– Michael Kuhn

– Damian Szklarczyk

– Andrea Franceschini

– Monica Campillos

– Christian von Mering

– Lars Juhl Jensen

– Andreas Beyer

– Peer Bork

Reflect– Sean O’Donoghue

– Heiko Horn

– Sune Frankild

– Evangelos Pafilis

– Michael Kuhn

– Nigel Brown

– Reinhardt Schneider

STRING– Christian von Mering

– Michael Kuhn

– Manuel Stark

– Samuel Chaffron

– Chris Creevey

– Jean Muller

– Tobias Doerks

– Philippe Julien

– Alexander Roth

– Milan Simonovic

– Jan Korbel

– Berend Snel

– Martijn Huynen

– Peer Bork

larsjuhljensen

top related