integration of heterogeneous data

142
Integration of heterogeneous data Lars Juhl Jensen

Upload: lars-juhl-jensen

Post on 10-May-2015

683 views

Category:

Technology


3 download

DESCRIPTION

10th Course in Bioinformatics and Systems Biology for Molecular Biologists, Schloss Hohenkammer, Hohenkammer, Germany, March 15, 2010.

TRANSCRIPT

Page 1: Integration of heterogeneous data

Integration of heterogeneous data

Lars Juhl Jensen

Page 2: Integration of heterogeneous data
Page 3: Integration of heterogeneous data
Page 4: Integration of heterogeneous data
Page 5: Integration of heterogeneous data
Page 6: Integration of heterogeneous data

data mining

Page 7: Integration of heterogeneous data

text mining

Page 8: Integration of heterogeneous data

interaction networks

Page 9: Integration of heterogeneous data
Page 10: Integration of heterogeneous data

Kuhn et al., Nucleic Acids Research, 2010

Page 11: Integration of heterogeneous data

parts lists

Page 12: Integration of heterogeneous data

630 genomes

Page 13: Integration of heterogeneous data

2.5 million proteins

Page 14: Integration of heterogeneous data

~74,000 small molecules

Page 15: Integration of heterogeneous data

many databases

Page 16: Integration of heterogeneous data

different formats

Page 17: Integration of heterogeneous data

model organism databases

Page 18: Integration of heterogeneous data

Ensembl

Page 19: Integration of heterogeneous data

RefSeq

Page 20: Integration of heterogeneous data

PubChem

Page 21: Integration of heterogeneous data

genomic context

Page 22: Integration of heterogeneous data

gene fusion

Page 23: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 24: Integration of heterogeneous data

conserved neighborhood

Page 25: Integration of heterogeneous data

operons

Page 26: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 27: Integration of heterogeneous data

bidirectional promoters

Page 28: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 29: Integration of heterogeneous data

phylogenetic profiles

Page 30: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 31: Integration of heterogeneous data

experimental data

Page 32: Integration of heterogeneous data

gene coexpression

Page 33: Integration of heterogeneous data
Page 34: Integration of heterogeneous data

protein interactions

Page 35: Integration of heterogeneous data

Jensen & Bork, Science, 2008

Page 36: Integration of heterogeneous data

genetic interactions

Page 37: Integration of heterogeneous data

Beyer et al., Nature Reviews Genetics, 2007

Page 38: Integration of heterogeneous data

small molecule interactions

Page 39: Integration of heterogeneous data

in vitro binding assays

Page 40: Integration of heterogeneous data

cellular activity assays

Page 41: Integration of heterogeneous data

many databases

Page 42: Integration of heterogeneous data

GEOGene Expression Omnibus

Page 43: Integration of heterogeneous data

BINDBiomolecular Interaction Network Database

Page 44: Integration of heterogeneous data

BioGRIDGeneral Repository for Interaction Datasets

Page 45: Integration of heterogeneous data

DIPDatabase of Interacting Proteins

Page 46: Integration of heterogeneous data

IntAct

Page 47: Integration of heterogeneous data

MINTMolecular Interactions Database

Page 48: Integration of heterogeneous data

HPRDHuman Protein Reference Database

Page 49: Integration of heterogeneous data

PDBProtein Data Bank

Page 50: Integration of heterogeneous data

BindingDB

Page 51: Integration of heterogeneous data

CTDComparative Toxicogenomics Database

Page 52: Integration of heterogeneous data

DrugBank

Page 53: Integration of heterogeneous data

GLIDAGPCR-Ligand Database

Page 54: Integration of heterogeneous data

MATADOR

Page 55: Integration of heterogeneous data

PDSP KiPsycoactive Drug Screening Program

Page 56: Integration of heterogeneous data

PharmGKBPharmacogenomics Knowledge Base

Page 57: Integration of heterogeneous data

different formats

Page 58: Integration of heterogeneous data

different identifiers

Page 59: Integration of heterogeneous data

partially redundant

Page 60: Integration of heterogeneous data

Campillos & Kuhn et al., Science, 2008

Page 61: Integration of heterogeneous data

curated knowledge

Page 62: Integration of heterogeneous data

complexes

Page 63: Integration of heterogeneous data

pathways

Page 64: Integration of heterogeneous data

Letunic & Bork, Trends in Biochemical Sciences, 2008

Page 65: Integration of heterogeneous data

many databases

Page 66: Integration of heterogeneous data

Gene Ontology

Page 67: Integration of heterogeneous data

MIPSMunich Information center

for Protein Sequences

Page 68: Integration of heterogeneous data

KEGGKyoto Encyclopedia of Genes and Genomes

Page 69: Integration of heterogeneous data

MetaCyc

Page 70: Integration of heterogeneous data

Reactome

Page 71: Integration of heterogeneous data

PIDNCI-Nature Pathway Interaction Database

Page 72: Integration of heterogeneous data

high confidence

Page 73: Integration of heterogeneous data

different formats

Page 74: Integration of heterogeneous data

different identifiers

Page 75: Integration of heterogeneous data

partially redundant

Page 76: Integration of heterogeneous data

literature mining

Page 77: Integration of heterogeneous data

>10 km

Page 78: Integration of heterogeneous data

human readable

Page 79: Integration of heterogeneous data

not computer readable

Page 80: Integration of heterogeneous data

different names

Page 81: Integration of heterogeneous data

text corpus

Page 82: Integration of heterogeneous data

MEDLINE

Page 83: Integration of heterogeneous data

SGDSaccharomyces Genome Database

Page 84: Integration of heterogeneous data

The Interactive Fly

Page 85: Integration of heterogeneous data

OMIMOnline Mendelian Inheritance in Man

Page 86: Integration of heterogeneous data

thesaurus

Page 87: Integration of heterogeneous data

co-mentioning

Page 88: Integration of heterogeneous data

statistical methods

Page 89: Integration of heterogeneous data

NLPNatural Language Processing

Page 90: Integration of heterogeneous data

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxgene The GAL4 gene]

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 91: Integration of heterogeneous data
Page 92: Integration of heterogeneous data

restricted access

Page 93: Integration of heterogeneous data

Reflect

Page 94: Integration of heterogeneous data

augmented browsing

Page 95: Integration of heterogeneous data

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009

Page 96: Integration of heterogeneous data

integration

Page 97: Integration of heterogeneous data

the easy problems

Page 98: Integration of heterogeneous data

many databases

Page 99: Integration of heterogeneous data

different formats

Page 100: Integration of heterogeneous data

different identifiers

Page 101: Integration of heterogeneous data

partially redundant

Page 102: Integration of heterogeneous data

parsers

Page 103: Integration of heterogeneous data

thesaurus

Page 104: Integration of heterogeneous data

book keeping

Page 105: Integration of heterogeneous data

the hard problems

Page 106: Integration of heterogeneous data

many data types

Page 107: Integration of heterogeneous data

not comparable

Page 108: Integration of heterogeneous data

variable quality

Page 109: Integration of heterogeneous data

raw quality scores

Page 110: Integration of heterogeneous data

intergenic distances

Page 111: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 112: Integration of heterogeneous data

correlations

Page 113: Integration of heterogeneous data
Page 114: Integration of heterogeneous data

reproducibility

Page 115: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 116: Integration of heterogeneous data

score calibration

Page 117: Integration of heterogeneous data

gold standard

Page 118: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 119: Integration of heterogeneous data

spread over 630 genomes

Page 120: Integration of heterogeneous data

transfer by orthology

Page 121: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 122: Integration of heterogeneous data

two modes

Page 123: Integration of heterogeneous data

COG mode

Page 124: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 125: Integration of heterogeneous data

protein mode

Page 126: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 127: Integration of heterogeneous data

combine all evidence

Page 128: Integration of heterogeneous data

P = 1-(1-P1)(1-P2)(1-P3) …

Page 129: Integration of heterogeneous data

visualize

Page 130: Integration of heterogeneous data

Kuhn et al., Nucleic Acids Research, 2010

Page 131: Integration of heterogeneous data

access

Page 132: Integration of heterogeneous data

access for humans

Page 133: Integration of heterogeneous data

web interfaces

Page 134: Integration of heterogeneous data
Page 135: Integration of heterogeneous data
Page 136: Integration of heterogeneous data
Page 137: Integration of heterogeneous data

access for computers

Page 138: Integration of heterogeneous data

web services

Page 139: Integration of heterogeneous data

RESTRepresentational State Transfer

Page 140: Integration of heterogeneous data

SOAPSimple Object Access Protocol

Page 141: Integration of heterogeneous data

Acknowledgments

STITCH– Michael Kuhn

– Damian Szklarczyk

– Andrea Franceschini

– Monica Campillos

– Christian von Mering

– Lars Juhl Jensen

– Andreas Beyer

– Peer Bork

Reflect– Sean O’Donoghue

– Heiko Horn

– Sune Frankild

– Evangelos Pafilis

– Michael Kuhn

– Nigel Brown

– Reinhardt Schneider

STRING– Christian von Mering

– Michael Kuhn

– Manuel Stark

– Samuel Chaffron

– Chris Creevey

– Jean Muller

– Tobias Doerks

– Philippe Julien

– Alexander Roth

– Milan Simonovic

– Jan Korbel

– Berend Snel

– Martijn Huynen

– Peer Bork

Page 142: Integration of heterogeneous data

larsjuhljensen