string: prediction of protein networks through integration of diverse large-scale data sets

STRING: Prediction of protein networks through integration of diverse large-scale data sets

Lars Juhl JensenEMBL Heidelberg

The problem ...

Prediction of protein function

• Homology based methods– Simple sequence similarity searches (BLAST)– Profile searches (PSI-BLAST)– Databases of conserved domains (Pfam, SMART)

• Non-homology based methods working on sequence– Prediction from sequence derived features (ProtFun)– Prediction from genomic context (STRING)

• Prediction from high-throughput experimental data– Microarray gene expression data– Protein-protein interaction screens– ...

Prediction of functional associations

“Protein mode”

Separate networkfor each species

“COG mode”

One networkcoveringall species

STRING provides a protein network based on integration of diverse types of evidence

Genomic Neighborhood

Species Co-occurrence

Gene Fusions

Database Imports

Exp. Interaction Data

Co-expression

Literature Co-mentioning

Score calibration against a common reference

• Many diverse types of evidence– The quality of each is judged by

very different raw scores– These are all calibrated against

the same reference set– This is the key to obtaining a

consistent scoring scheme

• Requirements for a reference– Must represent a compromise

of the all types of evidence– Broad species coverage

• Our reference is KEGG maps– Two proteins are “related” if on

a common KEGG map

Integrating physical interaction screens

Make binaryrepresentationof complexes

Yeast two-hybriddata sets are

inherently binary

Calculate scorefrom number of

(co-)occurrences

Calculate scorefrom non-shared

partners

Calibrate against KEGG maps

Infer associations in other species

Combine evidence from experiments

Gene fusion: predicting physical interactions

Detect multiple proteinsmatching to one protein

Exclude overlappingalignments

Infer associations inother species

Calibrate againstKEGG maps

Mining microarray expression databases

Re-normalize arraysby modern methodto remove biases

Buildexpression

matrix

Combinesimilar arrays

by PCA

Construct predictorby Gaussian kerneldensity estimation

Calibrateagainst

KEGG maps

Inferassociations inother species

Gene neighborhood: predicting co-expression

Identify runs of adjacent geneswith the same direction

Score each gene pair based onintergenic distances


Infer associationsin other species

Co-mentioning in the scientific literature

Associate abstracts with species

Identify gene names in title/abstract

Count (co-)occurrences of genes

Test significance of associations


Infer associations in other species

Phylogenetic profile: co-mentioning in genomes

Align all proteins against all

Calculate best-hit profile

Join similar species by PCA

Calculate PC profile distances


COG based vs. similarity based transfer

• Resolution of the mapping– COGs result in many-to-many– Sequence similarity should

resolve with better detail

• Our scoring scheme– Pairwise alignment scores are

normalized by self-hit– These scores are transformed

using exp(-k1/x), where k1=0.7

– Missing values are estimated– Divide each score by the

column and row sum

• This gives a quantitative score for protein correspondence

Tar

get

spec

ies

Source species

Tar

get

spec

ies

Source species

?

Source species

Target species

Transfer and combination of evidence

• Evidence scores are multiplied by “correspondence scores”

• From each set of closely related species (a clade) only the best scoring evidence of each type is transferred

• The best evidence from each clade is “added” and scaled:scoretransfer = k3 * ( 1 – (1-k2*clade1) * (1-k2*clade2) * ... )

• In-species and transferred evidence is “added” and a total combined score calculated

Combining multiple types of evidencefrom several species

The next step in data integration:predicting the type of interaction

Information extraction from PubMed:extracting specific types of associations

• Tokenization and multi word detection

• Part-of-speech tagging

• Semantic labeling– Gene names– Cue words for entity recognition– Verbs for relation extraction

• Named entity chunking– A CASS grammar recognizes

noun chunks relevant for gene transcription

– [nxgene The GAL4 gene]

• Relation chunking– Our CASS grammar also

recognizes relations between entities:

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

• Output and visualization– TIGERSearch for inspection– Script for extracting a binary

representation of the relations– Show later go into STRING

We extract from both active, passive, and nominalized sentence constructs

[nx_prom the ATR1 promoter region][contain contains[nx_uas_pt

[dt-a a] [bs binding site] [for for] [nx_activator the GCN4 activator protein]]

[nx_expr RNR1 expression][bez is] [repv reduced] [by by][nx_oprd CLN1 or CLN2 overexpression]

[dt-the the] [binding binding] [of of][nx_prot GCN4 protein] [to to][nx_prom the SER1 promoter in vitro]

A high confidence regulatory network

• We manage to extract a satisfactory number of relations– 422 relation chunks– 597 binary relations– 441 unique binary relations

• Activation/repression assigned for ~50% of relations

• High accuracy: 83-90% on event extraction

• “Arrows” generally point from known transcription factors to other genes

More STRING to come

• Adding more large scale data sets and more species

• New types of genomic context evidence– White seminar by Jan Korbel in May

• Assign specific interaction types to functional associations– Expand text mining to cover more interaction types– Predict interaction types from evidence types

• Interpreting the network– Discover functional modules/pathways– Network topology and network motifs– White seminar by Christian von Mering in June

Acknowledgments

• The STRING team– Christian von Mering– Berend Snel– Martijn Huynen– Daniel Jaeggi– Steffen Schmidt– Mathilde Foglierini– Peer Bork

• ArrayProspector web service– Julien Lagarde– Chris Workman

• NetView visualization tool– Sean Hooper

• Text mining together with EML– Jasmin Saric– Rossitza Ouzounova– Isabel Rojas

• All my other “partners in crime” on various projects– The Steinmetz Group– The Furlong Group

• Web resources– string.embl.de– www.bork.embl.de/ArrayProspector– www.bork.embl.de/synonyms

http://www.bork.embl.de/synonyms

Thank you!

string: prediction of protein networks through integration of diverse large-scale data sets

Technology