prediction of protein function from sequence derived protein features

28
Center for Biological Sequence Analysis Prediction of Protein Function from Sequence Derived Protein Features Lars Juhl Jensen

Upload: lars-juhl-jensen

Post on 11-May-2015

2.132 views

Category:

Business


3 download

DESCRIPTION

Technical University of Denmark, Lyngby, October 23, 2002

TRANSCRIPT

Page 1: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis Prediction of Protein

Function from Sequence Derived Protein Features

Lars Juhl Jensen

Page 2: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Function unknown for 40% of human proteins

Page 3: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Pairwise alignment

>carp Cyprinus carpio growth hormone 210 aa vs.

>chicken Gallus gallus growth hormone 216 aa

scoring matrix: BLOSUM50, gap penalties: -12/-2

40.6% identity; Global alignment score: 487

10 20 30 40 50 60 70

carp MA--RVLVLLSVVLVSLLVNQGRASDN-----QRLFNNAVIRVQHLHQLAAKMINDFEDSLLPEERRQLSKIFPLSFCNSD

:: . : ...:.: . : :. . :: :::.:.:::: :::. ..:: . .::..: .: .:: :.

chicken MAPGSWFSPLLIAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFERTYIPEDQRYTNKNSQAAFCYSE

10 20 30 40 50 60 70 80

80 90 100 110 120 130 140 150

carp YIEAPAGKDETQKSSMLKLLRISFHLIESWEFPSQSLSGTVSNSLTVGNPNQLTEKLADLKMGISVLIQACLDGQPNMDDN

: ::.:::..:..: ..:::.:. ::.:: : : ::. .:.:. :. ... ::: ::. ::..:.. : .: .

chicken TIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQYLSKVFTNNLVFGTSDRVFEKLKDLEEGIQALMRELEDRSPR---G

90 100 110 120 130 140 150 160

170 180 190 200 210

carp DSLPLP-FEDFYLTM-GENNLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL

.: : .. : . . .:. : ... ::.:::::.:::::::.: .::: .::::.

chicken PQLLRPTYDKFDIHLRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI

170 180 190 200 210

Page 4: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Functional assignment: alignment versus prediction

Alignment is good for transferring knowledge about the function of homologous proteins

For orphan proteins there is no knowledge to transfer

Orphan sequences must thus be handled by true prediction tools rather than alignment

Develop a prediction method that works for orphans but only requires sequence input

Assign a possible function to as many of the orphans as possible

Screen the human genome for novel pharmaceutical targets such as transcription factors and receptors

Page 5: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

The paradigm: sequence to structure to function

Structure does play a very important role for the function proteins

Structure is not very useful for prediction of protein function• For proteins of unknown function, the structure is

rarely known• Prediction of 3D structure from sequence is a very

difficult unsolved problem• Prediction of protein function from structure is by

many considered an even harder problem

Predicted secondary structure/fold can be used

Page 6: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

1AOZ (129 aa) vs. 1PLC (99 aa)scoring matrix: BLOSUM50, gap penalties: -12/-215.5% identity; Global alignment score: -23

10 20 30 40 50 601AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH .. .. : ... . . ..: . :...: . .: ...:. 1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD 10 20 30 40

70 80 90 100 110 1201AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI .: :. . . : . :::: .. . .:. : : ::. :.. 1PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT 50 60 70 80 90

1AOZ VDPPQGKKE :. 1PLC VN-------

Page 7: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

An enzyme and a non-enzyme from the Cupredoxin superfamily

Page 8: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Function prediction from post translational modifications

Proteins with similar function may not be related in sequence

Still they must perform their function in the context of the same cellular machinery

Similarities in features such like PTMs and physical/chemical properties could be expected for proteinswith similar function

Page 9: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Functional classes predicted

Functional role (Monica Riley categories)• The original scheme had 14 categories• We reduce it to 12 categories by skipping the

category ”other” and combining replication and transcription

Enzyme prediction• Enzyme vs. non-enzyme• Major enzyme class in the EC system

Gene Ontology • A subset of classes can be predicted

Systems related categories• For example “cell cycle regulated’’

Page 10: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

The concept of ProtFun

Predict as many biologically relevant features as we can from the sequence

Train artificial neural networks for each category, also optimizing the feature combinations

Assign a probability for each category from the NN outputs

Page 11: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Training of neural networks

Human protein protein sequences from SWISS-PROT were assigned to functional classes based on their keywords by using the EUCLID dictionary

The set of sequences was divided into a test and a training set with no significant sequence similarity between the two sets

Neural networks were first trained for single features and subsequently for combinations of the best performing features

Page 12: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Prediction performance on cellular role categories

Page 13: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Page 14: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

An enzyme and a non-enzyme from the Cupredoxin superfamily

Page 15: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

# Functional category 1AOZ 1PLC Amino_acid_biosynthesis 0.126 0.070 Biosynthesis_of_cofactors 0.100 0.075 Cell_envelope 0.429 0.032 Cellular_processes 0.057 0.059 Central_intermediary_metabolism 0.063 0.041 Energy_metabolism 0.126 0.268 Fatty_acid_metabolism 0.027 0.072 Purines_and_pyrimidines 0.439 0.088 Regulatory_functions 0.102 0.019 Replication_and_transcription 0.052 0.089 Translation 0.079 0.150 Transport_and_binding 0.032 0.052

# Enzyme/nonenzyme Enzyme 0.773 0.310 Nonenzyme 0.227 0.690

# Enzyme class Oxidoreductase (EC 1.-.-.-) 0.077 0.077 Transferase (EC 2.-.-.-) 0.260 0.099 Hydrolase (EC 3.-.-.-) 0.114 0.071 Lyase (EC 4.-.-.-) 0.025 0.020 Isomerase (EC 5.-.-.-) 0.010 0.068 Ligase (EC 6.-.-.-) 0.017 0.017

Similar structure different functions

Many examples exist of structurally similar proteins which have different functions

Two PDB structures from the Cupredoxin superfamily were shown• 1AOZ is an enzyme• 1PLC is not an enzyme

Despite their structural similarity, our method predicts both correctly

Page 16: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Evolution conserves protein features and function

Protein features are more conserved between orthologs than paralogs

This leads to ProtFun predicting orthologs to be more likely to share function than paralogs

That prediction is fully consistent with the notion that it is best to infer function from orthologous proteins

Page 17: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

ProtFun performance for other organisms

Our predictors work in general for eukaryotes

Some categories work quite well for prokaryotes• Most metabolism

categories• Transport and binding

While other categories fail• Energy metabolism• Regulatory functions

hsapdm elceleathascerspomssolafulm thephor

m tub

rpxxnm enecolihinfcje j

tm arbsub

ctra

aquasynec

Am

ino

acid

bio

syn

the

sis

Bio

synt

hes

is o

f co

fact

ors

Ce

ll en

velo

peC

ellu

lar

pro

cess

es

Ce

ntra

l in

term

edia

ry m

eta

b.

Ene

rgy

met

abo

lism

Fat

ty a

cid

met

abo

lism

Pur

ine

s a

nd p

yrim

idin

es

Re

gula

tory

fun

ctio

ns

Re

plic

atio

n a

nd

tra

nscr

iptio

nT

ran

slat

ion

Tra

nsp

ort

an

d b

ind

ing

Page 18: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Mapping category performances onto input features

hsapdm elceleathascerspomssolafulm thephor

m tub

rpxxnm enecolihinfcje j

tm arbsub

ctra

aquasynec

Am

ino

acid

bio

syn

the

sis

Bio

synt

hes

is o

f co

fact

ors

Ce

ll en

velo

peC

ellu

lar

pro

cess

es

Ce

ntra

l in

term

edia

ry m

eta

b.

Ene

rgy

met

abo

lism

Fat

ty a

cid

met

abo

lism

Pur

ine

s a

nd p

yrim

idin

es

Re

gula

tory

fun

ctio

ns

Re

plic

atio

n a

nd

tra

nscr

iptio

nT

ran

slat

ion

Tra

nsp

ort

an

d b

ind

ing

hsapdm elceleathascerspomssolafulm thephor

m tub

rpxxnm enecolihinfcje j

tm arbsub

ctra

aquasynec

Ext

inct

ion

co

effic

ien

tH

ydro

pho

bici

tyN

egat

ive

resi

due

sP

ositi

ve r

esid

ues

O-g

lyco

syla

tion

S/T

-pho

sph

oryl

atio

nY

-pho

sph

oryl

atio

nN

-gly

cosy

latio

nP

ES

T r

egio

ns

Sec

ond

ary

stru

ctur

eS

ubce

llula

r lo

caliz

atio

nLo

w c

om

ple

xity

re

gio

nsS

ign

al p

eptid

es

Tra

nsm

em

bra

ne h

elic

es

Page 19: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Performance contribution of sequence derived features

The correlations between features and function is conserved for eukaryotes

Some correlations extend to archaea and bacteria• Physical/chemical

properties• Secondary structure and

transmembrane helices

Other correlations only hold for eukaryotes• PTMs and Subcellular

localization features

hsapdm elceleathascerspomssolafulm thephor

m tub

rpxxnm enecolihinfcje j

tm arbsub

ctra

aquasynec

Ext

inct

ion

co

effic

ien

tH

ydro

pho

bici

tyN

egat

ive

resi

due

sP

ositi

ve r

esid

ues

O-g

lyco

syla

tion

S/T

-pho

sph

oryl

atio

nY

-pho

sph

oryl

atio

nN

-gly

cosy

latio

nP

ES

T r

egio

ns

Sec

ond

ary

stru

ctur

eS

ubce

llula

r lo

caliz

atio

nLo

w c

om

ple

xity

re

gio

nsS

ign

al p

eptid

es

Tra

nsm

em

bra

ne h

elic

es

Page 20: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Are our classes meaningful?

Page 21: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

A better classification system: the Gene Ontology

Standardized by the Gene Ontology Consortium

Proteins can belong to multiple classes

Different kinds of function can be annotated:• Molecular function• Biological process• Cellular component

GO assigns the “function” at several levels of detail

Page 22: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Training of the Gene Ontology predictor

GO numbers were assigned to all human SWISS-PROT and TREMBL entries based on matches to InterPro

Classes annotated to fewer than 20 different InterPro families were discard

The sequences were split into five sets of equal size where significant similarity only exist within sets – not between sets

Using this data set neural networks were trained in sets of five constituting a five fold cross validation

Single feature neural nets were first trained on each remaining category

Neural networks using combinations of features were trained on promising categories resulting in 14 good predictors

Page 23: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Prediction performance on Gene Ontology categories

Predicts many pharmaceutically interesting classes

70% of hormones and receptors can be predicted at a false positive rate of only 5%

All categories can be predicted with a sensitivity of 50% and 10% rate of false positives

Page 24: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Features usage

Transmembrane helices important for prediction of• Receptors• Transporters• Ion channels

Subcellular localization good for predicting• Receptors• Transcription

(regulation)

Page 25: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

############## ProtFun 2.0 predictions ##############

>ENSP00000257015 # Functional category Prob Odds Amino_acid_biosynthesis 0.021 0.955 Biosynthesis_of_cofactors 0.032 0.444 Cell_envelope => 0.661 10.836 Cellular_processes 0.039 0.534 Central_intermediary_metabolism 0.042 0.667 Energy_metabolism 0.043 0.478 Fatty_acid_metabolism 0.043 3.308 Purines_and_pyrimidines 0.164 0.675 Regulatory_functions 0.014 0.087 Replication_and_transcription 0.020 0.075 Translation 0.033 0.750 Transport_and_binding 0.834 2.034 # Enzyme/nonenzyme Prob Odds Enzyme 0.202 0.705 Nonenzyme => 0.798 1.118 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.055 0.264 Transferase (EC 2.-.-.-) 0.032 0.093 Hydrolase (EC 3.-.-.-) 0.077 0.243 Isomerase (EC 4.-.-.-) 0.020 0.426 Ligase (EC 5.-.-.-) 0.010 0.313 Lyase (EC 6.-.-.-) 0.017 0.334 # Gene Ontology category Prob Odds Signal_transducer 0.493 2.304 Receptor => 0.734 4.318 Hormone 0.001 0.154 Structural_protein 0.001 0.036 Transporter 0.050 0.459 Ion_channel 0.035 0.614 Voltage-gated_ion_channel 0.002 0.091 Cation_channel 0.010 0.217 Transcription 0.050 0.391 Transcription_regulation 0.021 0.168 Stress_response 0.364 4.136 Immune_response 0.477 5.612 Growth_factor 0.117 8.357 Metabolism 0.142 0.307 Metal_ion_transport 0.013 0.394

Possible novel receptor

No BLAST matches against SWISS-PROT with an E-value below 1

A Pfam search yielded a questionable match to TGF-beta type III receptors (E-value 0.28)

While this match is not significant on its own, it supports the predictions

Page 26: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Summary

A method for prediction of “protein function” has been developed for human proteins

This method has been successfully applied to a number different categorization systems

The feature usage of the neural networks is in agreement with current biological knowledge

Cross-species tests show that the prediction methods developed on human proteins work for most eukaryotes

The evolutionary aspects of “feature space” have been discussed

Page 27: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Acknowledgements

Other people at CBS• David Ussery• Marie Skovgaard• Ulrik de Lichtenberg• Thomas Skøt Jensen• Anne Mølgaard

The EUCLID team at CNB/CSIC, Madrid• Alfonso Valencia• Damien Devos• Javier Tamames

The ProtFun team at CBS• Søren Brunak• Ramneek Gupta• Can Kesmir• Kristoffer Rapacki• Hans-Henrik Stærfeldt• Henrik Nielsen• Nikolaj Blom• Claus A.F. Andersen• Anders Krogh• Steen Knudsen• Chris Workman

Page 28: Prediction of protein function from sequence derived protein features

Cen

ter

for

Bio

log

ical S

eq

uen

ce A

naly

sis

Thank you!