conditional random fields for the prediction of signal peptide cleavage sites

24
1 .W. Mak and S.Y. Kung, ICASSP’09 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University S.Y. Kung Princeton University

Upload: agnes

Post on 23-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites. M.W. Mak The Hong Kong Polytechnic University. S.Y. Kung Princeton University. Contents. Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

1

M.W. Mak and S.Y. Kung, ICASSP’09

Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

M.W. MakThe Hong Kong Polytechnic University

S.Y. KungPrinceton University

Page 2: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

2

M.W. Mak and S.Y. Kung, ICASSP’09

Contents1. Introduction

Proteins and Their Subcellular LocationsImportance of Protein Cleavage-Site PredictionInformation in Amino Acid SequencesExisting Approaches to Cleavage Site Prediction

2. Conditional Random Field (CRF)CRF for Cleavage Site Prediction

3. Experiments and ResultsEffectiveness of Different Feature FunctionsEffect of Varying Window SizeFusion with SignalP

Page 3: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

3

M.W. Mak and S.Y. Kung, ICASSP’09

Proteins and Their Destination

• A protein consists of a sequence of amino acids.

• Newly synthesized proteins need to pass across intra-cellular membrane to their destination.

http://redpoll.pharmacy.ualberta.ca

Page 4: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

4

M.W. Mak and S.Y. Kung, ICASSP’09

Signal Peptide

Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008.

• A short segment of 20 to 100 amino acids (known as signal peptides) contains information about the destination (address) of the protein.

• The signal peptide is cleaved off from the resulting mature protein when it passes across the membrane.

http://nobelprize.org

Mature protein

Signal Peptide Cleavage Site

Page 5: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

5

M.W. Mak and S.Y. Kung, ICASSP’09

• Defects in the protein sorting process can cause serious diseases, e.g., kidney stone

Importance of Cleavage Site Prediction

Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html

Page 6: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

6

M.W. Mak and S.Y. Kung, ICASSP’09

• Many proteins (e.g. insulin) are produced in living cells. To cause the proteins to be secreted out of the cell, they are provided with a signal peptide.

Importance of Cleavage Site Prediction

Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html

Bioreactor

Page 7: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

7

M.W. Mak and S.Y. Kung, ICASSP’09

Information in Sequences• Signal peptides contain some regular patterns. • Although the patterns exhibit substantial variation, they

can be detected by machine learning tools.

Cleavage SiteRich in hydrophobic AA

Page 8: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

8

M.W. Mak and S.Y. Kung, ICASSP’09

Existing Methods

• Weight matrices (PrediSi)• Neural Networks (SignalP 1.1)• HMMs (SignalP 3.0)

Page 9: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

9

M.W. Mak and S.Y. Kung, ICASSP’09

Weight Matrices

M A R S S L F T F L C L A V F I N G C L S Q I E Q Q

Score at position t = 16+0+8+6+78+7+7+13+10+6+8+6+0+6+7=178

t -1 t t+1

20AA

15 Positions

Page 10: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

10

M.W. Mak and S.Y. Kung, ICASSP’09

SignalP-HMMSource: Nielsen and Krogh

Mature protein

Signal Peptide

Page 11: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

11

M.W. Mak and S.Y. Kung, ICASSP’09

Contents1. Introduction

Proteins and Their Subcellular LocationsImportance of Protein Cleavage-Site PredictionInformation in Amino Acid SequencesExisting Approaches to Cleavage Site Prediction

2. Conditional Random Field (CRF)CRF for Cleavage Site Prediction

3. Experiments and ResultsEffectiveness of Amino Acid PropertiesEffectiveness of Different Feature FunctionsFusion with SignalP

Page 12: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

12

M.W. Mak and S.Y. Kung, ICASSP’09

Conditional Random Fields

• Given a sequence of observations (e.g., words), a CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations.

• Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging

Page 13: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

14

M.W. Mak and S.Y. Kung, ICASSP’09

Advantages of CRF

• Avoid computing likelihood p(observation|label). Instead, the posterior p(label|observation) is computed directly.

• Able to model long-range dependency without making the inference problem intractable.

• Guarantee global optimal.

M A R S S L F T F L C L A V F I N G C L S Q I E Q Q

Depends on

Page 14: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

15

M.W. Mak and S.Y. Kung, ICASSP’09

CRF for Cleavage Cite PredictionCleavage site

},,{ MCSL

Transition features

State features

Weights

1t Tt

Length of Sequence

n-grams of amino acids

Page 15: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

16

M.W. Mak and S.Y. Kung, ICASSP’09

CRF for Cleavage Cite Prediction

WA)5,( xb

e.g. bi-gram and query sequence = T Q T W A G S H S . . .

MyCy tt and e.g., 1

Page 16: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

18

M.W. Mak and S.Y. Kung, ICASSP’09

Contents1. Introduction

Proteins and Their Subcellular LocationsImportance of Protein Cleavage-Site PredictionInformation in Amino Acid SequencesExisting Approaches to Cleavage Site Prediction

2. Conditional Random Field (CRF)CRF for Cleavage Site Prediction

3. Experiments and ResultsEffectiveness of Different Feature FunctionsEffect of Varying Window SizeFusion with SignalP

Page 17: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

19

M.W. Mak and S.Y. Kung, ICASSP’09

Experiments• Data: 1937 protein sequences extracted from

Swissprot 56.5. The cleavage sites locations of these sequences were biologically determined

• Ten-fold cross validation

• For 1st-order state features, up to 5-grams of amino acids

• For 2nd-order state features, up to bi-grams of amino acids.

• Use CRF++ software

Page 18: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

21

M.W. Mak and S.Y. Kung, ICASSP’09

ResultsEffectiveness of Different Feature Functions:

Observations: (1) Transition feature by itself

is no good.(2) But, once combined with

state-features, performance improves

(Transition only)

(Transition + State)

Page 19: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

22

M.W. Mak and S.Y. Kung, ICASSP’09

ResultsEffect of Varying the Window Size:

}max{ SizeWindow nd

e.g. query sequence = T Q T W A G S H S . . . 5 SizeWindow

Page 20: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

23

M.W. Mak and S.Y. Kung, ICASSP’09

ResultsCompared with Other Predictors

Observations: (1) CRF is slightly better than SignalP(2) CRF is complementary to SignalP

Predictor Accuracy SignalP (HMM and NN) 81.88% PrediSi (Weight matrix) 77.06% CRF 82.19% CRF + SignalP 85.03%

Page 21: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

24

M.W. Mak and S.Y. Kung, ICASSP’09

Web Serverhttp://158.132.148.85:8080/CSitePred/faces/Page1.jsp

Page 22: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

25

M.W. Mak and S.Y. Kung, ICASSP’09

Web Serverhttp://158.132.148.85:8080/CSitePred/faces/Page1.jsp

Available in May2009

Page 23: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

26

M.W. Mak and S.Y. Kung, ICASSP’09

Page 24: Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites

27

M.W. Mak and S.Y. Kung, ICASSP’09

Conditional Random Fields

• Given a sequence of observations, A CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations.

• Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging

Observations

Labels

x

x

y