machine learning algorithms for protein structure prediction

65
Machine Learning Algorithms for Protein Structure Prediction Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006

Upload: kaori

Post on 12-Jan-2016

66 views

Category:

Documents


0 download

DESCRIPTION

Machine Learning Algorithms for Protein Structure Prediction. Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006. Outline. Introduction 1D Prediction 2D Prediction (Beta-Sheet Topology) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning Algorithms for Protein Structure Prediction

Machine Learning Algorithms for Protein Structure Prediction

Jianlin Cheng

Institute for Genomics and BioinformaticsSchool of Information and Computer Sciences

University of California Irvine2006

Page 2: Machine Learning Algorithms for Protein Structure Prediction

Outline

I. Introduction

II. 1D Prediction

III. 2D Prediction (Beta-Sheet Topology)

IV. 3D Prediction (Fold Recognition)

V. Publications and Bioinformatics Tools

Page 3: Machine Learning Algorithms for Protein Structure Prediction

Importance of Protein Structure Prediction

AGCWY……

Sequence Structure Function

Cell

Page 4: Machine Learning Algorithms for Protein Structure Prediction

Four Levels of Protein StructurePrimary Structure (a directional sequence of amino acids/residues)

Secondary Structure (helix, strand, coil)

N C…

Residue1

Alpha Helix Beta Strand / Sheet Coil

Residue2

Peptide bond

Page 5: Machine Learning Algorithms for Protein Structure Prediction

Four Levels of Protein Structure

Quaternary Structure (complex)Tertiary Structure

G Protein Complex

Page 6: Machine Learning Algorithms for Protein Structure Prediction

1D: Secondary Structure Prediction

Coil

MWLKKFGINLLIGQSV…

CCCCHHHHHCCCSSSSS…

Accuracy: 78%

Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

Neural Networks+ Alignments

Strand

Helix

Page 7: Machine Learning Algorithms for Protein Structure Prediction

1D: Solvent Accessibility PredictionExposed

Buried

MWLKKFGINLLIGQSV…

eeeeeeebbbbbbbbeeeebbb…

Accuracy: 79%

Neural Networks+ Alignments

Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

Page 8: Machine Learning Algorithms for Protein Structure Prediction

MWLKKFGINLLIGQSV…

OOOOODDDDOOOOO…

93% TP at 5% FP

Disordered Region

Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005

1D-RNN

1D: Disordered Region Prediction Using Neural Networks

Page 9: Machine Learning Algorithms for Protein Structure Prediction

MWLKKFGINLLIGQSV…

NNNNNNNBBBBBNNNN…

Domain 1 Domain 2 Domains

1D: Protein Domain Prediction Using Neural Networks

Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.

1D-RNN

+ SS and SA

HIV capsid protein Inference/Cut

Boundary

Top ab-initio domain predictor in CAFASP4

Page 10: Machine Learning Algorithms for Protein Structure Prediction

1D: Predict Single-Site Mutation From Sequence Using Support Vector Machine

• First method to predict energy changes from sequence accurately

• Useful for protein engineering, protein design, and mutagenesis analysis

…MWLAVFILINLK…

SupportVector

Machine

Correlation = 0.76

Cheng, Randall, and Baldi. Proteins, 2006

Page 11: Machine Learning Algorithms for Protein Structure Prediction

2D: Contact Map Prediction

1 2 ………..………..…j...…………………..…n 123....i.......n

3D Structure 2D Contact Map

Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005

Distance Threshold = 8Ao

Page 12: Machine Learning Algorithms for Protein Structure Prediction

2D: Disulfide Bond Prediction

Disulfide Bond

Cysteine j

Cysteine i

2D-RNN

GraphMatching

[1] Baldi, Cheng, Vullo. NIPS, 2004.[2] Cheng, Saigo, Baldi. Proteins, 2005

SupportVector

Machine

yes

Page 13: Machine Learning Algorithms for Protein Structure Prediction

2D: Prediction of Beta-Sheet Topology

N terminus

C terminus

Cheng and Baldi, Bioinformatics, 2005

Beta Sheet

BetaStrand

Beta ResiduePair

• Ab-Initio Structure Prediction

• Fold Recognition

• Protein Design

• Protein Folding

Page 14: Machine Learning Algorithms for Protein Structure Prediction

An Example of Beta-Sheet Topology

Structure ofProtein 1VJG

Beta Sheets

Level 1

4 5

2 1 3 6 7

Page 15: Machine Learning Algorithms for Protein Structure Prediction

An Example of Beta-Sheet Topology

Structure ofProtein 1VJG

Beta Sheets StrandStrand PairStrand AlignmentPairing Direction

Level 1 Level 2

Antiparallel

Parallel

4 5

2 1 3 6 7

Page 16: Machine Learning Algorithms for Protein Structure Prediction

An Example of Beta-Sheet Topology

Structure ofProtein 1VJG

Beta Sheets StrandStrand PairStrand AlignmentPairing Direction

Beta ResidueResidue Pair

Level 1 Level 2 Level 3

Antiparallel

Parallel

4 5

2 1 3 6 7

H-bond

Page 17: Machine Learning Algorithms for Protein Structure Prediction

Three-Stage Prediction of Beta-Sheets

• Stage 1 Predict beta-residue pairing probabilities

using 2D-Recursive Neural Networks (2D-

RNN, Baldi and Pollastri, 2003)

• Stage 2 Use beta-residue pairing probabilities to

align beta-strands

• Stage 3 Predict beta-strand pairs and beta-sheet

topology using graph algorithms

Page 18: Machine Learning Algorithms for Protein Structure Prediction

Stage 1: Prediction of Beta-Residue Pairings Using 2D-Recusive Neural Networks

Input Matrix I (m×m)

2D-RNNO = f(I)

Output / Target Matrix (m×m)

Iij

20 for Residues 3 SS 2 SA

Oij: Pairing Prob.Tij: 0/1

(i,j)

…AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK….

i j

Page 19: Machine Learning Algorithms for Protein Structure Prediction

An Example (Target)

Protein 1VJGBeta-Residue Pairing Map (Target Matrix)

1 2 3 4 5 6 7

Page 20: Machine Learning Algorithms for Protein Structure Prediction

An Example (Target)

Protein 1VJGBeta-Residue Pairing Map (Target Matrix)

1 2 3 4 5 6 7Antiparallel

Parallel

Page 21: Machine Learning Algorithms for Protein Structure Prediction

An Example (Prediction)

Page 22: Machine Learning Algorithms for Protein Structure Prediction

Stage 2: Beta-Strand Alignment

• Use output probability matrix as scoring matrix

• Dynamic programming• Disallow gaps and use

the simplified search algorithm

1 m

n 1

1 m1 n

Antiparallel

Parallel

Total number of alignments = 2(m+n-1)

Page 23: Machine Learning Algorithms for Protein Structure Prediction

Strand Alignment and Pairing Matrix

• The alignment score is the sum of the pairing probabilities of the aligned residues

• The best alignment is the alignment with the maximum score

• Strand Pairing Matrix

Strand Pairing Matrix of 1VJG

Page 24: Machine Learning Algorithms for Protein Structure Prediction

Stage 3: Prediction of Beta-Strand Pairings and Beta-Sheet Topology

(a) Seven strands of protein 1VJG in sequence order

(b) Beta-sheet topology of protein 1VJG

Page 25: Machine Learning Algorithms for Protein Structure Prediction

Minimum Spanning Tree Like Algorithm

Strand Pairing Graph (SPG)

(a) Complete SPGStrand Pairing Matrix

Page 26: Machine Learning Algorithms for Protein Structure Prediction

Minimum Spanning Tree Like Algorithm

Strand Pairing Graph (SPG)

Goal: Find a set of connected subgraphs that maximize the sum of the alignment scores and satisfy the constraints Algorithm: Minimum Spanning Tree Like Algorithm

(a) Complete SPG (b) True Weighted SPGStrand Pairing Matrix

Page 27: Machine Learning Algorithms for Protein Structure Prediction

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

Strand Pairing Matrix of 1VJG

Step 1: Pair strand 4 and 5

Page 28: Machine Learning Algorithms for Protein Structure Prediction

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1

Strand Pairing Matrix of 1VJG

N

Step 2: Pair strand 1 and 2

Page 29: Machine Learning Algorithms for Protein Structure Prediction

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1 3

Strand Pairing Matrix of 1VJG

N

Step 3: Pair strand 1 and 3

Page 30: Machine Learning Algorithms for Protein Structure Prediction

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1 3 6Strand Pairing Matrix of 1VJG

N

Step 4: Pair strand 3 and 6

Page 31: Machine Learning Algorithms for Protein Structure Prediction

An Example of MST Like Algorithm

0

1.3 0

.94 .37 0

.02 .02 .04 0

.02 .02 .03 1.9 0

.10 .05 .74 .04 .04 0

.02 .02 .03 .02 .02 .20 0

1

2

3

4

56

7

1 2 3 4 5 6 7

4 5

2 1 3 67Strand Pairing Matrix of 1VJG

N

C

Step 5: Pair strand 6 and 7

Page 32: Machine Learning Algorithms for Protein Structure Prediction

Method Specificity/

Sensitivity

Ratio of

Improvement

BetaPairing 41% 17.8

CMAPpro

(Pollastri and Baldi, 2002)

27% 11.7

Method Specificity Sensitivity % of non-local pairs

MST Like 53% 59% 20%

Method Alignment

Accuracy

Pairing

Direction

BetaPairing 66% 84%

Statistical Potential (Hubbard, 1994) 40% X

Pseudo-energy (Zhu and Braun, 1999) 35% X

Information Theory (Steward and Thornton, 2002) 37% X

1.Beta Residue Pairing

2. Beta Strand Alignment

3. Beta Strand Pairing

Page 33: Machine Learning Algorithms for Protein Structure Prediction

3D Structure Prediction•Ab-Initio Structure Prediction

•Template-Based Structure Prediction

Physical force field – protein foldingContact map - reconstruction

MWLKKFGINLLIGQSV…

……

Select structure with minimum free energy

MWLKKFGINKH…

Protein Data Bank

Fold

Recognition Alignment

Template

Simulation

Query protein

Page 34: Machine Learning Algorithms for Protein Structure Prediction

A Machine Learning Information Retrieval Framework for Fold Recognition

MWLKKFGIN……

Protein Data Bank

Fold Recognition

Alignment

Template

Query Protein

Cheng and Baldi, Bioinformatics, 2006

Machine Learning Ranking

Page 35: Machine Learning Algorithms for Protein Structure Prediction

Classic Fold Recognition Approaches

Sequence - Sequence Alignment(Needleman and Wunsch, 1970. Smith and Waterman, 1981)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL

ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL

Query

Template

Works for >40% sequence identity(Close homologs in protein family)

Alignment (similarity) score

Page 36: Machine Learning Algorithms for Protein Structure Prediction

Classic Fold Recognition Approaches

Profile - Sequence Alignment(Altschul et al., 1997)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL

ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN

QueryFamily

Template

More sensitive for distant homologs in superfamily. (> 25% identity)

AverageScore

Page 37: Machine Learning Algorithms for Protein Structure Prediction

Classic Fold Recognition Approaches

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL

ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN

QueryFamily

Template

1 2 … n

A  0.4      

C  0.1      

…        

W  0.5      

Position Specific Scoring MatrixOr Hidden Markov Model

More sensitive for distant homologs in superfamily. (> 25% identity)

12………………………………….………………n

Profile - Sequence Alignment(Altschul et al., 1997)

Page 38: Machine Learning Algorithms for Protein Structure Prediction

Classic Fold Recognition Approaches

1 2 … m

A  0.3      

C  0.5      

…        

W  0.2      

Profile - Profile Alignment(Rychlewski et al., 2000)

ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL

ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHNIPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHNIGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM

QueryFamily

TemplateFamily

1 2 … n

A  0.1      

C  0.4      

…        

W  0.5      

More sensitive for very distant homologs. (> 15% identity)

Page 39: Machine Learning Algorithms for Protein Structure Prediction

Classic Fold Recognition Approaches

MWLKKFGINLLIGQS….

Useful for recognizing similar folds without sequence similarity.(no evolutionary relationship)

Query

Template Structure

FitFitness Score

Sequence - Structure Alignment (Threading)(Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994)

Page 40: Machine Learning Algorithms for Protein Structure Prediction

Integration of Complementary Approaches

Meta Server

FR Server1

FR server2

FR server3

Query

Internet

Consensus

1. Reliability depends on availability of external servers2. Make decisions on a handful candidates

(Lundstrom et al.,2001. Fischer, 2003)

Page 41: Machine Learning Algorithms for Protein Structure Prediction

Machine Learning Classification Approach

Proteins

Class 1

Class 2

Class m

Classify individual proteins to several or dozens of structure classes(Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004)

Problem 1: can’t scale up to thousands of protein classesProblem 2: doesn’t provide templates for structure modeling

Support Vector Machine (SVM)

Page 42: Machine Learning Algorithms for Protein Structure Prediction

Machine Learning Information Retrieval Framework

Query-Template Pair

-

+

Score 1Relevance Function (e.g., SVM)

• Extract pairwise features• Comparison of two pairs (four proteins)• Relevant or not (one score) vs. many classes• Ranking of templates (retrieval)

Score 2

Score n

Rank

.

.

.

Page 43: Machine Learning Algorithms for Protein Structure Prediction

Pairwise Feature Extraction • Sequence / Family Information Features Cosine, correlation, and Gaussian kernel• Sequence – Sequence Alignment Features Palign, ClustalW• Sequence – Profile Alignment Features PSI-BLAST, IMPALA, HMMer, RPS-BLAST• Profile – Profile Alignment Features ClustalW, HHSearch, Lobster, Compass, PRC-HMM• Structural Features Secondary structure, solvent accessibility, contact map, beta-

sheet topology

Page 44: Machine Learning Algorithms for Protein Structure Prediction

Pairwise Feature Extraction

Page 45: Machine Learning Algorithms for Protein Structure Prediction

Relevance Function: Support Vector Machine Learning

Positive Pairs(Same Folds)

Negative Pairs(Different Folds)

Training/Learning

SupportVector

Machine

Training Data Set

Feature Space

Hyperplane

Page 46: Machine Learning Algorithms for Protein Structure Prediction

Relevance Function: Support Vector Machine Learning

f(x) = K is Gaussian Kernel:

Margin

Margin

(1) (2)

Page 47: Machine Learning Algorithms for Protein Structure Prediction

Training and Cross-Validation• Standard benchmark (Lindahl’s dataset, 976 proteins)• 976 x 975 query-template pairs (about 7,468 positives)

123.....976

Query

975 pairs

975 pairs

Query 1’s pairs

.

.

.

Rank 975templatesfor eachquery

975 pairsQuery 2’s pairs

(90%: 1- 878)

(10%: 879 – 976)

Train / Learn

Test

Page 48: Machine Learning Algorithms for Protein Structure Prediction

Results for Top Five Ranked Templates

•Family: close homologs, more identity•Superfamily: distant homologs, less identity•Fold: no evolutionary relation, no identity

Method Family Superfamily Fold

PSI-BLAST 72.3 27.9 4.7

HMMER 73.5 31.3 14.6

SAM-T98 75.4 38.9 18.7

BLASTLINK 78.9 4.06 16.5

SSEARCH 75.5 32.5 15.6

SSHMM 71.7 31.6 24

THREADER 58.9 24.7 37.7

FUGUE 85.8 53.2 26.8

RAPTOR 77.8 50 45.1

SPARKS3 86.8 67.7 47.4

FOLDpro 89.9 70.0 48.3

Page 49: Machine Learning Algorithms for Protein Structure Prediction

Specificity-Sensitivity Plot (Family)

Page 50: Machine Learning Algorithms for Protein Structure Prediction

Specificity-Sensitivity Plot (Superfamily)

Page 51: Machine Learning Algorithms for Protein Structure Prediction

Specificity-Sensitivity Plot (Fold)

Page 52: Machine Learning Algorithms for Protein Structure Prediction

Advantages of MLIR Framework• Integration

• Accuracy

• Extensibility

• Simplicity

• Reliability

• Completeness

• Potentials

DisadvantagesSlower than some alignment methods

Page 53: Machine Learning Algorithms for Protein Structure Prediction

A CASP7 Example: T0290Query sequence (173 residues):RPRCFFDIAINNQPAGRVVFELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFMVQGGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFFITKPTPHLDGHHVVFGQVISGQEVVREIENQKTDAASKPFAEVRILSCGELIP

Compare with the experimental structure:RMSD = 1Ao

FOLDpro

Predicted Structure

Page 54: Machine Learning Algorithms for Protein Structure Prediction

Publications and Bioinformatics Tools1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale Prediction of Disulphide Bond Connectivity. NIPS 2004.

[DIpro 1.0]2. J. Cheng, H. Saigo, and P. Baldi. Large-Scale Prediction of Disulphide Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks, and Weighted Graph Matching. Proteins, 2006.

[DIpro 2.0] 3. J. Cheng and P. Baldi. Three-Stage Prediction of Protein Beta-Sheets by Neural Networks, Alignments, and Graph Algorithms. Bioinformatics, 2005.

[BETApro]4. J. Cheng, A. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein Structure and Structural Feature Prediction Server. Nucleic Acids Research, 2005.

[SSpro 4/ACCpro 4/CMAPpro 2]5. J. Cheng, M. Sweredoski, and P. Baldi. Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery, 2005.

[DISpro]

Page 55: Machine Learning Algorithms for Protein Structure Prediction

6. J. Cheng, L. Scharenbroich, P. Baldi, and E. Mjolsness. Sigmoid: Towards a Generative, Scalable, Software Infrastructure for Pathway Bioinformatics and Systems Biology. IEEE Intelligent Systems, 2005.

[Sigmoid]7. J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes for Single Site Mutations Using Support Vector Machines. Proteins, 2006.

[MUpro]8. S. A. Danziger, S. J. Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H. Chen, J. Cheng, V. P. Hoang, H. Saigo, R. Luo, P. Baldi, R. K. Brachmann, and R. H. Lathrop. Functional Census of Mutation Sequence Spaces: The Example of p53 Cancer Rescue Mutants. IEEE Transactions on Computational Biology and Bioinformatics, 2006.

9. J. Cheng, M. Sweredoski, and P. Baldi. DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Mining and Knowledge Discovery, 2006.

[DOMpro]10. J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics, 2006.

[FOLDpro]

Publications and Bioinformatics Tools

Page 56: Machine Learning Algorithms for Protein Structure Prediction

Acknowledgements • Pierre Baldi• G. Wesley Hatfield, Eric Mjolsness, Hal

Stern, Dennis Decoste, Suzanne Sandmeyer, Richard Lathrop, Gianluca Pollastri, Chin-Rang Yang

• Mike Sweredoski, Arlo Randall, Liza Larsen, Sam Danziger, Trent Su, Hiroto Saigo, Alessandro Vullo, Lucas Scharenbroich

Page 57: Machine Learning Algorithms for Protein Structure Prediction
Page 58: Machine Learning Algorithms for Protein Structure Prediction

Markov Models

Page 59: Machine Learning Algorithms for Protein Structure Prediction
Page 60: Machine Learning Algorithms for Protein Structure Prediction
Page 61: Machine Learning Algorithms for Protein Structure Prediction

1D-Recursive Neural Network

Page 62: Machine Learning Algorithms for Protein Structure Prediction

2D-Recursive Neural Network

Page 63: Machine Learning Algorithms for Protein Structure Prediction
Page 64: Machine Learning Algorithms for Protein Structure Prediction

2D-RNNs

Page 65: Machine Learning Algorithms for Protein Structure Prediction

2D RNNs