machine learning algorithms for protein structure prediction
DESCRIPTION
Machine Learning Algorithms for Protein Structure Prediction. Jianlin Cheng Institute for Genomics and Bioinformatics School of Information and Computer Sciences University of California Irvine 2006. Outline. Introduction 1D Prediction 2D Prediction (Beta-Sheet Topology) - PowerPoint PPT PresentationTRANSCRIPT
Machine Learning Algorithms for Protein Structure Prediction
Jianlin Cheng
Institute for Genomics and BioinformaticsSchool of Information and Computer Sciences
University of California Irvine2006
Outline
I. Introduction
II. 1D Prediction
III. 2D Prediction (Beta-Sheet Topology)
IV. 3D Prediction (Fold Recognition)
V. Publications and Bioinformatics Tools
Importance of Protein Structure Prediction
AGCWY……
Sequence Structure Function
Cell
Four Levels of Protein StructurePrimary Structure (a directional sequence of amino acids/residues)
Secondary Structure (helix, strand, coil)
N C…
Residue1
Alpha Helix Beta Strand / Sheet Coil
Residue2
Peptide bond
Four Levels of Protein Structure
Quaternary Structure (complex)Tertiary Structure
G Protein Complex
1D: Secondary Structure Prediction
Coil
MWLKKFGINLLIGQSV…
CCCCHHHHHCCCSSSSS…
Accuracy: 78%
Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
Neural Networks+ Alignments
Strand
Helix
1D: Solvent Accessibility PredictionExposed
Buried
MWLKKFGINLLIGQSV…
eeeeeeebbbbbbbbeeeebbb…
Accuracy: 79%
Neural Networks+ Alignments
Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
MWLKKFGINLLIGQSV…
OOOOODDDDOOOOO…
93% TP at 5% FP
Disordered Region
Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2005
1D-RNN
1D: Disordered Region Prediction Using Neural Networks
MWLKKFGINLLIGQSV…
NNNNNNNBBBBBNNNN…
Domain 1 Domain 2 Domains
1D: Protein Domain Prediction Using Neural Networks
Cheng, Sweredoski, Baldi. Data Mining and Knowledge Discovery, 2006.
1D-RNN
+ SS and SA
HIV capsid protein Inference/Cut
Boundary
Top ab-initio domain predictor in CAFASP4
1D: Predict Single-Site Mutation From Sequence Using Support Vector Machine
• First method to predict energy changes from sequence accurately
• Useful for protein engineering, protein design, and mutagenesis analysis
…MWLAVFILINLK…
SupportVector
Machine
Correlation = 0.76
Cheng, Randall, and Baldi. Proteins, 2006
2D: Contact Map Prediction
1 2 ………..………..…j...…………………..…n 123....i.......n
3D Structure 2D Contact Map
Cheng, Randall, Sweredoski, Baldi. Nucleic Acid Research, 2005
Distance Threshold = 8Ao
2D: Disulfide Bond Prediction
Disulfide Bond
Cysteine j
Cysteine i
2D-RNN
GraphMatching
[1] Baldi, Cheng, Vullo. NIPS, 2004.[2] Cheng, Saigo, Baldi. Proteins, 2005
SupportVector
Machine
yes
2D: Prediction of Beta-Sheet Topology
N terminus
C terminus
Cheng and Baldi, Bioinformatics, 2005
Beta Sheet
BetaStrand
Beta ResiduePair
• Ab-Initio Structure Prediction
• Fold Recognition
• Protein Design
• Protein Folding
An Example of Beta-Sheet Topology
Structure ofProtein 1VJG
Beta Sheets
Level 1
4 5
2 1 3 6 7
An Example of Beta-Sheet Topology
Structure ofProtein 1VJG
Beta Sheets StrandStrand PairStrand AlignmentPairing Direction
Level 1 Level 2
Antiparallel
Parallel
4 5
2 1 3 6 7
An Example of Beta-Sheet Topology
Structure ofProtein 1VJG
Beta Sheets StrandStrand PairStrand AlignmentPairing Direction
Beta ResidueResidue Pair
Level 1 Level 2 Level 3
Antiparallel
Parallel
4 5
2 1 3 6 7
H-bond
Three-Stage Prediction of Beta-Sheets
• Stage 1 Predict beta-residue pairing probabilities
using 2D-Recursive Neural Networks (2D-
RNN, Baldi and Pollastri, 2003)
• Stage 2 Use beta-residue pairing probabilities to
align beta-strands
• Stage 3 Predict beta-strand pairs and beta-sheet
topology using graph algorithms
Stage 1: Prediction of Beta-Residue Pairings Using 2D-Recusive Neural Networks
Input Matrix I (m×m)
2D-RNNO = f(I)
Output / Target Matrix (m×m)
Iij
20 for Residues 3 SS 2 SA
Oij: Pairing Prob.Tij: 0/1
(i,j)
…AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK….
i j
An Example (Target)
Protein 1VJGBeta-Residue Pairing Map (Target Matrix)
1 2 3 4 5 6 7
An Example (Target)
Protein 1VJGBeta-Residue Pairing Map (Target Matrix)
1 2 3 4 5 6 7Antiparallel
Parallel
An Example (Prediction)
Stage 2: Beta-Strand Alignment
• Use output probability matrix as scoring matrix
• Dynamic programming• Disallow gaps and use
the simplified search algorithm
1 m
n 1
1 m1 n
Antiparallel
Parallel
Total number of alignments = 2(m+n-1)
Strand Alignment and Pairing Matrix
• The alignment score is the sum of the pairing probabilities of the aligned residues
• The best alignment is the alignment with the maximum score
• Strand Pairing Matrix
Strand Pairing Matrix of 1VJG
Stage 3: Prediction of Beta-Strand Pairings and Beta-Sheet Topology
(a) Seven strands of protein 1VJG in sequence order
(b) Beta-sheet topology of protein 1VJG
Minimum Spanning Tree Like Algorithm
Strand Pairing Graph (SPG)
(a) Complete SPGStrand Pairing Matrix
Minimum Spanning Tree Like Algorithm
Strand Pairing Graph (SPG)
Goal: Find a set of connected subgraphs that maximize the sum of the alignment scores and satisfy the constraints Algorithm: Minimum Spanning Tree Like Algorithm
(a) Complete SPG (b) True Weighted SPGStrand Pairing Matrix
An Example of MST Like Algorithm
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
56
7
1 2 3 4 5 6 7
4 5
Strand Pairing Matrix of 1VJG
Step 1: Pair strand 4 and 5
An Example of MST Like Algorithm
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
56
7
1 2 3 4 5 6 7
4 5
2 1
Strand Pairing Matrix of 1VJG
N
Step 2: Pair strand 1 and 2
An Example of MST Like Algorithm
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
56
7
1 2 3 4 5 6 7
4 5
2 1 3
Strand Pairing Matrix of 1VJG
N
Step 3: Pair strand 1 and 3
An Example of MST Like Algorithm
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
56
7
1 2 3 4 5 6 7
4 5
2 1 3 6Strand Pairing Matrix of 1VJG
N
Step 4: Pair strand 3 and 6
An Example of MST Like Algorithm
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
56
7
1 2 3 4 5 6 7
4 5
2 1 3 67Strand Pairing Matrix of 1VJG
N
C
Step 5: Pair strand 6 and 7
Method Specificity/
Sensitivity
Ratio of
Improvement
BetaPairing 41% 17.8
CMAPpro
(Pollastri and Baldi, 2002)
27% 11.7
Method Specificity Sensitivity % of non-local pairs
MST Like 53% 59% 20%
Method Alignment
Accuracy
Pairing
Direction
BetaPairing 66% 84%
Statistical Potential (Hubbard, 1994) 40% X
Pseudo-energy (Zhu and Braun, 1999) 35% X
Information Theory (Steward and Thornton, 2002) 37% X
1.Beta Residue Pairing
2. Beta Strand Alignment
3. Beta Strand Pairing
3D Structure Prediction•Ab-Initio Structure Prediction
•Template-Based Structure Prediction
Physical force field – protein foldingContact map - reconstruction
MWLKKFGINLLIGQSV…
……
Select structure with minimum free energy
MWLKKFGINKH…
Protein Data Bank
Fold
Recognition Alignment
Template
Simulation
Query protein
A Machine Learning Information Retrieval Framework for Fold Recognition
MWLKKFGIN……
Protein Data Bank
Fold Recognition
Alignment
Template
Query Protein
Cheng and Baldi, Bioinformatics, 2006
Machine Learning Ranking
Classic Fold Recognition Approaches
Sequence - Sequence Alignment(Needleman and Wunsch, 1970. Smith and Waterman, 1981)
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL
Query
Template
Works for >40% sequence identity(Close homologs in protein family)
Alignment (similarity) score
Classic Fold Recognition Approaches
Profile - Sequence Alignment(Altschul et al., 1997)
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL
ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN
QueryFamily
Template
More sensitive for distant homologs in superfamily. (> 25% identity)
AverageScore
Classic Fold Recognition Approaches
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLITAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL
ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN
QueryFamily
Template
1 2 … n
A 0.4
C 0.1
…
W 0.5
Position Specific Scoring MatrixOr Hidden Markov Model
More sensitive for distant homologs in superfamily. (> 25% identity)
12………………………………….………………n
Profile - Sequence Alignment(Altschul et al., 1997)
Classic Fold Recognition Approaches
1 2 … m
A 0.3
C 0.5
…
W 0.2
Profile - Profile Alignment(Rychlewski et al., 2000)
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHLITAKPEKTPTSPREQAIGLSVTFLEFLLPAGWVLYHLILAKPAKTPTSPKEEAIGLSVTFLSFLLPAGWVLYHLITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYHL
ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHNIPARPQWLKTSKRSTEWQSVTFLSFLLPYTQGLYHNIGAKPQWLWTSERSTEWHSVTFLSFLLPQTQGLYHM
QueryFamily
TemplateFamily
1 2 … n
A 0.1
C 0.4
…
W 0.5
More sensitive for very distant homologs. (> 15% identity)
Classic Fold Recognition Approaches
MWLKKFGINLLIGQS….
Useful for recognizing similar folds without sequence similarity.(no evolutionary relationship)
Query
Template Structure
FitFitness Score
Sequence - Structure Alignment (Threading)(Bowie et al., 1991. Jones et al., 1992. Godzik, Skolnick, 1992. Lathrop, 1994)
Integration of Complementary Approaches
Meta Server
FR Server1
FR server2
FR server3
Query
Internet
Consensus
1. Reliability depends on availability of external servers2. Make decisions on a handful candidates
(Lundstrom et al.,2001. Fischer, 2003)
Machine Learning Classification Approach
Proteins
Class 1
Class 2
Class m
Classify individual proteins to several or dozens of structure classes(Jaakkola et al., 2000. Leslie et al., 2002. Saigo et al., 2004)
Problem 1: can’t scale up to thousands of protein classesProblem 2: doesn’t provide templates for structure modeling
Support Vector Machine (SVM)
Machine Learning Information Retrieval Framework
Query-Template Pair
-
+
Score 1Relevance Function (e.g., SVM)
• Extract pairwise features• Comparison of two pairs (four proteins)• Relevant or not (one score) vs. many classes• Ranking of templates (retrieval)
Score 2
Score n
Rank
.
.
.
Pairwise Feature Extraction • Sequence / Family Information Features Cosine, correlation, and Gaussian kernel• Sequence – Sequence Alignment Features Palign, ClustalW• Sequence – Profile Alignment Features PSI-BLAST, IMPALA, HMMer, RPS-BLAST• Profile – Profile Alignment Features ClustalW, HHSearch, Lobster, Compass, PRC-HMM• Structural Features Secondary structure, solvent accessibility, contact map, beta-
sheet topology
Pairwise Feature Extraction
Relevance Function: Support Vector Machine Learning
Positive Pairs(Same Folds)
Negative Pairs(Different Folds)
Training/Learning
SupportVector
Machine
Training Data Set
Feature Space
Hyperplane
Relevance Function: Support Vector Machine Learning
f(x) = K is Gaussian Kernel:
Margin
Margin
(1) (2)
Training and Cross-Validation• Standard benchmark (Lindahl’s dataset, 976 proteins)• 976 x 975 query-template pairs (about 7,468 positives)
123.....976
Query
975 pairs
975 pairs
Query 1’s pairs
.
.
.
Rank 975templatesfor eachquery
975 pairsQuery 2’s pairs
(90%: 1- 878)
(10%: 879 – 976)
Train / Learn
Test
Results for Top Five Ranked Templates
•Family: close homologs, more identity•Superfamily: distant homologs, less identity•Fold: no evolutionary relation, no identity
Method Family Superfamily Fold
PSI-BLAST 72.3 27.9 4.7
HMMER 73.5 31.3 14.6
SAM-T98 75.4 38.9 18.7
BLASTLINK 78.9 4.06 16.5
SSEARCH 75.5 32.5 15.6
SSHMM 71.7 31.6 24
THREADER 58.9 24.7 37.7
FUGUE 85.8 53.2 26.8
RAPTOR 77.8 50 45.1
SPARKS3 86.8 67.7 47.4
FOLDpro 89.9 70.0 48.3
Specificity-Sensitivity Plot (Family)
Specificity-Sensitivity Plot (Superfamily)
Specificity-Sensitivity Plot (Fold)
Advantages of MLIR Framework• Integration
• Accuracy
• Extensibility
• Simplicity
• Reliability
• Completeness
• Potentials
DisadvantagesSlower than some alignment methods
A CASP7 Example: T0290Query sequence (173 residues):RPRCFFDIAINNQPAGRVVFELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFMVQGGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFFITKPTPHLDGHHVVFGQVISGQEVVREIENQKTDAASKPFAEVRILSCGELIP
Compare with the experimental structure:RMSD = 1Ao
FOLDpro
Predicted Structure
Publications and Bioinformatics Tools1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale Prediction of Disulphide Bond Connectivity. NIPS 2004.
[DIpro 1.0]2. J. Cheng, H. Saigo, and P. Baldi. Large-Scale Prediction of Disulphide Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks, and Weighted Graph Matching. Proteins, 2006.
[DIpro 2.0] 3. J. Cheng and P. Baldi. Three-Stage Prediction of Protein Beta-Sheets by Neural Networks, Alignments, and Graph Algorithms. Bioinformatics, 2005.
[BETApro]4. J. Cheng, A. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein Structure and Structural Feature Prediction Server. Nucleic Acids Research, 2005.
[SSpro 4/ACCpro 4/CMAPpro 2]5. J. Cheng, M. Sweredoski, and P. Baldi. Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery, 2005.
[DISpro]
6. J. Cheng, L. Scharenbroich, P. Baldi, and E. Mjolsness. Sigmoid: Towards a Generative, Scalable, Software Infrastructure for Pathway Bioinformatics and Systems Biology. IEEE Intelligent Systems, 2005.
[Sigmoid]7. J. Cheng, A. Randall, and P. Baldi. Prediction of Protein Stability Changes for Single Site Mutations Using Support Vector Machines. Proteins, 2006.
[MUpro]8. S. A. Danziger, S. J. Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H. Chen, J. Cheng, V. P. Hoang, H. Saigo, R. Luo, P. Baldi, R. K. Brachmann, and R. H. Lathrop. Functional Census of Mutation Sequence Spaces: The Example of p53 Cancer Rescue Mutants. IEEE Transactions on Computational Biology and Bioinformatics, 2006.
9. J. Cheng, M. Sweredoski, and P. Baldi. DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks. Data Mining and Knowledge Discovery, 2006.
[DOMpro]10. J. Cheng and P. Baldi. A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics, 2006.
[FOLDpro]
Publications and Bioinformatics Tools
Acknowledgements • Pierre Baldi• G. Wesley Hatfield, Eric Mjolsness, Hal
Stern, Dennis Decoste, Suzanne Sandmeyer, Richard Lathrop, Gianluca Pollastri, Chin-Rang Yang
• Mike Sweredoski, Arlo Randall, Liza Larsen, Sam Danziger, Trent Su, Hiroto Saigo, Alessandro Vullo, Lucas Scharenbroich
Markov Models
1D-Recursive Neural Network
2D-Recursive Neural Network
2D-RNNs
2D RNNs