ab-initio protein structure prediction - cornell university · ab-initio methods •combination (r....

Ab-initio protein structure prediction

Jaroslaw PillardyComputational Biology Service Unit

Cornell Theory Center, Cornell UniversityIthaca, NY USA

Methods for predicting protein structure

1. Homology (comparative) modeling – relies on the similarity of the sequence of the protein under study to those of proteins with similar sequence and known structure. Very reliable, if sequencesimilarity ≥ 40%.

2. Threading (fold recognition) – finding the best structure from the database, which fits the sequence studied in terms of a chosen score function.

3. Energy-based (ab initio): search of the structure as a global minimum of a designed potential-energy function, following Anfinsen’s thermodynamic hypothesis.

Ab-initio methods: advantages

• They do not require existence of a similarly shaped (folded) protein in a database in order to predict a structure. Entirely new and unknown folds may be found this way.

• The only information necessary is a sequence, however experimental knowledge about secondary structure and other structural features may be used.

• They perform very well for short sequences, where other methods are not as good.

• In some cases they may give insight to physical properties of protein (folding pathways, statistical ensembles).

Ab-initio methods: limitations

• They are usually slower than threading and homology modeling by few orders of magnitude. They usually require big parallel computer cluster to be run, and even then it may take few days using 50-100 processors to calculate something …

• The reliability of models produced diminishes fast with the protein size – the sequences shorter than 130-150 amino acids work best. It is possible to treat larger proteins if some other information is used (e.g. secondary structure).

• The correct prediction has to be chosen from several possibilities(families) that are close in energy. Prediction based on one (lowest-energy) family only is much less reliable.

• Typical resolution of a model is 3-6Å.

Ab-initio methods

•Combination (R. A. Friesner, Columbia University)

Uses simplified representation of protein chain with the energy function derived from protein statistics (PDB) and secondary structure elements frozen during the search (secondary structure prediction is used before the search). Monte Carlo with minimization is used as a global search

•Lattice folding (J. Skolnick, Danforth Plant Science Center)Uses simplified representation of protein chain on the lattice. Secondary structure prediction with multiple sequence alignment is used before search. Threading is used to derive some constraints. Monte Carlo search is used on the lattice, results are clustered, and 5 best models are refined of—lattice.

•FRAGFOLD (D. T. Jones, Brunel University, England)

Uses simplified representation of protein chain (Cα and Cβ only) with a special energy function. Fragment database is generated using standard secondary structure elements, and the energy function is optimized with simulated annealing.

Ab-initio methods

•Rosetta (D. Baker, University of Washington)Builds a database of short sequence fragments (9aa) and then uses Monte Carlo procedure to search conformational space. Resulting conformations are filtered based on backbone contacts, clustered, and re-evaluated with more complicated energy function.

•UNRES (H. A. Scheraga, Cornell University)

Uses simplified representation of protein chain with a sophisticated, physics-based energy function. Genetic algorithm or Monte Carlo with minimization are used for global search. Structure is refined at the all-atom level with ECEPP force field.

Hierarchical Approach to Protein-Structure Prediction

Stage 1: Global optimization of the potential-energy function in a simplified representation of the polypeptide chain. This is the key stage of the algorithm.

Stage 2: Conversion of the lowest-energy structures to the all-atom representation.

Stage 3: Limited energy optimization of the all-atom structures.

lowest in energy at the simplified level

What is Global Optimization?

Local minimization (optimization) is an algorithm leading to a closest local minimum in the attraction basin from a given starting point in the function domain.

Global minimization (optimization) is an algorithm designed to find the local minimum having the lowest (extreme) value of the function being optimized.

Thermodynamic hypothesis (Anfinsen, 1973):

The native structure corresponds to the global minimum of the free energy (F)

In most cases oscillations around the most stable structure are small and the lowest energy structure well represents the minimum-free-energy ensemble.

For proteins, energy functions may incorporate the solvent (all-atom models) and possibly other degrees of freedom (simplified models), but do not include the entropic free energy of backbone degrees of freedom. However, these entropic contributions are small in native states of proteins, since structural fluctuations are generally small. In this case free energy (F) may be substituted by energy (E) for optimization.

Why to do global optimization at all?

Global optimization: can it be done?

Global optimization is a very hard NP-complete problem impossible to solve in general case as well as in most physically-interesting casesThe problem lies in a huge number of local minima and the fact that function is usually defined in very high dimensional space:

ØFor N-residue protein in an all-atom representation it is estimated to be ~10N local minima in 6N-dimensional space (even if a lot of these minima are inaccessible due to sterical problems it is a large nuber…)

ØFor cluster of 55 identical Lennard-Jones particles the number of local minima is estimated to be ~1010 (in 159 dimensional space not counting permutations); for 147-particle cluster this number grows to ~1060 (in 435 dimensional space)

Global optimization: can it be done?

General answer is: NO.

It can be done in some particular cases when:

ØKnowledge about physics of a system under study should be used to simplify the search making it plausible

ØGlobal optimization method should be coupled with a potential function, i.e. they should be designed and tuned up together

Thermodynamic hypothesis must be supplemented by the requirement that the energy hypersurface is searchable, i.e. that the global energy minimum is preferred from the others not only by the energy at the minimum, but also by the heights of barriers …

The most important feature of potential energy hypersurface is its hierachicalorganization helping the global optimization method to get from high to low-energy structures.

This feature may described in 2D graph through relationships between groups of minima by using:

- maximum energy along the optimal trajectories connecting minima(disconnectivity tree)

- temperature at which particular minima kinetically confine thesystem

The dependence of grouping versus relation parameter (E or T) is usually graphically represented by a tree, where the root is when all the minima are equivalent, branches are groups of equivalent minima (at a given parameter value) and leaves are single minima.

Hierarchical energy/free energy surface

Hierarchical energy/free energy surface

energy

surface disconnectivity tree

LJ38 LJ55 LJ75

Lennard-Jones 6-12 clustersDisconnectivity graph – minima diverge on plot when the energy on the Y axis is greater than the highest energy on the lowest-energy pathway connecting two minima, otherwise they are represented together by a single vertical line.

J. P. K. Doye, M. A, Miller, D. J. Wales J. Chem. Phys. 1999, 111, 8417-8428.

P. N. Mortenson and D. J. Wales J. Chem. Phys. 2001, 114, 6443-6454.

constant ε distance-dependent ε

Ac-(Ala)8-NHMe with AMBER95

Global Optimization: top-down versus local-minima-based

Top-down method tries to explore tree-like structure of minima starting from its root and progressing to the end of longest branch (global minimum). The most important problem for such a method is how to choose an appropriate branch when a bifurcation occurs and the group of minima split. Top-down method in this meaning is equivalent to some kind of deformation.

Examples: Branch-and-Bound, Diffusion Equation Method, Distance Scaling Method, Gaussian Density Annealing, Gaussian Packet Annealing, Self-Consistent Basin-to-Deformed-Basin Mapping, Simulated Annealing.

Local-minima-based method does not explore tree-like structure of minima. It starts from local minimum (or minima), progresses by finding more local minima, and builds branches of disconnectivity graph by aggregating local minima. Most important problem here is how to cover as large space as possible (usually solved by similarity classification, families) and how to obtain as low-energy trial structures as possible from current structrure(s) (solved by “smart” Monte Carlo moves, Genetic Algorithm, etc).Examples: Monte Carlo-Minimization, Conformational Space Annealing, Electrostatically-Driven Monte Carlo, Conformation-Family Monte Carlo.

CSA and CFMC fit here as methods that try to recover “hierarchy” by different similarity measures (RMS, angles) and divide space into groups represented by a few (CFMC) or single (CSA) structure.


At present local-minima-based methods perform better than top-downmethods, or at least they are much more cost-effective.

The task of branch evaluation (predicting which branch leads to lower-energy minima) is by no means easier that designing effective algorithms for “smart” structure generation, but structure generation is by far more intuitive.

The task of choosing a deformation that leads to a global minimum is strongly linked to a branch evaluation problem and is really very difficult as well.


A good potential function is a necessary element of any algorithm for a structure prediction.

Good (predictive) potential function:

(a) Should reproduce the experimental structure within a certain accuracy

(b) The structures corresponding to the lowest-energy minima found for the potential should represent plausible structures, and one of them, preferably the global minimum, should correspond to the observedexperimental structure

(c) The lowest energy structure(s) corresponding to the native structure should be separated from other, non-physical structures by an energy gap

Potential Function

Potential Function for Proteins

ØFast and scaling well with the size of protein: global optimization requires large number of local minimizations to converge.

ØEasy to parameterize: potential should be coupled with a global optimization method, what requires several Z-score/global optimization iterations

ØClose to physical reality: this feature helps with transferability of a potential.

........C′ N

O

C

H

Cα

H

C′ N

O H

........

Variables:

3 backbone dihedral angles

average of 3 side-chain dihedral angles

6 per residue (with fixed bond lengths and bond angles)

ϕ ψ

χ

ω

At least 7 centers of interaction per residue, ~15 on average

Computational expense scales as ~15N*(15N-1)/2 ≈ 225N2

All-atom: geometry

Variables:

1 backbone dihedral angle1 backbone virtual bond angle1 side-chain rotation angle1 side-chain tilt angle

4 per residue

Only 2 centers of interaction per residue

Computational expense scales as ~2N*(2N-1)/2 ≈ 2N2

Side-chains are represented as ellipsoids (Gay-Berne potential)

Interaction centers are marked in colors

United-residue: geometry

All-atom representationof polypeptide chain insolution (explicit water)

United-residue (UNRES) representation of

polypeptide chain

X – principal degrees of freedom (variables that define the Cα trace of an polypeptide chain)

Y – secondary or “less important” degrees of freedomthat are averaged out

UNRES: formulation

∑∑∑

∑∑∑∑

=

−<≠<

+++

++++=

corr

ii

jijiji

N

m

mcorr

mcorrSCSC

irotroti

ibb

ii

tortorji

ppelji

pSCSCpji

SCSC

UwUwUw

UwUwUwUU

2

)()(

1

),()(

)(

βαθ

γ

- Three components (out of 13) were derived using the PDB (USCSC, Ub, Urot)

- USCSC parametrization depends on the statistical probability of side-chain contacts in the PDB, these probabilities did not change over years (1977). Moreover, these values correlate highly (R=0.94) with the experimentally determined octanol partitioning coefficients. This suggests that these parameters reflect physical interactions.

- Ub and Urot are not specific and serve only to maintain reasonable values of virtual-bond angles and positions of side-chain centroids with respect to Cα

trace.

UNRES

2nd order (electrostatic)

3rd order (local and electrostatic)

4th order3rd order (local)4th order (local and electrostatic)

2nd order (local) 3rd order (local and electrostatic)

6th order

5th order

UNRES: correlation terms

UNRES OVERVIEW : Types of residues

The number of different residue types depends on the energy termconsidered, i.e. different numbers of different amino acid related parameters are used for different energy terms.

• USCSC uses all 20 amino-acids; parameters for each interacting pair are considered independent. This energy term is the only one fully sequence-dependent.

• Urot and Ub also use all 20 amino acids, however they represent only internal energies within the residue, not residue-residue type of interaction.

• All other energy terms (which are cumulant-based) use simplified set of 3 amino acids: Ala, Gly and Pro.

Good Global Optimization Methods:Common features

• Find many low-energy local minima with different topologies. Global minimum may not correspond to the native structure, therefore both global minimum anddistinct low-energy conformations should be found

• Consider conformational space of local minima only, same as in the Monte Carlo-Minimization (MCM)

• Search as large part of conformational space as possiblein early stages and then narrows the search to smaller region.

A road to global minimum

Energy (kcal/mol)

RM

SD fr

om n

ativ

e

Conformational Space Annealing

• Finds many low-energy local minima with different topologies. Global minimum may not correspond to the native structure, therefore both global minimum anddistinct low-energy conformations should be found

• Considers conformational space of local minima only, same as in the Monte Carlo-Minimization (MCM)

• Uses genetic algorithm for structures generation• Searches the whole conformational space in early stages

and then narrows the search to smaller region.

* Arbitrary numbers50* randomconformations

First Bank

Energy minimization

D cut = 12 D ave

Bank

Select 20* seeds

D cut

Yes

All used as a seed ?

No

YesGenerate 50* random conformations, minimize their energies and add them to both Bank and First Bank

Stop

Generate 600 conformations (30* for each seed) by modifying seeds Energy minimization

Copy

GMECfound ?

Update Bank & Reduce

No

CSA is parallelized at the coarse-grain level – minimizations of newly generated structures are carried out in parallel, each minimization on different processor.

ConformationalSpace Annealing

Monte Carlo-Minimization (MCM)

Generate at random set of structures and locally minimize them.

1. Select the lowest-energy one as “generative” structure.

2. Carry out random change (perturbation) of “generative” structure to produce a new conformation.

3. Minimize the energy of the new conformation.

4. Compare its energy to energy of the “generative” structure by means of the Metropolis criterion [accepted with probabilty of exp(-∆E/kT)].

5. If accepted in Metropolis criterion new (minimized) structure becomes the “generative” structure, otherwise the “generative” structure remains unchanged.

6. Iterate into point 3.

Conformation-Family Monte Carlo (CFMC)

• Uses the Metropolis criterion to move between families

• Uses the Boltzmann distribution to choose conformation from a family

• Does not move between structures, but between families

• It is equivalent to smoothed “staircase deformation” of a potential function

Original function CFMCMCM

Application to UNRES polypeptide chains: Protein A

• Family definition was based on RMSD-based clustering (similar to minimal tree clustering method)

• New structures were produced by various kinds of angle perturbations(large for moves between families and small for improving a given family) and by an averaging

• The interfamily cut-off varied between 5Å at the beginning to 2 Å at the end of simulation

• The lowest-energy family was always found

Representatives of the lowest-energy family (native)

The lowest and next-to-the-lowest (3.2Å RMSD)

The lowest and the fourth (2.5Å RMSD)

Rotate Rotate

Native Mirror imageIntermediate

Three groups of families found by the CFMC

Rotate Rotate Rotate

The distribution of families found by the CFMC for Protein A

Rotate

RMSD from mirror RMSD from native

Ener

gy

Superposition of the calculated (yellow) and experimental (red) structure of 1FSD RMSD=2.8 Å

PROTEINS: Structure, Function, and Genetics Suppl 3:149-170 (1999)

AB INITIO: ASSESSMENT

Analysis and Assessment of Ab Initio Three-Dimensional Prediction, Secondary Structure, and Contacts PredictionC.A. Orengo, J.E. Bray, T. Hubbard, L. LoConte, and L. Sillitoe

Protein HDEA (T61) ……… the most impressive prediction was that of Scheraga’s group,23 using more classical ab initio methods, for which 76 residues could be superposed with an rmsd of 5.3 Å (using the Hubbard method) or 55 residues superposed to 3.81 Å (using the LCS plots of Zemla et al.14) (Fig. 9). Their method uses no information from sequence alignments, secondary structure prediction, or threading.

HDEA

RMSD=4.2 Å for 61 residues (80%, residues 25-85)

HDEA Segment

RMSD=2.9 Å for 27 residues (36%, residues 16-42)

CASP3 target T0061

(1BG8)

Rotate

CASP4 Target T0102, RMSD=4.2Å, native=blue, calculated=redAS48, Bacteriocin AS-48, E. faecalis, 70 amino acids, cyclical

ab-initio protein structure prediction - cornell university · ab-initio methods •combination (r....

Documents