v25 – protein docking, fft

V25 protein docking, FFT Fast Fourier Transform

Matching densities Orienting the two lattices can be done with respect to 6 degrees of freedom, 3 for translation along x, y, and z, and 3 for rotation around the angles , , and . Among all these possibilities, one wishes to identify the relative orientation x, y, z, , , that minimizes the sum of least squares

Here, R,, is a three-dimensional rotation matrix and Tx,y,z is a translation operator that translates molecule B to the position x, y, z. Minimizing the sum of squared errors is equivalent to maximizing the linear cross-correlation of A and B,

for a given translation vector (x,y,z) and rotation (, , ). Intuitively, we want to compute the overlap of the two densities after placing the two lattices on top of each other. But what means 'on top of each other' in mathematical terms?

What is the complexity of computing this correlation? Task: compute the linear cross-correlation of A and B,

for a given translation vector (x,y,z) and rotation (, , ).

Let R be the set of all possible rotations of molecule B. Certain combinations of rotating around 3 Euler angles will lead to the same final result those can be omitted and we obtain a minimal set of rotations R.

Then we need O(N3) for computation of each value Cxyz leading to a total time of O(N6) for the computation of the translational component of C in all grid points and O( |R| N6) for the complete algorithm.

Fast Fourier Transform after: Numerical RecipesDiscrete Fourier Transform of a function from a finite number of its sampled points.Suppose that we have N consecutive sampled valuesso that the sampling interval is . Let us assume that N is even.

The discrete Fourier transform of the N points hk isThe formula for the discrete inverse Fourier transform, which recovers the setof hks exactly from the Hns is:

Fast Fourier Transform How much computation is involved in computing the discrete Fourier transform of N points? Until the mid-1960s, the standard answer was this:

Define W as the complex numberThen we can write The vector of hks is multiplied by a matrix whose (n,k)th element is the constantW to the power n k.The matrix multiplication produces a vector result whose components are the Hns.

This matrix multiplication requires N2 complex multiplications, plus a smaller number of operations to generate the required powers of W.

So, the discrete Fourier transform appears to be an O(N2) process.

Fast Fourier Transform However, the discrete Fourier transform (in 1 dimension) can be computed inO(N log2 N) operations by an algorithm called the Fast Fourier Transform.

With N = 106, the difference between O(N2) and O(N log2 N) is 30 CPU seconds against 2 CPU weeks!

The FFT algorithm became generally known in the mid-1960s from the work of J.W. Cooley and J.W. Tukey.In fact, efficient methods to compute discrete Fourier transforms had been independently discovered many times, starting with Gauss in 1805.

FFT by Danielson and Lanczos (1942) D. and L. showed that a discrete Fourier transform of length N can be rewritten as the sum of two discrete Fourier transforms, each of length N/2.One of the two is formed from the even-numbered points of the original N, the other from the odd-numbered points.W is the same constant as before.

Fke : k-th component of the Fourier transform of length N/2 formed from the even components of the original fj s

Fko : k-th component of the Fourier transform of length N/2 formed from the odd components of the original fj s

FFT by Danielson and Lanczos (1942) The wonderful property of the Danielson-Lanczos-Lemma is that it can be used recursively.

Having reduced the problem of computing Fk to that of computing Fke and Fko , we can do the same reduction of Fke to the problem of computing the transformof its N/4 even-numbered input data and N/4 odd-numbered data.

We can continue applying the DL-Lemma until we have subdivided the data all the way down to transforms of length 1.

What is the Fourier transform of length one? It is just the identity operation that copies its one input number into its one output slot.

For every pattern of log2N es and os, there is a one-point transform that is just one of the input numbers fn

FFT by Danielson and Lanczos (1942) The next trick is to figure out which value of n corresponds to which pattern of es and os in Answer: reverse the pattern of es and os, then let e = 0 and o = 1, and you will have, in binary the value of n.

Idea: this works because the successive subdividisions of the data into even and odd are tests of successive low-order (least significant) bits of n.

This idea of bit reversal can be exploited in a very clever way which, along with the DL-Lemma, makes FFT practical:

Suppose we take the original vector of data fj and rearrange it into bit-reversed order, so that the individual numbers are in the order not of j, but of the number obtained by bit-reversing j.

FFT by Danielson and Lanczos (1942) Reordering an array (here of length 8) bybit reversal,(a) between two arrays, versus (b) in place.

The points as given are the one-point transforms. We combine adjacent pairs to get two-point transforms, then combine adjacent pairs of pairs to get 4-point transforms, and so on until the first and second halves of the whole data set are combined into the final transform.Each combination takes of order N operations, and there are log2N combinations.

This, then, is the structure of an FFT algorithm.

Faster than FFT Shape Matching? Bioinformatics 23, 427 (2007)

Prediction of Assemblies from Pairwise Docking Inbar et al., J. Mol. Biol. 349, 435 (2005)CombDock: first fully automated approach for predicting hetero multimolecular assembly only based on structural models of its protein subunits.

Problem appears more difficult than the pairwise docking problem; it is NP-hard.

Idea: exploit additional geometric constraints embraced in the combinatorial problem.

Input: a set of protein structural models.

Unlike a 3D puzzle, where two connected pieces in the puzzle solution match perfectly, we would like to tolerate some extent of penetration, due to the flexible nature of the proteins.

Pairwise docking: Katchalski-Kazir algorithm; FTDOCKGabb et al. J. Mol. Biol. (1997)Discretize proteins A and B on a grid.Every node is assigned a valueUse FFT to compute correlation efficiently.

Output: solutions with best surfacecomplementarity.

(1) All pairs docking module Inbar et al., J. Mol. Biol. 349, 435 (2005)Module gets as its input N protein structures predict pairwise interactions.

Perform pairwise docking for each of the N (N - 1) / 2 pairs of proteins.Keep K best solutions for each pair of proteins.

Since pairwise-docking is a difficult problem, the correct solution may be among the first few hundred solutions. K should be set reasonably high.

Here, K was varied from dozens to hundreds.

(2) Combinatorial assembly module Inbar et al., J. Mol. Biol. 349, 435 (2005)Input: N subunits and N (N - 1) / 2 sets of K scored transformations.These are the candidate interactions.

Reduction to a spanning treeBuild weighted graph representing the input:each structural unit = vertexeach transformation = edge connecting the corresponding verticesedge weight = score of the transformation

Since the input contains K transformations for each pair of subunits, we have a complete graph with K parallel edges between each pair of vertices.

(2) Combinatorial assembly module Inbar et al., J. Mol. Biol. 349, 435 (2005)For two subunits, each candidate complex is represented by an edge and the two vertices.

In the case of N structural units a candidate complex is represented by a spanning tree = a subgraph of the input graph that connects all vertices and has no circles.

Each spanning tree of the input graph represents a complex of all the input structural units. The problem of finding complexes is equivalent to finding spanning trees.

The number of spanning trees in a complete graph with no parallel edges is NN-2 (Cayleys formula).

Since the input graph has K parallel edges between each pair of vertices, the number of spanning trees is NN-2 KN-1 .

Exhaustive searches are infeasible.

(2) Combinatorial assembly module:algorithm Inbar et al., J. Mol. Biol. 349, 435 (2005)Algorithm uses 2 basic principles:(1) hierarchical construction of the spanning tree(2) greedy selection of subtrees

Different trees share common trees generate trees with n vertices by connecting two trees of smaller size (that were previously generated) with an input edge.

Thus, the common parts of different trees are generated only once.

When connecting subtrees, validate only the inter-subtree constraints. need to check whether there are severe penetrations in the complex only between pairs of subunits, where each is represented by a different subtree.

(2) Combinatorial assembly module:algorithm Inbar et al., J. Mol. Biol. 349, 435 (2005)Stage 1: algorithm constructs trees of size 1. Each tree contains a single vertex that represents a subunit.

Stage i: the tree complexes that consist of exactly i vertices (subunits) are generated by connecting two trees generated at a lower stage with an input edge transformation.

Tree complexes that fulfil the penetration constraint are kept for the next stages.

Because it is impractical to search all valid spanning trees, the algorithm performs a greedy selection of subtrees. For each subset of vertices, the algorithm keeps only the D best-scoring valid trees that connect them.

The tree score is the sum of its edge weights.

Flowchart www.cs.tau.ac.il/~inbaryuv/combdoc/

Example Inbar et al., J. Mol. Biol. 349, 435 (2005)The construction of the third-best scoring solution of arp2/3 complex (RMSD 1.2 ). The combinatorial assembly algorithm is hierarchical: at the first stage, each complex consists of a single subunit.

At the ith stage it constructs complexes that consist of i subunits by connecting complexes of smaller size using one of the input candidate transformations.

The arp2/3 complex consists of seven subunits shown at the top. In this Figure we present only the complexes of the different stages that are relevant to the construction of the third-best scoring solution (at the bottom of the Figure).

Along with each complex is its corresponding subgraph, where the vertices represent the subunits and the edges represent the pairwise interactions that were used to construct the complex. In each graph, the red edge represents the transformation of the current stage, while blue edges represent transformations of previous stages.

Final scoring Inbar et al., J. Mol. Biol. 349, 435 (2005)The geometric score evaluates the shape complementarity between the subunits:check distances between surface points on adjacent subunits.Close surface points increase score,penetrating surface points decrease score.

Physico-chemical component of the final score counts the #surface points that belong to non-polar atoms = gives an estimate of the hydrophobic effect.

Clustering of solutions:(1) compute contact maps between subunits: array of N ( N 1 ) bins.If two subunits are in contact within the complex, set the corresponding bit to 1,and to 0 otherwise.(2) superimpose complexes that have the same contact map and compute RMSDbetween C atoms. If this distance is less than a threshold, consider complexes as members of a cluster. For each cluster, keep only the complex with the highest score.

Performance for known complexes Inbar et al., J. Mol. Biol. 349, 435 (2005)

Method works with different contact topologies. Inbar et al., J. Mol. Biol. 349, 435 (2005)The near-native solutions for two complexes with different contact topologies. Left: CombDock solution, Right: solution superposed on the crystal structure (gray thiner lines). (a) the sixth-best scoring solution for the IkBa/NF-kB complex of an unbound input, RMSD 1.9 . The p65 subunit was extracted from a homodimer structure (PDB 1BFT). The structure used for the IkBa subunit was generated by MODELLER6 v2 using bcl-3 (PDB 1k1b) as the template structure; (b) the second-best scoring solution of VHL/elonginC/elonginB complex (PDB 1vcb), with an RMSD of 0.5 . Each complex consists of three subunits but, while in the IkBa/NF-kB complex all the subunits are in contact with each other, in the VHL/elonginC/ elonginB complex the elonginC is the core of the complex (in yellow) and VHL (in blue) and elonginB (in red) are not in contact. The algorithm was able to predict a near-native solution for both complexes regardless of their contact topologies.

Examples of large complexes Inbar et al., J. Mol. Biol. 349, 435 (2005)Left: CombDock solution, Right: solution superposed on the crystal structure (gray thinner lines).

The solutions are: (a) the third-best scoring assembly of the seven subunits of the arp2/3 complex, RMSD 1.2 ; (b) the bestranked complex of the ten subunits of RNA polymerase II, RMSD 1.4 .

Discussion of CombDock Inbar et al., J. Mol. Biol. 349, 435 (2005)For the five different targets, CombDock predicted at least one near-native solution and ranked it in the top ten for both bound and unbound cases.

Problem in evaluating performance: full sets of unbound structures are not available for complexes with a higher number of subunits.

It is unlikely that this version of the algorithm (using rigid protein conformations) will be able to correctly assemble such complexes if the input subunits involve significant conformational changes. future version should include hinge-bending movements of protein subunits.

Alber et al., Nature 450, 683 (2007)

v25 – protein docking, fft

Documents

discrete fourier transforms

power n

original n

n points hk isthe formula

sampled points

given translation vector

n consecutive sampled

grid points