consensus fold recognition methods

Consensus Fold Recognition Methods

Dongbo BuSchool of Computer Science

University of Waterloo

Joint work withS.C. Li, X. Gao, L. Yu, J. Xu, M. Li

Nov. 2006

Outline

• Background

• Consensus Prediction Methods

• ACE7: consensus method by identifying latent servers

• Experimental Results

• Future Work

Background

From sequence to structure

• The Rate Gap – gene prediction is fast,– but experimental structure

determination is slow

• The First Principle– Sequence almost determine

structure

• CASP Competition– A fair and objective examination

Computational Methods

motivation

possibility

benchmark

Homologous Modeling --- sequence-sequence alignment

Threading---sequence-structure alignment

Ab initio--- database independent

Why Consensus?

• Observation:– no single server can reliably predict the best

models for all the targets. – a particular structure prediction server may

perform well on some targets, but badly on others.

• A natural idea to solve this issue:– to combine the strengths of different

prediction methods to obtain better structural models.

What is Consensus Method?

Formal Description

• Notations:– Target: the query protein sequence– Server: implementation of a prediction method– Model: a predicted structure

Classical Consensus Methods

Research History

• Early exploration of consensus idea:– Consensus many methods in one server.– INBGU (SHGU) D. Fischer 2000– 3D-PSSM (Phyre) L. Kelly 2000

• The first consensus server– CAFASP-CONSENS: D. Fischer 2001

• Successors: – Pcons/Pmodeller J. Lundstrom, A.

Elofsson 2001– 3D-Jury K. Ginalski, A. Elofsson 2003– 3D-Shotgun D. Fischer 2003– ACE L. Yu, J. Xu, M. Li 2004

Three-step Process

• Step1: Model Comparison– determine model similarities

• Step2: Feature Extraction– formal description of a model

• Step3: Model Selection – select a model, or part of it.

• Many machine learning techniques were introduced in the 3rd step.

3D-Shotgun: Majority Voting

• Basic Idea:– Reminiscent of “cooperative algorithms”

• Five Input Servers:– GONP, GONPM, PRFSEQ, SEQPPRF,

SEQPMPRF

• Step 1. Model Comparison– For each initial model, to find models with

LOCAL similarity.

3D-Shotgun (cont)

• Step 2. Feature Extraction– For each model M, superimpose similar

models upon M,– Using the shared similarity to compute

transformation– Build a multiple structure alignment A(M) as a

result,– Feature:

• the number of models share structural element with A(M).

3D-Shotgun (cont)

• Step 3. Selection– Majority Voting– Choose the structural element with the highest count.– The underlying rationale:

• The recurring structural elements are most likely to be correct.

Confidence Assignment

• For each assembled model M’, a confidence score S’ is given as follows:

• Here, – k,l run over all the input models– S_{k,l} is the confidence score given by the individual

server– Sim() adopts MaxSub.

Performance of 3D-Shotgun

CAFASP-Consensus and Pcons: Neutral Network

• Step 1. Model Comparison– CAFASP-Consensus: check SCOP id, or run MaxSub– Pcons: LGScore2 to detect similarity

• Step 2: Feature Extraction– CAFASP-Consensus: number of similar models– Pcons:

ratio of the similar models

weighted f1

ratio of the similar 1st model

CAFASP-Consensus and Pcons: (cont)

• Step 3. Model Selection– Formulated into a

machine learning problem

– Attribute: • Log(LGScore2),

significantly better than LGScore2.

Pmodeller = Pcons + ProQ

• ProQ:– a neutral network package to measure the

quality of a structure

• Pmodeller has an advantage over Pcons because a number of high-score but false-positive models are eliminated.

Performance of Pcons/Pmod

ACE: SVM Regression

• Step 1. Model Comparison– MaxSub

• Step 2. Feature Extraction

– f1: the normalized similarity with all the other models– f2: the normalized similarity with the most similar one– f3: for each target, to measure the divergence of server

predictions.

ACE (cont)

• Step 3: Selection– SVM Regression: to predict the model quality– Attribute:

• MaxSub with the native structure

Performance of ACE• In CASP6, ACE was ranked 2nd among 87

automatic servers. • On LiveBench test set:

Other techniques• 3D-Jury:

– Rationale: the average of lower energy conformation is similar to the native structure.

– Basic Idea: Mimic the average step by the following scoring function:

Other techniques (cont)

• Robetta:– For each fragment, choose a local structure

from a set, and assemble them to minimize an energy funtion.

• BPROMPT: – Bayesian Belief Network

• JPred:– Decision Tree

CASP7 Performance

ACE7: A Consensus Method by Identifying Latent Servers

Motivation

• Server Correlation:– Although consensus servers assume that

each individual server is independent of others, it is observed from CASP6 results that correlation exists between different servers to some degree.

• Negative Effect:– this kind of correlation sometimes makes a

native-like model receive less support than the incorrect models.

Examination of ACE on CASP6 Dataset

• Observation:– If a native-like model receives support from only 1or 2

server, it is difficult to select it.

Source of Server Correlation

• Server Correlation:– some servers tend to generate similar results,

• Reason:– Roughly speaking, the correlations arose from the fact

that these servers adopt similar techniques, including sequence alignment tools, secondary structure prediction methods, and scoring functions,etc.

• Latent Servers: – Here, we use independent latent servers to represent

the common features shared by these implicit servers.

ACE7: to reduce the server correlation

• Step 1. Adopting Maximum Likelihood to estimate the server correlation.

• Step 2. Employing Principle Component Analysis technique to derive the latent servers.

• Step 3. Using an ILP model to weigh the latent servers.

Two Assumptions of ACE7

• Assumption 1:

– Here, we approximate Ci,m by:

• Assumption 2:

Maximum Likelihood Estimation of Server Correlation

Here,

Server Correlation

• Observation:– The server correlation is significant with respect to the fact that there are

thousands of candidate models.– some servers are correlated more tightly than others.

• mGenThreader and RAPTOR (0.383) vs. FUGUE3 and Prospect (0.182).

• Implication: – These individual server may be clustered into cliques according to

correlations; – the servers in a small clique may be underestimated according to the simple

“majority voting” rule.

Uncovering the Latent Server

•

Uncovering the Latent Servers (cont)

• Using the PCA technique, the latent severs can be estimated as:

Explanation of Latent Servers

• Observation:– H1: represents MGTH and RAPT– H2: SPKS– H3: FUG3– H4: ST02– H5: PROS– H6: no preference

Construct a More Accurate Server

• Since latent servers are mutually independent, it is reasonable to assume:

• Key Point:– How to set the weight of each latent server?

– An ILP model:• To maximize the gap between the scores of the native-like

models and incorrect models.

ILP Model (soft-margin idea)

Experiment on CASP7 Dataset

• Observation:– For T0363, ACE7 succeeds even only one server votes the native-like

model.

Sensitivity of ACE7

• Observation: – ACE7 has a higher sensitivity than any individual

server.

Future Work

Conclusion

• Though consensus methods rely on structure clustering property, the server correlation also bring negative effect.

•

Future Work

• To find a better approximation of Ci,m.

• Using MaxSub instead of GDT.

• RAPTOR has a good performance in choosing the top 5 models, but always be puzzled to choose the top 1 model.

• We try to help to choose the best from the top 5 models remains an open problem.

Thanks.

consensus fold recognition methods

Documents

model cafaspconsensus

assembled model

model selectionformulated

initial model

model comparisonfor

consensus servercafaspconsens

dshotgun contstep

better structural models