structure prediction (i): secondary structure structure prediction (i): secondary structure...

Structure PredictionStructure Prediction (I):(I):Secondary structureSecondary structure

DNA/Protein structure-function analysis and prediction

Lecture 7

Center for Integrative Bioinformatics VU

Faculty of Sciences

Protein secondary structure20 amino acid types A generic residue Peptide bond Alpha-helix Beta strands/sheet

11 MKYNNHDKIR DFIIIEAYMF RFKKKVKPEVMKYNNHDKIR DFIIIEAYMF RFKKKVKPEV 31 31 DMTIKEFILL TYLFHQQENTDMTIKEFILL TYLFHQQENT LPFKKIVSDLLPFKKIVSDL 61 61 CYKQSDLVQH IKVLVKHSYI SKVRSKIDER CYKQSDLVQH IKVLVKHSYI SKVRSKIDER 91 91 NTYISISEEQ REKIAERVTL FDQIIKQFNLNTYISISEEQ REKIAERVTL FDQIIKQFNL121 121 ADQSESQMIP KDSKEFLNLM MYTMYFKNII ADQSESQMIP KDSKEFLNLM MYTMYFKNII 151 KKHLTLSFVE FTILAIITSQ NKNIVLLKDL151 KKHLTLSFVE FTILAIITSQ NKNIVLLKDL181181 IETIHHKYPQ TVRALNNLKK IETIHHKYPQ TVRALNNLKK QGYLIKERST QGYLIKERST 211 211 EDERKILIHM DDAQQDHAEQ LLAQVNQLLAEDERKILIHM DDAQQDHAEQ LLAQVNQLLA241241 DKDHLHLVFE DKDHLHLVFE

Protein primary structure

SARS Protein From Staphylococcus Aureus 1 MKYNNHDKIR DFIIIEAYMF RFKKKVKPEV DMTIKEFILL TYLFHQQENT 1 MKYNNHDKIR DFIIIEAYMF RFKKKVKPEV DMTIKEFILL TYLFHQQENT SHHH HHHHHHHHHH HHHHHHTTT SHHH HHHHHHHHHH HHHHHHTTT SS HHHHHHH HHHHS S SE SS HHHHHHH HHHHS S SE 51 LPFKKIVSDL CYKQSDLVQH IKVLVKHSYI SKVRSKIDER NTYISISEEQ 51 LPFKKIVSDL CYKQSDLVQH IKVLVKHSYI SKVRSKIDER NTYISISEEQ EEHHHHHHHS SS GGGTHHH HHHHHHTTS EEHHHHHHHS SS GGGTHHH HHHHHHTTS EEEE SSSTT EEEE HHH EEEE SSSTT EEEE HHH 101 REKIAERVTL FDQIIKQFNL ADQSESQMIP KDSKEFLNLM MYTMYFKNII 101 REKIAERVTL FDQIIKQFNL ADQSESQMIP KDSKEFLNLM MYTMYFKNII HHHHHHHHHH HHHHHHHHHH HTT SS S HHHHHHHHHH HHHHHHHHHH HTT SS S SHHHHHHHH SHHHHHHHH HHHHHHHHHH HHHHHHHHHH 151 KKHLTLSFVE FTILAIITSQ NKNIVLLKDL IETIHHKYPQ TVRALNNLKK 151 KKHLTLSFVE FTILAIITSQ NKNIVLLKDL IETIHHKYPQ TVRALNNLKK HHH SS HHH HHHHHHHHTT TT EEHHHH HHH SS HHH HHHHHHHHTT TT EEHHHH HHHSSS HHH HHHHHHHHHH HHHSSS HHH HHHHHHHHHH 201 QGYLIKERST EDERKILIHM DDAQQDHAEQ LLAQVNQLLA DKDHLHLVFE 201 QGYLIKERST EDERKILIHM DDAQQDHAEQ LLAQVNQLLA DKDHLHLVFE HTSSEEEE S SSTT EEEE HTSSEEEE S SSTT EEEE HHHHHHHHH HHHHHHHHH HHHHHHHHTS SS TT SS HHHHHHHHTS SS TT SS

SARS Protein From Staphylococcus Aureus

First two levels of First two levels of protein protein structurestructure

Why predict when we can get the real thing?Why predict when we can get the real thing?

PDB structures : : 29326 protein structures

UniProt Release 3.5 consists of:Swiss-Prot Release : 167089 protein sequencesTrEMBL Release : 1560235 protein sequences

Primary structure

Secondary structure

Tertiary structure

Quaternary structure

Function

No problems

Overall 77% accurate at predicting

Overall 35% accurate at predicting

No reliable means of predicting yet

Do you feel like guessing?

Secondary structure is derived by tertiary coordinatesTo get to tertiary structure we need NMR, X-ray

We have an abundance of primaries..so why not use them?

ALPHA-HELIX: Hydrophobic-hydrophilic residue periodicity patterns

BETA-STRAND: Edge and buried strands, hydrophobic-hydrophilic residue periodicity patterns

OTHER: Loop regions contain a high proportion of small polar residues like alanine, glycine, serine and threonine.

The abundance of glycine is due to its flexibility and proline for entropic reasons relating to the observed rigidity in its kinking the main-chain. As proline residues kink the main-chain in an incompatible way for helices and strands, they are normally not observed in these two structures, although they can occur in the N-terminal two positions of -helices.

Some SSE rules that helpSome SSE rules that help

Edge

Buried

Using computers in predicting protein secondary has its onset 30 ago (Nagano (1973) J. Mol. Biol., 75, 401) on single sequences.

The accuracy of the computational methods devised early-on was in the range 50-56% (Q3). The highest accuracy was achieved by Lim with a Q3 of 56% (Lim, V. I. (1974) J. Mol. Biol., 88, 857). The most widely used method was that of Chou-Fasman (Chou, P. Y. , Fasman, G. D. (1974) Biochemistry, 13, 211).

Random prediction would yield about 40% (Q3) correctness given the observed distribution of the three states H, e and C in globular proteins (with generally about 30% helix, 20% strand and 50% coil).

Historical backgroundHistorical backgroundNagano 1973 – Interactions of residues in a window of 6. The interactions were linearly combined to calculate interacting residue propensities for each SSE type (H, E or C) over 95 crystallographically determined protein tertiary structures.Lim 1974 – Predictions are based on a set of complicated stereochemical prediction rules for helices and sheets based on their observed frequencies in globular proteins.

Chou-Fasman 1974 - Predictions are based on differences in residue type composition for three states of secondary structure: helix, strand and turn (i.e., neither helix nor strand). Neighbouring residues were checked for helices and strands and predicted types were selected according to the higher scoring preference and extended as long as unobserved residues were not detected (e.g. proline) and the scores remained high.

The older standard: GORThe older standard: GOR

The GOR method (version IV) was reported by the authors to perform single sequence prediction accuracy with an accuracy of 64.4% as assessed through jackknife testing over a database of 267 proteins with known structure. (Garnier, J. G., Gibrat, J.-F., , Robson, B. (1996) In: Methods in Enzymology (Doolittle, R. F., Ed.) Vol. 266, pp. 540-53.)

The GOR method relies on the frequencies observed for residues in a 17- residue window (i.e. eight residues N-terminal and eight C-terminal of the central window position) for each of the three structural states.

The sliding window: GORThe sliding window: GOR

Sliding window

Sequence of known structure

A constant window of n residues long slides along sequence

Central residue

The amino acid frequencies are converted to secondary structure propensities for the central window position using an information function based on conditional probabilities. As it is not feasible to sample all possible 17-residue fragments directly from the PDB (there are 2017 possibilities) increasingly complex approximations have been applied.

In GOR I and GOR II, the 17 positions in the window were treated as being independent, and so single-position information could be summed over the 17-residue window.

In GOR III, this approach was refined by including pair frequencies derived from 16 pairs between each non-central and the central residue in the 17-residue window.

The current version, GOR IV combines pair-wise information over all possible paired positions in a window .

H H H EE E E E

The frequencies of the residues in the window are converted to probabilities of observing a SSE type

Accuracy burst due to four separate improvementsAccuracy burst due to four separate improvements

1) Using Multiple sequence Alignments instead of single sequence input

2) More advanced decision making algorithms

3) Improvement of sequence database search tools 1) PSI-BLAST (Altschul et al, 1997) – most widely used2) SAM (Karplus et al, 1998)

4) Increasingly larger database size (more candidates)

Using Multiple Sequence AlignmentsUsing Multiple Sequence AlignmentsZvelebil et al. (1987) for the first time exploited multiple sequence alignments to predict secondary structure automatically by extending the GOR method and reported that predictions were improved by 9% compared to single sequence prediction. Multiple alignments, as opposed to single sequences, offer a much improved means to recognise positional physicochemical features such as hydrophobicity patterns. Moreover, they provide better insight into the positional constraints of the amino acid composition. Finally, the placement of gaps in the alignment can be indicative for loop regions.

Levin et al. (1993) also quantified the effect and observed 8% increased accuracy when multiple alignments of homologous sequences with sequence identities of 25% were used.

As a consequence, the current state-of-the-art methods all use input information from multiple sequence alignments but are sensitive to alignment quality.

Sequence cheY (PDB code 3chy)

AA |ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP|INIT PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE |Iter 1 PHD | EEEEEEEE HHHHHHHHHHHHHHH HHHHHHHH EEEEEE |Iter 2 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHH EEEEEE |Iter 3 PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |Iter 4 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEE |Iter 5 PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |Iter 6 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE |Iter 7 PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |Iter 8 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEEE |Iter 9 PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHHH EEEEE | DSSP | TT EEEE S HHHHHHHHHHHHHHT EEEESSHHHHHHHHHH EEEEES S|

AA |NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM|INIT PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH |Iter 1 PHD | HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |Iter 2 PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |Iter 3 PHD | HHHHHHHHHHHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |Iter 4 PHD | HHHHH EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |Iter 5 PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |Iter 6 PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH |Iter 7 PHD | HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |Iter 8 PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |Iter 9 PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH | DSSP |SS HHHHHHHHHH TTTTT EEEEESS HHHHHHHHHTT SEEEESS HHHHHHHHHHHHHHHT |

Requires an initial training phase

TRAINING:

Sequence fragments of a certain length derived from a database of known structures are used, so that the central residue of such fragments can be assigned the true secondary structural state as a label.

Then a window of the same length is slid over the query sequence (or multiple alignment) and for each window the k most similar fragments are determined using a certain similarity criterion.

The distribution of the thus obtained k secondary structure labels is then used to derive propensities for three SSE states (H,E or C).

Improved Methods: K-Nearest NeighbourImproved Methods: K-Nearest Neighbour

Sequence fragments from database of known structures

Sliding window

Central residue

Similarity good enough

Qseq

PSSHHE

A neural network has to be trained.

TRAINING:

Like k-NN but this time the information is used to adjust the weights of the internal connections for optimising the grouping of a set of input patterns into a set of output patterns.

Normally difficult to understand the internal functioning of the network.

Beware: overtraining the network.

Improved Methods: Neural NetworksImproved Methods: Neural NetworksNeural networks are learning systems based upon complex non-linear statistics. They are organised as interconnected layers of input and output units, and can also contain intermediate (or "hidden") unit layers (neurons). Each unit in a layer receives information from one or more other connected units and determines its output signal based on the weights of the input signals (synapses).

Sliding window

Qseq

Sequence database of known structures

Central residue

Neural Network

The weights are adjusted according to the model used to handle the input data.

Neural networksTraining an NN:Forward pass:

the outputs are calculated and the error at the output units calculated.Backward pass:

The output unit error is used to alter weights on the output units. Then the error at the hidden nodes is calculated (by back-propagating the error at the output units through the weights), and the weights on the hidden nodes altered using these values.

For each data pair to be learned a forward pass and backwards pass is performed. This is repeated over and over again until the error is at a low enough level (or we give up).

Y = 1 / (1+ exp(-k.(Σ Win * Xin)), where Win is weight and Xin is input

The graph shows the output for k=0.5, 1, and 10, as the activation varies from -10 to 10.

Diversity and alignment size gives better predictionsDiversity and alignment size gives better predictionsThe reigning secondary structure prediction method for the last 5 years PSIPRED (Jones, 1999) incorporates multiple sequence information from database searching and neural nets.

The method exploits position specific scoring matrices (PSSMs) as generated by the PSI-BLAST algorithm (Altschul et al, 1997) and feeds those to a two-layered neural network.

Since the method invokes the PSI-BLAST database search engine to gather information from related sequences, the method only needs a single sequence as input. The accuracy of the PSIPRED method is 76.5%, as evaluated by the author.

An investigation into the effects of larger databases and more accurate sequence selection methods has shown that these improvements provide better and more diverse MSAs for secondary structure prediction. (Przybylski, D. and Rost, B. (2002) Proteins, 46, 197-205.)

The PHD method (Profile network from HeiDelberg) broke the 70% barrier of prediction accuracy. (Rost and Sander (1993)

PHD, PHDpsi, PROFsecPHD, PHDpsi, PROFsec

Since the original method, the BLAST search and MAXHOM alignment routines have been replaced by PSI-BLAST in PHDpsi and more recently the use of complex bi-directional neural networks have given rise to PROFsec which is a close competitor and in many cases better than PSIPRED.

Three neural networks:

1) A 13 residue window slides over the alignment and produces 3-state raw secondary structure predictions.

2) A 17-residue window filters the output of network 1. The output of the second network then comprises for each alignment position three adjusted state probabilities. This post-processing step for the raw predictions of the first network is aimed at correcting unfeasible predictions and would, for example, change (HHHEEHH) into (HHHHHHH).

3) A network for a so-called jury decision between networks 1 and 2 and a set of independently trained networks (extra predictions to correct for training biases. The predictions obtained by the jury network undergo a final simple filtering step to delete predicted helices of one or two residues and changing those into coil.

How to develop a secondary structure prediction methodHow to develop a secondary structure prediction method

MethodMethod

For jackknife test: K=N-1

Database of N sequences with known structure

Training set of K<N sequences with known

structure

Test set of T<<N sequences with known

structure

For jackknife test: T=1

Trained Trained MethodMethod PredictionPrediction

Standard of truthStandard of truth

Assessment Assessment method(s)method(s) Prediction Prediction

accuracyaccuracy

For full jackknife test: Repeat process N times and average prediction scores

Other Other method(s) method(s) predictionprediction

Method Method benchmarkbenchmark

A jackknife test is a test scenario for prediction methods that need to be tuned using a training database.

Its simplest form:

For a database containing N sequences with known tertiary (and hence secondary) structure, a prediction is made for one test sequence after training the method on the remaining training database containing the N-1 remaining sequences (one-at-a-time jackknife testing).

A complete jackknife test would involve N such predictions.

If N is large enough, meaningful statistics can be derived from the observed performance. For example, the mean prediction accuracy and associated standard deviation give a good indication of the sustained performance of the method tested.

If this is computationally too expensive, the db can be split in larger groups, which are then jackknifed.

The Jackknife testThe Jackknife test

Protein SecondaryProtein Secondary structure: Standards of Truthstructure: Standards of TruthWhat is a standard of truth?

- a structurally derived secondary structure assignment

Why do we need one?

- it dictates how accurate our prediction is

How do we get it?

- methods use hydrogen-bonding patterns along the main-chain to define the Secondary Structure Elements (SSEs).

1) DSSP (Kabsch and Sander, 1983) – most popular

2) STRIDE (Frishman and Argos, 1995)

3) DEFINE (Richards and Kundrot, 1988)

Annotation:Annotation:

Helix: 3/10-helix (G), -helix (H), -helix (I)

Strand: -strand (E), -bulge (B)

Turn: H-bonded turn (T), bend (S)

Rest: Coil (“ “)

Assessing prediction accuracyAssessing prediction accuracy

How do we decide how good a prediction is?1) Qn : the number of correctly predicted n SSE states over the

total number of predicted states

2) SOV: the number of correctly predicted n SSE states over the total number of predictions with higher penalties for core segment regions (Zemla et al, 1999)

3) MCC:the number of correctly predicted n SSE states over the total number of predictions taking into account how many prediction errors were made for each state

Which one would you use?

• Biological information impact• What are you testing?• What is your prediction used for?

Making sense of the scores:• Compare to your selected Standard Of Truth• Use all three to get a better picture

Automated Evaluation InitiativesAutomated Evaluation Initiatives

The EVA Server

CASP (also includes fold recognition assessments), CAFASP biannual experiments

With the amount of methods available freely online, biologists are puzzled and have no way of knowing which one to use.

These initiatives allow continual evaluation on sequences that are added to the PDB and use DSSP as a standard of truth.

LETS GO TO WEB …

The consensus superiorityThe consensus superiorityDeriving a consensus from multiple methods is always more accurate than any one individual method used.

Early on Jpred (Cuff and Barton, 1998) investigated weighted and un-weighted multiple method majority voting with a upper limit 4% increase.

Nowadays, any three top scoring methods can be improved by 1.5-2% by simple majority voting consensus. It is the three clocks on a boat scenario. If one clock goes wrong, the likelihood that the other two will go wrong at the same time and in the same way is very low.

We are currently completing a dynamic programming consensus algorithm that produces an optimally segmented consensus which is more biologically correct than simple majority voting and intend to set it as a standard on EVA for method consensus evaluation.

Predictions set

Max observationsare kept as correct

HHHEEEEC

E

A stepwise hierarchy

1) Sequence database searching• PSI-BLAST, SAM-T2K

2) Multiple sequence alignment of selected sequences• PSSMs, HMM models, MSAs

3) Secondary structure prediction of query sequences based on the generated MSAs

• Single methods: PHD, PROFsec, PSIPred, SSPro, JNET, YASPIN• consensus methods

Trained machine-learning

Algorithm(s)

Secondary structure prediction

PSSMCheck file HMM model

SAM-T2KPSI-BLASTSequence database

Sequence database

Single sequence

Homologous sequences

MSA

MSA method

Step 1: Database sequence search

Step 2: MSA

Step 3: SS Prediction

Trained machine-learning

Algorithm(s)

Secondary structure prediction

SAM-T2KPSI-BLASTSequence database

Sequence database

Single sequence

Homologous sequences

MSA

MSA method

Optimised MSA and SS prediction

Step 1: Database sequence search

Step 2: MSA

Step 3: SS Prediction

Iterative MSA/SS predictionmutual optimisation

Iterative homologue detectionbyoptimised information

structure prediction (i): secondary structure structure prediction (i): secondary structure...

Documents

protein secondary structure

staphylococcus aureus

prediction lecture

tertiary coordinates

proline residues

abundance of glycine

swissprot release

reliable means of predicting