introduction and basic concepts · introduction and basic concepts laboratory of bioinformatics i...
TRANSCRIPT
Emidio Capriotti http://biofold.org/
Department of Pharmacy and Biotechnology (FaBiT) University of Bologna
Introduction and Basic Concepts
Laboratory of Bioinformatics I Module 2
March 10, 2020
Schedule and MaterialsThis module is a 58-hour course running for 8 weeks
March 10 - April 30, 2020
Schedule changes from week to week:
Tuesday, Thursday and Friday 14:00 - 17:00
In April more changes
Project submission deadline May 18, 2020
Course website
http://biofold.github.io/pages/courses/2020/lb1-2.html
Main Aims
• Knowledge of tools for sequence and structure analysis and their development
• Protein functional annotation
• Theoretical background of machine learning approaches
• Problem solving skills and development of basic tools.
Topics• Protein Geometrical Features and Protein Structural Alignment
• Multiple Sequence Alignment
• Hidden Markov Models for Sequence Alignment
• Methods for Building Hidden Markov Models for Proteins
• Protein Structure and Mapping Problems
• Introduction to Statistical Methods and Machine Learning
• Development of Structure Prediction Methods
• Module Project: Model a Protein Domain HMM
Take Home Message
• Protein structure is more conserved than sequence. Proteins sharing high sequence identity usually share similar structures, as proven by pair-wise structural alignment procedures.
•When the identity level is high enough, it is possible to exploit the results of pair-wise sequence alignment for transferring structural information between proteins.
Structural Alignment Given two sets of points A = (a1, a2, …, an) and B = (b1,b2,…bm) in Cartesian space, find the optimal subsets A(P) and B(Q) with |A(P)| = |B(Q)|, and find the optimal rigid body transformation G between the two subsets A(P) and B(Q) that minimizes a given distance metric D over all possible rigid body transformation G, i.e.
The two subsets A(P) and B(Q) define a “correspondence”, and p = |A(P)| = |B(Q)| is called the correspondence length. Naturally, the correspondence length is maximal when A(P) and B(Q) are similar.
Therefore there are essentially two problems in structure alignment: • Find the correspondence set (which is NP-hard), and • Find the alignment transform (which is O(n)).
Bourne P. 2012
€
RMSD =
(ai −bi)2
i=1
n
∑n
[ ]{ }))(()(min QBGPADG
−
The Foundation of Structural Bioinformatics
Chotiha and Lesk, EMBO Journal 1986 Chothia C, Lesk AM. The relation between the divergence of sequence andstructure in proteins. EMBO J. 1986
Why Sequence Alignment?
The measure of sequence similarity allow to make estimation about the structural similarity
Comparison of two sequences for measuring their similarity
• To define a distance between two sequences
• Develop an algorithm for finding the alignment with minimal distance
• To statistically evaluate the significance of the alignment
Sequence Distance ScoreWhich events do we consider? Mutation
It is necessary to define a score for the substitution of residue i with residue j Substitution Matrices s(i,j)
Distance among sequences
Which events do we consider?
MutationIt is necessary to define a score for the substitution ofresidue i with residue j
Substitution matrices s(i,j)
A: ALASVLIRLITRLYPB: ASAVHLNRLITRLYP
¦ ),(),( ii BAsBAScore
BLOSUM62
A:B:
Other eventsDeletion and Insertion: some residues can be inserted or deleted during the evolution
The (negative) score of a gap depends only on the length
σ(n) = -nd linearσ(n) = -d - (n-1) e (d: opening, e: extension)
Distance among sequences
Which events do we consider?
MutationDeletion and InsertionSome residues can be inserted or deleted
A: ALASVLIRLIT--YPB: ASAVHL---ITRLYP
)2()3(),(),( VV �� ¦ ii BAsBAScoreThe (negative) score of a gap depends only on its lengthV(n) = -nd linearV(n) = -d - (n-1)e affine (d: opening, e: extension)
N.B. All the scores are independent of the position alongthe sequence
A:B:
Alignment Algorithms
Algorithms for finding the minimum distance between two sequences
• Global alignment: Needleman-Wunsch: Global alignment-compare pairs of sequences on their whole length
• Local alignment: Smith-Waterman: Local alignment-compare pairs of sequences searching the most similar subsequences
Alignment SignificanceGiven an alignment with score S, is it significant?
Significance can be evaluated by comparing with the score distribution of random alignments
Significant alignments
Significance of an alignment
Given an alignment with score S, is it significant?
Significance can be evaluated by comparing with the score distribution of random alignments
Score
Occ
urre
nce
Structural Homology
Sander and Schneider, Proteins 1991
Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins. 1991;9(1):56-68
Based on the database of homology-derived secondary structure of proteins (HSSP).Define the relation between sequence similarity, structure similarity, and alignment length.
Twilight ZoneB.Rost
Fig. 3. Pairwise sequence identity versus alignment length. The originalHSSP-curve (Sander and Schneider, 1991) (dotted circles, eqn 1) appearedto fit the true positives (homologues, A) better than the false positives (B).In contrast, the new curve proposed here (filled diamonds, eqn 2) was moreconservative in excluding false positives. Note that due to the huge numberof pairs the plots for true (A) and false (B) positives appeared almostequally densely populated (Figure 2 revealed the problem of such a scatterplot).
any threshold extracted from the sequence alignment n, thefollowing equations hold (for cumulative numbers):false negatives 1 true positives 5 all pairs of similar structure
true negatives 1 false positives 5 all pairs ofdissimilar structure.
Distance to HSSP thresholdThe HSSP-curve was originally defined by (Sander andSchneider, 1991):
pI(n) 5 n 1 290.15 · L–0.562, for L , 80{ 25 , doe L ˘ 80 (1)where L gave the number of residues aligned between twoproteins; pI the cut-off percentage of identical residues overthe L aligned residues; and n described the distance inpercentage points from the curve (n 5 0 corresponds to theoriginal HSSP-curve; n 5 5 to the official HSSP databasereleases; curve plotted in Figure 3). Once Schneider and Sander
88
(1991) had discovered the basic functional dependence betweensequence identity and alignment length, they merely had tofix two free parameters: the factor and the exponent. Bothwere chosen to fit the data observed in 1991, in particular toreach values of 25% around alignment length of 80, andvalues of 100% around alignment length of 10. The principlefunctional dependence described by eqn 1 also followsfrom statistics, as was recently shown in an elegant work(Alexandrov and Soloveyev, 1998). Let pi (i 5 1,..., 20) be theprobability that amino acid i occurs in a protein, and mij thescore for randomly aligning two amino acids i and j. The scoreS of an entire alignment can then be approximated by:
S 5 ,m. · L
where,m. is the expectation value ofmij, and L the alignmentlength. If the values ofmij are independent, Gaussian distributedvariables, it follows (after some elementary operations) thatthe relation between the standard deviation of the values ofmij (σm ), and the resulting score distribution (σS) is:
σm 5 L–0.5 · σsIn their original article Alexandrov and Soloveyev workout an appropriate re-scaling of the dynamic programmingalignment. However, this scheme cannot be applied after thealignment has been completed (as the threshold functions usedin this work), rather it has to be implemented into thealignment method.New curve for length-dependent significance of pairwisesequence identityI attempted to solve the problems of the original HSSP-curve(eqn 1; Results) by defining the following curve for theseparation of true and false positives (Figure 3, grey line withdotted circles):
pI(n 5 n 1 480 · L–0.32 · (1 1 e –L/1000) (2)
where L gave the number of residues aligned between twoproteins; pI the cut-off percentage of identical residues overthe L aligned residues; and n described the distance inpercentage points from the curve (n 5 0 plotted in Figure 3).The constraints in visually selecting the final function were (i)to maintain the functional form defined by eqn 1 (and suggestedby the statistics of Alexandrov and Soloveyev, 1998); (ii) tohit the 100% mark at alignments that are too short to revealanything about structural similarity (5 11 residues); (iii) tosaturate at levels around 20% sequence identity (reached forlength5 300); and (iv) to roughly reflect the observed gradient.Saturation for long alignments was realized by the functionalform of the exponent (note: the term 1 e–L/a resulted in anexponential decay). This ‘saturation’ constraint also afflictedthe particular value of the factor (0.32 rather than about 0.5as suggested by the distribution of the data, Figure 4).New curve for length-dependent significance of pairwisesequence similarityThe original HSSP-curve was derived for sequence identity,not for sequence similarity (Sander and Schneider, 1991). Thefunctional dependence between similarity and length appearedcomparable to the one between identity and length (Results).This prompted a similar definition for the separation betweentrue and false positives based on similarity:
pS(n 5 n 1 420 · L–0.335 · (1 1 e –L/2000) (3)
B.Rost
Fig. 3. Pairwise sequence identity versus alignment length. The originalHSSP-curve (Sander and Schneider, 1991) (dotted circles, eqn 1) appearedto fit the true positives (homologues, A) better than the false positives (B).In contrast, the new curve proposed here (filled diamonds, eqn 2) was moreconservative in excluding false positives. Note that due to the huge numberof pairs the plots for true (A) and false (B) positives appeared almostequally densely populated (Figure 2 revealed the problem of such a scatterplot).
any threshold extracted from the sequence alignment n, thefollowing equations hold (for cumulative numbers):
false negatives 1 true positives 5 all pairs of similar structure
true negatives 1 false positives 5 all pairs ofdissimilar structure.
Distance to HSSP thresholdThe HSSP-curve was originally defined by (Sander andSchneider, 1991):
pI(n) 5 n 1 290.15 · L–0.562, for L , 80{ 25 , doe L ˘ 80 (1)
where L gave the number of residues aligned between twoproteins; pI the cut-off percentage of identical residues overthe L aligned residues; and n described the distance inpercentage points from the curve (n 5 0 corresponds to theoriginal HSSP-curve; n 5 5 to the official HSSP databasereleases; curve plotted in Figure 3). Once Schneider and Sander
88
(1991) had discovered the basic functional dependence betweensequence identity and alignment length, they merely had tofix two free parameters: the factor and the exponent. Bothwere chosen to fit the data observed in 1991, in particular toreach values of 25% around alignment length of 80, andvalues of 100% around alignment length of 10. The principlefunctional dependence described by eqn 1 also followsfrom statistics, as was recently shown in an elegant work(Alexandrov and Soloveyev, 1998). Let pi (i 5 1,..., 20) be theprobability that amino acid i occurs in a protein, and mij thescore for randomly aligning two amino acids i and j. The scoreS of an entire alignment can then be approximated by:
S 5 ,m. · L
where,m. is the expectation value ofmij, and L the alignmentlength. If the values ofmij are independent, Gaussian distributedvariables, it follows (after some elementary operations) thatthe relation between the standard deviation of the values ofmij (σm ), and the resulting score distribution (σS) is:
σm 5 L–0.5 · σsIn their original article Alexandrov and Soloveyev workout an appropriate re-scaling of the dynamic programmingalignment. However, this scheme cannot be applied after thealignment has been completed (as the threshold functions usedin this work), rather it has to be implemented into thealignment method.New curve for length-dependent significance of pairwisesequence identityI attempted to solve the problems of the original HSSP-curve(eqn 1; Results) by defining the following curve for theseparation of true and false positives (Figure 3, grey line withdotted circles):
pI(n 5 n 1 480 · L–0.32 · (1 1 e –L/1000) (2)
where L gave the number of residues aligned between twoproteins; pI the cut-off percentage of identical residues overthe L aligned residues; and n described the distance inpercentage points from the curve (n 5 0 plotted in Figure 3).The constraints in visually selecting the final function were (i)to maintain the functional form defined by eqn 1 (and suggestedby the statistics of Alexandrov and Soloveyev, 1998); (ii) tohit the 100% mark at alignments that are too short to revealanything about structural similarity (5 11 residues); (iii) tosaturate at levels around 20% sequence identity (reached forlength5 300); and (iv) to roughly reflect the observed gradient.Saturation for long alignments was realized by the functionalform of the exponent (note: the term 1 e–L/a resulted in anexponential decay). This ‘saturation’ constraint also afflictedthe particular value of the factor (0.32 rather than about 0.5as suggested by the distribution of the data, Figure 4).New curve for length-dependent significance of pairwisesequence similarityThe original HSSP-curve was derived for sequence identity,not for sequence similarity (Sander and Schneider, 1991). Thefunctional dependence between similarity and length appearedcomparable to the one between identity and length (Results).This prompted a similar definition for the separation betweentrue and false positives based on similarity:
pS(n 5 n 1 420 · L–0.335 · (1 1 e –L/2000) (3)
Rost, Proteins 1999
In the region above 20% of sequence identity, 90% of alignments correspond to homologous protein; while below 25% only 10%.
For sequences longer than 100 residues
Midnight zone:
Similar structuresare abundant but
cannot be recognized
Twilight zone:
Many false positives(similar sequences
with different structures)
Safe zone:
No false positives(all the similar
sequences have similar structures)
20% 30%Identity (%)
Sequence identity and structural identity
20% 30% % Identity
Comparative ModelingFlow chart of Comparative Modeling
Template selection
Sequence alignment
Model building
Model evaluation
Target
Model
PDB
>targetIIGGVESRPHSRPYMAHLEITTE RGFTATCGGFLITRQFVMTAAHC SGREITVTLGAHDVSKTES....
Quality check
Yes
No
Sequence matching
Template Model
0 50 100 150 200 250Residue
-40
-30
-20
-10
0
10
E/kT
E/kT?
target IIGGVESRPHSRPYMAHLEI 3RP2A IIGGVESIPHSRPYMAHLDItarget TTERGFTATCGGFLITRQ..3RP2A VTEKGLRVICGGFLISRQ..
Use of Predicted StructuresDepending off the sequence similarity with the template the predicted structure can be used for different purposes
•Comparative Modeling
•Threading
•Ab initio or De novo predictions
Baker and Sali (2001). Science, 294: 93-96.
ly (19, 20). Other factors such as templateselection and alignment accuracy usuallyhave a larger impact on the model accuracy,especially for models based on less than 40%sequence identity to the templates.
There is a wide range of applications ofprotein structure models (Figs. 1 and 2). Forexample, high- and medium-accuracy com-parative models frequently are helpful in re-fining functional predictions that have beenbased on a sequence match alone becauseligand binding is more directly determined bythe structure of the binding site than by itssequence. It is often possible to correctlypredict features of the target protein that donot occur in the template structure. The sizeof a ligand may be predicted from the volumeof the binding site cleft (Fig. 2A). For exam-ple, the complex between docosahexaenoicfatty acid and brain lipid-binding protein wasmodeled on the basis of its 62% sequenceidentity to the crystallographic structure ofadipocyte lipid-binding protein (PDB code1ADL) (21). A number of fatty acids wereranked for their affinity to brain lipid-bindingprotein consistently with site-directed mu-
tagenesis and affinity chromatography exper-iments, even though the ligand specificityprofile of this protein is different from that ofthe template structure. Another example isprediction of a binding site for a chargedligand based on a cluster of charged residueson the protein, as was done for mouse mastcell protease 7 (Fig. 2B) (22). The predictionof a proteoglycan binding patch was con-firmed by site-directed mutagenesis and hep-arin-affinity chromatography experiments.Fortunately, errors in the functionally impor-tant regions in comparative models are manytimes relatively low because the functionalregions, such as active sites, tend to be moreconserved in evolution than the restof the fold. The utility of low-accuracy com-parative models can be illustrated by a mo-lecular model of the whole yeast ribosome,whose construction was facilitated by fit-ting comparative models of many ribosom-al proteins into the electron microscopymap of the ribosomal particle (23). Thisexample also suggests that structuralgenomics of single proteins or their do-mains, combined with protein structure pre-
diction, may contribute substantially to ef-ficient structural characterization of largemacromolecular assemblies.
The accuracy and reliability of modelsproduced by de novo methods is much lowerthan that of comparative models based onalignments with more than 30% sequenceidentity, but the basic topology of a protein ordomain can in some cases be predicted rea-sonably well (Fig. 1, D and E). For roughly40% of proteins shorter than 150 amino acidsthat have been examined, one of the five mostcommonly recurring models generated byRosetta has sufficient global similarity to thetrue structure to recognize it in a search of theprotein structure database. Reasonable mod-els can in some cases be produced for do-mains of even very large proteins by usingmultiple sequence alignments to identify do-main boundaries (Fig. 1D).
The accuracy of de novo models is toolow for problems requiring high-resolutionstructure information. Instead, the low-reso-lution models produced by these methods canreveal structural and functional relationshipsbetween proteins not apparent from their ami-no acid sequences and provide a frameworkfor analyzing spatial relationships betweenevolutionarily conserved residues or betweenresidues shown experimentally to be func-tionally important. These applications are il-lustrated by examples from the recent CASP4blind protein structure prediction experiment(24, 25). The predicted structure of a proteininvolved in cell lysis (26 ) was found to bestructurally related to a protein with a similarfunction but no significant sequence similar-ity (Fig. 2B). The predicted structure of adomain of the mismatch repair protein MutS(27) (Fig. 1D) has structural similarity toproteins with related functions (28). Func-tionally important residues of the signalingprotein Frizzled (29) were clustered in thepredicted structure in a surface patch likely tobe involved in a key protein-protein interac-tion (Fig. 2C). Thus, in favorable cases denovo predictions can provide some of themost important functional insights obtainablefrom experimentally determined structures.
Modeling on a Genomic ScaleThreading and comparative modeling methodshave already been applied on a genomic scale(18, 30, 31). In total, domains in 58% of all600,000 known protein sequences were mod-eled with ModPipe (18) and MODELLER (9)and deposited into a comprehensive database ofcomparative models, ModBase (32–34). TheWeb interface to the database allows flexiblequerying for fold assignments, sequence-struc-ture alignments, models, and model assessmentsof interest. An integrated sequence/structureviewer, ModView, allows inspection and anal-ysis of the query results. ModBase will be in-creasingly interlinked with other applications
Fig. 1. Accuracy andapplication of proteinstructure models.Shown are the differ-ent ranges of applica-bility of comparativeprotein structure mod-eling, threading, and denovo structure predic-tion; the correspondingaccuracy of proteinstructure models; andtheir sample applica-tions. (A through C).Sample comparativemodels based on about60% (A), 40% (B), and30% (C) sequenceidentity to their tem-plate structure. (D andE) Examples of Rosettade novo structure pre-dictions for the CASP4structure predictionexperiment. Predictedstructures are in red,and actual structuresare in blue. The accura-cy of the models de-crease significantly ingoing from (A) to (E),but the overall struc-ture is still roughly cor-rect. (D) A domainfrom the 811-residueMUtS protein whichwas recognized as anautonomous unit froman alignment of ho-mologous sequences;such parsing of largeproteins into domainscan make structure
prediction more tractable.
5 OCTOBER 2001 VOL 294 SCIENCE www.sciencemag.org94
G E N O M E : U N L O C K I N G B I O L O G Y ’ S S T O R E H O U S E
on
July
4, 2
007
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
Remote homologsSequences longer than 100 residues and sharing more the 30% of residues have similar structures (for shorter sequences the level of identity must be higher).
This DO NOT exclude that sequences sharing lower identity have similar structures.
Pairs proteins with similar structure and low sequence identity are referred as “remote homologs”
Example:Sperm Whale Myoglobin (1JP6:A)Bacterial Haemoglobin (1VHB:A) RMSD = 0.18 nm, Identity: 12%
aligned by TM-align
Sequence Identity InferenceCan we use sequence similarity to predict other features of an unknown protein?
Solution: Define a the sequence similarity threshold that allow a reliable transfer of annotation features.
In other words we need to find the problem specific twilight region
Midnight zone: Twilight zone: Safe zone:
?% ?%Identity (%)
Sequence identity and function inference
Are the thresholds for functional inference equal to those valid for structural inference?
If not, how would you fix the thresholds?
?% % Identity?%
Subcellular LocalizationSequence identity for reliably transferring subcellular localization is higher than that required for transferring structure.
Rost, Nair: Protein Sci. 2002 Dec; 11(12): 2836–2847.
IT IS NECESSARY TO START FROM EXPERIMENTAL DATA:Example: Sequence identity and subcellular localization.
Sequence identity for reliably transferring subcellular localization is higher than that required for transferring structure
Rost and Nair, Protein Sci. 2002
A false positiveQ9SLK0 (ICDHX_ARATH):Peroxisomal isocitrate dehydrogenase
Q9SRZ6 (ICDHC_ARATH): Cytosolic isocitrate dehydrogenase
84.2% identity (93.3% similar) in 417 aa overlap
Lalign at ExPASy
Functional AnnotationSequence identity for can be used for functional annotation measuring the identity and similarity between Gene Ontology terms.
Sangar V et al. BMC Bioinformatics 2007
Vineet Sangar1 Daniel J Blankenberg, Naomi Altman and Arthur M Lesk: BMC Bioinformatics 2007, 8:294
IT IS NECESSARY TO START FROM EXPERIMENTAL DATA:Example: Sequence identity and similarity between GO annotations
Dissimilar functionsP04385 (GAL1_YEAST) Galactokinase
Catalytic activity ATP + alpha-D-galactose = ADP + alpha-D-galactose 1-phosphate.
P13045 (GAL3_YEAST) Protein GAL3 The GAL3 regulatory function is required for rapid induction of the galactose system.
72.9% identity (90.5% similar) in 528 aa overlap
Lalign at ExPASy
Case StudyElectron carrier protein. The oxidized form of the cytochrome c heme group can accept an electron from the heme group of the cytochrome c1 subunit of cytochrome reductase. Cytochrome c then transfers this electron to the cytochrome oxidase complex, the final protein carrier in the mitochondrial electron-transport chain.
Back to the sequence-to-structure relation: Cytochrome C
HIS 19
CYS 18
CYS 15
MET 81
Feature key Position(s) Length Description
Binding sitei 15 – 15 1 Heme (covalent)Binding sitei 18 – 18 1 Heme (covalent)Metal bindingi 19 – 19 1 Iron (heme axial ligand)Metal bindingi 81 – 81 1 Iron (heme axial ligand)
Electron carrier protein. The oxidized form of the cytochrome c heme group can accept an electron from the heme group of the cytochrome c1 subunit of cytochrome reductase. Cytochrome c then transfers this electron to the cytochrome oxidase complex, the final protein carrier in the mitochondrial electron-transport chain.
PDB: 3zcf:A
UniProt: P99999
Back to the sequence-to-structure relation: Cytochrome C
HIS 19
CYS 18
CYS 15
MET 81
Feature key Position(s) Length Description
Binding sitei 15 – 15 1 Heme (covalent)Binding sitei 18 – 18 1 Heme (covalent)Metal bindingi 19 – 19 1 Iron (heme axial ligand)Metal bindingi 81 – 81 1 Iron (heme axial ligand)
Electron carrier protein. The oxidized form of the cytochrome c heme group can accept an electron from the heme group of the cytochrome c1 subunit of cytochrome reductase. Cytochrome c then transfers this electron to the cytochrome oxidase complex, the final protein carrier in the mitochondrial electron-transport chain.
PDB: 3zcf:A
UniProt: P99999
Homo vs HorseHuman Cytochrome C – Uniprot:P99999. PDB: 3ZCF:A Equine Cytochrome C – Uniprot: P00004. PDB 3O20:ACytochrome C (Homo vs. Horse)
Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:AEquine Cytochrome C – Uniprot: P00004. PDB 3O20:A
Structural alignment: RMSD= 0,035 nm88% sequence identity
1:A 20:A 40:A 60:A | | . | . | . | . | . | . |GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLEN||||||||||:.||.||||||||||||||||||||||||||||||:.||.|||||||.|.|:||||||||GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFTYTDANKNKGITWKEETLMEYLEN| | . | . | . | . | . | . |1:A 20:A 40:A 60:A
80:A 100:A . | . | . |
PKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE||||||||||||.|||||.||.||||||||||||PKKYIPGTKMIFAGIKKKTEREDLIAYLKKATNE
. | . | . | 80:A 100:A
Structural alignment: RMSD= 0.035 nm
88% sequence identity
Sequence vs Structure In this case the sequence alignment is the same of the structural alignment and the positions of the binding sites are conserved.
Lalign at ExPASy
Cytochrome C (Homo vs. Horse)
Structural alignment: RMSD= 0,035 nm88% sequence identity
88.6% identity (95.2% similar) in 105 aa overlap (1-105:1-105)
10 20 30 40 50 60Homo MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW
:::::::::::..::.::::::::::::::::::::::::::::::..:: ::::::: :Horse MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFTYTDANKNKGITW
10 20 30 40 50 60
70 80 90 100 Homo GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
:.::::::::::::::::::::.::::: :: ::::::::::::Horse KEETLMEYLENPKKYIPGTKMIFAGIKKKTEREDLIAYLKKATNE
70 80 90 100
GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLEN||||||||||:.||.||||||||||||||||||||||||||||||:.||.|||||||.|.|:||||||||GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFTYTDANKNKGITWKEETLMEYLEN
PKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE||||||||||||.|||||.||.||||||||||||PKKYIPGTKMIFAGIKKKTEREDLIAYLKKATNE
Sequence alignment: 88% sequence identity
IDENTICAL TO STRUCTURAL ALIGNMENT
IT CAN BE SAFELY ADOPTED FOR MODELLING PROCEDURES
portion where structural and sequence alignments coincide
Sequence alignment: 88% sequence identity IDENTICAL TO STRUCTURAL ALIGNMENT
Homo vs Rhodobacter Sph.Human Cytochrome C — Uniprot:P99999. PDB: 3ZCF:ACytochrome C2 Rhodobacter Sph. – Uniprot: P0C0X8. PDB 1CXC:A
Structural alignment: RMSD= 0,18 nm
28% sequence identity
Cytochrome C (Homo vs. Rhodobacter sphaeroides)
Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C2 Rhodobacter Sph. – Uniprot: P0C0X8. PDB 1CXC:A
Structural alignment: RMSD= 0,18 nm28% sequence identity
1:A 20:A 40:A | | . | . | . | . | . GDVEKGKKIFIMKCSQCHTVEKGG-------KHKTGPNLHGLFGRKTGQAPGYS-YTAANKNKG---IIW||.|.|.|.|. .|..||.:.... ..||||||:|..||..|....:. |....|..| :.|GDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQADFKGYGEGMKEAGAKGLAW| | . | . | . | . | . | . | 3:A 20:A 40:A 60:A
60:A 80:A 100:A | . | . | . | . | GEDTLMEYLENPKKYI--------PGTKMIFVGIKKKEERADLIAYLKKATNE.|:...:|.:.|.|:: ...||.| .:||..:...:.|||......DEEHFVQYVQDPTKFLKEYTGDAKAKGKMTF-KLKKEADAHNIWAYLQQVAVR
. | . | . | . | . | 80:A 100:A 120:A
Sequence vs Structure (I)In this case the sequence alignment can be used for homology modeling after a refinement of the alignment because one binding site is not conserved.
Lalign at ExPASy
Structural alignment: RMSD= 0,18 nm
28% sequence identityStructural alignment: RMSD= 0,18 nm28% sequence identity
GDVEKGKKIFIMKCSQCHTVEKGG-------KHKTGPNLHGLFGRKTGQAPGYS-YTAANKNKG---IIW||.|.|.|.|. .|..||.:.... ..||||||:|..||..|....:. |....|..| :.|GDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQADFKGYGEGMKEAGAKGLAW
GEDTLMEYLENPKKYI--------PGTKMIFVGIKKKEERADLIAYLKKATNE.|:...:|.:.|.|:: ...||.| .:||..:...:.|||......DEEHFVQYVQDPTKFLKEYTGDAKAKGKMTF-KLKKEADAHNIWAYLQQVAVR
Sequence alignment: 29% sequence identity
SIMILAR (NOT IDENTICAL) TO STRUCTURAL ALIGMENT
IT CAN BE ADOPTED FOR MODELLING PROCEDURES (WITH SOME REFINEMENT)
Global without end-gap score: 111; 29.3% identity (56.1% similar) in 123 aa overlap10 20 30 40 50
sp|P99 MGDVEKGKKIFIMKCSQCHTVEK-------GGKHKTGPNLHGLFGRKTG-QAPGYSYTA:: : : : : .:. ::.. : . ::::::.:. :: .: :: .:
sp|P0C QEGDPEAGAKAF-NQCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQADFKGYGE10 20 30 40 50
60 70 80 90 100 sp|P99 ANKN---KGIIWGEDTLMEYLENPKKYIP-------GTKMIFVGIKKKEERADLIAYLKK
. :. ::. : :. ...:...: :.. . . .::. . .. :::..sp|P0C GMKEAGAKGLAWDEEHFVQYVQDPTKFLKEYTGDAKAKGKMTFKLKKEADAHNIWAYLQQ
60 70 80 90 100 110
sp|P99 ATNE .. .
sp|P0C VAVRP120
Cytochrome C (Homo vs. Rhodobacter sphaeroides)
portion where structural and sequence alignments coincide
Homo vs Rhodobacter Pal.Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C2 Rhodopseudomons pal. – Uniprot: P00091. PDB 1I8O:A
Structural alignment: RMSD= 0,13 nm
29% sequence identity
Cytochrome C (Homo vs. Rhodopseudomonas palustris)
Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C2 Rhodopseudomons pal. – Uniprot: P00091. PDB 1I8O:A
Structural alignment: RMSD= 0,13 nm29% sequence identity
1:A 20:A 40:A 60:A | | . | . | . | . | . | . GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKG---IIWGEDTLMEY.|...|..:|. .|..||.. .|...||.|.|..|||.|.|.|:.|...|.|.| ::|..|.:..|xDAKAGEAVFK-QCMTCHRA---DKNMVGPALAGVVGRKAGTAAGFTYSPLNHNSGEAGLVWTADNIVPY| | . | . | . | . | . | . 1:A 20:A 40:A 60:A
80:A 100:A | . | . | . |
LENPKKYIP--------------GTKMIFVGIKKKEERADLIAYLKKAT|..|..::. .|||.| .:...::|.|.:|||....LADPNAFLKKFLTEKGKADQAVGVTKMTF-KLANEQQRKDVVAYLATLK
| . | . | . | . | 80:A 100:A
Sequence vs Structure (II)In this case the sequence alignment needs to be fixed homology to because all the binding site shifted.
Lalign at ExPASy
Cytochrome C (Homo vs. Rhodopseudomonas palustris)
Structural alignment: RMSD= 0,13 nm29% sequence identity
GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKG---IIWGEDTLMEY
.|...|..:|. .|..||.. .|...||.|.|..|||.|.|.|:.|...|.|.| ::|..|.:..|
xDAKAGEAVFK-QCMTCHRA---DKNMVGPALAGVVGRKAGTAAGFTYSPLNHNSGEAGLVWTADNIVPY
LENPKKYIP--------------GTKMIFVGIKKKEERADLIAYLKKAT
|..|..::. .|||.| .:...::|.|.:|||....
LADPNAFLKKFLTEKGKADQAVGVTKMTF-KLANEQQRKDVVAYLATLK
Sequence alignment: 29% sequence identity
IS IT SIMILAR TO STRUCTURAL ALIGMENT?
COULD IT BE SAFELY ADOPTED FOR MODELLING PROCEDURES?
Global without end-gap score: 152; 28.7% identity (63.0% similar) in 108 aa overlap
10 20 30
sp|P99 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGL
:.. :. .: .:: : ... :. .:: : :.
sp|P00 MVKKLLTILSIAATAGSLSIGTASAQDAKAGEAVF----KQCMTCHRADKNMVGPALGGV
10 20 30 40 50
40 50 60 70 80 90
sp|P99 FGRKTGQAPGYSYTAANKNKG---IIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERA
:::.: : :..:. :.:.: ..: :....::..:. .. : ... : .. .
sp|P00 VGRKAGTAAGFTYSPLNHNSGEAGLVWTADNIINYLNDPNAFL---KKFLTDKGKADQAV
60 70 80 90 100 110
100
sp|P99 DLIAYLKKATNE
. . : .::
sp|P00 GVTKMTFKLANEQQRKDVVAYLATLK
120 130 portion where structural and sequence alignments coincide
Structural alignment: RMSD= 0,13 nm
29% sequence identity
Homo vs ArabidopsisHuman Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C6A Arabidopsis Thaliana – Uniprot: Q93VA3. PDB 2CE0:A
Structural alignment: RMSD= 0,35 nm
13% sequence identity
Cytochrome C (Homo vs. Arabidopsis thaliana)
Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C6A Arabidopsis Thaliana – Uniprot: Q93VA3. PDB 2CE0:A
Structural alignment: RMSD= 0,35 nm13% sequence identity
1:A 20:A 40:A 60:A | | . | . | . | . | . | . GDVEKGKKIFIMKCSQCHTVEKGGKHKTGP---NLHG-LFGRKTGQAPGYSYTAANKNKGIIWGEDTLME.|:::|..:|...|..||... |... . .|.. ...|. .:..|:.:..LDIQRGATLFNRACAACHDTG--GNII--QPGATLFTKDLERN-----------------GVDTEEEIYR| | . | . | . | . | 3:A 20:A 40:A
80:A 100:A | . | . | . |
YLE---------NPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE... ..|....|.......:... |...|..:.|....:VTYFGKGRMPGFGEKCTPRGQCTFGPRLQDE-EIKLLAEFVKFQADQ
. | . | . | . | . 60:A 80:A
Sequence vs Structure (III)In this case the sequence alignment is significantly different form the structural alignment.
Lalign at ExPASy
Cytochrome C (Homo vs. Arabidopsis thaliana)
Structural alignment: RMSD= 0,35 nm13% sequence identity
GDVEKGKKIFIMKCSQCHTVEKGGKHKTGP---NLHG-LFGRKTGQAPGYSYTAANKNKGIIWGEDTLME.|:::|..:|...|..||... |... . .|.. ...|. .:..|:.:..LDIQRGATLFNRACAACHDTG--GNII--QPGATLFTKDLERN-----------------GVDTEEEIYR
YLE---------NPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE... ..|....|.......:... |...|..:.|....:VTYFGKGRMPGFGEKCTPRGQCTFGPRLQDE-EIKLLAEFVKFQADQ
Sequence alignment: 20% sequence identity
VERY DIFFERENT FROM STRUCTURAL ALIGMENT
IT CANNOT BE SAFELY ADOPTED FOR MODELLING PROCEDURES
Global without end-gap score: 3; 20.0% identity (43.8% similar) in 105 aa overlap10 20 30
Homo MGDVEKGKKIFIMKCSQCHTVEKGGKHKTG:...: .: : :: . :. . :
A.Thal DFLLKKLAPPLTAVLLAVSPICFPPESLGQTLDIQRGATLFNRACIGCHDT-GGNIIQPG50 60 70 80 90 100
40 50 60 70 80 90sp|P99 PNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKE
.: ...: : . . .:. . . : : : . : : . ..sp|Q93 ATLFTKDLERNGVD-----TEEEIYRVTYFGKGRMPGFGE---KCTPRGQCTF-GPRLQD
110 120 130 140 150
100 sp|P99 ERADLIAYLKKATNE
:. :.: . : . sp|Q93 EEIKLLAEFVKFQADQGWPTVSTD
160 170 portion where structural and sequence alignments coincide
Structural alignment: RMSD= 0,35 nm
13% sequence identity
Search for Better AlignmentWhy is it not sufficient to align sequences (when identity is low) to recover information, not even for “important” residues?
Sequence alignments are «general» and treat each position in the same way There is no knowledge on the «important» sites
How can we detect the “important” residues starting from protein structures (even when information on catalytic sites is not available)?
Compare multiple structures and analyze the conservation of residues
How can we align sequences constraining the alignment of important residues?
Compare multiple sequences and check for the conservation of patterns Use alignment frameworks able to introduce positional dependences.