identifying functional residues of proteins from sequence info using msa (multiple sequence...
Post on 21-Dec-2015
212 views
TRANSCRIPT
Identifying functional residues of proteins from sequence info
Using MSA (multiple sequence alignment)- search for remote homologs using HMMs or profiles
Remote homologs with no known structure- Given a large, diverse superfamily- protein may evolve different function or subtype
- different substrate specificity or activity- proteins with similar fold but different function
Past methods used phylogenetic trees- map unknown protein to one of the branches of the tree produced
- but- maybe diverged to long ago to be clearly identified- co-evolution of multiple features- possible convergent evolution of molecular function at aa level
Other methodologies:
Analysis/prediction of subtype from sequence alignments-characterization of aa residues, looking for significant substitutions
- gathering sequences into subgroups, comparing each subgroup
Principal component analysis (Casari et al, 1995)- looks for functional residues conserved in protein families
Evolutionary Trace (Lichtarge et al)
Phylogenetic Inference (Sjolander et al)
Goal: identify regions conferring sub-family specificity-Secondary goal: predict subtypes of orphan sequences
Input to algorithm:- multiple sequence alignment (MSA) of sequences in a protein family- classification of subfamilies of sequences from above MSA
For the given subtypes (or subfamilies) provided:- get the MSA subalignment for each subfamily- build a HMM profile for each sub-family MSA
- Rationale: generate pseudocounts and account for statistical bias
For each subalignment profile
The profile value for amino acid x at position i for subfamily j over all amino acids at a given position will sum to 1. (probability of finding an amino acid x at position i in the subfamily j)
Relative Entropy- measure of “distance” between two probability distributions- Relative entropy produces a value >= 0. (value of 0 for two identical distributions)- for each position i in a subfamily s
For each position, a RE value for a subfamily s vs s-bar (all other subfamilies)
Cumulative Relative Entropy- given a set of relative entropies for each subfamily for each position
-To produce a CRE for a given position i in the MSA across all subfamilies.
Given this set of cumulative relative entropy measures- one for each position in MRA- you take the Z score.
- Standard statistical measure- the number of std dev’s above/below the mean- tells you which residue positions vary strongly in aa distribution between families- empirically, Z > 3 correlates with functional residue
For position i, which amino acid is dominant in a given subfamily- find probability of observing aa x at position in subfamily s vs not-s
- Take the aa with probability >= 0.5
- We now have a small set of aa residues which differ strongly between subfamilies of a protein family.
What exactly constitutes a family or subfamily?- not always clear- automated tree generation could not separate data into clear subfamilies- use of PFAM alignments and SWISSPROT data
Subfamilies are not clearly defined in databases- divided proteins from PFAM database into subfamilies based on SWISSPROT data- keyword search limited to enzymatic activity string in SWISSPROT
- put into groups, then checked for obvious mistakes- also eliminated divisions “easily discernable by sequence comparison”- 62 groupings from 42 alignments remained
- randomly pick 1:1 to produce 42 groups over 42 alignments
Subfamily data
Four very large families to test their results on- nucleotidyl cyclases- eukaryotic protein kinases- lactate/malate dehydrogenases- trypsin-like serine proteases
Nucleotidyl cyclases- membrane-attached or cytosolic, cyclize (GTP -> cGMP) or (ATP -> cAMP) - found residues 1018, 938, which correlate with previous results- also identified residues which have not been tested experimentally
Protein kinases- phosphorylate serine/threonine or tyrosine residues- compare to experimental result- some ser/thr vs tyr kinase differences not detected
- inconsistency (no conservation) within the subfamily- residues which were common to both ser/thr and tyr kinases
Subfamilies
Lactate/Malate Dehydrogenases- common to a very wide variety of organisms- highly divergent- results mostly as expected- but a few residues identified outside of active site
Serine Proteases- cut protein backbone- differing specificity as to where (what aa precedes cut)- specificity pocket determines where protease can bind- identified 2 out of 3 of experimentally-determined pocket residues
- (third had a low z-score because of tolerance in one protein family)- also identified a few residues outside of the active site
Subfamilies (cont)
Sequence Similarity- straight % similarity with other sequences (ignoring gaps)
BLAST- database search, assign to nearest subfamily with best alignment
HMM method- align sequence of sub-type to all HMMs of subfamilies and assign it to best alignment
- will attempt to do iterative optimization of match…
Profile method- take original HMM, and probability profile
-Sub-profile method- only use residues in above formula that have a positive Z-score- to reduce noise, restrict to values that have above average positive relative entropy
Prediction of Protein Subfamily
Input: a multiple-sequence alignment- each sequence is converted to a vector of size (20 * l) where l is length of the alignment
Generation of of N x (20*l) matrix- one sequence produces a vector of dimensions 20*l- N sequences to produce N vectors of dimension 20*l
Use Principal Component Analysis- get the covariance matrix- tells you how factors are correlated to one another
- eliminate covariance by finding eigenvectors/eigenvalues of covariance matrix- largest eigenvalues and corresponding eigenvectors give you principal components
- ie the largest factors determining distribution of your dataset- they take the three largest (the largest of which represents consensus sequence)
- project their 20*l dimensional data onto those 3 dimensions- this can be used to predict a protein subfamily for a given protein
Casari, et al. (1995) A method to predict functional residues in proteins
Construction of a “comparison matrix”- take matrix x (matrix transpose)- solve for eigenvectors and eigenvalues as before
Columns of f represent amino acid values and positions- becomes possible to examine individual amino acid residues and positions- plotted on graph, shows residue correlation to type of protein subfamily- does this actually work?
General Weirdness