identifying functional residues of proteins from sequence info using msa (multiple sequence...

Identifying functional residues of proteins from sequence info

Using MSA (multiple sequence alignment)- search for remote homologs using HMMs or profiles

Remote homologs with no known structure- Given a large, diverse superfamily- protein may evolve different function or subtype

- different substrate specificity or activity- proteins with similar fold but different function

Past methods used phylogenetic trees- map unknown protein to one of the branches of the tree produced

- but- maybe diverged to long ago to be clearly identified- co-evolution of multiple features- possible convergent evolution of molecular function at aa level

Other methodologies:

Analysis/prediction of subtype from sequence alignments-characterization of aa residues, looking for significant substitutions

- gathering sequences into subgroups, comparing each subgroup

Principal component analysis (Casari et al, 1995)- looks for functional residues conserved in protein families

Evolutionary Trace (Lichtarge et al)

Phylogenetic Inference (Sjolander et al)

Goal: identify regions conferring sub-family specificity-Secondary goal: predict subtypes of orphan sequences

Input to algorithm:- multiple sequence alignment (MSA) of sequences in a protein family- classification of subfamilies of sequences from above MSA

For the given subtypes (or subfamilies) provided:- get the MSA subalignment for each subfamily- build a HMM profile for each sub-family MSA

- Rationale: generate pseudocounts and account for statistical bias

For each subalignment profile

The profile value for amino acid x at position i for subfamily j over all amino acids at a given position will sum to 1. (probability of finding an amino acid x at position i in the subfamily j)

Relative Entropy- measure of “distance” between two probability distributions- Relative entropy produces a value >= 0. (value of 0 for two identical distributions)- for each position i in a subfamily s

For each position, a RE value for a subfamily s vs s-bar (all other subfamilies)

Cumulative Relative Entropy- given a set of relative entropies for each subfamily for each position

-To produce a CRE for a given position i in the MSA across all subfamilies.

Given this set of cumulative relative entropy measures- one for each position in MRA- you take the Z score.

- Standard statistical measure- the number of std dev’s above/below the mean- tells you which residue positions vary strongly in aa distribution between families- empirically, Z > 3 correlates with functional residue

For position i, which amino acid is dominant in a given subfamily- find probability of observing aa x at position in subfamily s vs not-s

- Take the aa with probability >= 0.5

- We now have a small set of aa residues which differ strongly between subfamilies of a protein family.

What exactly constitutes a family or subfamily?- not always clear- automated tree generation could not separate data into clear subfamilies- use of PFAM alignments and SWISSPROT data

Subfamilies are not clearly defined in databases- divided proteins from PFAM database into subfamilies based on SWISSPROT data- keyword search limited to enzymatic activity string in SWISSPROT

- put into groups, then checked for obvious mistakes- also eliminated divisions “easily discernable by sequence comparison”- 62 groupings from 42 alignments remained

- randomly pick 1:1 to produce 42 groups over 42 alignments

Subfamily data

Four very large families to test their results on- nucleotidyl cyclases- eukaryotic protein kinases- lactate/malate dehydrogenases- trypsin-like serine proteases

Nucleotidyl cyclases- membrane-attached or cytosolic, cyclize (GTP -> cGMP) or (ATP -> cAMP) - found residues 1018, 938, which correlate with previous results- also identified residues which have not been tested experimentally

Protein kinases- phosphorylate serine/threonine or tyrosine residues- compare to experimental result- some ser/thr vs tyr kinase differences not detected

- inconsistency (no conservation) within the subfamily- residues which were common to both ser/thr and tyr kinases

Subfamilies

Lactate/Malate Dehydrogenases- common to a very wide variety of organisms- highly divergent- results mostly as expected- but a few residues identified outside of active site

Serine Proteases- cut protein backbone- differing specificity as to where (what aa precedes cut)- specificity pocket determines where protease can bind- identified 2 out of 3 of experimentally-determined pocket residues

- (third had a low z-score because of tolerance in one protein family)- also identified a few residues outside of the active site

Subfamilies (cont)

Sequence Similarity- straight % similarity with other sequences (ignoring gaps)

BLAST- database search, assign to nearest subfamily with best alignment

HMM method- align sequence of sub-type to all HMMs of subfamilies and assign it to best alignment

- will attempt to do iterative optimization of match…

Profile method- take original HMM, and probability profile

-Sub-profile method- only use residues in above formula that have a positive Z-score- to reduce noise, restrict to values that have above average positive relative entropy

Prediction of Protein Subfamily

Input: a multiple-sequence alignment- each sequence is converted to a vector of size (20 * l) where l is length of the alignment

Generation of of N x (20*l) matrix- one sequence produces a vector of dimensions 20*l- N sequences to produce N vectors of dimension 20*l

Use Principal Component Analysis- get the covariance matrix- tells you how factors are correlated to one another

- eliminate covariance by finding eigenvectors/eigenvalues of covariance matrix- largest eigenvalues and corresponding eigenvectors give you principal components

- ie the largest factors determining distribution of your dataset- they take the three largest (the largest of which represents consensus sequence)

- project their 20*l dimensional data onto those 3 dimensions- this can be used to predict a protein subfamily for a given protein

Casari, et al. (1995) A method to predict functional residues in proteins

Construction of a “comparison matrix”- take matrix x (matrix transpose)- solve for eigenvectors and eigenvalues as before

Columns of f represent amino acid values and positions- becomes possible to examine individual amino acid residues and positions- plotted on graph, shows residue correlation to type of protein subfamily- does this actually work?

General Weirdness

identifying functional residues of proteins from sequence info using msa (multiple sequence...

Documents