the use of 4-grams for protein classification and sequence comparison
DESCRIPTION
The use of 4-grams for Protein Classification and Sequence Comparison. Dror Tobi, ShannChing Chen, Ivet Bahar. Each sequence or group of sequences is represented as a vector in the 20 4 -dimensional space of 4-grams - PowerPoint PPT PresentationTRANSCRIPT
The use of 4-grams for Protein Classification and Sequence
Comparison
Dror Tobi, ShannChing Chen, Ivet Bahar
The 4-gram Concept
Each sequence or group of sequences is represented as a vector in the 204-dimensional space of 4-grams
% of sequence identity between two sequences correlates with the cosine value of their vectors
AASD
QLIR
FGTY
4-gram – a short sequence of four amino acids
Representation of Sequence(s) as 4-gram Vector(s)
Calculating 4-gram frequencies in the examined DB
Calculating 4-gram frequencies for a given sequence or a given family of sequences
Creating a 4-gram vector using a weight function
Three steps:
1. Calculating 4-gram frequencies in DB
As a reference DB we chose the Swiss-Prot.
A table of the # of occurrences of each 4-gram was created
AAAA 10929AAAR 2230..VVVV 1402
The table enables us to calculate the database frequency of 4-gram i as
grams)-4 all (of occurrence of #
i gram-4 of occurrence of #idbf
The 4-gram frequencies for a given sequence or a family of sequences is done using a hash table.
Each 4-gram is entered into a hash table from which the 4-gram family frequency is calculated
2. Calculating 4-gram frequencies of a sequence (or family)
n xxxx
n xxxx n xxxx
family) theof members all(over soccurrence of #
i gram-4 of soccurrence of #
f
iff
3. The 4-gram weight function
if
idbi
fif f
fnW ln
where is the average number of times 4-gram i
appears in family f
then Wi > 0idbfIf > i
ff
then Wi = 0idbfIf = i
ff
then Wi < 0idbfIf < i
ff
The weight of 4-gram i for sequence/family f is defined as:
ifw
ifn
(no important contribution)
Building a 4-gram Vector (cont’d)
4-gram vector of length k is built from the k 4-grams with the highest | Wi | values. These 4-grams are referred to as the k most discriminative 4-grams.
The selection of the k most discriminative 4-grams is done using a heap data structure.
xxxx1
w1
xxxx5
w5
xxxx9
w9
xxxx1050
w1050
xxxx1001
w1001
The vector elements are sorted according to their 4-gram identity using quick sort algorithm.
1 2 kIdentityWeight
Comparing two Vectors
Vector similarity is measured by the cosine of the angle between the two vectors
uv
uv
)cos(
xxxx1
w1
xxxx5
w5
xxxx9
w9
xxxx1050
w1050
xxxx1001
w1001
xxxx5
w5
xxxx6
w6
xxxx9
w9
xxxx1056
w1056
xxxx1001
w1001
EC4 family classification
EC4 Test
1769 families (containing a total of 10,919 enzymes) defined at the EC level4 classification (at Expasy) were considered (*). A 4-gram vector (model, probe vector) was built for each EC4 family.
The cosine between the probe vector for a given EC4 family and the 4-gram vector of each sequence in the Swiss-Prot was calculated. All sequences were rank-ordered based on their cosine values.
(*) out of a total of ~4000 in SWISS-PROT release 27.7, excluding families that do not contain any sequences
Success Definition
% success is defined as the % of family members having a cosine value higher then any non family sequence in the Swiss-Prot DB.
F001 0.567F003 0.456F005 0.354F002 0.333P0SD 0.301F004 0.255…..
A case of 80% success. Family members are colored blue.
Example: for a family (F00X) that has five members F001-5
EC4 Initial Results
EC4 Results
0200400600800
1000120014001600
100 > 95 > 90 > 85 > 80 < 80
% success
# o
f fa
mil
ies
1K
5K
10K
EC 1.14.12.3 a case of failure
EC 1.14.12.3 is a family of four proteins. When we tested this family against Swiss-Prot no family member had a higher cosine value than the highest cosine value of non-family members.
EC 1.14.12.3Phylogenetic tree
•THIS DIOXYGENASE SYSTEM CONSISTS OF FOUR PROTEINS: THE TWO SUBUNITS OF THE HYDROXYLASE COMPONENT (BEDC1 AND BEDC2), A FERREDOXIN (BEDB) AND A FERREDOXIN REDUCTASE (BEDA).
Sequence homogeneity is a prerequisite for successful 4-gram classification
Sub Family
Sub Family
Family vector
Preliminary Conclusions
4-gram classification is a fast way to classify/cluster sequences. 120,000 comparisons take ~4 min on regular desktop.
Sequence homogeneity within a family is a prerequisite for successful classification.
The EC classification classifies enzymes according to their function, which does not necessarily correlate with classification based upon sequence similarity.
4-grams uses in Sequence Search
The 4-gram vector “as is” measures “sequence identity” and therefore can easily detect close sequences ( >55% identity)
But what about sequences with low sequence identity (30-55%)?
Case of P03579 / P0358143.6% identity; Global alignment score: 414
10 20 30 40 50 60P03579 MPYTINSPSQFVYLSSAYADPVQLINLCTNALGNQFQTQQARTTVQQQFADAWKPVPSMT : :.: .:::.::.. ::: . ..: :: .:.::::..: ... . . : : P03581 MAYSIPTPSQLVYFTENYADYIPFVNRLINARSNSFQTQSGRDELREILIKSQVSVVSPI 10 20 30 40 50 60
70 80 90 100 110 P03579 VRFPASD-FYVYRYNSTLDPLITALLNSFDTRNRIIEVDNQPAPNTTEIVNATQRVDDAT :::: .:.: . ... . ::::.: :::::.:::.:. .:.: .::..:.:::.P03581 SRFPAEPAYYIYLRDPSISTVYTALLQSTDTRNRVIEVENSTNVTTAEQLNAVRRTDDAS 70 80 90 100 110 120
120 130 140 150 P03579 VAIRASINNLANELVRGTGMFNQAGFETASGLVW--TTTPAT- .::. ....: . :. :::.::...::.::::.: :::: : P03581 TAIHNNLEQLLSLLTNGTGVFNRTSFESASGLTWLVTTTPRTA 130 140 150 160
Cos(P03579, P03581) = 0.04
Improving Sensitivity using homology 4-grams
P03579 MPYTINSPSQFVYLSSAY : :.: .:::.::.. : P03581 MAYSIPTPSQLVYFTENY
SPSQ APSQNPSQTPSQ…SPSK
Identity 4-grams Homology 4-grams
Identity Vector Homology Vector
Including homology in vector comparison
Query SequenceIdentity Vector
Homology Vector
Unknown Sequence
i
h
Score = cos( i ) + cos( h )
4-gram Search Results
cos cos h = 1 = 5 = 10Q8UFL7 100.00% 1.00 1 0.007 1.007 1.035 1.07Q92Q49 64.30% 0.64 0.248 0.053 0.301 0.513 0.778Q98MC1 51.20% 0.51 0.152 0.034 0.186 0.322 0.492Q8YHH1 51.70% 0.52 0.151 0.044 0.195 0.371 0.591Q9A710 38.50% 0.39 0.065 0.043 0.108 0.28 0.495Q8VQ25 36.90% 0.37 0.059 0.036 0.095 0.239 0.419Q9ZE02 34.50% 0.35 0.058 0.032 0.09 0.218 0.378Q9CJL2 29.80% 0.30 0.057 0.032 0.089 0.217 0.377Q9PEI1 27.50% 0.28 0.05 0.023 0.073 0.165 0.28Q9K1G9 29.70% 0.30 0.049 0.039 0.088 0.244 0.439Q9JX32 29.90% 0.30 0.049 0.039 0.088 0.244 0.439Q9HXY3 30.40% 0.30 0.048 0.029 0.077 0.193 0.338Q92J66 33.90% 0.34 0.048 0.043 0.091 0.263 0.478Q8ZH59 29.50% 0.30 0.047 0.025 0.072 0.172 0.297Q8XZI4 29.20% 0.29 0.044 0.024 0.068 0.164 0.284Q8ZRP1 30.00% 0.30 0.042 0.035 0.077 0.217 0.392Q8Z9A4 30.20% 0.30 0.042 0.035 0.077 0.217 0.392P37764 31.80% 0.32 0.042 0.039 0.081 0.237 0.432P44936 27.80% 0.28 0.041 0.028 0.069 0.181 0.321Q9KA70 30.90% 0.31 0.04 0.031 0.071 0.195 0.35O67776 29.50% 0.30 0.04 0.044 0.084 0.26 0.48Q9KPV9 27.70% 0.28 0.038 0.025 0.063 0.163 0.288Q9ZMH8 28.10% 0.28 0.035 0.029 0.064 0.18 0.325
Correlation between cosine value and Sequence alignment % identity
Query / 0 1 5 10 >30% Identity Found O00584 0.95 0.97 1.00 0.97 2 O67369 0.97 0.97 0.94 0.87 1 P00299 0.86 0.89 0.93 0.91 37 P03579 0.89 0.93 0.97 0.96 30 P03860 0.98 0.98 0.98 0.98 1 P10101 0.99 0.98 0.98 0.97 1 P13341 0.99 0.99 0.99 0.99 1 P18568 1.00 1.00 1.00 1.00 3 P21723 0.90 0.92 0.95 0.92 595 P38097 0.98 0.99 0.99 0.99 1 P59542 0.78 0.84 0.93 0.94 32 P95354 0.88 0.89 0.90 0.87 4 P97222 0.87 0.90 0.94 0.93 139 Q06529 0.99 0.99 0.99 0.99 1 Q929U9 0.63 0.72 0.86 0.86 9 Q46804 0.94 0.94 0.97 0.99 2 Q8UFL7 0.73 0.79 0.92 0.91 13
Conclusions
The use of homology 4-grams improve detection of distant sequences (30 – 55% sequence identity).
The 4-gram based method seems to be suitable also for sequence search.
After precalculation of the sequences’ 4-gram vector it is possible to compare two sequences with time complexity of O(1).