the use of 4-grams for protein classification and sequence comparison

The use of 4-grams for Protein Classification and Sequence

Comparison

Dror Tobi, ShannChing Chen, Ivet Bahar

The 4-gram Concept

Each sequence or group of sequences is represented as a vector in the 204-dimensional space of 4-grams

% of sequence identity between two sequences correlates with the cosine value of their vectors

AASD

QLIR

FGTY

4-gram – a short sequence of four amino acids

Representation of Sequence(s) as 4-gram Vector(s)

Calculating 4-gram frequencies in the examined DB

Calculating 4-gram frequencies for a given sequence or a given family of sequences

Creating a 4-gram vector using a weight function

Three steps:

1. Calculating 4-gram frequencies in DB

As a reference DB we chose the Swiss-Prot.

A table of the # of occurrences of each 4-gram was created

AAAA 10929AAAR 2230..VVVV 1402

The table enables us to calculate the database frequency of 4-gram i as

grams)-4 all (of occurrence of #

i gram-4 of occurrence of #idbf

The 4-gram frequencies for a given sequence or a family of sequences is done using a hash table.

Each 4-gram is entered into a hash table from which the 4-gram family frequency is calculated

2. Calculating 4-gram frequencies of a sequence (or family)

n xxxx

n xxxx n xxxx

family) theof members all(over soccurrence of #

i gram-4 of soccurrence of #

f

iff

3. The 4-gram weight function

if

idbi

fif f

fnW ln

where is the average number of times 4-gram i

appears in family f

then Wi > 0idbfIf > i

ff

then Wi = 0idbfIf = i

ff

then Wi < 0idbfIf < i

ff

The weight of 4-gram i for sequence/family f is defined as:

ifw

ifn

(no important contribution)

Building a 4-gram Vector (cont’d)

4-gram vector of length k is built from the k 4-grams with the highest | Wi | values. These 4-grams are referred to as the k most discriminative 4-grams.

The selection of the k most discriminative 4-grams is done using a heap data structure.

xxxx1

w1

xxxx5

w5

xxxx9

w9

xxxx1050

w1050

xxxx1001

w1001

The vector elements are sorted according to their 4-gram identity using quick sort algorithm.

1 2 kIdentityWeight

Comparing two Vectors

Vector similarity is measured by the cosine of the angle between the two vectors

uv

uv

)cos(

xxxx1

w1

xxxx5

w5

xxxx9

w9

xxxx1050

w1050

xxxx1001

w1001

xxxx5

w5

xxxx6

w6

xxxx9

w9

xxxx1056

w1056

xxxx1001

w1001

EC4 family classification

EC4 Test

1769 families (containing a total of 10,919 enzymes) defined at the EC level4 classification (at Expasy) were considered (*). A 4-gram vector (model, probe vector) was built for each EC4 family.

The cosine between the probe vector for a given EC4 family and the 4-gram vector of each sequence in the Swiss-Prot was calculated. All sequences were rank-ordered based on their cosine values.

(*) out of a total of ~4000 in SWISS-PROT release 27.7, excluding families that do not contain any sequences

http://us.expasy.org/enzyme/

http://us.expasy.org/enzyme/

Success Definition

% success is defined as the % of family members having a cosine value higher then any non family sequence in the Swiss-Prot DB.

F001 0.567F003 0.456F005 0.354F002 0.333P0SD 0.301F004 0.255…..

A case of 80% success. Family members are colored blue.

Example: for a family (F00X) that has five members F001-5

EC4 Initial Results

EC4 Results

0200400600800

1000120014001600

100 > 95 > 90 > 85 > 80 < 80

% success

# o

f fa

mil

ies

1K

5K

10K

EC 1.14.12.3 a case of failure

EC 1.14.12.3 is a family of four proteins. When we tested this family against Swiss-Prot no family member had a higher cosine value than the highest cosine value of non-family members.

EC 1.14.12.3Phylogenetic tree

•THIS DIOXYGENASE SYSTEM CONSISTS OF FOUR PROTEINS: THE TWO SUBUNITS OF THE HYDROXYLASE COMPONENT (BEDC1 AND BEDC2), A FERREDOXIN (BEDB) AND A FERREDOXIN REDUCTASE (BEDA).

Sequence homogeneity is a prerequisite for successful 4-gram classification

Sub Family

Sub Family

Family vector

Preliminary Conclusions

4-gram classification is a fast way to classify/cluster sequences. 120,000 comparisons take ~4 min on regular desktop.

Sequence homogeneity within a family is a prerequisite for successful classification.

The EC classification classifies enzymes according to their function, which does not necessarily correlate with classification based upon sequence similarity.

4-grams uses in Sequence Search

The 4-gram vector “as is” measures “sequence identity” and therefore can easily detect close sequences ( >55% identity)

But what about sequences with low sequence identity (30-55%)?

Case of P03579 / P0358143.6% identity; Global alignment score: 414

10 20 30 40 50 60P03579 MPYTINSPSQFVYLSSAYADPVQLINLCTNALGNQFQTQQARTTVQQQFADAWKPVPSMT : :.: .:::.::.. ::: . ..: :: .:.::::..: ... . . : : P03581 MAYSIPTPSQLVYFTENYADYIPFVNRLINARSNSFQTQSGRDELREILIKSQVSVVSPI 10 20 30 40 50 60

70 80 90 100 110 P03579 VRFPASD-FYVYRYNSTLDPLITALLNSFDTRNRIIEVDNQPAPNTTEIVNATQRVDDAT :::: .:.: . ... . ::::.: :::::.:::.:. .:.: .::..:.:::.P03581 SRFPAEPAYYIYLRDPSISTVYTALLQSTDTRNRVIEVENSTNVTTAEQLNAVRRTDDAS 70 80 90 100 110 120

120 130 140 150 P03579 VAIRASINNLANELVRGTGMFNQAGFETASGLVW--TTTPAT- .::. ....: . :. :::.::...::.::::.: :::: : P03581 TAIHNNLEQLLSLLTNGTGVFNRTSFESASGLTWLVTTTPRTA 130 140 150 160

Cos(P03579, P03581) = 0.04

Improving Sensitivity using homology 4-grams

P03579 MPYTINSPSQFVYLSSAY : :.: .:::.::.. : P03581 MAYSIPTPSQLVYFTENY

SPSQ APSQNPSQTPSQ…SPSK

Identity 4-grams Homology 4-grams

Identity Vector Homology Vector

Including homology in vector comparison

Query SequenceIdentity Vector

Homology Vector

Unknown Sequence

i

h

Score = cos( i ) + cos( h )

4-gram Search Results

cos cos h = 1 = 5 = 10Q8UFL7 100.00% 1.00 1 0.007 1.007 1.035 1.07Q92Q49 64.30% 0.64 0.248 0.053 0.301 0.513 0.778Q98MC1 51.20% 0.51 0.152 0.034 0.186 0.322 0.492Q8YHH1 51.70% 0.52 0.151 0.044 0.195 0.371 0.591Q9A710 38.50% 0.39 0.065 0.043 0.108 0.28 0.495Q8VQ25 36.90% 0.37 0.059 0.036 0.095 0.239 0.419Q9ZE02 34.50% 0.35 0.058 0.032 0.09 0.218 0.378Q9CJL2 29.80% 0.30 0.057 0.032 0.089 0.217 0.377Q9PEI1 27.50% 0.28 0.05 0.023 0.073 0.165 0.28Q9K1G9 29.70% 0.30 0.049 0.039 0.088 0.244 0.439Q9JX32 29.90% 0.30 0.049 0.039 0.088 0.244 0.439Q9HXY3 30.40% 0.30 0.048 0.029 0.077 0.193 0.338Q92J66 33.90% 0.34 0.048 0.043 0.091 0.263 0.478Q8ZH59 29.50% 0.30 0.047 0.025 0.072 0.172 0.297Q8XZI4 29.20% 0.29 0.044 0.024 0.068 0.164 0.284Q8ZRP1 30.00% 0.30 0.042 0.035 0.077 0.217 0.392Q8Z9A4 30.20% 0.30 0.042 0.035 0.077 0.217 0.392P37764 31.80% 0.32 0.042 0.039 0.081 0.237 0.432P44936 27.80% 0.28 0.041 0.028 0.069 0.181 0.321Q9KA70 30.90% 0.31 0.04 0.031 0.071 0.195 0.35O67776 29.50% 0.30 0.04 0.044 0.084 0.26 0.48Q9KPV9 27.70% 0.28 0.038 0.025 0.063 0.163 0.288Q9ZMH8 28.10% 0.28 0.035 0.029 0.064 0.18 0.325

Correlation between cosine value and Sequence alignment % identity

Query / 0 1 5 10 >30% Identity Found O00584 0.95 0.97 1.00 0.97 2 O67369 0.97 0.97 0.94 0.87 1 P00299 0.86 0.89 0.93 0.91 37 P03579 0.89 0.93 0.97 0.96 30 P03860 0.98 0.98 0.98 0.98 1 P10101 0.99 0.98 0.98 0.97 1 P13341 0.99 0.99 0.99 0.99 1 P18568 1.00 1.00 1.00 1.00 3 P21723 0.90 0.92 0.95 0.92 595 P38097 0.98 0.99 0.99 0.99 1 P59542 0.78 0.84 0.93 0.94 32 P95354 0.88 0.89 0.90 0.87 4 P97222 0.87 0.90 0.94 0.93 139 Q06529 0.99 0.99 0.99 0.99 1 Q929U9 0.63 0.72 0.86 0.86 9 Q46804 0.94 0.94 0.97 0.99 2 Q8UFL7 0.73 0.79 0.92 0.91 13

Conclusions

The use of homology 4-grams improve detection of distant sequences (30 – 55% sequence identity).

The 4-gram based method seems to be suitable also for sequence search.

After precalculation of the sequences’ 4-gram vector it is possible to compare two sequences with time complexity of O(1).

the use of 4-grams for protein classification and sequence comparison

Documents

gram frequencies

gram family frequency

gram identity

gram vectorscalculating

gram vector model

family sequence

gram weight functionwhere

family members