[ieee 2005 ieee symposium on computational intelligence in bioinformatics and computational biology...

8
0-7803-9387-2/05/$20.00 ©2005 IEEE Biological Sequence Prediction using General Fuzzy Automata Mansoor Doostfatemeh Dept. of Computing and Information Science University of Guelph Guelph, ON N1G 2W1 E-mail: [email protected] Stefan C. Kremer Dept. of Computing and Information Science University of Guelph Guelph, ON N1G 2W1 E-mail: [email protected] Abstract— This paper shows how the newly developed paradigm of General Fuzzy Automata (GFA) can be used as a biological sequence predictor. We consider the positional correlations of amino acids in a protein family as the basic criteria for prediction and classification of unknown sequences. It will be shown how the GFA formalism can be used as an efficient tool for classification of protein sequences. The results show that this approach predicts the membership of an unknown sequence in a protein family better than profile Hidden Markov Models (HMMs) which are now a popular and putative approach in biological sequence analysis. I. I NTRODUCTION Protein families are sets of proteins that typically share common function, structure, sequence and genetic origin. The correlations among these attributes allow researchers to infer one attribute on the basis of another. The recent advent of high-throughput genetic sequencing technologies has enabled researchers to readily sequence proteins. By identifying protein families, and classifying proteins into these families based on their amino-acid sequences, it is possible to infer the function, structure and genetic origin of otherwise unstudied proteins. This in turn, can provide invaluable insight into biological processes, medical advances, etc. The process of identifying the family of a protein based on its residual sequence, is normally performed using Hidden Markov Models (HMMs) implemented in a software package called HMMER (http://hmmer.wustl.edu/). HMMs are simple computational devices that can generate sequences according to probability distributions, or compute a probability for a given input sequence. In the latter case, a sequence is fed into the HMM, one element at a time and at the end of the sequence a probability is computed. For protein family identification the sequences are proteins, the elements are the residues and the computed probability is the probability of membership of a particular input protein in a given family [2], [10]. We propose a different and novel model for protein sequence prediction. We use a General Fuzzy Automa- ton (GFA) [3], [6] that we will show outperforms the com- monly used HMM classifier. The GFA uses the mathematical paradigm of fuzzy set theory to compute a fuzzy membership in a protein family for a given protein. The fuzzy membership represents the degree to which the protein is a member of the family and can be used in place of the HMM s probability computation to determine family membership. In the sequel, we will briefly introduce fuzzy automata theory. For more information, refer to [11], [12], [13], [3], [6]. A. General fuzzy automata Automata are the prime example of general computational systems over discrete spaces [8]. A finite-state automaton is a discrete model of a system where each state represents an internal distinct situation (configuration) of the system. The states transit to each other based on some external events(inputs.) In fact, a deeper observation into systems and applications we are dealing with daily, shows that almost all of them can be modeled as such. e.g. diagnosis systems, assembly lines of factories, elevators, controllers, etc. A Deterministic Finite-state Automaton (DFA) is defined as: Definition 1: [9] A DFA M is a 5-tuple machine denoted as: M =(Q, Σ, δ, q 0 ,F ), where: Q is a finite set of states. Q = {q 0 ,q 1 , ... , q n }. Σ is a finite set of input symbols. Σ= {a 0 ,a 1 , ..., a m }. q 0 Q is the start state of M . F Q is a set of final states of M . δ : Q × Σ Q is a transition map (function). i.e. δ(q i ,a k )= q j , where q i ,q j Q and a k Σ. This means that the automaton will transit from state q i to state q j upon input symbol a k . q i is called the current state, and q j is called the next state. In many applications there is some kind of vagueness in the system, which makes it difficult to make a sharp diagnosis or decision. Such systems can be better modelled by fuzzy automata, which can be defined as: Definition 2: [11], [12] A Fuzzy Finite-state Automaton (FFA) M is a 6-tuple machine denoted as: M =(Q, Σ, δ, R, Z, ω), where: Q is a finite set of states. Q = {q 1 ,q 2 , ... , q n }. Σ is a finite set of input symbols. Σ= {a 1 ,a 2 , ..., a m }. R Q is the fuzzy start state of M . Z is a finite set of output labels. Z = {z 1 ,z 2 , ..., z l }. δ : Q × Σ × Q (0, 1] is the fuzzy transition map. ω : Q Z is the output map. As can be seen, associated with each fuzzy transition, there is a membership value (mv) in (0, 1] interval. We call

Upload: sc

Post on 04-Mar-2017

216 views

Category:

Documents


2 download

TRANSCRIPT

0-7803-9387-2/05/$20.00 ©2005 IEEE

Biological Sequence Prediction using GeneralFuzzy Automata

Mansoor DoostfatemehDept. of Computing and Information Science

University of GuelphGuelph, ON N1G 2W1

E-mail: [email protected]

Stefan C. KremerDept. of Computing and Information Science

University of GuelphGuelph, ON N1G 2W1

E-mail: [email protected]

Abstract— This paper shows how the newly developedparadigm of General Fuzzy Automata (GFA) can be used asa biological sequence predictor. We consider the positionalcorrelations of amino acids in a protein family as the basiccriteria for prediction and classification of unknown sequences.It will be shown how the GFA formalism can be used as anefficient tool for classification of protein sequences. The resultsshow that this approach predicts the membership of an unknownsequence in a protein family better than profile Hidden MarkovModels (HMMs) which are now a popular and putative approachin biological sequence analysis.

I. INTRODUCTION

Protein families are sets of proteins that typically sharecommon function, structure, sequence and genetic origin. Thecorrelations among these attributes allow researchers to inferone attribute on the basis of another. The recent advent ofhigh-throughput genetic sequencing technologies has enabledresearchers to readily sequence proteins. By identifying proteinfamilies, and classifying proteins into these families based ontheir amino-acid sequences, it is possible to infer the function,structure and genetic origin of otherwise unstudied proteins.This in turn, can provide invaluable insight into biologicalprocesses, medical advances, etc.

The process of identifying the family of a protein basedon its residual sequence, is normally performed using HiddenMarkov Models (HMMs) implemented in a software packagecalled HMMER (http://hmmer.wustl.edu/). HMMs are simplecomputational devices that can generate sequences accordingto probability distributions, or compute a probability for agiven input sequence. In the latter case, a sequence is fed intothe HMM, one element at a time and at the end of the sequencea probability is computed. For protein family identification thesequences are proteins, the elements are the residues and thecomputed probability is the probability of membership of aparticular input protein in a given family [2], [10].

We propose a different and novel model for proteinsequence prediction. We use a General Fuzzy Automa-ton (GFA) [3], [6] that we will show outperforms the com-monly used HMM classifier. The GFA uses the mathematicalparadigm of fuzzy set theory to compute a fuzzy membershipin a protein family for a given protein. The fuzzy membershiprepresents the degree to which the protein is a member of thefamily and can be used in place of the HMM’s probability

computation to determine family membership. In the sequel,we will briefly introduce fuzzy automata theory. For moreinformation, refer to [11], [12], [13], [3], [6].

A. General fuzzy automataAutomata are the prime example of general computational

systems over discrete spaces [8]. A finite-state automaton isa discrete model of a system where each state representsan internal distinct situation (configuration) of the system.The states transit to each other based on some externalevents (inputs.) In fact, a deeper observation into systems andapplications we are dealing with daily, shows that almost all ofthem can be modeled as such. e.g. diagnosis systems, assemblylines of factories, elevators, controllers, etc. A DeterministicFinite-state Automaton (DFA) is defined as:

Definition 1: [9] A DFA M is a 5-tuple machine denotedas: M = (Q, Σ, δ, q0, F ), where:

Q is a finite set of states. Q = {q0, q1, ... , qn}.Σ is a finite set of input symbols. Σ = {a0, a1, ..., am}.

q0 ∈ Q is the start state of M .F ⊆ Q is a set of final states of M .

δ : Q × Σ → Q is a transition map (function). i.e.δ(qi, ak) = qj , where qi, qj ∈ Q and ak ∈ Σ. Thismeans that the automaton will transit from state qi tostate qj upon input symbol ak. qi is called the currentstate, and qj is called the next state.

In many applications there is some kind of vagueness inthe system, which makes it difficult to make a sharp diagnosisor decision. Such systems can be better modelled by fuzzyautomata, which can be defined as:

Definition 2: [11], [12] A Fuzzy Finite-stateAutomaton (FFA) M̃ is a 6-tuple machine denoted as:M̃ = (Q, Σ, δ, R, Z, ω), where:

Q is a finite set of states. Q = {q1, q2, ... , qn}.Σ is a finite set of input symbols. Σ = {a1, a2, ..., am}.

R ∈ Q is the fuzzy start state of M̃ .Z is a finite set of output labels. Z = {z1, z2, ..., zl}.δ : Q × Σ × Q → (0, 1] is the fuzzy transition map.ω : Q → Z is the output map.As can be seen, associated with each fuzzy transition,

there is a membership value (mv) in (0, 1] interval. We call

this membership value the weight of the transition. Theweight (mv) of the transition from state qi (current state) tostate qj (next state) upon input symbol ak is denoted asδ(qi, ak, qj) .

Also, it can be seen from Definition 2, that FFA have twointeresting characteristics. First, in contrast to DFA where atany time only one state is active, in FFA more than onestate can be active at any time, each of them with its ownmv (belonging to [0, 1] interval.) The second property of FFAis that there may be several transitions to the same state q at aspecific time t . We call this situation multi-membership, andsay that state q is multi-membership at time t .

However, this characterization of fuzzy automata sufferedfrom serious insuffiencies [6] which motivated us to developthe formalism of GFA:

Definition 3: [6] : General fuzzy automaton — A GFA F̃ isan 8-tuple machine denoted as F̃ = (Q, Σ, R̃, Z, ω, δ̃, z1, z2),where Q , Σ , R̃ , Z , ω are the same as in Definition 2, and:z1 : [0, 1] × [0, 1] → [0, 1] is a mapping function, which is

applied via δ̃ to assign mv’s to the active states, thuscalled membership assignment function.

δ̃ : (Q × [0, 1]) × Σ × Qz1(µ,δ)−−−−→ [0, 1] is the augmented

transition function.z2 : [0, 1]∗ → [0, 1] is a multi-membership resolution strat-

egy which resolves the multi-membership active statesand assigns a single mv to them, thus called multi-membership resolution function.

As can be seen, Definition 3 incorporates two new functionsinto the operation of a fuzzy automaton. The Membershipassignment function (z1) combines the mv of the currentstate (µ) with the weight of the transition (δ) to assign a newmv to the next state. The Multi-membership resolution func-tion (z2), resolves the multi-membership states and assigns aunified mv to them (a multi-membership state is the destinationof several simultaneous transitions, where each of them forcesthe multi-membership state to take a different mv.)

The details of the GFA formalism and its significance canbe found in [6]. Here, we just present a simple exampleto illustrate how a GFA operates which gives in turn moreinsight of the way GFA contribute to biological sequenceprediction. We use two notations which should be described:Qact(ti) denotes the set of active states of a GFA at time ti ,and µti(qm) denotes the mv of state qm at time ti . Thus,µt2(q4) = 0.8 shows that the mv of q4 at time t2 is 0.8 , andQact(t2) = {(q4, 0.8), (q5, 0.25)} states that the active statesat time t2 are q4 and q5 with mv’s of 0.8 and 0.25 respectively.

Example 1: Consider the GFA in Figure 1 with severaltransition overlaps. It can be specified as:

F̃ = (Q, Σ, δ̃, R̃, Z, ω, z1, z2) where:• Q = {q1, q2, q3, q4, q5} : Set of states.• Σ = {a, b} : Set of input symbols.• R̃ = {(q1, µt0(q1)), (q2, µt0(q2))} : Set of initial states.

R̃ = {(q1, 0.66), (q2, 0.74)}.• δ̃ : (Q × [0, 1]) × Σ × Q

z1(µ,δ)−−→ [0, 1] : The augmented

transition function.

q1

q2 q4

q5q3

b , 0.88

a , 0.03

a , 0.75

a, 0.89

a, 0.81

b , 0.45Start

a , 0.26

a , 0.49

b, 0

.26

b , 0.32

b, 0.12

µµµµt0(q1)=0.66

a , 0.31

Start

µµµµt0(q2)=0.74

b , 0.46

b , 0.92

Fig. 1. The GFA of Example 1

• z1 : Mean function. z1(µ, δ) = (µ + δ)/2

• z2 : Maximum function.n

z2i=1

(νi) = Max(ν1, ν2, ..., νn)

Note that output mapping is not of our interest in this paper.Thus we don’t mention Z and ω here.

Assuming that F̃ starts operating at time t0 and the nextthree inputs are a, b, b respectively (one at a time), active statesand their mv’s at each time step are as follows:

• Obviously at time t0 :Qact(t0) = R̃ = {(q1, 0.66), (q2, 0.74)}

• At t1, input is a . Thus q2 and q3 get activated. Then:µt1(q2) = δ̃((q1, µ

t0(q1)), a, q2) = z1(µt0(q1), δ(q1, a, q2))

= z1(0.66, 0.03) = Mean(0.66, 0.03) = 0.345

But q3 is multi-membership at t1 . Then:

µt1(q3) = z2i=1 &2

[z1[µ

t0(qi), δ(qi, a, q3)]]

= z2

[z1

[µt0(q1), δ(q1, a, q3)

], z1

[µt0(q2), δ(q2, a, q3)

]]

= z2

[z1(0.66, 0.31), z1(0.74, 0.89)

]

= Max[Mean(0.66, 0.31), Mean(0.74, 0.89)

]

= Max(0.485, 0.815) = 0.815

Then we have:Qact(t1) = {(q2, µ

t1(q2)), (q3, µt1(q3))}

= {(q2, 0.345), (q3, 0.815)}

• At t2 input is b . q4 and q5 get activated, and none ofthem is multi-membership. It is easy to verify that:

Qact(t2) = {(q4, 0.8675), (q5, 0.2325)}

• With the entry of b at time t3 , q2 and q4 get activated.q4 is multi-membership. Then:

µt3(q4) = z2i=4 &5

[z1[µ

t3(qi), δ(qi, b, q4)]]

= z2

[z1

[µt2(q4), δ(q4, b, q4)

], z1

[µt2(q5), δ(q5, b, q4)

]]

= Max[Mean(0.6875, 0.32), Mean(0.2325, 0.26)

]

= Max(0.50375, 0.24625) = 0.50375And we have:

Qact(t3) = {(q2, µt3(q2)), (q4, µ

t3(q4))}

= {(q2, 0.34125), (q4, 0.50375)}�

Although it is evident that the positional relation of theresidues with respect to each other is very important incharacterizing sequence structure and functions, the traditionalHMMs which are mostly used for protein sequence classifica-tion and prediction [2], base the probability of each symbolin the sequence (in our case, each residue in the protein)solely on the immediately preceding symbol. This capturessome common relations between adjacent symbols and allowsfor prediction of membership. An extension of the basicHMM model to higher orders allows longer-term relationshipsbetween symbols to be captured as well.

By contrast, we build our GFA model, not specifically onthe relationship between adjacent symbols, but rather on therelationship between highly correlated symbols (regardless oflinear distance). This allows us to capture not only proximalinfluences, but also distal ones. The fact that adjacent symbolsin the sequence refer to amino acids which occur in successionalong the protein backbone cause one such amino acid toaffect its neighbor and make the HMM model a reasonableapproximation. The three-dimensional structure of proteins,folds, disulphide bonds, etc., however, suggests that otherresidues which may not be close in the primary structure ofthe proteins sequence, may in fact be in physical proximityin a three- dimensional molecule. It is reasonable to assume,therefore that such residues would influence each other andin combination help to define a protein family. Our modelis capable of capturing both of the relations we have justmentioned. For this reason we expect it to be capable ofsuperior performance over its simpler HMM variant.

One obvious limitation of our approach is that our modelis more complicated. We attempt to minimize this, however,by considering only the most prevalent interactions betweenamino acids and thereby perform a dimensionality reductionon the parameter space of our model. With this optimization,we are able to show that our model outperforms the traditionalHMM approach on a well-studied protein family. This paper,thus, provides a new and promising approach to protein familyidentification that warrants further study.

II. CORRELATIONS IN PROTEIN SEQUENCES

As an initial testbed, we worked on the globin family.Globins are oxygen transporting molecules which occur acrossa broad range of species. We selected this sequence becauseit has served as the most-used, benchmark protein family forother identification studies. We plan to examine other familiesin future work.

This family was read and aligned from the PFAM website (http://pfam.wustl.edu/) [14]. The accession number wasPF00042, and we selected Seed Alignment option. With theseoptions, the size of the globin family was 76, and the length ofeach sequence (after alignment) was 167. We then developeda procedure to compute the membership of a sequence inthe globin family by summing evidence for membership.This evidence takes the form of correlations between specific

residuals in specific locations within aligned proteins. Ourprocess involves comparing the expected occurrences of tworesiduals in specific locations (based on their independent oc-currence rates) to their observed co-occurrence. The presenceor absence of such co-occurrences then gives evidence formembership in the protein family.

To calculate the membership value (mv) of each sequence,we use the pair correlations of different residues (amino acidsin this case) at different positions. To compute the value ofa pair correlation, we considered both the single frequenciesand co-occurrence of every pair of residues. The followingexample shows the way pair correlations were calculated.

Example 2: In this globin family, the single frequencies ofthe amino acids T at position 47 and amino acid L at position113 are 28 and 25 respectively. We denote it as fT47

= 28 andfL113

= 25 . Also, the co-occurrence of T47 and L113 is 22.We denote this co-occurrence as: CORT47,L113

= 22 . Then,the correlation between T47 and L113 is calculated as follows:

Expected co-occurrence of T47 and L113 if they wereindependent = 28

76 ∗ 2576 ∗ 76 = 9.21

Score attributed to the correlation = Real Co-occurrence −Expected Co-occurrence = 22 − 9.21 = 12.79 .

Similarly, fT47= 28 , fW156

= 48 , and CORT47,W156=

8 . Then, the attributed score to T47 and W156 correlation iscalculated as:

Expected co-occur of T47 and W156 if they were indepen-dent = 28

76 ∗ 4876 ∗ 76 = 17.68

Score attributed to the correlation = Real Co-occurrence −Expected Co-occurrence = 8 − 17.68 = −9.68 .This means that T47 and L113 are positively correlated, whileT47 and W158 are negatively correlated.

All correlations are sorted non-decreasingly according totheir absolute value. We select a subset of the top correlationsas the basis for the construction of our GFA, and use theconstructed GFA to evaluate any sequence (attributing a mem-bership value (mv) to that). Each correlation will be given aweight in such a way that the sum of all weights is 1 . Theweight of a positive correlation is positive, and the weight of anegative correlation is negative (in next section, we will brieflydescribe the calculation of correlations weights and how theycontribute to the mv of a sequence.)

The mv for an unknown sequence is calculated using ourGFA as follows:

If both residues of a positive correlation are present ina sequence, it will have a positive contribution to the mvof the sequence (its weight will be added to the mv of thesequence), and if only one of them is present it will have anegative contribution. A negative correlation affects the mv ofa sequence reversely. Tables I and II illustrate the contributionsof positive and negative correlations, respectively.

III. CALCULATION OF CORRELATION WEIGHTS

Before talking about the calculation of correlations weights,it should be noted that we experimented with two approachesregarding positive and negative correlations. In the first ap-proach, we considered positive and negative correlations sep-

TABLE ICONTRIBUTION OF A POSITIVE CORRELATION TO THE MV OF A SEQUENCE

R1 R2 Contributionabsent absent 0.0absent present −α

present absent −α

present present +α

R1 and R2 are the two residues (at two specific positions) whose correlationis being considered, and α is the weight of the correlation. The first rowsimply implies that the correlation is not applicable to the sequence underconsideration.

TABLE IICONTRIBUTION OF A NEGATIVE CORRELATION TO THE MV OF SEQUENCES

R1 R2 Contributionabsent absent 0.0absent present +α

present absent +α

present present −α

arately. Positive and negative correlations were sorted indepen-dently and the top ones were selected for construction of GFA,and then used for the prediction of sequences. For example, ifwe considered 500 correlations, the 250 most positive corre-lations and the 250 most negative ones were selected. In thesecond approach however, positive and negative correlationswere considered together. They were sorted according to theirabsolute values, and the top ones were selected. In practice,the second approach worked better. And for globins, usually15 to 20 percent of the selected correlations were negative andthe rest were positive.

As can be seen from Tables I and II, different correlationseither increase or decrease the mv of a sequence. In otherwords, the mv of a sequence is formed of four components oras we have called them, gains:

1) Positive gain due to the positive correlations (4th row ofTable I.) This gain is denoted as POSpos .

2) Positive gain due to the negative correlations (2nd and3rd rows of Table II.) This gain is denoted as POSneg .

3) Negative gain due to the positive correlations (2nd and3rd rows of Table I.) This gain is denoted as NEGpos.

4) Negative gain due to the negative correlations (4th rowof Table II.) This gain is denoted as NEGneg .

However, a very tricky point should be considered in paircorrelations. In a protein family, there are several correlations,which relate the residues in two specific columns. But, in asingle sequence which is to be scored, only one of them mayhappen. In other words, in practice, the number of correlationswhich are absent (one residue is present while the other isabsent), is usually much greater than the present correla-tions (both residues are present.) This was quite observablein our experiments. For example, A23 (Residue A at Column23) may be correlated (positively or negatively) with C59, G59,M59, P59, and R59 , but in a specific sequence only one ofthese can happen, say A23 and M59 . If all these correlationsare positive, this means that the only correlation which causesa positive contribution is CORA23,M59

, and all others con-tribute negatively. Therefore, the final score will be stronglybiased toward absent correlations. This analysis shows that the

presence of a correlation should be more significant comparedto its absence. In other words, the 4th rows of Tables I and IIshould be more dominant than the 2nd and 3rd rows, and thusshould be given higher contributions (weights). Therefore, thealgorithm which calculates the mv of the sequences, shouldtake this analytical and empirical observation into account.

To describe the mv calculation algorithm, we need to definesome notations used in the algorithm. The following is a listof the notations used for the related parameters:

• S : The sequence which is being processed.• PCS : Sum of Presence Contribution to Sequence S ,

correlations whose presence caused the contribution (bothresidues were present.)

• ACS : Sum of Absence Contribution to Sequence S ,correlations whose absence caused the contribution (oneresidue was present while the other was absent.)

• WPCS: Weight of PCS .

• WACS: Weight of ACS .

Now, Algorithm 1 shows how the weights of different typesof correlations are calculated, and how they contribute to themv of a sequence:

Algorithm 1: Sequence Membership Value Calculation

1) PCS = ΣSPOSpos +

∣∣ΣSNEGneg

∣∣ ⇒1

PCS= 1

ΣS

POSpos+∣∣Σ

SNEGneg

∣∣

2) ACS = ΣSPOSneg +

∣∣ΣSNEGpos

∣∣ ⇒1

ACS= 1

ΣS

POSneg+∣∣Σ

SNEGpos

∣∣

3) WPCS=

1PCS

1PCS

+ 1ACS

4) WACS=

1ACS

1PCS

+ 1ACS

Finally, the mv of Sequence S is calculated as:

5) mvS

= WPCS∗

[ΣSPOSpos + Σ

SNEGneg

]+

WACS∗

[ΣSPOSneg + Σ

SNEGpos

]

Note that among the four components of mvS

, the first andthe third ones are positive, and the second and the fourth onesare negative.

IV. IMPLEMENTATION

To implement this predictor, we use a special case of GFAwhich considers every correlation separately, and the weightsof the transitions are assigned according to the weights ofcorrelations with some adjustment. Before explaining the de-tails of this development, it’s time to review some fundamentalideas about the GFA. For more details, refer to [3], [6].

Details of GFA development as a biological sequence pre-dictor, will be better understood by an example.

Example 3: Consider part of the multiple alignment ofglobin sequences in Figure 2, which is cited from [2].

Protein identifier alignmentHBA HUMAN . . . V G A # # H A G E Y . . .HBB HUMAN . . . V # # # # N V D E V . . .MYG PHYCA . . . V E A # # D V A G H . . .GLB3 CHITP . . . V K G # # # # # # D . . .GLB5 PETMA . . . V Y S # # T Y E T S . . .LGB2 LUPLU . . . F N A # # N I P K H . . .GLB1 GLIDI . . . I A G A D N G A G V . . .

Fig. 2. Ten positions (columns) from the multiple alignment of seven globinsequences from [1]. “#” represents a gap. Note that the aligned sequencesare much larger. But here, we consider only 10 columns to illustrate ourmethodology. Cited from [2]

Even for this small aligned sample, there are about 20000pair correlations (both positive and negative). However, weconsider only 8 correlations as illustrated in Table III to showour methodology in a small example.

TABLE IIITHE EIGHT CORRELATIONS CONSIDERED IN EXAMPLE 3. R1 AND R2 ARE

THE TWO RESIDUES OF THE CORRELATION. “#” REPRESENTS A GAP.

Correlation Value usedR1 freq R2 freq Real Score in Fig 3A8 2 G9 2 1.42 1.5N6 3 V10 2 1.14 1.3A3 3 H10 2 1.14 1.1#4 6 #5 6 0.86 0.9G3 2 #4 6 −0.71 -0.7G3 2 #5 6 −0.71 -0.5#4 6 A8 2 −0.71 -0.3#4 6 G9 2 −0.71 -0.1

Figure 3 on next page shows a GFA which implements thecorrelations of Table III.

Regarding Example 3 and its related figure (Fig. 3), a fewpoints are noticeable:

• Usually, in the systems which are modelled by automata,the two paradigms of states and external events areessential (states are the different distinguishable internalsituations of a system, while events are characterizedas the inputs to the system, which in combination withthe current state(s) specify the next state(s) of the sys-tem.) Although there may not be a direct and obviouscorrespondence between a protein sequence and a state,we can say that different states in Figure 3, show theinstantaneous situation (contribution) of a correlation tothe mv of a sequence. Also, the residues act as theexternal events, which cause the situation(s) of differentcorrelations to change to new ones. And the fact thatseveral correlations are applicable to any sequence, isperfectly characterized (modelled) by fuzzy automata interms of simultaneous activation of several states.

• Since Figure 3 is relatively crowded and some correlationscores are equal, we don’t use the real scores of thecorrelations (3rd column in Table III) as the base fortransition weights in this Figure. Rather, we use thenumbers in Column 4 of Table III. This is done to makeall correlations distinct, which increases the readabilityof Figure 3. In fact, the real weights of the transitions

are the relative weights of the correlations. But, they areusually small numbers and we do not mention them here.

• We treat a gap like any other residue here. However, otheroptions are possible. We will talk more about this issuein the concluding remarks.

• For simplicity, most similar transitions are representedas one transition. For example, a transition labelled as“ar, 0”, means that upon entrance of all residues (ar),this transition takes place. In other words, this transitionis in fact 21 transitions to the same state (number ofdifferent residues including gap.) Or a transition labelledas “

−−

A, 0.25” means that upon entrance of any residuebut “A” this transition takes place. Thus, this represents20 transitions (including a gap) to the same state.

• As can be seen in Definition 3, two functions shouldbe associated with a GFA, a membership assignmentfunction (z1) and a multi-membership resolution func-tion (z2 ). For this application they are defined as follows:z1: We distinguish two different scenarios for z1 , de-

pending on whether the presence of a correlationcontributes positively or negatively (see Tables I andII). Equations 1 and 2 characterize the scenarios forpositive and negative correlations respectively.

z1(µ, δ) =

µ + δ if µ ∗ δ = 0+[µ + |δ|] if µ ∗ δ > 0−[|µ| + |δ|] if µ ∗ δ < 0

(1)

z1(µ, δ) =

µ + δ if µ ∗ δ = 0−[µ + |δ|] if µ ∗ δ > 0+[|µ| + |δ|] if µ ∗ δ < 0

(2)

z2: In this special case of GFA, z2 adds the partial con-tributions of different correlations. In other words,z2 implements Step 5 of Algorithm 1. Note that inFigure 3, the multi-membership happens at the laststate (qend.)

• Negative weights in Figure 3 are included to increasereadability and clarification. However, as we know allweights in a fuzzy automaton belong to [0, 1] interval.In practice, we normalize all negative weights (and con-sequently mv’s) in [0, 0.5) interval, while all positiveweights and mv’s are normalized in [0.5, 1) interval.

• Transition weights (all nonzero weights) are in fact halfof the weight of the applicable correlations. This is doneto make the membership assignment function (z1) workcorrectly both for positively contributing and negativelycontributing correlations.

V. SIMULATIONS AND EXPERIMENTS

After implementation of the aforementioned GFA, we per-formed several experiments to evaluate its generalization andprediction capability. To have a better performance analy-sis, we compared our results with the HMMER software,which is essentially the fundamental algorithm of PFAMdatabase. As mentioned earlier, HMMER uses Hidden Markov

qbeg

µt0 (qbeg)=0

ar,0

A, -

0.55

A,0.

55G, 0

.35G, -

0.35

G, 0.25

G, -0.25

q4-4

ar , 0

#, 0.35

#, -0.35

ar,0

H, 0.55H, -0.55

#, 0.25

#, -0.25

#,-0.

45#,

0.45

q3-1

ar,0q4-1

ar,0q5-1

ar,0q6-1

ar,0q7-1

ar,0q8-1

q9-1

q2

q3-3

ar,0 q4-3

q5-3

q3-2

q4-2

ar,0 ar,0 ar,0 ar,0 ar,0q5-2

q6-2

q7-2

q8-2

q1

ar,0

q3-4

q4-5

#,0.1

5

#,-0

.15

q4-6#,0

.05

#, -0.05

q4-7ar, 0

ar,0 ar,0 ar,0 ar,0q6-3

q7-3

q8-3

q5-4

#, 0.45

#, -0.45

ar,0 ar,0q5-5

q6-5

ar,0 q7-5

ar,0 ar,0 ar,0 ar,0q5-6

q6-6

q7-6

q8-6

ar,0 ar,0 ar,0 ar,0q6-4

q7-4 q

8-4

q8-5

A, 0.15

A, -0.15G, 0.05

G, -0.05

ar,0

ar,0 q5-7

q9-3

q9-4

q9-6

q9-2

q9-5

ar,0

ar, 0

ar, 0

ar, 0

ar, 0

q6-7

q6-8

N, -0.65

N,0.65

ar, 0

ar,0 ar,0 ar,0q7-7

q8-7

q9-7

V, -0.65

V, 0.65

ar,0 q7-8

A, -0.75

A,0.75

q8-8

G, -0.75

G, 0.75

q9-8

ar, 0

qend

start

Fig. 3. A GFA which represents(predicts) the partial alignment of Figure 2. “ar” means all residues and show that no matter what residue enters next, thistransition will happen. Dash “−” over the labels means their complements. e.g. “A, 0.25” means that upon entrance of “A” this transition takes place with a

weight of 0.25, while “−−

A , 0.25” means that upon entrance of any residue but “A” this transition takes place. “#” represents a gap.

Models (HMMs) for sequence analysis. The following pointsshould be mentioned in this regard:

1) The training set was the small globin family, whichcontained 76 sequences of length 167 after alignment.

2) We decided to test the performance of our method andcompare it against HMMER on 3 different types of data:(i) the same globins used to build the classifiers, (ii)other globins not used to build the classifiers, (iii) non-globin sequences. For the latter case we developed 3different kinds of negative exemplars. Specifically, weconducted the following 5 tests for both the GFA andHMMER systems:

a) Training: In order to evaluate the ability of thesystems to incorporate the example data, we pre-sented the training set itself, back to the systemsfor classification. This also serves to establish abaseline for performance on known globins byproviding labels for the numerical outputs of thetwo classifiers.

b) Random: Some sequences of length 167 residueswere generated just randomly. Any of the 21 pos-sible residues were assigned randomly to any posi-tion. The purpose of this set was to generate non-globins. An effective classifier should give different

numerical and hence categorical responses to thesenon-globins than to the training set.

c) Total frequency: The relative frequencies ofresidues in the training set were calculated, andthen some sequences of length 167, were generatedbased on them. The purpose of this set was to gen-erate more realistic non-globins. In actual proteins,some amino acids occur far more frequently thanothers so it is natural to use this distribution fornegative examples. Again, an effective classifiershould give different numerical and categoricalresponses to these negative exemplars than to pos-itive exemplars.

d) Null: The positional relative frequencies ofresidues in the training set were calculated. Thensome sequences of length 167, were generatedbased on them. e.g. if in the first column of trainingsequences, only the four residues D, K , N , and Roccurred with the relative frequencies of 0.11, 0.27,0.14, and 0.48 respectively, the residues for the firstcolumn of test sequences, were generated with theabove probabilities. This dataset represents anotherattempt to present a more biologically plausible setof negative exemplars.

-0.01

-0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

Training Random Total frequency Null Generalization

Mem

bers

hip

Val

ue

Dataset

Globin Prediction - GFA

-150

-100

-50

0

50

100

150

200

Training Random Total frequency Null Generalization

Sco

re

Dataset

Globin Prediction - HMMER

GFA chart HMMER chartFig. 4. Distribution of the different globin and non-globin catergories calculated by the trained GFA and HMMER. GFA chart shows the mv’s and HMMERchart shows the scores of different categories. Each box-plot illustrates the distribution of the mv’s (for GFA) or the scores (for HMMER) of one category.In each box-plot five points are observable: the bottom and top points show respectively, the minimum and maximum value within the category, and therectangular area shows the µ ± σ range. The bottom of the rectangle shows µ − σ, the middle point shows mean, and the top point of the rectangle showsµ + σ . 5000 pair correlations (out of 450000) were used to construct the GFA.

e) Generalization: For this problem, the two classi-fiers were each built 76 times. Each time, one of the76 sequences was removed from the data used toconstruct the models and the remaining 75 proteinswere used. Then the performance on the missingprotein was computed. The performance over all76 trials was then evaluated. In other words, weconducted a 76-fold cross-validation. The purposeof this experiment was to evaluate the performanceof the systems on data not used in their ownconstruction. We also tried 5 other cases where,in each run either 2, 3, 4, 5, or 6 sequences weretaken out. For these cases, we did 100 runs in eachcase selecting the removed sequences randomly,and then averaged over these 100 runs.

In our experiments we call the above 5 categories,Training, Random, Total frequency, Null, and Gen-eralization respectively.

Figure 4 summarizes the results of the experiments done byGFA and HMMER.

The following points can be noticed from Figure 4:1) HMMER assigns the Random sequences (Category b)

and Total frequency sequences (Category c) very lowscores compared to the Training sequences (category a),indicating that they do not belong to the family. TheGFA does this as well, though the membership valuesthat it assigns to these non-globins lie closer to theTraining set. Although GFA does not classify these twocategories that much negatively, it still separates themfrom the training set quite acceptably. In fact, in all runsat most two sequences (out of 100 generated for testing)had a mv greater than the minimum mv of the trainingset. Thus, still we can judge that GFA distinguishes neg-ative members of a protein family, which are randomly

generated, with acceptable performance.2) When we take some of the family sequences out to test

the generalization capabilities of the two approaches,their performance will drop (the scores for the Gen-eralization data lie slightly lower than those for theTraining data.) However, the drop level of HMMER issignificantly larger than that of GFA. Compare the firstand last box-plots in Figures 4-a and 4-b together tosee the difference. This difference holds for the casethat we take one sequence out in each run. When,the number of the out sequences increase (the trainingset size decreases), the drop level of HMMER getsstill larger compared to that of GFA. Details of theseexperiments can be found in [7].

3) A very interesting phenomenon happens for Null modelsequences (Category d). As can be seen, HMMER isreally confused with these sequences. The numericalvalues that the HMMER generates for these non-globinslie entirely within the range of values generated for theglobins of the Training and Generalization tests. Thismeans that all of them are classified as members of theglobin family, while none of the generated sequencesactually belong to the family. This confusion stemsfrom the very nature of HMM approach, where theemissions take place based on the relative frequencies ofthe residues in each position independently from otherpositions. The GFA, on the other hand, has been able toclearly distinguish these negative members by assigningthem low membership values with only some overlapwith positive exemplars. Even, the maximum mv of thesequences is still much smaller than the range of onestandard deviation from the mv of the training set.

VI. CONCLUSION

Although the simplicity of hidden Markov models has madethem a popular tool in the biological sequence predictionand classification, the fact that they can not capture anythingbeyond the interactions between adjacent symbols (residues),limits their abilities for deep structural and functional analysisof protein families. A HMM can not show any sensitivity tothe mutual effects of two or more residues which may notbe adjacent in the protein sequence backbone. This shortagemotivated us to develop a correlation-based approach forthis purpose. We employed the powerful tool of GFA tocapture the inherent fuzziness of this task, and showed that itpredicts the membership sequences more reliably than HMMs.Specifically, HMM can be easily fooled by negative examplesgenerated by a null model, while GFA does predict them asnon-members.

The following issues are worth noting for future research:1) Although the number of pair-wise correla-

tions (correlation between two residues) is hugeeven for small protein families, we don’t need toconsider too many of them for building our GFA.For example, we experimented with globins with 76members of length 167. There are more than 450,000correlations for this small-size family. We tried 500,1000, 2000, 3000, 4000, and 5000-correlation cases.We selected the 5000-correlation case for more detailedcomparisons. It should be noted however, that eventhe 500-correlation case had acceptable performanceon negative examples and was able to predict positiveexamples not worse than HMM. As a rule of thumb wecan say that, the longer the family sequences, the morethe number of correlations which should be considered.Increasing the size of the family essentially improvesthe accuracy of the training sequence mv’s as a basefor comparison. The larger the family size, the moreaccurate the training criterion.

2) The automaton employed for this purpose (see Figure 3)is a special case of GFA. However, it still shows thepower and efficiency of GFA. This task can not be doneby any other types of conventional automata. Moreover,while the calculation of the mv’s of the sequences canbe done without using GFA, the GFA calculation wasmuch faster, with the price of more memory usage.

3) Our research achievements have prepared a very ap-pealing ground for considering higher order correlations.The GFA calculation of mv’s is not only very fast, butalso is a perfect candidate for parallel processing, and itcan be easily extended to triple-correlations, quadruple-correlations, etc., without considerable increase in com-plexity. Moreover, GFA calculation is an efficient ap-proach for online calculation, where a mv may be neededat any point of the entering sequence which is supposedto be classified.

4) We experimented with aligned sequences. This was dueto the fact that alignment is not a problem any more and

there are well-established alignment algorithms. And aslong as we use the same alignment tool, for trainingand test sequences, the quality of alignment does notseem to have a serious impact on the performance ofGFA and the reliability of the results. However, ourapproach has the potential to be extended to non-alignedsequences with some modifications. This is now underdevelopment, and we will have more to say about thatin future.

5) A limitation of our approach is time complexity ofGFA construction (deriving correlations, selecting thetop ones, and implementing them into the structure of theGFA.) It is a lengthy process, and it will get more com-plicated for higher order correlations. On a P-III personalcomputer, it takes about 5 minutes to build the GFA forthe globin family with 5000 correlations. However, foreach family, training takes place only once, and thenthe mv calculation of unseen sequences (classification)is very fast and comparable with HMM classification.

This effort represents a first attempt to evaluate the appli-cability of GFA to protein family classification. Future workwill focus on classification between multiple protein families.

REFERENCES

[1] D. Bashford, C. Cothia, and A.M. Lesk, “Determinants of a ProteinFold: Unique Features of the Globin amino acid Sequence”, J. ofMolecular Biology, v. 196, pp. 199-216, 1987.

[2] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, “Biological SequenceAnalysis, Probabilistic Models of Proteins and Nucleic Acids”, Cam-bridge Univ Press, 1998.

[3] M. Doostfatemeh and S.C. Kremer, “A fuzzy finite-state automatonthat unifies a number of other popular computational paradigms,” inProc. Of the ANNIE 2003 Conf. (ANNIE 03’). 2003, ASME Press, pp.441-446.

[4] M. Doostfatemeh and S.C. Kremer, “Representing generalized fuzzyautomata in recurrent neural networks,” in Proc. Of the 17th IEEECanada Conf. (CCECE 2004, Niagara Falls). 2004, IEEE Press, pp.1901-1906.

[5] M. Doostfatemeh, S.C. Kremer, The Significance of Output Mapping inFuzzy Automata, in: Proc. of the 12th Iranian Conf. on Elec Eng. (ICEE2004), Ferdowsi University, Iran, 2004, pp. 191-197.

[6] M. Doostfatemeh and S.C. Kremer, “New Directions in FuzzyAutomata,” Int. J. of Approximate Reasoning, 38(2005) pp. 175-214.

[7] M. Doostfatemeh “New Directions in Fuzzy Automata: a General andmore Practical Formalism,” Ph.D. Thesis, Dept. of Computing andInformation Science, University of Guelph, Canada, 2005

[8] B.R. Gaines and L.J. Kohout, “The Logic of Automata”, Int. J. ofGeneral Systems, v. 2, pp. 191-208, 1976.

[9] J. Hopcroft and J. Ullman, “Introduction to Automata Theory, Lan-guages, and Computation”, Reading, MA: Addison-Wesley, 1979.

[10] S. C. Kremer, “Hidden Markov Models and Neural Networks,” inGenetics, Genomics, Proteomics and Bioinformatics, vol. 4, (L.B.Jorde, P.F.R. Little, M.J. Dunn and S. Subramaniam eds.). 2005, JohnWiley and Sons Ltd..

[11] J.N. Mordeson and D.S. Malik, “Fuzzy Automata and Languages,Theory and Applications”, Chapman&Hall/CRC, 2002.

[12] C.W. Omlin, C.L. Giles, and K.K. Thornber, “Equivalence in Knowl-edge Representation: Automata, RNNs, and Dynamical Fuzzy Sys-tems”, Proc. of IEEE, v. 87, n. 9, pp. 1623-1640, 1999.

[13] E. Santos, “Maximin Automata”, Information and Control, v.13, pp.363-377, 1968.

[14] E.L.L. Sonnhammer, S.R. Eddy, and R. Durbin, “A ComprehensiveDatabase of Protein Domain Families Based on Seed Alignments”,Proteins, v. 26, pp. 405-420, 1998.