[lecture notes in computer science] artificial intelligence and computational intelligence volume...

F.L. Wang et al. (Eds.): AICI 2010, Part II, LNAI 6320, pp. 98–105, 2010. © Springer-Verlag Berlin Heidelberg 2010

Multiple Sequence Alignment Based on ABC_SA

Xiaojun Xu and Xiujuan Lei

College of Computer Science Shaanxi Normal University Xi’an, Shaanxi Province, China, 710062

{hbxuxiaojun,xjlei168}@163.com

Abstract. In this paper, we apply the artificial bee colony (ABC) and its im-proving format to solve multiple sequence alignment (MSA) problem. The im-proved method named ABC_SA, which is presented to prevent algorithm from sliding into local optimum through introducing Metropolis acceptance criteria into ABC’s searching process. The results of simulation experiment demonstrate that ABC_SA algorithm is able to settle multiple sequence alignment effectively by increasing the food source’s diversity and is able to converge at global optimal alignment.

Keywords: Multiple Sequence Alignment (MSA), Artificial Bee Colony (ABC), Metropolis Acceptance Criteria, ABC_SA.

1 Introduction

Multiple sequence alignment (MSA) is a crucial problem in bioinformatics. It can be useful for structure prediction, phylogenetic analysis and polymerase chain reaction (PCR) primer design. The alignment results reflect the similarity relation and biological characteristics of sequences. Unfortunately, finding an accurate multiple sequence alignment has been shown NP-hard [1]. Therefore several methods were proposed which can be grouped in four great classes. The first class includes exact methods which use a generalization of Needleman algorithm [2] in order to align all the se-quences simultaneously. Their main shortcoming is their complexity which becomes even more critical with the increase of the number of sequences. The second class contains methods based on a progressive approach [3], which builds up multiple se-quence alignment gradually by aligning the closest pair of sequences first and succes-sively adding in the more distant ones. It is easy to trap into the local minima and consequently they can lead to poor quality solutions. The third class covers methods [4] based on graph theory of which the main representative is the partial order alignment. Then, the iterative methods of the forth class were showed to be promising. The basic idea is to start with an initial alignment including all the sequences and then iteratively refines it through a series of suitable refinements called iterations. The main iterative methods consist of SA [5], GA [6], PSO [7] and so on. However, the performance of above mentioned methods solving MSA is not ideal. ABC [8] was firstly proposed in 2005 and it has not been applied in this area. In this paper, we apply the ABC algorithm and its improving format to solve MSA problem. The improved method named ABC_SA, which is presented to prevent algorithm from falling into stagnation through

Multiple Sequence Alignment Based on ABC_SA 99

introducing Metropolis acceptance criteria into ABC’s searching process. The conse-quences of simulation experiment indicate the ABC_SA performs better in solving MSA.

2 Multiple Sequence Alignment

2.1 The Mathematical Model of Multiple Sequence Alignment

A biological sequences of which the length is l is a string consists of l characters that are collected from a finite alphabet ∑ . In terms of DNA sequence, ∑ contains four characters which respectively represent four different kinds of nucleotides. In terms of protein sequence, ∑ contains twenty characters which respectively represent twenty different kinds of amino acid. Given a group of sequences which comprised of )2( ≥nn sequence ),...,,( 21 nsssS = , where

iiliii ssss .......21= , ni ≤≤1 , ∑∈ijs , ilj ≤≤1 , il is equal to

the length of i-th sequence. Then an alignment about S can be defined as a ma-

trix )( ijaA = , where ni ≤≤1 , lj ≤≤1 , i

n

ii lll

1)max(

=∑≤≤ and the characteristics of A are as

follows: (1) }{−∑∈ ∪ija ,where the symbol }{− denotes a gap; (2) The i-th sequence of A is

identical with the i-th of S when the gaps are deleted; (3) There is no row formed by only }{− in matrix A .

2.2 Objective Function of Multiple Sequence Alignment

The common objective functions of multiple sequence alignment refer to Sum-of-Pairs (SP) [9] function and Coffee function [10]. We select SPS function based on SP as evaluation criteria. Supposed there are N sequences to be aligned, and the aligned length is L , ijc is j-th letter of i-th sequence, then the score of j-th row of all N se-

quences can be defined as )( jSP .

∑∑−

= =

=1

1 1

),()(N

i

N

kkjij ccpjSP (1)

Where

⎪⎪

⎩

⎪⎪

⎨

⎧

−=−=

−=−=

∑∈≠

∑∈=+

=

)''''(0

)''''(0

),(0

),(1

),(

kjij

kjij

kjijkjij

kjijkjij

kjij

candc

corc

ccandcc

ccandcc

ccp (2)

So, the final score of A is

∑=

=L

j

jSPASUM1

)()( (3)

If the aligned sequences come from the standard database, for example Balibase1.0, there is certainly a standard alignment dSA tan− , so we can get a relative score SPS,

100 X. Xu and X. Lei

)tan(

)(

dSASUM

ASUMSPS

−= (4)

Then we estimate A using SPS.

3 Algorithm Description

3.1 Artificial Bee Colony Algorithm (ABC)

ABC algorithm which simulates the intelligent foraging behavior of honey bee swarms was proposed by Karaboga in 2005 [8]. The model of ABC consists of four factors: food source, employed bees, onlookers, and scout bees.

Food source: the value of a food source depends on many factors such as its richness or concentration of its energy and so on. When solving MSA based on ABC, the posi-tion of a food source represents a possible alignment which is assessed by formula (4). The value of SPS is higher, the alignment is better.

Employed bees: they are associated with a particular food source which they are currently exploiting or are “employed” at. They carry with them information about this particular source. At the same time, an employed bee produces a modification on the position in her memory depending on the local information and tests the value of the new source. Because MSA is a special and discrete problem, the mean of modifying is different from that in reference [8]. We define it as following:

Proposed there are N sequences in food source A , the length of which is Al and the length of i-th sequence without gap is il , then a new food source B is got by randomly

breaking jA ll − gaps in a random sequence )1( Njj ≤≤ . If the SPS score of B is higher

than that of A , the bee memorizes the position of B instead of A . Otherwise it keeps the position of A . If a position of food source can not be improved further through a pre-determined number of cycles called “limit”, then that food source is assumed to be abandoned. The bee whose food source has been exhausted becomes a scout.

Onlookers: they are waiting in the nest and establishing a food source through the information shared by employed bees. An artificial onlooker bee chooses a food source depending on the probability values associated with that food source. Assumed there are M food sources, the probability of i-th food source selected by onlooker is )(ip .

)(

)()(

1jf

ifiP

M

j=∑

= (5)

And )(if is the SPS score of i-th food source. There is a greater probability of onlookers choosing more profitable sources.

Scouts: they are searching the environment surrounding the nest for new food sources randomly.

3.2 ABC_SA Algorithm

In the process of solving MSA, ABC may not get a global optimizer alignment. After analyzing, the shortcoming happens in the random search process of scouts.


Hypothesized the food source A has not been promoted for “limit” cycles, so it should be abandoned. The relative employed bee turns to a scout which immediately finds a new food source B instead of A . If the property of B is lower than that of A , the algorithm cannot arrive the global optimizer solution. In order to settle this problem, we introduce Metropolis acceptance criteria into the searching process of scouts. Whether to accept B is decided by the following rule 1.

Rule 1: if )()( AfBf > , B is accepted; Else, calculate

)_

)()(exp(

curT

BfAfP

−= (6)

If randP > , then B is accepted; So, the algorithm not only reserves the global optimizer but also increases the food

source’s diversity through adding a worse food source to scout according to a small probability. And the main step of the ABC_SA is given below:

STEP 1: Initialize the number size of bee popsize , the maximal cycle itmax , the initial temperature iniT _ , the terminal temperature endT _ , annealing parameter alpha and so on. Set 1=iter , iniTcurT __ =

STEP 2: Calculate the fitness of each food source according to formula (4), record the position of the best food source. And the first half of the colony consists of the employed artificial bees and the second half includes the onlookers. For every food source, there is only one employed bee. In other words, the number of employed bees is equal to the number of food sources around the hive.

STEP 3: Produce the new solution of the employed bees and onlookers. STEP 4: Produce the new solution of the scouts according to rule 1. STEP 5: 1+=iteriter , curTalphacurT _*_ =

STEP 6: Output the best alignment if the termination condition is met. Otherwise turn to Step2.

4 Key Problems

4.1 Encoding Method of the Food Source

We prefer two dimension matrix encoding method after analyzing the features of MSA. Supposed there are N sequences to be aligned, and the respective length of each se-quence is Nlll ,...,, 21 and the aligned length L is between maxl and max*2.1 l , so we should

generate a one-dimensional vector i∂ as a set of which the element is a random per-

mutation of the integers from 1 to ilL − . Then the encoding of the food source may be

expressed as [ ]N∂∂∂= ,...,, 21β showed in example 1.

Example 1. There are three sequences 321 ,, sss to be aligned, and rfydgeilyqsks =1 ,

adesvynpgns =2 , ydepikqsers =3 . So a possible food source is signified

as [ ]321 ,, ∂∂∂=β , where [ ]13,8,21 =∂ , [ ]14,11,8,4,12 =∂ , [ ]13,9,4,23 =∂ .


4.2 The Alignment Corresponding to Encoding Food Source

For sequence rfydgeilyqsks = , presumed the set of inserted gaps is )7,4,1(=∂ , we get a

new sequence ilyqskrfgeyds ___' = after inserting gaps into s . In terms of example 1,

the homologous alignment is displayed as follow example 2.

Example 2. An alignment of 321 ,, sss

5 Simulation Results and Discussion

In the simulation process, ABC_SA was applied for finding the alignment of multiple sequences from BAliBASE1.0 for the evaluation of MSA programs. The BAli-BASE1.0 multiple alignments were constructed by manual comparison of structure and are validated by structure-superposition algorithms. Thus, the alignments are unlikely to be biased toward any specific multiple alignment method [11]. In order to evaluate the performance of the ABC_SA algorithm, we select nine sequences with different lengths and different similarities from BAliBASE1.0.

We compare the results obtained by ABC_SA approach with those by genetic al-gorithm (GA) (6), particle swarm algorithm (PSO) (7), artificial bee colony (ABC) (8) and simulated annealing (SA) (5). In ABC_SA algorithm, the predetermined number is 10/10/5lim =it which indicates the number itlim for short, medium and long sequences are respective 5, 10 and 10, the annealing parameter is 99.0/97.0/95.0=alpha . Experi-mental results further indicate that our approach performs better than above ap-proaches. The parameters setting are showed in Table 1. Table 2 displays the characters of nine testing sequences.

Table 1. The parameters setting of five algorithms

5.1 Simulation Results

In the simulation studies, for each group sequences to be aligned, the algorithm runs for ten times. And the comparative results of the best, mean and worst alignment of the investigated algorithm are presented in table 3, table 4 and table 5 for short, medium and


Table 2. The characters of nine testing sequence

Table 3. The comparative results for short sequences

Table 4. The comparative results for medium sequences

Table 5. The comparative results for long sequences

long sequence respectively. You can also see the curves of convergence from figure 1 to figure 4 which reveal the speed of convergence along with the increase of iteration.

From table 3, it is clearly to be seen that the ABC_SA algorithm can solve MSA with effect in which the length of sequences is short. The comparative results for SH3 are remarkable. The accuracy of average obtained by ABC_SA is bigger than that obtained by other four approaches as well as the accuracy of best and worst. When it comes to sequence 451c and lkrn, we receive the same verdict as it to SH3. Table 4 shows the simulation result about medium sequences which contain kinase, lpii and 5ptp. As we have seen, ABC_SA algorithm takes up the leading position among all above five al-gorithms. Table 5 presents the comparative results of long sequences. The length of long sequence is much longer than medium and short sequence. So settling long sequence alignment needs much time which is the reason why we set the maximum iterative


number is 1200. From table 5, we can conclude the performance of ABC_SA is as well as it of ABC, but is more excellent than other three approaches.

Fig 1 to Fig 4 respectively displays the curves of convergence of sequence SH3, 451c, lpii and glg. In these figures, the bold line expresses the convergence rate of ABC_SA algorithm, and the other four lines express the rate of SA, ABC, GA and PSO. Like we see, the bold line is not only higher than others, but also it stays in a stable value. In other words, ABC_SA algorithm can both get a greater alignment and take less time when solving sequence alignment.

Fig. 1. The curves of convergence for SH3 Fig. 2. The curves of convergence for 451c

Fig. 3. The curves of convergence for lpii Fig. 4. The curves of convergence for glg

Consequently, the ABC_SA algorithm which merges the Metropolis acceptance

criteria into basic ABC, has the superiority of both ABC and SA. So the algorithm reserves the global optimizer as well as increases the food source’s diversity through adding a worse food source to scout according to a small probability decided by rule 1. All these strategies encourage the exploration process and make the algorithm performs better. Furthermore, the simulation results all demonstrate the validity.


6 Conclusion

We put up with multiple sequence alignment successfully based on ABC, for the sake of improving the performance of ABC, a new algorithm ABC_SA is proposed by intro-ducing metropolis acceptance criteria into ABC. In the way that was expected ABC_SA perform better than GA, PSO, ABC, SA. However, the accuracy rate of alignment gained by ABC_SA is different from that of alignment from Balibase1.0. There are two drawbacks in ABC_SA algorithm. One is that the alignment is worse than that from Balibase1.0 which needed enhance. The other is that the number of parameter in ABC_SA is too much, each of which has big influence on algorithm. So the next task is to overcome the both problems.

Acknowledgement. This work was supported by the Natural Science Foundation of Shaanxi Province of China (2010JQ8034) and the Fundamental Research Funds for the Central Universities, Shaanxi Normal University (2009020016).

References

1. Lei, X.j., Sun, J.j., Ma, Q.z.: Multiple sequence alignment based on chaotic PSO. In: Ad-vances in Computation and Intelligence: 4th International Symposium on Intelligence Computation and Application, pp. 351–360 (2009)

2. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 569–571 (1970)

3. Feng, D.F., Doolittle, R.F.: Progressive sequence alignment as a prerequisite to correct phylognentic trees. Journal of Molecular Biology, 351–360 (1987)

4. Huo, H.W., Xiao, Z.W.: A Multiple Alignment Approach for DNA Sequences Based on the Maximum Weighted Path Algorithms. Journal of Software, 185–195 (2007)

5. Kim, J., Pramanik, S., Chung, M.J.: Multiple sequence alignment using simulated annealing. Computer applications in bioscience, 419–426 (1994)

6. Horng, J.T., Wu, L.C., Lin, C.M., Yang, B.H.: A genetic algorithm for multiple sequence alignment. Soft Computing - A Fusion of Foundations. Methodologies and Applications, 407–420 (2005)

7. Lei, C.W., Ruan, J.H.: A particle swarm optimization algorithm for finding DNA sequence motifs. In: IEEE Bioinformatics and Biomeidcine Workshops, pp. 166–173 (2008)

8. Karaboga, D.: An idea based on honey bee swarm for numerical optimization. Technical Report-TR06, Erciyes University, Engineering Faculty, Computer Engineering Department (2005)

9. Thompson, J.D., Plewniak, F., Poch, O.: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acid Research, 2682–2690 (1999)

10. Notredame, C., Holm, L., Higgins, D.G.: COFFEE: An objective functions for multiple sequence alignments. Bioinformatics, 407–422 (1998)

11. Zhang, M., Fang, W.W., Zhang, J.H., Chi, Z.X.: MSAID: multiple sequence alignment based on a measure of information discrepancy. Computational Biology and Chemistry, pp.175–181 (2005)

[lecture notes in computer science] artificial intelligence and computational intelligence volume...

Documents