multiple sequence alignment using stem cell … · multiple sequence alignment using ... calculate...

International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015

www.ijcea.com ISSN 2321-3469

J.Priyadharshini and Dr.S.P.Victor 39

MULTIPLE SEQUENCE ALIGNMENT USING STEM CELL

ALGORITHM

J.Priyadharshini1 and Dr.S.P.Victor2

1Assistant Professor, Department of ComputerScience and Applications,St.Joseph’s College for Women,

Tirupur.

2Associate Professor and Head &Director of the Research Centre, Department of Computer Science,

St. Xavier's College,Palayamkottai.

ABSTRACT:

This paper proposes a straightforward and well-organized approach to improve the precision of multiple sequence alignment. This paper analyses existing genetic algorithm to estimate the similarity of the input sequences, and based on this measure, we align the input sequences divergently. The goodness of a string is represented in the genetic algorithm by an objective function and fitness function, the quantities to be optimized. The existing method has required more number of iterations to meet the convergence point and the convergence speed is slow. To overcome these limitations of existing method, Stem Cell Optimization algorithm is devised with a set of methods like self-renewal and power relation.

Keywords : Chromosomes, genes, crossover, mutation, Genetic algorithm, Stem cell, MSA

[1] INTRODUCTION

Genetic algorithms are stochastic search algorithms, which act on a population of possible

solutions. They are based on the mechanics of population genetics and selection. The potential

solutions are encoded as `genes' | strings of characters from some alphabet. New solutions can be

produced by `mutating' members of the current population, and by `mating' two solutions together to

form a new solution. The better solutions are selected to breed and mutate and the worse ones are

discarded.[3] Multiple Sequence Alignment (MSA) is one of the most challenging and active ongoing

research problems in the field of computational molecular biology. Being able to align multiple

sequences of DNA, RNA, or amino acids is essential for biologists to determine similarity in sequences

which often leads to similarity in function and provides valuable evolutionary information.[2]

[2] STATE OF THE ART

Basics of Genetic algorithm

1. Start

MULTIPLE SEQUENCE ALIGNMENT USING STEM CELL ALGORITHM

40

2. Initialization: Sequence length is computed after finding maximum number of gaps allowed with

respect to the longest sequence in the set of sequences that needs to be aligned. Let us say that the aligned

sequences’ length is given by length, generate initial alignment by inserting required number of gaps

given by, length-Sequence_Length(i). An initial population of several alignments is created in this

manner. Size of the initial population can be set by the user as well.

3. Chromosome Representation: Encode the alignments of initial population into chromosomes using

the representation scheme described later in the section.

4. Genetic Operations: Create a new population using following steps repeatedly, until the minimum

desired fitness value is not obtained or desired N generations are done .

Selection: Using selection schemes like elitism with roulette wheel selection or random

selection, few sequences are selected to perform crossover & mutation operations.

Crossover : Crossover operations are performed on the pairs of less fit chromosomes. Single

point crossover, double point crossover and min-max crossover methods have been used.

Selection for next generation: Chromosomes with better fitness values among the lot are used

for producing other fit chromosomes using crossover and mutation schemes. Here we have

experimented with a simple scheme where the chromosomes produced whose fitness value is

less than the parent chromosomes are discarded. i.e. the best 2 chromosomes of parenti, parentj,

childi and childj. One & two point crossover schemes are tried.

Mutation operation is performed on selected chromosomes. Following mutations are

performed- random gap shuffling, insertion and deletion of gaps.

Calculate overall alignment fitness value of the obtained alignments from crossover & mutation

operations.

Discard the chromosomes, whose fitness value is less than the parent chromosomes.

Save the alignment representation & its associated parameters.

5. Result: The best sequence alignment would be corresponding to the chromosome with highest fitness

value after N generations are done or desired minimum acceptable score is obtained.

6. End

[3] PROBLEM DEFINITION

Analysis of Existing Methodologies

Sample Input Sequence

>MMVHLTPMMKSAVTALWGKVNVDMVGGMALGRLLVVYPWTQRFFMSFGDLSTPDAVM

>MMGLSDGMWQLVLNVWGKVMADIPGHGQMVLIRLFKGHPMTLMKFDKFKHLKSMDMM

KAS >ALVMDNNAVAVSFSMMQMALVLKSWAILKKDSANIALRFFLKIFMVAPS

>MMRPMPMLIRQSWRAVSRSPLMHGTVLFARLFALMPDLLPLFQYNCRQFSSPMD




Initialization :We insert gaps in the input sequence to make the initial population of say 10

alignments.

lmax = 61, corresponding to the longest sequence,

length(N)=1.2 * lmax =74,

gap1= length-len1 =17,

gap2= length- len2 =13,

gap3= length-len3 =25,

gap4= length-len4 =20

>MMVHLT---PM---MKSAV-T-AL- WGKVNVDMVGGMALGR--LLV-VYPWTQ-R-FFMSF-

GDLSTPDA--VM

>M-MGL--SDGM-WQ-LVL-N--VW-GKVM-ADIP-GHGQMVLIRLFKGHPMTL-MKFDKF-

KHLKSMD- MMKAS

>A-LVMDNNA--VAV--S--FS--MM-Q--MA--LVL-KS-W-A-ILKKD---S-A-N-IALRFFLKIFM-

VAPS

>MMRP-MPML-I-RQSWR--AVS-RS-P-LMHGT-VLF-ARLFALM--PDLLP--L---FQ-

YNCRQF-SSP-MD

Chromosome Representation: The chromosomes are generated by encoding the sequences. The gap

positions in the sequence are being used to represent a chromosome. The gap positions of all the

sequences are used to make a single chromosome where, end of a sequence in chromosome

representation is indicated by a absolute point. An absolute point’s value is equal to the length of each

sequence in the initial population. In this manner ten chromosomes are produced corresponding to

initial population of 10 alignments.


42

Reproduction/Selection: Chromosomes are selected from the population to be parents to crossover and

produce offspring. According to Darwin’s Theory of survival of the fittest, the best ones should survive

and create new offspring [4]. That is why reproduction operator is sometimes known as the selection

operator. The various selection schemes that we used in our tool are:

Elitism: In this method, first the best 20% chromosomes are copied to a new population. The rest

chromosomes undergo genetic operation in a classical manner. Elitism can very rapidly increase the

performance of Genetic algorithm because it prevents loosing the best-found solutions. Table 1 shows

newly generated 10 chromosomes and their fitness values.

Chrom Fitness value

1 -597

2 -616

3 -622

4 -637

5 -660

6 -497

7 -694

8 -670

9 -654

10 -616

Table 1. Chromosomes and their fitness values

Applying Elitism- the highest fitness value chromosome, is part of the next generation:

chrom6 Fitness= -497




Random Selection: In this method, any random chromosomes are copied to a new population. The rest

chromosomes undergo genetic operations to produce new chromosomes.

Random-Sel (chrom-popm: chromosome generation of m chromosomes)

{

For k=1 to m, Choose any random number r and calculated i=(r%k)+1, which represents the

chromosome to be selected.

Save the chromi to be part of next generation. Perform crossover, mutation operations on the

remaining chromosomes.

}

Applying selection

Crossover: Cross over is a process of taking more than one parent chromosomes and producing a child

solution from them. Here, we have implemented three types of crossovers- single point crossover, two

point crossover and max-min crossover. Cross over is performed by selecting two parents with higher

fitness values as shown in example and then selecting a single crossover point which may be some

formula based or randomly determined based on the length of the parents. Each such crossover results

in two child chromosomes. As an experimental scheme we have restored only those child chromosomes

which have better fitness scores than their parents.

Crossover point = 0.6*74 = 44, nearest

absolute point is 32nd position at which

crossover performed.


44

Performing min-max crossover on the parent chromosomes of above illustration:

Mutation: Mutation is a genetic operator used to maintain genetic diversity from one generation of

a population of algorithm chromosomes to the next. Mutation alters one or more gene values in a

chromosome from its initial state. After crossover the best set of chromosomes with number of

chromosomes equaling to the size of iPop (initial population set) are selected and mutation is applied

upon them.

Before Mutation:

M--MVHLTPMMK-AVTALWGKVNVDMVG-GMALGR-L-LV--VYP-WTQRFFMS-F-

GDLSTPDAVM-GN

After Mutation:

MM-VHLT-PMMKSAV-TALWG-KVNVDMV-GGMALGRLL--VVYP-W-QRFFMSFG---

DLSTPDAVMG-N

Fitness function: The fitness function determines how "good" an alignment is. Fitness evaluation

methods play an important role in the performance of evolutionary algorithms. The most common

strategy that is used, albeit with significant variations, is called the "Sum-Of-Pair" Objective Function.

Table 2 illustrates the fitness scores.

In this method, for each location on the aligned sequences, one of three situations will occur: match,

mismatch or a gap.

Table2. Fitness Scores

Fitness Score= fitness(s1,s2)+ fitness(s1,s3)+ fitness(s3,s2)

Two scoring matrices: PAM-250, BLOSUM-45

[4] PROPOSED WORK

Stem Cells Algorithm (SCA), is based on behavior of stem cells in reproducing themselves. SCA has

high speed of convergence, low level of complexity with easy implementation process. It also avoids

the local minimums in an intelligent manner. The proposed algorithm can be used to solve multiple

S1 A T - G A T - C C

S2 - T A G C T A C C

S3 A - A - A T A G C




sequence alignment problems which is expected to rectify the problems faced with the previously

analyzed Genetic Algorithm because of its behavioral characteristics. Comparisons can be made with

other evolutionary algorithms like particle swarm optimization (PSO) algorithm, ant colony optimization

(ACO) algorithm and artificial bee colony (ABC) algorithm in solving the multiple sequence alignment

problem.

Two features are considered as the main characteristics in the definition of stem cells that are as follows:

1. Self-renewal: They have the ability of producing from the various cycles of cell division with

maintaining the characteristics of that cell.

2. Power: They have the capacity of resolution from different types of specific cells, but it is

possible that a cell

has also the ability to be separated into several cells. The initial matrix of variables in Stem

Cells

Algorithm (SCA) consists of the features of stem cell that transfer to organ or

tissue of adult person. Stem feature matrix is defined as follows. Stem Cells = [SC1, SC2,..., n

SCN]. Where

N is the number of features related to the number of discussed problem variables. It should be

noted that, the

number of self- renewals from the best cell depends on the type of the problem and each selected

cell in

previous step can produce only its own similar or just produces its opposite point or both of

them.

SCOptimum(t+1) = ζ SCOptimum(t) ------(1)

Where t denotes the iteration number, SCOptimum is the best stem cell in each iteration

and ζ is a random number where ζ ∈[0,1], however, it can be considered accidental such as ζ =0.96.

If selected cell is to be exactly renewal of the same cell as itself, we consider ζ =1 and if it has its

opposite point, the value ζ = 0.2 can be replaced. The self-renewal process is defined as:

xij(t+1) = μ ij + ϕ ( μ ij(t)− μ kj(t)) --------(2)

Where xi denotes the ith stem cell position, and t is the iteration number, μ k is the random selected

stem cell, j is the solution dimension and if μ ij(t)− μ kj(t) = τ then ϕ ( τ ) produces a random variable

in the interval of [−1,1]. It should be noted that choosing the best cell and using it in the next iteration

and its self-renewal are just a part of that iteration population that should be determined at the

beginning of algorithm process. Another part of the population is considered absolutely random.

This action is likely to have an advantage that minimizes the probability of trapping in a local

minimum, because the orientation of algorithm into obtained point of convergence is always tested.

This process will continue until its objective to find the best stem cell is achieved and this problem

is equivalent to find the best parameters in the case of the lowest value of cost function. Finally, the


46

best stem cell is selected when it has the most power relative to the other cells. The comparative

power relation can be obtained by:

ς = SCOptimum

∑ 𝑆𝐶𝑖𝑁𝑖=1

Where ς denotes the comparative power of stem cells, SCOptimum is stem cells selected in terms

of cost, and N is the final number of population regarding to which the best answer is obtained and

the algorithm stops. Finally, the algorithm continues themselves,until the convergence criterion will

be true or the specified number of iterations will be reached.[4]

[5] RESULTS AND DISCUSSION

Sample Input Sequence

Sequence 1: ATTGCCGACT

Sequence 2: AC

Sequence 3: GACCCTAG

The longest of these sequences is ten nucleotides. The number of gaps to be inserted to every

sequence for this example is also ten, so the total 10 alignment length would be twenty. For each

column, there is a 50% chance of inserting a nucleotide for Sequence 1, a 10% chance for Sequence 2,

and a 40% chance for Sequence 3. The ending result might look something like this:

Sequence 1 : ---AT--TGC-- C-G-A-CT

Sequence 2 : - - - - - - - - -AT- - - - - - - - -

Sequence 3: - -G---AC---C-C-TAG--

The sequences are now randomly aligned and have enough gaps so that the nucleotides can be

shifted later in the algorithm.

Stem cell feature matrix: The stem cells are generated by encoding the sequences. The gap positions in

the sequence are being used to represent a stem cell. The gap positions of all the sequences are used to

make a single stem cell where, end of a sequence in stem cell representation is indicated by an absolute

point. An absolute point’s value is equal to the length of each sequence in the initial population. In this

manner three stem cells are produced corresponding to initial population of 3 alignments.

Stem cell 1: 0 1 2 5 6 10 11 13 15 17 20 0 1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18 19 20 0 1 3 4 5 8

9 10 12 14 18 19 20

Stem cell 2: 1 3 5 7 9 10 11 14 16 17 20 0 1 2 3 4 6 7 8 9 10 11 12 14 15 16 17 18 19 20 1 3 4 6 7 8

9 10 11 12 13 14 20

Stem cell 3: 1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6

7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 6 8 10 11 12 14 15 20




Here we adapt “Self renewal” style of reproduction. Each alignment in the population is judged

according to a power relation . This will determine the chance of survival for each of these alignments.

After all alignments have been scored, they will be randomly selected using weighted probabilities. The

population size will stay the same, but there may be copies of some of the alignments, and others will

disappear. For instance, suppose the 3 stem cells are with the following scores:

Alignment 1 = 25

Alignment 2 = 15

Alignment3 = 10

The total population fitness is the sum of all of the individual alignment fitness scores in the population

which is for this example fifty. This means that Alignment 1 will have a 50% chance (Alignment 1

individual fitness divided by the population total fitness) of being selected, Alignment 2 will have a 30%

chance, and Alignment 3 will have a 20% chance. A roulette wheel is created with one hundred slots on

it. Alignment 1 will occupy slots 1-50, Alignment 2 will occupy slots 51-80, and Alignment 3 will

occupy slots 81-100.

A random number will be generated to determine which alignments will be added to the next generation.

In the example, three random numbers are generated between 1 and 100. If the numbers are 23, 44, and

92, then the alignments that occupy these slots on the roulette wheel will be added to the next generation.

The new population will consist of two copies of Alignment 1 and one copy of Alignment 3. Alignment

2, though not the lowest scoring alignment, dies off. Let us take stemcell 3 for self renewal process since

stemcell 1 already has 2 copies .Self Renewal: The ability of self-renewal can be expressed in two ways:

In the first case proliferation is done asymmetrically so that a stem cell is divided into two cells; a

daughter cell which is similar to original stem cell and another cell with a fundamental difference with

original stem cell. Another form of renewal will be symmetrical. This renewal is such that the original

stem cell is divided into a daughter cell that is similar to the original stem cell and other cell is converted

to mitosis and generates two cells similar to the original stem cell[4].

Stem cell 3- Step 1: Proliferation(Asymmetrical)

Daughter cell 1: 1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3

4 6 8 10 11 12 14 15 20

Daughter cell 2:0 3 5 13 14 15 16 17 18 19 20 0 5 20 5 7 9 13 16 17 18 19 20

Step 2: Symmetrical

Daughter cell 1: 1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3

4 6 8 10 11 12 14 15 20

Daughter cell 2:1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3

4 6 8 10 11 12 14 15 20


48

Daughter cell 3:1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 6 8

10 11 12 14 15 20

Power Relation: Power, another feature of stem cells, is related to the capacity to detect and separate

stem cells so that they have the capacity of separation to different types of cells. Stem cells have the

ability of producing an organ from a component of that organ. These cells can also be converted to a

large number of cells with similar properties of the original cells.

The cost of each of the existing stem cells is then calculated by a measure function. Two memories are

considered for each stem cell: local memory and global memory. The cost of each cell is calculated by

the criterion function. The cost value of each stem cell is stored in the local memory. Now the cell having

the lowest

cost is considered as the best cell in the first iteration. However, this choice is valid only in one region

of cells. In each region, the best cell is then selected. The location and cost of the best cells in each region

are stored in the global memory and then data from each region will be shared with the best cells in other

areas and will be selected between them and used in subsequent iteration. The best cell in previous

iteration renewals its own similar and opposite.

In this example let us assume that Daughter cell 2 of step 1 has the lowest cost and as been selected as

the best cell. According to roulette wheel strategy, two copies of stem cell 1 and one copy of daughter

cell 2 of step 1 forms the aligned sequences which can be used to reveal valid evolutionay information.

[6] CONCLUSION

We have introduced a new optimization algorithm called Stem cell algorithm that has overcome the

existing problems in the previously introduced genetic algorithm, which is also an optimization

algorithm which required more number of iterations to meet the convergence point. The new algorithm

is based on natural behavior of stem cells in reproducing themselves. Self renewal and Power relation

are the main features of this algorithm which are used to align the multiple sequences with less number

of iterations and with high speed of convergence. In order to reveal valuable evolutionary information,

this MSA is carried out with the help of Stem cell algorithm.

REFERENCES

[1] Yongtao Ye, David W. Cheung, Yadong Wang, Siu-Ming Yiu , Tak-Wah Lam, Hing-Fung Ting, “GLProbs:

Aligning multiple sequences adaptively”, ACM-BCB 2013 Washington, DC, USA.

[2] Dr.PankajAgarwal, ”Alignment of Multiple Sequences using GA method”, International Journal of Emerging

Technologies in Computational and Applied Sciences(IJETCAS), IJETCAS 13-164; 2013, ISSN (Print):

2279-0047 ISSN (Online): 2279-0055




[3] Jonathan shapiro, “Genetic algorithms in Machine Learning and its applications”,Lecture Notes in Computer

Science, lume 2049, 2001, pp 146-168

[4] Mohammad Taherdangkoo, Mehran Yazdi, and Mohammad Hadi Bagheri, “Stem Cells Optimization

Algorithm”, D.-S. Huang et al. (Eds.): ICIC 2011, LNBI 6840, pp. 394–403, 2012. © Springer-Verlag Berlin

Heidelberg 2012

[5] Amie Judith Radenbaugh, “Applications of genetic algorithms in bioinformatics” San Jose State

University,2008

http://link.springer.com/search?facet-author=%22Jonathan+Shapiro%22

http://link.springer.com/bookseries/558

http://link.springer.com/bookseries/558

multiple sequence alignment using stem cell … · multiple sequence alignment using ... calculate...

Documents