multiple sequence alignment using stem cell … · multiple sequence alignment using ... calculate...
TRANSCRIPT
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
J.Priyadharshini and Dr.S.P.Victor 39
MULTIPLE SEQUENCE ALIGNMENT USING STEM CELL
ALGORITHM
J.Priyadharshini1 and Dr.S.P.Victor2
1Assistant Professor, Department of ComputerScience and Applications,St.Joseph’s College for Women,
Tirupur.
2Associate Professor and Head &Director of the Research Centre, Department of Computer Science,
St. Xavier's College,Palayamkottai.
ABSTRACT:
This paper proposes a straightforward and well-organized approach to improve the precision of multiple sequence alignment. This paper analyses existing genetic algorithm to estimate the similarity of the input sequences, and based on this measure, we align the input sequences divergently. The goodness of a string is represented in the genetic algorithm by an objective function and fitness function, the quantities to be optimized. The existing method has required more number of iterations to meet the convergence point and the convergence speed is slow. To overcome these limitations of existing method, Stem Cell Optimization algorithm is devised with a set of methods like self-renewal and power relation.
Keywords : Chromosomes, genes, crossover, mutation, Genetic algorithm, Stem cell, MSA
[1] INTRODUCTION
Genetic algorithms are stochastic search algorithms, which act on a population of possible
solutions. They are based on the mechanics of population genetics and selection. The potential
solutions are encoded as `genes' | strings of characters from some alphabet. New solutions can be
produced by `mutating' members of the current population, and by `mating' two solutions together to
form a new solution. The better solutions are selected to breed and mutate and the worse ones are
discarded.[3] Multiple Sequence Alignment (MSA) is one of the most challenging and active ongoing
research problems in the field of computational molecular biology. Being able to align multiple
sequences of DNA, RNA, or amino acids is essential for biologists to determine similarity in sequences
which often leads to similarity in function and provides valuable evolutionary information.[2]
[2] STATE OF THE ART
Basics of Genetic algorithm
1. Start
MULTIPLE SEQUENCE ALIGNMENT USING STEM CELL ALGORITHM
40
2. Initialization: Sequence length is computed after finding maximum number of gaps allowed with
respect to the longest sequence in the set of sequences that needs to be aligned. Let us say that the aligned
sequences’ length is given by length, generate initial alignment by inserting required number of gaps
given by, length-Sequence_Length(i). An initial population of several alignments is created in this
manner. Size of the initial population can be set by the user as well.
3. Chromosome Representation: Encode the alignments of initial population into chromosomes using
the representation scheme described later in the section.
4. Genetic Operations: Create a new population using following steps repeatedly, until the minimum
desired fitness value is not obtained or desired N generations are done .
Selection: Using selection schemes like elitism with roulette wheel selection or random
selection, few sequences are selected to perform crossover & mutation operations.
Crossover : Crossover operations are performed on the pairs of less fit chromosomes. Single
point crossover, double point crossover and min-max crossover methods have been used.
Selection for next generation: Chromosomes with better fitness values among the lot are used
for producing other fit chromosomes using crossover and mutation schemes. Here we have
experimented with a simple scheme where the chromosomes produced whose fitness value is
less than the parent chromosomes are discarded. i.e. the best 2 chromosomes of parenti, parentj,
childi and childj. One & two point crossover schemes are tried.
Mutation operation is performed on selected chromosomes. Following mutations are
performed- random gap shuffling, insertion and deletion of gaps.
Calculate overall alignment fitness value of the obtained alignments from crossover & mutation
operations.
Discard the chromosomes, whose fitness value is less than the parent chromosomes.
Save the alignment representation & its associated parameters.
5. Result: The best sequence alignment would be corresponding to the chromosome with highest fitness
value after N generations are done or desired minimum acceptable score is obtained.
6. End
[3] PROBLEM DEFINITION
Analysis of Existing Methodologies
Sample Input Sequence
>MMVHLTPMMKSAVTALWGKVNVDMVGGMALGRLLVVYPWTQRFFMSFGDLSTPDAVM
>MMGLSDGMWQLVLNVWGKVMADIPGHGQMVLIRLFKGHPMTLMKFDKFKHLKSMDMM
KAS >ALVMDNNAVAVSFSMMQMALVLKSWAILKKDSANIALRFFLKIFMVAPS
>MMRPMPMLIRQSWRAVSRSPLMHGTVLFARLFALMPDLLPLFQYNCRQFSSPMD
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
J.Priyadharshini and Dr.S.P.Victor 41
Initialization :We insert gaps in the input sequence to make the initial population of say 10
alignments.
lmax = 61, corresponding to the longest sequence,
length(N)=1.2 * lmax =74,
gap1= length-len1 =17,
gap2= length- len2 =13,
gap3= length-len3 =25,
gap4= length-len4 =20
>MMVHLT---PM---MKSAV-T-AL- WGKVNVDMVGGMALGR--LLV-VYPWTQ-R-FFMSF-
GDLSTPDA--VM
>M-MGL--SDGM-WQ-LVL-N--VW-GKVM-ADIP-GHGQMVLIRLFKGHPMTL-MKFDKF-
KHLKSMD- MMKAS
>A-LVMDNNA--VAV--S--FS--MM-Q--MA--LVL-KS-W-A-ILKKD---S-A-N-IALRFFLKIFM-
VAPS
>MMRP-MPML-I-RQSWR--AVS-RS-P-LMHGT-VLF-ARLFALM--PDLLP--L---FQ-
YNCRQF-SSP-MD
Chromosome Representation: The chromosomes are generated by encoding the sequences. The gap
positions in the sequence are being used to represent a chromosome. The gap positions of all the
sequences are used to make a single chromosome where, end of a sequence in chromosome
representation is indicated by a absolute point. An absolute point’s value is equal to the length of each
sequence in the initial population. In this manner ten chromosomes are produced corresponding to
initial population of 10 alignments.
MULTIPLE SEQUENCE ALIGNMENT USING STEM CELL ALGORITHM
42
Reproduction/Selection: Chromosomes are selected from the population to be parents to crossover and
produce offspring. According to Darwin’s Theory of survival of the fittest, the best ones should survive
and create new offspring [4]. That is why reproduction operator is sometimes known as the selection
operator. The various selection schemes that we used in our tool are:
Elitism: In this method, first the best 20% chromosomes are copied to a new population. The rest
chromosomes undergo genetic operation in a classical manner. Elitism can very rapidly increase the
performance of Genetic algorithm because it prevents loosing the best-found solutions. Table 1 shows
newly generated 10 chromosomes and their fitness values.
Chrom Fitness value
1 -597
2 -616
3 -622
4 -637
5 -660
6 -497
7 -694
8 -670
9 -654
10 -616
Table 1. Chromosomes and their fitness values
Applying Elitism- the highest fitness value chromosome, is part of the next generation:
chrom6 Fitness= -497
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
J.Priyadharshini and Dr.S.P.Victor 43
Random Selection: In this method, any random chromosomes are copied to a new population. The rest
chromosomes undergo genetic operations to produce new chromosomes.
Random-Sel (chrom-popm: chromosome generation of m chromosomes)
{
For k=1 to m, Choose any random number r and calculated i=(r%k)+1, which represents the
chromosome to be selected.
Save the chromi to be part of next generation. Perform crossover, mutation operations on the
remaining chromosomes.
}
Applying selection
Crossover: Cross over is a process of taking more than one parent chromosomes and producing a child
solution from them. Here, we have implemented three types of crossovers- single point crossover, two
point crossover and max-min crossover. Cross over is performed by selecting two parents with higher
fitness values as shown in example and then selecting a single crossover point which may be some
formula based or randomly determined based on the length of the parents. Each such crossover results
in two child chromosomes. As an experimental scheme we have restored only those child chromosomes
which have better fitness scores than their parents.
Crossover point = 0.6*74 = 44, nearest
absolute point is 32nd position at which
crossover performed.
MULTIPLE SEQUENCE ALIGNMENT USING STEM CELL ALGORITHM
44
Performing min-max crossover on the parent chromosomes of above illustration:
Mutation: Mutation is a genetic operator used to maintain genetic diversity from one generation of
a population of algorithm chromosomes to the next. Mutation alters one or more gene values in a
chromosome from its initial state. After crossover the best set of chromosomes with number of
chromosomes equaling to the size of iPop (initial population set) are selected and mutation is applied
upon them.
Before Mutation:
M--MVHLTPMMK-AVTALWGKVNVDMVG-GMALGR-L-LV--VYP-WTQRFFMS-F-
GDLSTPDAVM-GN
After Mutation:
MM-VHLT-PMMKSAV-TALWG-KVNVDMV-GGMALGRLL--VVYP-W-QRFFMSFG---
DLSTPDAVMG-N
Fitness function: The fitness function determines how "good" an alignment is. Fitness evaluation
methods play an important role in the performance of evolutionary algorithms. The most common
strategy that is used, albeit with significant variations, is called the "Sum-Of-Pair" Objective Function.
Table 2 illustrates the fitness scores.
In this method, for each location on the aligned sequences, one of three situations will occur: match,
mismatch or a gap.
Table2. Fitness Scores
Fitness Score= fitness(s1,s2)+ fitness(s1,s3)+ fitness(s3,s2)
Two scoring matrices: PAM-250, BLOSUM-45
[4] PROPOSED WORK
Stem Cells Algorithm (SCA), is based on behavior of stem cells in reproducing themselves. SCA has
high speed of convergence, low level of complexity with easy implementation process. It also avoids
the local minimums in an intelligent manner. The proposed algorithm can be used to solve multiple
S1 A T - G A T - C C
S2 - T A G C T A C C
S3 A - A - A T A G C
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
J.Priyadharshini and Dr.S.P.Victor 45
sequence alignment problems which is expected to rectify the problems faced with the previously
analyzed Genetic Algorithm because of its behavioral characteristics. Comparisons can be made with
other evolutionary algorithms like particle swarm optimization (PSO) algorithm, ant colony optimization
(ACO) algorithm and artificial bee colony (ABC) algorithm in solving the multiple sequence alignment
problem.
Two features are considered as the main characteristics in the definition of stem cells that are as follows:
1. Self-renewal: They have the ability of producing from the various cycles of cell division with
maintaining the characteristics of that cell.
2. Power: They have the capacity of resolution from different types of specific cells, but it is
possible that a cell
has also the ability to be separated into several cells. The initial matrix of variables in Stem
Cells
Algorithm (SCA) consists of the features of stem cell that transfer to organ or
tissue of adult person. Stem feature matrix is defined as follows. Stem Cells = [SC1, SC2,..., n
SCN]. Where
N is the number of features related to the number of discussed problem variables. It should be
noted that, the
number of self- renewals from the best cell depends on the type of the problem and each selected
cell in
previous step can produce only its own similar or just produces its opposite point or both of
them.
SCOptimum(t+1) = ζ SCOptimum(t) ------(1)
Where t denotes the iteration number, SCOptimum is the best stem cell in each iteration
and ζ is a random number where ζ ∈[0,1], however, it can be considered accidental such as ζ =0.96.
If selected cell is to be exactly renewal of the same cell as itself, we consider ζ =1 and if it has its
opposite point, the value ζ = 0.2 can be replaced. The self-renewal process is defined as:
xij(t+1) = μ ij + ϕ ( μ ij(t)− μ kj(t)) --------(2)
Where xi denotes the ith stem cell position, and t is the iteration number, μ k is the random selected
stem cell, j is the solution dimension and if μ ij(t)− μ kj(t) = τ then ϕ ( τ ) produces a random variable
in the interval of [−1,1]. It should be noted that choosing the best cell and using it in the next iteration
and its self-renewal are just a part of that iteration population that should be determined at the
beginning of algorithm process. Another part of the population is considered absolutely random.
This action is likely to have an advantage that minimizes the probability of trapping in a local
minimum, because the orientation of algorithm into obtained point of convergence is always tested.
This process will continue until its objective to find the best stem cell is achieved and this problem
is equivalent to find the best parameters in the case of the lowest value of cost function. Finally, the
MULTIPLE SEQUENCE ALIGNMENT USING STEM CELL ALGORITHM
46
best stem cell is selected when it has the most power relative to the other cells. The comparative
power relation can be obtained by:
ς = SCOptimum
∑ 𝑆𝐶𝑖𝑁𝑖=1
Where ς denotes the comparative power of stem cells, SCOptimum is stem cells selected in terms
of cost, and N is the final number of population regarding to which the best answer is obtained and
the algorithm stops. Finally, the algorithm continues themselves,until the convergence criterion will
be true or the specified number of iterations will be reached.[4]
[5] RESULTS AND DISCUSSION
Sample Input Sequence
Sequence 1: ATTGCCGACT
Sequence 2: AC
Sequence 3: GACCCTAG
The longest of these sequences is ten nucleotides. The number of gaps to be inserted to every
sequence for this example is also ten, so the total 10 alignment length would be twenty. For each
column, there is a 50% chance of inserting a nucleotide for Sequence 1, a 10% chance for Sequence 2,
and a 40% chance for Sequence 3. The ending result might look something like this:
Sequence 1 : ---AT--TGC-- C-G-A-CT
Sequence 2 : - - - - - - - - -AT- - - - - - - - -
Sequence 3: - -G---AC---C-C-TAG--
The sequences are now randomly aligned and have enough gaps so that the nucleotides can be
shifted later in the algorithm.
Stem cell feature matrix: The stem cells are generated by encoding the sequences. The gap positions in
the sequence are being used to represent a stem cell. The gap positions of all the sequences are used to
make a single stem cell where, end of a sequence in stem cell representation is indicated by an absolute
point. An absolute point’s value is equal to the length of each sequence in the initial population. In this
manner three stem cells are produced corresponding to initial population of 3 alignments.
Stem cell 1: 0 1 2 5 6 10 11 13 15 17 20 0 1 2 3 4 5 6 7 8 11 12 13 14 15 16 17 18 19 20 0 1 3 4 5 8
9 10 12 14 18 19 20
Stem cell 2: 1 3 5 7 9 10 11 14 16 17 20 0 1 2 3 4 6 7 8 9 10 11 12 14 15 16 17 18 19 20 1 3 4 6 7 8
9 10 11 12 13 14 20
Stem cell 3: 1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 6 8 10 11 12 14 15 20
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
J.Priyadharshini and Dr.S.P.Victor 47
Here we adapt “Self renewal” style of reproduction. Each alignment in the population is judged
according to a power relation . This will determine the chance of survival for each of these alignments.
After all alignments have been scored, they will be randomly selected using weighted probabilities. The
population size will stay the same, but there may be copies of some of the alignments, and others will
disappear. For instance, suppose the 3 stem cells are with the following scores:
Alignment 1 = 25
Alignment 2 = 15
Alignment3 = 10
The total population fitness is the sum of all of the individual alignment fitness scores in the population
which is for this example fifty. This means that Alignment 1 will have a 50% chance (Alignment 1
individual fitness divided by the population total fitness) of being selected, Alignment 2 will have a 30%
chance, and Alignment 3 will have a 20% chance. A roulette wheel is created with one hundred slots on
it. Alignment 1 will occupy slots 1-50, Alignment 2 will occupy slots 51-80, and Alignment 3 will
occupy slots 81-100.
A random number will be generated to determine which alignments will be added to the next generation.
In the example, three random numbers are generated between 1 and 100. If the numbers are 23, 44, and
92, then the alignments that occupy these slots on the roulette wheel will be added to the next generation.
The new population will consist of two copies of Alignment 1 and one copy of Alignment 3. Alignment
2, though not the lowest scoring alignment, dies off. Let us take stemcell 3 for self renewal process since
stemcell 1 already has 2 copies .Self Renewal: The ability of self-renewal can be expressed in two ways:
In the first case proliferation is done asymmetrically so that a stem cell is divided into two cells; a
daughter cell which is similar to original stem cell and another cell with a fundamental difference with
original stem cell. Another form of renewal will be symmetrical. This renewal is such that the original
stem cell is divided into a daughter cell that is similar to the original stem cell and other cell is converted
to mitosis and generates two cells similar to the original stem cell[4].
Stem cell 3- Step 1: Proliferation(Asymmetrical)
Daughter cell 1: 1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3
4 6 8 10 11 12 14 15 20
Daughter cell 2:0 3 5 13 14 15 16 17 18 19 20 0 5 20 5 7 9 13 16 17 18 19 20
Step 2: Symmetrical
Daughter cell 1: 1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3
4 6 8 10 11 12 14 15 20
Daughter cell 2:1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3
4 6 8 10 11 12 14 15 20
MULTIPLE SEQUENCE ALIGNMENT USING STEM CELL ALGORITHM
48
Daughter cell 3:1 2 4 6 7 8 9 10 11 12 20 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 6 8
10 11 12 14 15 20
Power Relation: Power, another feature of stem cells, is related to the capacity to detect and separate
stem cells so that they have the capacity of separation to different types of cells. Stem cells have the
ability of producing an organ from a component of that organ. These cells can also be converted to a
large number of cells with similar properties of the original cells.
The cost of each of the existing stem cells is then calculated by a measure function. Two memories are
considered for each stem cell: local memory and global memory. The cost of each cell is calculated by
the criterion function. The cost value of each stem cell is stored in the local memory. Now the cell having
the lowest
cost is considered as the best cell in the first iteration. However, this choice is valid only in one region
of cells. In each region, the best cell is then selected. The location and cost of the best cells in each region
are stored in the global memory and then data from each region will be shared with the best cells in other
areas and will be selected between them and used in subsequent iteration. The best cell in previous
iteration renewals its own similar and opposite.
In this example let us assume that Daughter cell 2 of step 1 has the lowest cost and as been selected as
the best cell. According to roulette wheel strategy, two copies of stem cell 1 and one copy of daughter
cell 2 of step 1 forms the aligned sequences which can be used to reveal valid evolutionay information.
[6] CONCLUSION
We have introduced a new optimization algorithm called Stem cell algorithm that has overcome the
existing problems in the previously introduced genetic algorithm, which is also an optimization
algorithm which required more number of iterations to meet the convergence point. The new algorithm
is based on natural behavior of stem cells in reproducing themselves. Self renewal and Power relation
are the main features of this algorithm which are used to align the multiple sequences with less number
of iterations and with high speed of convergence. In order to reveal valuable evolutionary information,
this MSA is carried out with the help of Stem cell algorithm.
REFERENCES
[1] Yongtao Ye, David W. Cheung, Yadong Wang, Siu-Ming Yiu , Tak-Wah Lam, Hing-Fung Ting, “GLProbs:
Aligning multiple sequences adaptively”, ACM-BCB 2013 Washington, DC, USA.
[2] Dr.PankajAgarwal, ”Alignment of Multiple Sequences using GA method”, International Journal of Emerging
Technologies in Computational and Applied Sciences(IJETCAS), IJETCAS 13-164; 2013, ISSN (Print):
2279-0047 ISSN (Online): 2279-0055
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
J.Priyadharshini and Dr.S.P.Victor 49
[3] Jonathan shapiro, “Genetic algorithms in Machine Learning and its applications”,Lecture Notes in Computer
Science, lume 2049, 2001, pp 146-168
[4] Mohammad Taherdangkoo, Mehran Yazdi, and Mohammad Hadi Bagheri, “Stem Cells Optimization
Algorithm”, D.-S. Huang et al. (Eds.): ICIC 2011, LNBI 6840, pp. 394–403, 2012. © Springer-Verlag Berlin
Heidelberg 2012
[5] Amie Judith Radenbaugh, “Applications of genetic algorithms in bioinformatics” San Jose State
University,2008