an exact formula for the number of alignments between two dna sequences
TRANSCRIPT
Short Communication
An Exact Formula for the Number of Alignments BetweenTwo DNA Sequences
ANGELA TORRESa, ALBERTO CABADAb and JUAN J. NIETOb,*
aDepartamento de Psiquiatrıa, Radiologıa y Salud Publica, Facultad de Medicina, Universidad de Santiago de Compostela, Santiago de Compostela,Spain; bDepartamento de Analisis Matematico, Facultad de Matematicas, Universidad de Santiago de Compostela, Santiago de Compostela, Spain
(Received 9 June 2003)
Given two DNA sequences, one is usually interested inmeasuring their similarity. For this purpose we have toconsider different alignments of both the sequences.The number of alignments grows rapidly with the lengthof the sequences. In this short communication, we givean exact formula for the number of possible alignmentsusing the theory of difference equations.
Keywords: DNA sequence; Alignment; Difference equation; Exactformula
INTRODUCTION
Mathematics, statistics and computer science play arelevant role in the study of genetic sequences and,generally, in molecular biology, bioinformaticsand computational biology (Paun et al., 1998; Tang,2000; Jamshidi et al., 2001; Lange, 2002; Morgenstern,2002; Percus, 2002; Nieto et al., 2003; Torres andNieto, 2003).
The comparison of genomic sequences (Forsteret al., 1999; Gusev et al., 1999; Li et al., 2001; Liben-Nowell, 2001; Jiang et al., 2002) is performed usingsequence alignment as a basic tool. It is obvious thatthe number of possible alignments grows rapidly.Here, we present an exact formula for the number ofalignments using the theory of difference equationsand a computer program to compute that number,and discuss some of its implications. Our methodgives an alternate way for computing the number ofpossible gapped alignments. For example, for twosequences of length 8 and 16, respectively,
the number of alignments is <40 £ 106. For twosequences of length greater than 107, such a numberis greater than 1080.
ALIGNMENTS
The information archive in each organism is thegenetic material, DNA or, in some viruses, RNA.DNA molecules are long chain molecules containinga message in a four-letter alphabet: A (adenine),G (guanine), C (cytosine) and T (thymine). In theRNA alphabet, T is replaced by U (uracil). Thegenetic information is encoded digitally as stringsover this four-letter alphabet. A DNA sequence isjust a string
x ¼ ðx1; x2; . . .; xnÞ
where xi, i ¼ 1; 2; . . .; n is one of the four nucleotidesA, G, C, T. We say that the sequence x is of length n.
Given another sequence
y ¼ ð y1; y2; . . .; ymÞ;
we wish to compare both sequences and measuretheir similarity and determine their residue–residuecorrespondences. To compare the nucleotides oramino acids that appear at corresponding positions intwo sequences, we must first assign those correspon-dences and a sequence alignment is just theidentification of residue–residue correspondence(Lesk, 2002). Any assignment of correspondencesthat preserves the order of the residues within the
ISSN 1042-5179 print/ISSN 1029-2365 online q 2003 Taylor & Francis Ltd
DOI: 10.1080/10425170310001617894
*Corresponding author. Tel.: þ34-981-563-100. Fax: þ34-981-597-054. E-mail: [email protected]
DNA Sequence, December 2003 Vol. 14 (6), pp. 427–430
Mito
chon
dria
l DN
A D
ownl
oade
d fr
om in
form
ahea
lthca
re.c
om b
y U
nive
rsity
of
Cal
ifor
nia
Irvi
ne o
n 10
/28/
14Fo
r pe
rson
al u
se o
nly.
sequences is an alignment and we allow the
introduction of gaps. Thus, an alignment of sequences
x and y is an arrangement of x ¼ ðx1; x2; . . .; xnÞ and
y ¼ ðy1; y2; . . .; ynÞ and gaps in two rows, keeping the
ordering on the xi and on the yi, but not having any
gap directly over another gap. This restriction implies
that there are only a finite number of possible
alignments.
For example, given the sequences x ¼ CGT and
y ¼ ACTT; an alignment is
Sequence alignments is one of the basic tools ofmolecular biology, computational biology andbioinformatics (Altschul et al., 2001; Lesk, 2002;Sadovsky, 2003; Torres and Nieto, 2003).
The number of alignments f(n,m) between two givensequences x ¼ ðx1; . . .; xnÞ and y ¼ ðy1; . . .; ymÞ is givenby the following recurrence relation (see Lange, 2002),
f ðn;mÞ ¼ f ðn21;mÞþ f ðn;m21Þþ f ðn21;m21Þ; ð1Þ
together with the initial conditions
f ð0;mÞ ¼ f ðn;0Þ ¼ 1; n;m[ {1;2;3; . . .}: ð2Þ
The initial conditions hold from the trivial fact thatthe only alignment is given by insert n blanks in thestring with none elements. The general recurrenceformula follows from the fact that every pair ofmodified sequences ends with one of the three pairs
xn
2
!;
2
ym
!or
xn
ym
!;
and the remainder of the alignment to the left of oneof these pairs constitutes an alignment between twoshorter sequences.
EXACT FORMULA
It is obvious that Eqs. (1) and (2) have a uniquesolution since Eq. (1) is an explicit recurrence relationwith given initial conditions (2). To solve suchproblem we shall use some results from the theory ofdifference equations (Kelley and Peterson, 2001).One of the advantages of our method is that it scalesbetter than direct evaluation of Eq. (1). Denote forn [ {1; 2; . . .}; k [ {0; 1; . . .}; the “falling factorialpower” as n0 ¼ 1 and, if k $ 1,
nk ¼ nðn 2 1Þ· · ·ðn 2 k þ 1Þ: ð3Þ
From the definition, one can verify the followingproperty
ðn þ 1Þk 2 nk ¼ knk21; k $ 0: ð4Þ
First, we assume that n $ m. We look for a solutiongiven by the following expression
f ðn;mÞ ¼Xmþ1
i¼0
AiðmÞni; ð5Þ
where
A0ðmÞ ; 1; for all m [ {0; . . .; n}: ð6Þ
Note that if the solution f is given by thisexpression then
f ðn þ 1;m þ 1Þ2 f ðn;m þ 1Þ ¼ f ðn þ 1;mÞ þ f ðn;mÞ
¼Xmþ1
i¼0
AiðmÞððn þ 1Þi þ niÞ:
Using property (4) we deduce that this lastexpression is equal toXmþ1
i¼1
ð2Ai21ðmÞ þ iAiðmÞÞni21 þ 2Amþ1ðmÞnmþ1
As consequence of these equalities, from property(4) and the initial conditions (2), we conclude that
f ðn;m þ 1Þ ¼ 1 þXmþ1
i¼1
ð2Ai21ðmÞ þ iAiðmÞÞni
i
þ 2Amþ1ðmÞnmþ2
m þ 2:
From this last expression and the equality (5)applied at the point ðn;m þ 1Þ; we have that thecoefficients Ai(m) satisfy the following recurrencesystem
Aiðm þ 1Þ ¼ AiðmÞ þ2
iAi21ðmÞ;
i [ {1; 2; . . .;m þ 1}; ð7Þ
and
Amþ2ðm þ 1Þ ¼2
m þ 2Amþ1ðmÞ: ð8Þ
Now, taking into account that 1 ¼ f ðn; 0Þ ¼1 þ A1ð0Þn; we arrive at A1ð0Þ ¼ 0; and as aconsequence of Eq. (8), we have that the followingproperty holds
Aiði 2 1Þ ¼ 0 for all i $ 1: ð9Þ
Now, we rewrite Eq. (7) with initial conditions (9)as
Aiðmþ 1Þ ¼ AiðmÞþAi21ðmÞ; m $ i2 1; Aiði2 1Þ ¼ 0:
In the sequel, we verify that the unique solution ofthis problem is given by
AiðmÞ ¼2i
ði!Þ2mi for all i [ {0; 1; . . .;m}:
For i ¼ 1; since A0ðmÞ ¼ 1 for all m $ 1 we havethat A1ðmÞ ¼ 2m ; 2m 1:
– C G T –
A C – T T
A. TORRES et al.428
Mito
chon
dria
l DN
A D
ownl
oade
d fr
om in
form
ahea
lthca
re.c
om b
y U
nive
rsity
of
Cal
ifor
nia
Irvi
ne o
n 10
/28/
14Fo
r pe
rson
al u
se o
nly.
Assume that i $ 2. The initial condition followstrivially from the fact that ði 2 1Þi ¼ 0: Equation (7)follows directly from property (4).
When n # m, we look for a solution that satisfiesthe following expression
f ðn;mÞ ¼Xnþ1
i¼0
AiðnÞmi;
and conclude that f ðn;mÞ ¼ f ðm; nÞ:Thus, using that if 0 # k # n, then
nk ¼ k!n
k
!;
as a consequence of the development done in thissection, we arrive at the following result.
Conclusion: The unique solution of problems (1)and (2) is given by the following expression
f ðn;mÞ ¼Xmin {n;m}
k¼0
2km
k
!n
k
!: ð10Þ
We note that the formula allows for the situationwhere no residues are aligned between sequences,i.e. all the residues of one sequence are in the gaps ofthe other sequence. Of course, this is a non-biologicalresult and there are a huge number of suchalignments. In a forthcoming paper we will presentthe formula for the number of alignments with nooverlapping residues, the number of optimalalignments (with respect to an optimality criterion),and the expression for the number of multiplealignments. Also, we will discuss explicitly somescaling questions in the context of biologicalsequences.
CONSEQUENCES
Equation (10) permeates us to compute directly thevalue of the alignments between two strings of DNA.In the sequel we give some numerical values for suchcombinations
f ð2; 1Þ ¼ f ð1; 2Þ ¼ 5;
f ð4; 2Þ ¼ f ð2; 4Þ ¼ 41;
f ð8; 4Þ ¼ f ð4; 8Þ ¼ 3649;
f ð16; 8Þ ¼ f ð8; 16Þ ¼ 39490049:
Hence, we can see in this simple example that thenumber of alignments grows very fast. Equation (10)shows that the double sequence {f(n, m)} is mono-tone nondecreasing in both variables, i.e.f(n, m) # f(n, m þ 1) and f(n, m) # f(n þ 1, m), for alln, m [ N. In particular, f ðn;mÞ ¼ f ðm; nÞ $ f ðn; nÞ forall n # m and f ðn;mÞ ¼ f ðm; nÞ # f ðn; nÞ whenevern $ m.
From this fact, the behaviour of function f at thediagonal terms (n, n) plays an important role in thestudy of the considered functions. Clearly, { f (n, n)} isa nondecreasing monotone. And it grows very fast,as we can note in the following facts
f ð10; 10Þ ¼ 8097453;
f ð50; 50Þ < 15 £ 1036;
f ð100; 100Þ < 2 £ 1074;
Special mention of the inequalities
f ð106; 106Þ , 1080 , f ð107; 107Þ:
This last estimation is relevant since there are only1080 protons in our universe (see Nowak and May,2000, p. 84). A typical sequence contains about 200–500 base pairs (Lesk, 2002). Proteins are about200–400 amino acids long, and hence they are600–1200 letters long.
Using the Newton’s binomial formula andclassical inequalities, we arrive at the followingestimations from above and below on the growth off(n,n):
Xn
k¼0
2kn
k
!¼ 3n # f ðn; nÞ
#Xn
k¼0
ffiffiffi2
p k n
k
! ! Xn
k¼0
ffiffiffi2
p k n
k
! !
¼ ð1 þffiffiffi2
pÞ2n:
These estimations are shown in Fig. 1 when bothsequences are of the same length n.
Acknowledgements
Second and third author’s research partiallysupported by Ministerio de Ciencia y Tecnologıaand FEDER, project BFM2001-3884-C02-01, and byXunta de Galicia and FEDER, project PGIDIT02P-XIC20703PN. The authors thank the anonymousreviewers for their useful comments.
FIGURE 1 Upper and lower estimation of the growth of thenumber of alignments f (n, n).
NUMBER OF ALIGNMENTS 429
Mito
chon
dria
l DN
A D
ownl
oade
d fr
om in
form
ahea
lthca
re.c
om b
y U
nive
rsity
of
Cal
ifor
nia
Irvi
ne o
n 10
/28/
14Fo
r pe
rson
al u
se o
nly.
References
Altschul, S.F., Bundschuh, R., Olsen, R. and Hwa, T. (2001) “Theestimation of statistical parameters for local alignment scoredistributions”, Nucleic Acid Research 29, 351–361.
Forster, M., Heath, A. and Afzal, M. (1999) “Application ofdistance geometry to 3D visualization of sequence relation-ships”, Bioinformatics 15, 89–90.
Gusev, V.D., Nemytikova, L.A. and Chuzhanova, N.A. (1999)“On the complexity measures of genetic sequences”,Bioinformatics 15, 994–999.
Jamshidi, N., Edwards, J.S., Fahland, T., Church, G.M. and Palsson,B.O. (2001) “Dynamic simulation of the human red blood cellmetabolic network”, Bioinformatics 17, 286–287.
Jiang, T., Lin, G., Ma, B. and Zhang, K. (2002) “A general editdistance between RNA structures”, Journal of ComputationalBiology 9, 371–388.
Kelley, W.G. and Peterson, A.C. (2001) Difference Equations. AnIntroduction with Applications (Academic Press, San Diego).
Lange, K. (2002) Mathematical and Statistical Methods for GeneticAnalysis (Springer-Verlag, New York).
Lesk, A.M. (2002) Introduction to Bioinformatics (Oxford UniversityPress, Oxford).
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P. and Zhang, H.(2001) “An information-based sequence distance and itsapplication to whole mitochondrial phylogeny”, Bioinfor-matics 17, 149–154.
Liben-Nowell, D. (2001) “On the structure of syntenic distance”,Journal of Computational Biology 8, 53–67.
Morgenstern, B. (2002) “A simple and space-efficientfragment-chaining algorithm for alignment of DNAand protein sequences”, Applied Mathematics Letters 15,11–16.
Nieto, J.J., Torres, A. and Vazquez-Trasande, M.M. (2003) “A metricspace to study differences between polynucleotides”, Applied
Mathematics Letters, in press.Nowak, M.A. and May, R.M. (2000) Virus Dynamics (Oxford
University Press, Oxford).Paun, Gh., Rozenberg, G. and Salomaa, A. (1998) DNA Computing:
New Computing Paradigms (Springer-Verlag, Berlin).Percus, J. (2002) Mathematics of Genome Analysis (Cambridge
University Press, Cambridge).Sadovsky, M.G. (2003) “The method to compare nucleotide
sequences based on the minimum entropy principle”, Bulletin
of Mathematical Biology 65, 309–322.Tang, B. (2000) “Evaluation of some DNA cloning
strategies”, Computers and Mathematics with Applications 39,43–48.
Torres, A. and Nieto, J.J. (2003) “The fuzzy polynucleotide space:basic properties”, Bioinformatics 19, 587–592.
A. TORRES et al.430
Mito
chon
dria
l DN
A D
ownl
oade
d fr
om in
form
ahea
lthca
re.c
om b
y U
nive
rsity
of
Cal
ifor
nia
Irvi
ne o
n 10
/28/
14Fo
r pe
rson
al u
se o
nly.