sequences: a maximum likelihoodapproach arora ...biitcomm/research/genomic signal...theperiodicities...
TRANSCRIPT
DETECTION OF PERIODICITIES IN GENE SEQUENCES: A MAXIMUMLIKELIHOOD APPROACH
Raman Arora and William A. Sethares
Department of Electrical and Computer Engineering, University of Wisconsin-Madison1415 Engineering Drive, Madison WI 53706, [email protected], [email protected]
ABSTRACTA novel approach is presented to the detection of ho-mological, eroded and latent periodicities in DNA se-quences. Each symbol in a DNA sequence is assumedto be generated from an information source with anunderlying probability mass function (pmf) in a cyclicmanner. The number of sources can be interpreted asthe periodicity of the sequence. The maximum likelihoodestimates are developed for the pmfs of the informationsources as well as the period of the DNA sequence.The statistical model can also be utilized for buildingprobabilistic representations of RNA families.
I. INTRODUCTION
The structural features of DNA sequences have bio-logical implications [1]. One such structural feature issymbolic periodicity. The periodicities in gene sequenceshave been linked with evolution of genome and proteinstrucuture [1], [2], [3]. The DNA sequences exhibit ho-mologous, eroded and latent periodicities. Homologousperiodicity occurs when short fragments of DNA arerepeated in tandem to give periodic sequences [4]. Mostcurrent approaches for finding periodicities transform thesymbolic DNA sequence to a numerical sequence [5], [6],[7]; these techniques are primarily aimed at the detectionof homological periodicities.Some researchers have also explored the detection of
imperfect or eroded periodicities which model a sequenceof similar units repeated but with some changes. There-fore, the homology between repeated units in an erodedsequence is not perfect [4]. The imperfect periodicitymay occur in strands of DNA due to changes or erosionof nucleotides.
The periodicity in DNA sequences may also be mod-eled as latent periodicity [4], for instance an observedperiod of nucleotides may be (A/C)(T/G)(T/A)(G/T)(CIGIA)(GIA), i.e. the first nucleotide of a period maybe A or C followed by a T or G and so on. Thehidden periodicities may not be found efficiently byalgorithms developed for finding homological and erodedperiodicities [2]. The latent periodicity detection wasstudied in [7], [8] and latent periodicities of some humangenes were reported.
This paper presents a novel approach to finding la-tent periodicities in DNA sequences that parallels theextraction of beat information from low level audio
features in [9]. Each symbol of the sequence is assumedto be generated by an information source with someunderlying probability mass function and the sequenceis generated by drawing symbols from these sources ina cyclic manner. The number of sources is equal to thelatent period in the sequence and the latent periodicityis same as the statistical periodicity or cyclostationarity.The paper presents maximum likelihood estimates of thepmfs and the period. The symbolic sequences are nottransformed into numerical sequences and the methodpresented here is capable of finding all three kinds ofperiodicities: homological, eroded and latent.
The problem of detecting latent periodicities in sym-bolic sequences is formulated mathematically in the nextsection. The maximum likelihood estimate of the periodis developed in section III and results are discussed insection IV. A short discussion on building probabilisticrepresentations for non-coding RNAs is presented insection V.
II. STATISTICAL PERIODICITY
The statistical periodicity model that is employed here todiscover possibly hidden periodicities in gene sequencesdoes not assume that the sequence itself is periodic.Instead it is assumed that there is a periodicity inunderlying statistical distributions which is locked to aknown periodic grid.A given DNA sequence D = [D1,.. ,DN] can be
denoted by the mapping D: N -> S, from the naturalnumbers to the alphabet S {A, G, C, T}. Assumethat the statistical periodicity of the sequence D is T.This implies there are IT information sources (or randomvariables) denoted as X1, . . ., XT. The random variableXi takes values on the alphabet S according to anassociated probability mass function Pi; it generates thejth symbol in S with probability Pi(j) = P(Xi = Sj)for j 1, ..., S where S is the cardinality of thealphabet (which is four for the DNA sequences).
The number of complete statistical periods in D areM LN/Ti. Define i = (i mod T). Then for1 < i < N, the symbol Di, i.e. the jth symbol in thesequence D, is generated by the random variable X,.The random variables Xi, i = 1, ....,T are assumed tobe independent. The structural parameters, Pi... . , PT,and the timing parameter IT are unknown. Define 9 =
[T, Pi, ... , PT]. The search space for parameter IT is
1-4244-0999-3/07/$25.00 ©2007 IEEE.
the set B = {1,...,No} for some No < N and forthe pmfs [Pi,... ,PT] the search space is the subsetQ C [0,11]I I< of column stochastic matrices (forP C nPji C [0: l] and E S11 Pji = I for i' = l1) . . .
Let p = B x Q denote the search space for the parameter9. Given the data, the maximum aposteriori (MAP)estimate of parameter 9 is
9 arg max2(9 lD).
By Bayes rule the posterior probability is
2(9D) =(D9)2(9)2P(D) (1)
where, by independence of Xi's,N
2(D9) Hf2(X, = Dil9) (2)i=l
is the likelihood. Note that the probability 2(D)f2°,(D 98)2F(9)d9 is a constant and thus, assuminga uniform prior on 9,
9 arg maxP(D 9). (3)
In words, the MIAP estimate is same as the maximumlikelihood estimate.
III. THE MAXIMUM LIKELIHOOD ESTIMATE
This section develops the maximum likelihood estimate(MLE) for the unknown parameter 9. The data-sequenceD = [D1,. . ,DN] is represented by a sequence ofvectors W = [w1,.. , WN] where each wi is an S x 1vector with wji = 1 Di = Si. So, if the jthsymbol in the sequence D is C, i.e. the third symbol ofthe alphabet S, then the jth vector wi in the sequenceW is [ 0 0 1 0 ]'. Also define a S x 7 stochastic matrixA with entries Aji = (Xi = Sj). The columns of thematrix A denote the pmfs of the information sources;the entry Aji denotes the probability that the jth sourcegenerates the jth symbol of the alphabet S. Write theunknown parameter 8 = [A, 7]. This notation simplifiesthe derivation of the MLE by noting that
-P(Xi~= DilA, =T) (Ajj=l
The likelihood can therefore be written as
2(AVA,T)N
W(X, = DiA,7)t i1
i=lj=
M wjI(AAi)w,
k i=l j=lN-MIr i
(k
l l(lAj, i )x
(4)
k i= j=1
where (k) = (k -1)T + i. Note that the first term onthe right hand side of (4) captures the observations in Mcomplete periods (given the period T) while the secondproduct captures the observation over the last incompletecycle. The log-likelihood is
logP(W A, )M f isi
SEEE Wji(k) log (Ai) +k= i= j=1
N-Mfr ISI
E E wji(M+1) log (Ail )i=l i=1
For a fixed 7, the MLE for A is
AT = arg max log2(W A,7). (6)
The log-likelihood in (5) is a concave function of vari-ables A Also, note that these variables satisfy theconstraint: s 1 for i 1,..., 7. Constrainedoptimization using Lagrange multipliers gives the (j, i)thelement of the matrix AT as
I 1 Ek=1 Wji(k), t N-:... MS
M k=1 Wji(k), t= N- M7, . ..(7)
for j 1,..., S . The estimates of the parameters Acan then be used to determine the MLE for the period7,
T* = arg max log2(WEAT, T).TeB
(8)
IV. RESULTS
The method in the previous section was applied to avariety of simulated symbolic sequences and to chro-mosome XVI of S. cerevisiae in order to detect peri-odicities. A homological symbolic sequence from the setS = {A, G, C, T} with period 7 = 7 was generated.The sequence was eroded by changing the symbols atrandomly chosen points in the sequence. The algorithmwas tested with various degrees of erosion. The plots inFig. 1 strongly support a statistical periodicity of 7 evenwith 85% erosion. The noise floor in the plots increases(i.e. the heights of the peaks decrease) with increasederosion. Note that a T-periodic sequence also shows k7-periodicity for any positive integer k.
Figure 2(a) shows the results with latent periodicityof simulated symbolic sequence where a single periodis (A/C)(T/G)(T/A)(G/T)(C/G/A)(G/A). This was gener-ated by six information sources, Xl.... ,X6 with X1generating A or C each with equal probability and X5generating A,G or C each with probability 1/3. Applyingthe technique of Sect. III to this sequence results in aplot showing strong six-periodic behaviour. In contrast,when a random sequence is used (i.e. when each sourcegenerates all symbols with equal frequency), Fig. 2(b)shows that no significant periodicities are detected.The algorithm was also tested with the protein coding
region of chromosome XVI of S. cerevisiae (GenBankaccession number NC 001 148). The 2160 base-pair(bp)
Honological Periodicity V. IDENTIFYING NON-CODING RNAS
._ _. ... ~~~~~~~~~~..........E
5 10 15 20 5 5 10 15 20 25
75% erosion 85% erosion0
0 o ............................. ....................
FE
...
5 10 15 20 25 5 10 15 20 25
Figure 1. Maximum log-likelihood of data plotted againstPeriod for a simulated symbolic sequence of length6400 symbols with period 7: (a) Homological periodicsequence (b) 50% eroded sequence (c) 75% erodedsequence (d) 85% eroded sequence.
(a) (A/l) (T/G) (TIA) (GIT) (CGA) (G1A)
.C ... A.. ... ..X.V .... . ..
._
E
E
5 \0 15 20
Period(c) Chromosome XVI of S- Cerevisiae
........\......f1.
It
I ..............'-' ''' '' '' '''I 'EE : g| tx) A :j\ \ 0\ \\ # \OCZ g g } g
0 5\\1 0 1 5 20
Period
(b) random se uence
A/
........... .......... .......... ...........
0 5 10 15 20
Period
(d) Spectrur of Chromosome XVI
200 400 600 800 1000
Frequency
Figure 2. (a) Maximum Log-likelihood of data plottedagainst period for a simulated symbolic sequence oflength 6400 symbols with latent periodicity 7, (b) maxi-mum log-likelihood versus period for completely randomsymbolic sequence (c) maximum log-likelihood plottedagainst period for protein coding region of chromosomeXVI of S. cerevisiae (d) the magnitude of DFT ofnumerical sequence derived from the sequence in part(c).
long sequence (from bp 85 - 2244) shows a latentperiodicity of period three as plotted in Fig. 2(c). Theperiod-3 behaviour of protein coding genes is expectedsince amino acids are coded by trinucleotide units calledcodons [7], [10]. For comparison, the symbolic sequence
is transformed into a numerical sequence as in [7] and themagnitude of the 2160-point DFT is plotted in Fig. 2. Thepeaks at fi = 720, f2 = 360 and f3 = 180 correspondto 3, 6 and 12-periodic behaviour respectively.
The central dogma of molecular biology states that DNAis transcribed to RNA and then translated to proteins. Thegenetic information therefore flows from DNA to proteinthrough the RNA. However, besides playing the role ofa passive intermediary messenger (mRNA), RNAs havebeen known to play important non-coding function in theprocess of translation (tRNA, rRNA) [11]. Since theseRNAs are not translated into proteins, these are callednon-coding RNAs (ncRNAs) or RNA genes. Originally,such RNA genes were considered rare but in the lastdecade many new RNA genes have been found andhave been shown to play diverse roles: chromosomereplication, protein degradation and translocation, regu-
lating gene expression and many more. Thus RNA genes
may play a much more significant role than previouslythought. The number of ncRNAs in human genomes isin the order of tens of thousands and considering the vastamount of genomic data there is a need for computationalmethods for identification of ncRNAs [10].
The statistical model presented in this paper forfinding periodicities in symbolic sequences can be uti-lized for building probabilistic representations of RNAfamilies. The RNA has the same primary structure as
DNA, consisting of a sugar-phosphate backbone withnucleotides attached to it. However, in RNA the nu-
cleotide thymine(T) is replaced by uracil (U) as the basecomplementary to adenine (A). So, RNA is representedby the string of bases: A, C, G and U. RNA existsas a single-stranded molecule since the replacement ofthymine by uracil makes RNA too bulky to form a
stable double helix. However, the complementary bases(A and U, G and C) can form a hydrogen bond andsuch consecutive base pairs cause the RNA to fold ontoitself resulting in 2-D and 3-D secondary and tertiarystructures. A typical secondary structure is a hairpinstructure as shown in Fig. 3(a); the consecutive base pairsthat bond together get stacked onto each other to form a
stem while the unpaired bases form a loop.Typical methods employed for identification of DNA
gene sequences and proteins do not perform as well inthe identification of ncRNAs because they are based on
finding structural features (like periodicities) in primarysequences whereas most functional ncRNAs preserve
their secondary structures more than they preserve theirprimary sequences [10]. Therefore, in the identificationof ncRNAs there is need for techniques that also evaluatesimilarity between secondary structures. Such techniqueshave been shown to be more effective in comparing anddiscriminating RNA sequences [12].
The RNA sequences preserve the secondary structurewhen undergoing erosion or mutation by compensatorymutation as shown in Fig. 4. This causes strong pairwisecorrelations between distant bases in the primary RNAsequence. Unlike the techniques employed for DNAidentification in earlier works, the approach presentedhere can describe such pairwise correlations. Consider a
sequence of ncRNA molecules, tandem repeats of which
-o00
-e
0
50°% erosion
loor) .i v As_--- 71, ,
G G RNAOto G h
s em
o G_k A - --
Structuraly S ructural&AJN similar mismatch <
RNA d RNA 2
Figure 3. (a) RNAO has hairpin secondary structure. (b)RNA1 is similar in structure to RNAO. It differs at twopositions in the primary sequence from RNAO. (c) RNA2structure is not hairpin, it has a structural mismatch withRNAO. RNA2 also differs at two position in the primarysequence from RNAO but it must be scored lower insimilarity to RNAO as compared to RNA1.
CompensatoryMutatioG
s , X, X3 X, X , s
Symbol N=(A/C) M=(U/C)Correlatiois N'=(U/G) M'=(NG)
Figure 4. A given base pair in a ncRNA moleculeundergoes compensatory mutation i.e if one of the nu-
cleotides in a base-pair mutates, the other nucleotidealso changes to complementary nucleotide. So there isa strong correlation between the base positions indicatedby N and N' and base positions indicated by M and M'.
have undergone random compensatory mutations as
shown in Fig. 4. According to the statistical periodicitymodel presented in this paper, the sources generatingthe symbols that do not bond (nucleotides in the loopof a hairpin ncRNA) have a point-mass pmf. On theother hand, the sources corresponding to a bondedbase pair have conjugate distributions (in Fig. 4 pmfsof N and N' agree on complementary bases). Theseconjugate distributions capture the distant base-paircorrelations. The sequence of sources correspondingto ncRNA molecules in Fig. 4 after identifyingthe bonded nucleotides with conjugate sources isX1, X2, X3, X4, X3, X2, X1. The statistical periodicitymodel presented here is therefore capable of describingprimary as well as secondary structural similarities.
VI. ACKNOWLEDGEMENTS
The authors would like to thank Charu Gera for varioushelpful discussions on molecular biology.
VII. REFERENCES
[1] C. M. Heame, S. Ghosh, and J. A. Todd, "Mi-crosatellites for linkage analysis of genetic traits,"Trends in Genetics, vol. 8, pp. 288, 1992.
[2] E. V. Korotkov and D. A. Phoenix, "Latent pe-riodicity of DNA sequences of many genes," inProceedings of Pacific Symposium on Biocomput-ing 97, R. B. Altman, A. K. Dunker, L. Hunter,and T. Klein, Eds., Singapore-New-Jersey-London,1997, pp. 222-229, Word Scientific Press.
[3] E. V. Korotkov and N. Kudryaschov, "Latentperiodicity of many genes," Genome Informatics,vol. 12, pp. 437 - 439, 2001.
[4] M. B. Chaley, E. V. Korotkov, and K. G. Skryabin,"Method revealing latent periodicity of the nu-cleotide sequences modified for a case of smallsamples," DNA Research, vol. 6, pp. 153-163, Feb.1999.
[5] V. R. Chechetkin, L. A. Knizhnikova, and A. Y.Turygin, "Three-quasiperiodicity, mutual correla-tuions, ordering and long modulations in genomicnucleotide sequences viruses," Journal ofbiomolec-ular structure and dynamics, vol. 12, pp. 271, 1994.
[6] E. A. Cheever, D. B. Searls, W. Karunaratne, andG. C. Overton, "Using signal processing techniquesfor dna sequence comparison," in Proc. of the1989 Fifteenth Annual Northeast BioengineeringConference, Boston, MA, Mar 1989, pp. 173 - 174.
[7] D. Anastassiou, "Genomic signal processing," IEEESignal Processing Magazine, vol. 18, pp. 8-20, Jul2001.
[8] E. V. Korotkov and M. A. Korotova, "Latentperiodicity of dna sequences of some human genes,"DNA Sequence, vol. 5, pp. 353, 1995.
[9] W. A. Sethares, R. D. Morris, and J. C. Sethares,"Beat tracking of audio signals using low levelaudio features," IEEE Transactions On Speechand Audio Processing, vol. 13, no. 2, pp. 275-285,March 2005.
[10] B. J. Yoon and R. P. Vaidyanathan, "Computationalidentification and analysis of noncoding mas - un-earthing the buried treasures in the genome," IEEESignal Processing Magazine, vol. 24, no. 1, pp. 64-74, Jan 2007.
[11] M. S. Waterman, Introduction to ComputationalBiology: Maps, sequences and genomes, Chapmanand Hall/CRC, first edition, 1995.
[12] S. R. Eddy, "Non-coding ma genes and the modemma world," Nature Reviews Genetics, vol. 2, no. 12,pp. 919-929, December 2001.