sequences: a maximum likelihoodapproach arora ...biitcomm/research/genomic signal...theperiodicities...

4
DETECTION OF PERIODICITIES IN GENE SEQUENCES: A MAXIMUM LIKELIHOOD APPROACH Raman Arora and William A. Sethares Department of Electrical and Computer Engineering, University of Wisconsin-Madison 1415 Engineering Drive, Madison WI 53706, [email protected], [email protected] ABSTRACT A novel approach is presented to the detection of ho- mological, eroded and latent periodicities in DNA se- quences. Each symbol in a DNA sequence is assumed to be generated from an information source with an underlying probability mass function (pmf) in a cyclic manner. The number of sources can be interpreted as the periodicity of the sequence. The maximum likelihood estimates are developed for the pmfs of the information sources as well as the period of the DNA sequence. The statistical model can also be utilized for building probabilistic representations of RNA families. I. INTRODUCTION The structural features of DNA sequences have bio- logical implications [1]. One such structural feature is symbolic periodicity. The periodicities in gene sequences have been linked with evolution of genome and protein strucuture [1], [2], [3]. The DNA sequences exhibit ho- mologous, eroded and latent periodicities. Homologous periodicity occurs when short fragments of DNA are repeated in tandem to give periodic sequences [4]. Most current approaches for finding periodicities transform the symbolic DNA sequence to a numerical sequence [5], [6], [7]; these techniques are primarily aimed at the detection of homological periodicities. Some researchers have also explored the detection of imperfect or eroded periodicities which model a sequence of similar units repeated but with some changes. There- fore, the homology between repeated units in an eroded sequence is not perfect [4]. The imperfect periodicity may occur in strands of DNA due to changes or erosion of nucleotides. The periodicity in DNA sequences may also be mod- eled as latent periodicity [4], for instance an observed period of nucleotides may be (A/C)(T/G)(T/A)(G/T) (CIGIA)(GIA), i.e. the first nucleotide of a period may be A or C followed by a T or G and so on. The hidden periodicities may not be found efficiently by algorithms developed for finding homological and eroded periodicities [2]. The latent periodicity detection was studied in [7], [8] and latent periodicities of some human genes were reported. This paper presents a novel approach to finding la- tent periodicities in DNA sequences that parallels the extraction of beat information from low level audio features in [9]. Each symbol of the sequence is assumed to be generated by an information source with some underlying probability mass function and the sequence is generated by drawing symbols from these sources in a cyclic manner. The number of sources is equal to the latent period in the sequence and the latent periodicity is same as the statistical periodicity or cyclostationarity. The paper presents maximum likelihood estimates of the pmfs and the period. The symbolic sequences are not transformed into numerical sequences and the method presented here is capable of finding all three kinds of periodicities: homological, eroded and latent. The problem of detecting latent periodicities in sym- bolic sequences is formulated mathematically in the next section. The maximum likelihood estimate of the period is developed in section III and results are discussed in section IV. A short discussion on building probabilistic representations for non-coding RNAs is presented in section V. II. STATISTICAL PERIODICITY The statistical periodicity model that is employed here to discover possibly hidden periodicities in gene sequences does not assume that the sequence itself is periodic. Instead it is assumed that there is a periodicity in underlying statistical distributions which is locked to a known periodic grid. A given DNA sequence D = [D1,.. ,DN] can be denoted by the mapping D: N -> S, from the natural numbers to the alphabet S {A, G, C, T}. Assume that the statistical periodicity of the sequence D is T. This implies there are IT information sources (or random variables) denoted as X1, . . ., XT. The random variable Xi takes values on the alphabet S according to an associated probability mass function Pi; it generates the jth symbol in S with probability Pi(j) = P(Xi = Sj) for j 1, ..., S where S is the cardinality of the alphabet (which is four for the DNA sequences). The number of complete statistical periods in D are M LN/Ti. Define i = (i mod T). Then for 1 < i < N, the symbol Di, i.e. the jth symbol in the sequence D, is generated by the random variable X,. The random variables Xi, i = 1, ....,T are assumed to be independent. The structural parameters, Pi... . , PT, and the timing parameter IT are unknown. Define 9 = [T, Pi, ... , PT]. The search space for parameter IT is 1-4244-0999-3/07/$25.00 ©2007 IEEE.

Upload: others

Post on 25-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SEQUENCES: A MAXIMUM LIKELIHOODAPPROACH Arora ...biitcomm/research/Genomic Signal...Theperiodicities in gene sequences have been linked with evolution ofgenome and protein strucuture

DETECTION OF PERIODICITIES IN GENE SEQUENCES: A MAXIMUMLIKELIHOOD APPROACH

Raman Arora and William A. Sethares

Department of Electrical and Computer Engineering, University of Wisconsin-Madison1415 Engineering Drive, Madison WI 53706, [email protected], [email protected]

ABSTRACTA novel approach is presented to the detection of ho-mological, eroded and latent periodicities in DNA se-quences. Each symbol in a DNA sequence is assumedto be generated from an information source with anunderlying probability mass function (pmf) in a cyclicmanner. The number of sources can be interpreted asthe periodicity of the sequence. The maximum likelihoodestimates are developed for the pmfs of the informationsources as well as the period of the DNA sequence.The statistical model can also be utilized for buildingprobabilistic representations of RNA families.

I. INTRODUCTION

The structural features of DNA sequences have bio-logical implications [1]. One such structural feature issymbolic periodicity. The periodicities in gene sequenceshave been linked with evolution of genome and proteinstrucuture [1], [2], [3]. The DNA sequences exhibit ho-mologous, eroded and latent periodicities. Homologousperiodicity occurs when short fragments of DNA arerepeated in tandem to give periodic sequences [4]. Mostcurrent approaches for finding periodicities transform thesymbolic DNA sequence to a numerical sequence [5], [6],[7]; these techniques are primarily aimed at the detectionof homological periodicities.Some researchers have also explored the detection of

imperfect or eroded periodicities which model a sequenceof similar units repeated but with some changes. There-fore, the homology between repeated units in an erodedsequence is not perfect [4]. The imperfect periodicitymay occur in strands of DNA due to changes or erosionof nucleotides.

The periodicity in DNA sequences may also be mod-eled as latent periodicity [4], for instance an observedperiod of nucleotides may be (A/C)(T/G)(T/A)(G/T)(CIGIA)(GIA), i.e. the first nucleotide of a period maybe A or C followed by a T or G and so on. Thehidden periodicities may not be found efficiently byalgorithms developed for finding homological and erodedperiodicities [2]. The latent periodicity detection wasstudied in [7], [8] and latent periodicities of some humangenes were reported.

This paper presents a novel approach to finding la-tent periodicities in DNA sequences that parallels theextraction of beat information from low level audio

features in [9]. Each symbol of the sequence is assumedto be generated by an information source with someunderlying probability mass function and the sequenceis generated by drawing symbols from these sources ina cyclic manner. The number of sources is equal to thelatent period in the sequence and the latent periodicityis same as the statistical periodicity or cyclostationarity.The paper presents maximum likelihood estimates of thepmfs and the period. The symbolic sequences are nottransformed into numerical sequences and the methodpresented here is capable of finding all three kinds ofperiodicities: homological, eroded and latent.

The problem of detecting latent periodicities in sym-bolic sequences is formulated mathematically in the nextsection. The maximum likelihood estimate of the periodis developed in section III and results are discussed insection IV. A short discussion on building probabilisticrepresentations for non-coding RNAs is presented insection V.

II. STATISTICAL PERIODICITY

The statistical periodicity model that is employed here todiscover possibly hidden periodicities in gene sequencesdoes not assume that the sequence itself is periodic.Instead it is assumed that there is a periodicity inunderlying statistical distributions which is locked to aknown periodic grid.A given DNA sequence D = [D1,.. ,DN] can be

denoted by the mapping D: N -> S, from the naturalnumbers to the alphabet S {A, G, C, T}. Assumethat the statistical periodicity of the sequence D is T.This implies there are IT information sources (or randomvariables) denoted as X1, . . ., XT. The random variableXi takes values on the alphabet S according to anassociated probability mass function Pi; it generates thejth symbol in S with probability Pi(j) = P(Xi = Sj)for j 1, ..., S where S is the cardinality of thealphabet (which is four for the DNA sequences).

The number of complete statistical periods in D areM LN/Ti. Define i = (i mod T). Then for1 < i < N, the symbol Di, i.e. the jth symbol in thesequence D, is generated by the random variable X,.The random variables Xi, i = 1, ....,T are assumed tobe independent. The structural parameters, Pi... . , PT,and the timing parameter IT are unknown. Define 9 =

[T, Pi, ... , PT]. The search space for parameter IT is

1-4244-0999-3/07/$25.00 ©2007 IEEE.

Page 2: SEQUENCES: A MAXIMUM LIKELIHOODAPPROACH Arora ...biitcomm/research/Genomic Signal...Theperiodicities in gene sequences have been linked with evolution ofgenome and protein strucuture

the set B = {1,...,No} for some No < N and forthe pmfs [Pi,... ,PT] the search space is the subsetQ C [0,11]I I< of column stochastic matrices (forP C nPji C [0: l] and E S11 Pji = I for i' = l1) . . .

Let p = B x Q denote the search space for the parameter9. Given the data, the maximum aposteriori (MAP)estimate of parameter 9 is

9 arg max2(9 lD).

By Bayes rule the posterior probability is

2(9D) =(D9)2(9)2P(D) (1)

where, by independence of Xi's,N

2(D9) Hf2(X, = Dil9) (2)i=l

is the likelihood. Note that the probability 2(D)f2°,(D 98)2F(9)d9 is a constant and thus, assuminga uniform prior on 9,

9 arg maxP(D 9). (3)

In words, the MIAP estimate is same as the maximumlikelihood estimate.

III. THE MAXIMUM LIKELIHOOD ESTIMATE

This section develops the maximum likelihood estimate(MLE) for the unknown parameter 9. The data-sequenceD = [D1,. . ,DN] is represented by a sequence ofvectors W = [w1,.. , WN] where each wi is an S x 1vector with wji = 1 Di = Si. So, if the jthsymbol in the sequence D is C, i.e. the third symbol ofthe alphabet S, then the jth vector wi in the sequenceW is [ 0 0 1 0 ]'. Also define a S x 7 stochastic matrixA with entries Aji = (Xi = Sj). The columns of thematrix A denote the pmfs of the information sources;the entry Aji denotes the probability that the jth sourcegenerates the jth symbol of the alphabet S. Write theunknown parameter 8 = [A, 7]. This notation simplifiesthe derivation of the MLE by noting that

-P(Xi~= DilA, =T) (Ajj=l

The likelihood can therefore be written as

2(AVA,T)N

W(X, = DiA,7)t i1

i=lj=

M wjI(AAi)w,

k i=l j=lN-MIr i

(k

l l(lAj, i )x

(4)

k i= j=1

where (k) = (k -1)T + i. Note that the first term onthe right hand side of (4) captures the observations in Mcomplete periods (given the period T) while the secondproduct captures the observation over the last incompletecycle. The log-likelihood is

logP(W A, )M f isi

SEEE Wji(k) log (Ai) +k= i= j=1

N-Mfr ISI

E E wji(M+1) log (Ail )i=l i=1

For a fixed 7, the MLE for A is

AT = arg max log2(W A,7). (6)

The log-likelihood in (5) is a concave function of vari-ables A Also, note that these variables satisfy theconstraint: s 1 for i 1,..., 7. Constrainedoptimization using Lagrange multipliers gives the (j, i)thelement of the matrix AT as

I 1 Ek=1 Wji(k), t N-:... MS

M k=1 Wji(k), t= N- M7, . ..(7)

for j 1,..., S . The estimates of the parameters Acan then be used to determine the MLE for the period7,

T* = arg max log2(WEAT, T).TeB

(8)

IV. RESULTS

The method in the previous section was applied to avariety of simulated symbolic sequences and to chro-mosome XVI of S. cerevisiae in order to detect peri-odicities. A homological symbolic sequence from the setS = {A, G, C, T} with period 7 = 7 was generated.The sequence was eroded by changing the symbols atrandomly chosen points in the sequence. The algorithmwas tested with various degrees of erosion. The plots inFig. 1 strongly support a statistical periodicity of 7 evenwith 85% erosion. The noise floor in the plots increases(i.e. the heights of the peaks decrease) with increasederosion. Note that a T-periodic sequence also shows k7-periodicity for any positive integer k.

Figure 2(a) shows the results with latent periodicityof simulated symbolic sequence where a single periodis (A/C)(T/G)(T/A)(G/T)(C/G/A)(G/A). This was gener-ated by six information sources, Xl.... ,X6 with X1generating A or C each with equal probability and X5generating A,G or C each with probability 1/3. Applyingthe technique of Sect. III to this sequence results in aplot showing strong six-periodic behaviour. In contrast,when a random sequence is used (i.e. when each sourcegenerates all symbols with equal frequency), Fig. 2(b)shows that no significant periodicities are detected.The algorithm was also tested with the protein coding

region of chromosome XVI of S. cerevisiae (GenBankaccession number NC 001 148). The 2160 base-pair(bp)

Page 3: SEQUENCES: A MAXIMUM LIKELIHOODAPPROACH Arora ...biitcomm/research/Genomic Signal...Theperiodicities in gene sequences have been linked with evolution ofgenome and protein strucuture

Honological Periodicity V. IDENTIFYING NON-CODING RNAS

._ _. ... ~~~~~~~~~~..........E

5 10 15 20 5 5 10 15 20 25

75% erosion 85% erosion0

0 o ............................. ....................

FE

...

5 10 15 20 25 5 10 15 20 25

Figure 1. Maximum log-likelihood of data plotted againstPeriod for a simulated symbolic sequence of length6400 symbols with period 7: (a) Homological periodicsequence (b) 50% eroded sequence (c) 75% erodedsequence (d) 85% eroded sequence.

(a) (A/l) (T/G) (TIA) (GIT) (CGA) (G1A)

.C ... A.. ... ..X.V .... . ..

._

E

E

5 \0 15 20

Period(c) Chromosome XVI of S- Cerevisiae

........\......f1.

It

I ..............'-' ''' '' '' '''I 'EE : g| tx) A :j\ \ 0\ \\ # \OCZ g g } g

0 5\\1 0 1 5 20

Period

(b) random se uence

A/

........... .......... .......... ...........

0 5 10 15 20

Period

(d) Spectrur of Chromosome XVI

200 400 600 800 1000

Frequency

Figure 2. (a) Maximum Log-likelihood of data plottedagainst period for a simulated symbolic sequence oflength 6400 symbols with latent periodicity 7, (b) maxi-mum log-likelihood versus period for completely randomsymbolic sequence (c) maximum log-likelihood plottedagainst period for protein coding region of chromosomeXVI of S. cerevisiae (d) the magnitude of DFT ofnumerical sequence derived from the sequence in part(c).

long sequence (from bp 85 - 2244) shows a latentperiodicity of period three as plotted in Fig. 2(c). Theperiod-3 behaviour of protein coding genes is expectedsince amino acids are coded by trinucleotide units calledcodons [7], [10]. For comparison, the symbolic sequence

is transformed into a numerical sequence as in [7] and themagnitude of the 2160-point DFT is plotted in Fig. 2. Thepeaks at fi = 720, f2 = 360 and f3 = 180 correspondto 3, 6 and 12-periodic behaviour respectively.

The central dogma of molecular biology states that DNAis transcribed to RNA and then translated to proteins. Thegenetic information therefore flows from DNA to proteinthrough the RNA. However, besides playing the role ofa passive intermediary messenger (mRNA), RNAs havebeen known to play important non-coding function in theprocess of translation (tRNA, rRNA) [11]. Since theseRNAs are not translated into proteins, these are callednon-coding RNAs (ncRNAs) or RNA genes. Originally,such RNA genes were considered rare but in the lastdecade many new RNA genes have been found andhave been shown to play diverse roles: chromosomereplication, protein degradation and translocation, regu-

lating gene expression and many more. Thus RNA genes

may play a much more significant role than previouslythought. The number of ncRNAs in human genomes isin the order of tens of thousands and considering the vastamount of genomic data there is a need for computationalmethods for identification of ncRNAs [10].

The statistical model presented in this paper forfinding periodicities in symbolic sequences can be uti-lized for building probabilistic representations of RNAfamilies. The RNA has the same primary structure as

DNA, consisting of a sugar-phosphate backbone withnucleotides attached to it. However, in RNA the nu-

cleotide thymine(T) is replaced by uracil (U) as the basecomplementary to adenine (A). So, RNA is representedby the string of bases: A, C, G and U. RNA existsas a single-stranded molecule since the replacement ofthymine by uracil makes RNA too bulky to form a

stable double helix. However, the complementary bases(A and U, G and C) can form a hydrogen bond andsuch consecutive base pairs cause the RNA to fold ontoitself resulting in 2-D and 3-D secondary and tertiarystructures. A typical secondary structure is a hairpinstructure as shown in Fig. 3(a); the consecutive base pairsthat bond together get stacked onto each other to form a

stem while the unpaired bases form a loop.Typical methods employed for identification of DNA

gene sequences and proteins do not perform as well inthe identification of ncRNAs because they are based on

finding structural features (like periodicities) in primarysequences whereas most functional ncRNAs preserve

their secondary structures more than they preserve theirprimary sequences [10]. Therefore, in the identificationof ncRNAs there is need for techniques that also evaluatesimilarity between secondary structures. Such techniqueshave been shown to be more effective in comparing anddiscriminating RNA sequences [12].

The RNA sequences preserve the secondary structurewhen undergoing erosion or mutation by compensatorymutation as shown in Fig. 4. This causes strong pairwisecorrelations between distant bases in the primary RNAsequence. Unlike the techniques employed for DNAidentification in earlier works, the approach presentedhere can describe such pairwise correlations. Consider a

sequence of ncRNA molecules, tandem repeats of which

-o00

-e

0

50°% erosion

Page 4: SEQUENCES: A MAXIMUM LIKELIHOODAPPROACH Arora ...biitcomm/research/Genomic Signal...Theperiodicities in gene sequences have been linked with evolution ofgenome and protein strucuture

loor) .i v As_--- 71, ,

G G RNAOto G h

s em

o G_k A - --

Structuraly S ructural&AJN similar mismatch <

RNA d RNA 2

Figure 3. (a) RNAO has hairpin secondary structure. (b)RNA1 is similar in structure to RNAO. It differs at twopositions in the primary sequence from RNAO. (c) RNA2structure is not hairpin, it has a structural mismatch withRNAO. RNA2 also differs at two position in the primarysequence from RNAO but it must be scored lower insimilarity to RNAO as compared to RNA1.

CompensatoryMutatioG

s , X, X3 X, X , s

Symbol N=(A/C) M=(U/C)Correlatiois N'=(U/G) M'=(NG)

Figure 4. A given base pair in a ncRNA moleculeundergoes compensatory mutation i.e if one of the nu-

cleotides in a base-pair mutates, the other nucleotidealso changes to complementary nucleotide. So there isa strong correlation between the base positions indicatedby N and N' and base positions indicated by M and M'.

have undergone random compensatory mutations as

shown in Fig. 4. According to the statistical periodicitymodel presented in this paper, the sources generatingthe symbols that do not bond (nucleotides in the loopof a hairpin ncRNA) have a point-mass pmf. On theother hand, the sources corresponding to a bondedbase pair have conjugate distributions (in Fig. 4 pmfsof N and N' agree on complementary bases). Theseconjugate distributions capture the distant base-paircorrelations. The sequence of sources correspondingto ncRNA molecules in Fig. 4 after identifyingthe bonded nucleotides with conjugate sources isX1, X2, X3, X4, X3, X2, X1. The statistical periodicitymodel presented here is therefore capable of describingprimary as well as secondary structural similarities.

VI. ACKNOWLEDGEMENTS

The authors would like to thank Charu Gera for varioushelpful discussions on molecular biology.

VII. REFERENCES

[1] C. M. Heame, S. Ghosh, and J. A. Todd, "Mi-crosatellites for linkage analysis of genetic traits,"Trends in Genetics, vol. 8, pp. 288, 1992.

[2] E. V. Korotkov and D. A. Phoenix, "Latent pe-riodicity of DNA sequences of many genes," inProceedings of Pacific Symposium on Biocomput-ing 97, R. B. Altman, A. K. Dunker, L. Hunter,and T. Klein, Eds., Singapore-New-Jersey-London,1997, pp. 222-229, Word Scientific Press.

[3] E. V. Korotkov and N. Kudryaschov, "Latentperiodicity of many genes," Genome Informatics,vol. 12, pp. 437 - 439, 2001.

[4] M. B. Chaley, E. V. Korotkov, and K. G. Skryabin,"Method revealing latent periodicity of the nu-cleotide sequences modified for a case of smallsamples," DNA Research, vol. 6, pp. 153-163, Feb.1999.

[5] V. R. Chechetkin, L. A. Knizhnikova, and A. Y.Turygin, "Three-quasiperiodicity, mutual correla-tuions, ordering and long modulations in genomicnucleotide sequences viruses," Journal ofbiomolec-ular structure and dynamics, vol. 12, pp. 271, 1994.

[6] E. A. Cheever, D. B. Searls, W. Karunaratne, andG. C. Overton, "Using signal processing techniquesfor dna sequence comparison," in Proc. of the1989 Fifteenth Annual Northeast BioengineeringConference, Boston, MA, Mar 1989, pp. 173 - 174.

[7] D. Anastassiou, "Genomic signal processing," IEEESignal Processing Magazine, vol. 18, pp. 8-20, Jul2001.

[8] E. V. Korotkov and M. A. Korotova, "Latentperiodicity of dna sequences of some human genes,"DNA Sequence, vol. 5, pp. 353, 1995.

[9] W. A. Sethares, R. D. Morris, and J. C. Sethares,"Beat tracking of audio signals using low levelaudio features," IEEE Transactions On Speechand Audio Processing, vol. 13, no. 2, pp. 275-285,March 2005.

[10] B. J. Yoon and R. P. Vaidyanathan, "Computationalidentification and analysis of noncoding mas - un-earthing the buried treasures in the genome," IEEESignal Processing Magazine, vol. 24, no. 1, pp. 64-74, Jan 2007.

[11] M. S. Waterman, Introduction to ComputationalBiology: Maps, sequences and genomes, Chapmanand Hall/CRC, first edition, 1995.

[12] S. R. Eddy, "Non-coding ma genes and the modemma world," Nature Reviews Genetics, vol. 2, no. 12,pp. 919-929, December 2001.