on the entropy of dna: algorithms and measurements based ... · that approximate entropy in the...

10
Chapter 6 On the Entropy of DNA: Algorithms and Measurements Based on Memory and Rapid Convergence* Martin Faracht Dept of CS Rutgers Univ. Michiel Noordewier* Dept of CS Rutgers Univ. Abraham Wynerll Dept. of Stat. Stanford Univ. Serap Savari§ Larry Sheppq Dept. of EE AT&T Bell Labs MIT Jacob Ziv** Dept. of EE Technion Inst. Abstract gree than the retained sequences (“exons”). We have applied the information theoretic notion of entropy to characterize DNA sequences. We con- sider a genetic sequence signal that is too small for asymptotic entropy estimates to be accurate, and for which similar approaches have previously failed. We prove that the match length entropy estimator has a relatively fast converge rate and demonstrate experimentally that by using this entropy estima- tor, we can indeed extract a meaningful signal from segments of DNA. Further, we derive a method for detecting certain signals within DNA - known as splice junctions - with significantly better perfor- mance than previously known methods. 1 Introduction DNA carries the instructions for the operation of living organisms. The simple combinatorial struc- ture of the DNA molecule has been an obvious lure to theorists interested in studying the way infor- mation is transmitted in living organisms. Most such attempts have had little or no success [12], to the point where some researchers have denied the utility of information theory, and more specifi- cally information theoretic entropy, in the study of DNA [6]. The main result of this paper is that we find that the entropy of genetic material which is ul- timately expressed in protein sequences is higher than that which is discarded. This is an unex- pected result, since current biological theory holds that the discarded sequences (“introns”) are capa- ble of tolerating random changes to a greater de- In this paper, we examine the utility of in- formation theoretic tools in a classic setting, the intron/exon boundary problem (described below). Prediction of these boundaries in the sequence of DNA is au essential task if we are to predict the product of a gene. We give the first experimental verification of an entropic difference between in- trons and exons, based on novel entropy estimation methods with very fast convergence times. The convergence time is important since exons tend to be quite small. Thus, the fact the methods exit that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further, we give an entropy-based algorithm for in- tron/exon boundary detection. Our studies show that we achieve better false positive/false negative 'fu.cWc,.rotg~r8.odu;Supportedby DIMACS (Centerfor Discrete Mathematics aud Theoretical Computer Science), a National Science Foundation Science and Technology Center under NSF contract STC-8809648. ! noordowi~cs.rotgors.odu *,yskit.tdu qlasQresarch.att.com Il.j.Opl.~.ir.8turiord.odu l * jrQ**.t*ohnion.rc.il 48

Upload: others

Post on 13-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

Chapter 6

On the Entropy of DNA: Algorithms and Measurements Based on Memory and Rapid Convergence*

Martin Faracht Dept of CS

Rutgers Univ.

Michiel Noordewier* Dept of CS

Rutgers Univ.

Abraham Wynerll Dept. of Stat. Stanford Univ.

Serap Savari§ Larry Sheppq Dept. of EE AT&T Bell Labs

MIT

Jacob Ziv** Dept. of EE

Technion Inst.

Abstract gree than the retained sequences (“exons”).

We have applied the information theoretic notion of entropy to characterize DNA sequences. We con- sider a genetic sequence signal that is too small for asymptotic entropy estimates to be accurate, and for which similar approaches have previously failed. We prove that the match length entropy estimator has a relatively fast converge rate and demonstrate experimentally that by using this entropy estima- tor, we can indeed extract a meaningful signal from segments of DNA. Further, we derive a method for detecting certain signals within DNA - known as splice junctions - with significantly better perfor- mance than previously known methods.

1 Introduction

DNA carries the instructions for the operation of living organisms. The simple combinatorial struc- ture of the DNA molecule has been an obvious lure to theorists interested in studying the way infor- mation is transmitted in living organisms. Most such attempts have had little or no success [12], to the point where some researchers have denied the utility of information theory, and more specifi- cally information theoretic entropy, in the study of DNA [6].

The main result of this paper is that we find that the entropy of genetic material which is ul- timately expressed in protein sequences is higher than that which is discarded. This is an unex- pected result, since current biological theory holds that the discarded sequences (“introns”) are capa- ble of tolerating random changes to a greater de-

In this paper, we examine the utility of in- formation theoretic tools in a classic setting, the intron/exon boundary problem (described below). Prediction of these boundaries in the sequence of DNA is au essential task if we are to predict the product of a gene. We give the first experimental verification of an entropic difference between in- trons and exons, based on novel entropy estimation methods with very fast convergence times. The convergence time is important since exons tend to be quite small. Thus, the fact the methods exit that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further, we give an entropy-based algorithm for in- tron/exon boundary detection. Our studies show that we achieve better false positive/false negative

'fu.cWc,.rotg~r8.odu;Supportedby DIMACS (Centerfor Discrete Mathematics aud Theoretical Computer Science), a National Science Foundation Science and Technology Center under NSF contract STC-8809648.

! noordowi~cs.rotgors.odu *,yskit.tdu qlasQresarch.att.com Il.j.Opl.~.ir.8turiord.odu

l * jrQ**.t*ohnion.rc.il

48

Page 2: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

ON THE ENTROPY OF DNA 49

&es than previously known methods. Our work suggests that entropy is a useful

tool in exploring DNA. We suggest some reasons why previous researchers failed to detect significant variance in entropy of genomic regions, and offer a cautionary tail to those interested in using entropy based tools in settings where asymptotic complex- ity breaks down. Finally, we provide a discussion of the role of a proper mathematical treatment of entropy in a biological setting.

1.1 Biology DNA and proteins are polymers, constructed of subunits known as nucleotides and amino acids, respectively. The sequence of each protein is a function of a DNA sequence which serves as the “gene” for that protein. The cellular expression of proteins proceeds by the creation of a Umessagen copy from the DNA template into a closely related molecule known as RNA (Figure 1). This RNA is then translated into a protein.

I i mt n I I I

t

- precursormRNA

mRNA (after splicing)

protein

folded protein

Figure 1: The flow of information in a eukaryotic cell.

One of the most unexpected findings in molec- ular biology is that large pieces of the RNA are removed before it is translated further [2]. The ma- jority of eukaryoticl genes display a complex struc- ture in which sequences which code for protein are interrupted by intervening, non-coding sequences. Initial transcription of these genes results in a pre- message RNA molecule from which segments must be accurately removed to produce a translatable message.

The retained sequences (represented by boxes in Figure 1) are known as ezons, while the removed

‘Eukaryotic cells contain nuclei, unlike prokaryotic cells such as bacterial and viruses. See [lt3] for a general introduction to molecuIar biology

sequences are known as intmns. Exons tend to be no more than 200 characters long, while introns cas be many tens of thousands of characters long. Thus the majority of a typical eukaryotic gene will consist of intron regions. Since the discovery of such “split genes” over a decade ago, the nature of the splicing event has been the subject of intense research (for a recent study see [lo, 141). RNA precursors contain patterns similar to those in the Figure 2. The points at which RNA is removed (the boundaries of the boxes in Figure 2) are known as splice-junctions. Evolutionary conservation for splice junctions is commonly observed by the con- struction of a “consensus” sequence - a process by which many sequences are aligned and a composite subsequence is created by taking the majority base at each position. Such a composite is used both to support biological inferences, and as a discriminant tool for the recognition of i.ntron/exon boundaries in raw sequence data now being produced by the various genome projects .

Qoll Iniron aan

‘~~cT(A~~)AGT -wT)~~wT)AGIIG(G~?,.

Figure 2: Splice junction consensus.

One feature which will be important in our de- tection algorithm (Section 4) will be the fact that introns almost alzuays begin with a GT and end with an AG . However, numerous other locations can resemble these canonical patterns. As a result, these patterns do not by themselves reliably imply the presence of a splice-junction. Evidently, ifjunc- tions are to be recognized on the basis of sequence information alone, longer-range sequence informa- tion will have to be included in the decision-making criteria. A central problem is therefore to deter- mine the extent to which sequences surrounding splice-junctions differ from sequences surrounding spurious analogues .

Finally, while mutations are thought to occur with approximately the same frequency in introns and exons, exons are subject to greater selective pressure than introns. Thus, it makes good intu- itive sense that their entropy may be different, since they are subject to different random processes, and

Page 3: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

50 FARACH ET AL.

entropy is understood to be a measure of random- of a DNA molecule. Cophen and Stewart [4] con- ness. However, little work has been done to ex- fuse entropy and information content. This is a amine differences between entropies of introns and fairly usual mistake that occurs when “context” exons . Knopka and Owens [8] estimated a value is considered, and Shannon realized that we could they termed “Local Compositional Complexity”, only have a mathematical theory of information in which corresponds to the one-dimensional entropy a context-free universe. The appropriate approach of a DNA sequence. This parameter, based on fre- is to consider, as Shannon did, sequences to be out- quency cormts of nucleotides from a large number comes of stochastic processes and then to estimate of d.ifFerent sources, showed a maximum value for the entropy of the distribution (or distributions) introns . from which the observed data would be typical.

1.2 Entropy Information Entropy was intro- duced by Shannon [ll] . We first provide some definitions. Let (Xl,Xz, . . .) = Xr be a stochas- tic process with probability law P. For every positive integer I, and every possible sequence of outcomes (from the alphabet d) 2: E {*cl}', de- fine the probability or likelihood function to be P(z{) = Pr(X: = z:}. Thus P maps every se- quence to its probability under P. The Entropy, H(P), is deflued as the following limit,

We have chosen to consider entropy because it is a natural measure of the following phenomena: complexity, compressibility, predictivity, and ran- domness. A common feature of all these qualities is that they are all properties of individual sequences, even those that are not thought of as outcomes of a random process. It was the genius of Shannon to consider “messages” to be random quantities, which allowed him to proceed with a general theory with far-reaching consequences, even though the theory was truly inapplicable in most situations. Accordingly, when we now consider the entropy of sequences, it becomes important to point out that entropy is a property of distributions, not, at least directly, a property of individual sequences.

where & denotes the operation of taking an ex- pectation with respect to the elements of V. It is known that for any random variable V, H( U IV) 5 H(U) with equality if and only if U and V are sta- tistically independent (see, for example, [5]).

1.3 Our Results Our results are two-fold. First, we consider the question of whether there is an in- formation theoretic difference between introns and exons. Having noted the previous failure by re- searchers to find such a distinction, we nonetheless tried standard methods for estimating the entropies of sequences, i.e. compression, and character sin- gleton and tuple distributions. Not surprisingly, we found no statistically significant difference be- tween these measures for introns and exon. We noted, however, that exons are quite short, and our estimators may not have enough time to converge to the correct entropy.

Confusion about the notion of information the- l In Section 2, we give theorems which show that oretic entropy has led to incorrect conclusions in another entropy estimator, the match-length previous work. Hariri, et al. [6] examine the effi- estimator, has a significantly faster conver- cacy of several poor entropy estimators, and mis- gence rate. Indeed, we show that a significant apply the methods to individual DNA sequences. difference exists, both qualitatively and quan- Furthermore, they draw an invalid conclusion from titatively, in the intron and exon entropies us- a non-existent correspondence between the chemi- ing this scheme. The experimental results are

The notion of entropy can also be expanded to account for memory. For example, let us consider the chance variables U and V. The conditional entropy of U given that V = v is defined by

H(Ulv) = - c Pmb(ulu) *log, P?Vb(u]u) UEU

and the conditional entropy of U with respect to V is defined by

cal entropy and the information theoretic entropy described in Section 3.

Page 4: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

ON THE ENTROPY OF DNA

The results in Sections 2 and 3 do not, however, imply a method for finding iutron/exon bound- aries, and as noted above, the false positive rate of the consensus sequences method (as well as meth- ods based on neural nets [3]) make them virtually useless for this task. One of the main features of consensus sequences is that they are memoryless models of canonical sequences, that is, they fail to take into account correlations between positions surrounding the splice junction.

0 In Section 4, we use conditional entropy as a tool for deciding which bits of information surrounding a splice junction are relevant, and we proceed to derive an algorithm for splice-junction detections. Our experiments show that our method is significantly superior, in terms of sensitivity/specificity tradeoff, to previous methods for detecting intron/exon boundaries.

1.4 Data Sequence data were obtained from GenBank release 80.0, and included only human genes which were described as “complete coding sequencesn . Within these sequences were 1275147 bases, with 659 introns and 669 exons. The average exon size was 184 bases with a standard deviation of 96, and the average intron size was 867 bases with a standard deviation of 583. The median lengths, however, are much shorter. Exons and introns have median lengths of 139 and 434, respectively. This fact is important in the methods described below with respect to choosing the size of the window size (N,). Since most exons are quite a bit shorter than the average the window size must be kept small in order to make sure that data is not lost.

2 Methods of Entropy Estimation

There are several common methods of estimating the entropy of a random process; we will describe several. The most straightforward would be to at- tempt a direct computation of the expected log of the empirical distribution function. Using this ap- proach the entropy estimate would be only as accu- rate an estimate as the estimate of the probability of n-tuples for n large. It should be clear that this

51

cases the amount of data is insufficient to achieve a good estimate of all but the marginal (first order) distribution and perhaps the distribution of pairs. If, however, it is possible to determine a priori that the process is of small Markov order, then such a scheme may provide satisfactory results.

Another popular choice involves data compres- sion schemes. Such a procedure would involve com- pressing the data, measuring the total compression, and thereby determining an upper bound for the entropy of the process. If the algorithm is uuiver- sal, then the compression ratio will approach the entropy as the size of the dataset increases. This method provides a startling advantage over the pre- vious one: since it is not based on a model, no spe- cific underlying structure of the process need be assumed. However, the approach, although very efficient in compressing data, is seriously impaired by a slow rate of convergence when used to estimate entropies .

For example (and for context) we will explain how a version of the the Lempel-Ziv data compres- sion algorithm can be used to estimate entropy. Al- though there are many modifications of the original algorithm, they are all sufficiently alike in spirit to consider a representative. In fact, all versions re- flect a common usage of string matching and pat- tern frequency; we demonstrate in the following ex- ample:

EXAMPLE 1. (LZ ALGORITHM [19]) Consider a sequence of binary data, e.g. (0010111000101). We purse the sequence into unique phmses by placing com- mas after every contiguous substring completes a knew” pattern. This pattern forms a phmse which becomes part of the “dictionary” of patterns, with new phrase formed by searching left to right down the sequence to find the shortest contiguous sub- string that is not already in the dictionary. For example, the sequence above would be parsed into {O,Ol, 011, l,OO, 010, ..). Notice that every phmse is unique, and that new phmses are formed by ex- tending previous phmses by exactly one symbol. Let. C, be the number of commas formed in a LZ parse of a sequence of length n. Ziv and Lem- pel[19] have shown that the quantity ““Tc”- + H

technique is generally impractical, since in most. as n + co, which makes it an obvious candidate for

Page 5: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

52 FARACH ET AL.

an entropy estimate. Indeed, it is a popular choice since it is easy to implement and is universally ap- plicable (empirically it performs very robustly in situations that are neither stationary nor ergodic, but we will return to this point later). Further- more, the string matching concept that lies behind this approach is intuitively appealing as a complex- ity measure since it quantitatively captures repet- itive structure. The drawbacks to this scheme in- clude a notoriously slow rate of convergence due to the large number of observations needed to build the dictionary of patterns. Furthermore, there are very few known distributional properties (neces- sary to any statistical application). It has been shown [l] that C,, is asymptotically Normal, but this was proved only for memoryless sources with p = .5 (this case appears later as a pathological example).

2.1 A Different Approach Another technique is based on the Fixed-Database LZ algorithm, a method that very closely resembles practical ver- sions that are widely in use, and that is very sim- ilar in spirit to all other versions of the algorithm (through their common usage of string matching). Let Xlm be the data source, which we wish to com- press or estimate its entropy. We assume that we have a “database”, D,, of n observations X0,+,. We define the longest match into D, of the incom- ing sequence X1, X2, . . . by

L = inf{k : X,k+’ p D,,}

where g means “as a contiguous substring”. For example, if

D, = { 0011010011000100)

with n = 16 and X1,X2,. . . = (0100100.. .} then L = 5 since the string (01001) C_ D, but the extension (010010) g D,. The following theorems (see [IS] for proofs) provide insight as to how L, the longest match, can be used to estimate entropy (an alternative proof of theorem 3 appears in [17].

THEOREM 1. If {Xk} is Uniform[A] iid., then for any positive integers 1 and n

Pr{L < I + logn} M exp(-2’).

A surprising result shows the uniform case to be a remarkable pathology:

THEOREM 2. Let {Xk} be a stationary, ergodic sounze with finite memory that is not uniform i.i.d. Thus, Pr(Xk = zhlX:z = z:;, = Pr{Xk = zklXL:& = z&j. In such a situation the asymptotic distribution of L is normal with mean p = w and variance u2 = v where u2 is the variance of the log P(X) defined as:

g2= Lim Var( - W)

. *-+oo n

The last theorem provides us with the first moment of L as well as the relationship of L to the entropy H:

THEOREM 3. Asn + 00, IE[L]-*I = O(1). In light of these theorems we form an estimate of the entropy of the process based on the length of repeated patterns. This scheme is potentially better than the the data-compression scheme that is based on the length of repeated patterns when used as an entropy estimator, since we apply it repeatedly to each incoming letter.

2.2 A Sliding Window Entropy Estimate Choose a positive integer N,. This parameter will be the size of a “window” of observations that will serve as the database into which we will reference incoming data to find the longest match. Given the sequence X1, X2,. . ., define, for every index i, the longest match of the string of observation to the right of i, into the string of observations in a window of size N,,, to the left of i. Formally, let

Li = min(k : Xj$f+’ g Xi-N,+l} P-1)

This defines a sequence of random variables {Li}. Theorem 3 suggests the following entropy estima- tor:

(2.2)

where E is the average of L;. Theorems 1 and 2 suggest that this convergence takes place with an error that is 0( &)). In fact, there are therefore two sources of statistical error in the estimate; (1) the standard error of z for fixed N, (ameliorated by large datasets) and (2) the

Page 6: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

ON THE ENTROPY OF DNA

bias term of O(l/ log N,) from Theorem 3. This error comes from the fact that we do not know the length of the memory of the process. If the memory is greater than N,,, we are losing a source for predictivity. However, the error term is only a bound on how extensive that source of error can be, and indeed we have no prior reason to suspect bias at all.

There are no clear rules to follow in selecting the size of the window. For a data set of n samples, there are roughly n/NW “independent” match lengths. This implies that a prudent choice would be to let log N,,, be approximately equal to the standard error of E. This can surely not be a hardfast rule since there is some reason to believe that the O(1) b ound in Theorem 3 may be, in many cases, quite small.

We point out several assumptions in our anal- ysis:

The entropy measure we are using only ap- proximates an entropy measure, due to the fact that the memory of the “process” is longer than our N,.

DNA is not stationary. This is probably a rea- son why the entropy estimator does better - it is more robust to weaker conditions and non- stationary processes cannot be characterized by entropy.

DNA is not a random process. Mathematics serves only as a guide in looking for a useful statistic, we do not suggest that we have characterized the “entropy of DNA”.

3 Results of Entropy Estimation

In our preliminary studies, the use of Lempel-Ziv compression suggested different entropy estimates for exons versus introns, and indicated that the variability of introns was higher that that of exons. However, due to the fact that the new entropy es- timation methods described below converge faster, we now have more reliable evidence. This follows from the fact that Lempel-Ziv carries out a string- matching operation only once per phrase while the new method does it for every letter.

We will use the sliding window entropy esti- mate on the genetic material of introns and ex-

53

ons. To do this, we will consider DNA to be a stochastic source defined on sequences of letters from the alphabet U = {A,C,G,T}. To this end, let X1,X2,... be a sequence of base pairs repre- senting a pure sequence of either exon or intron material. We will apply the sliding window entropy estimator to our experimental set of exon or intron material. Recall that we define, for any choice of NW:

This defines the longest match of the sequence at position i with respect to the N, neighbors to the left. We let L be the average match length. It is hoped that the difference in the makeup of the exon and the introns will be reflected in the average sliding match length statistic, z. The usefulness of this statistic is established by Theorem 3, and the Ergodic Theorem, which implies the convergence of L to E[L]. We must establish, however, the assumption that DNA has a fixed memory, so that Equation 2.2 will hold.

3.1 Entropy Estimate This first experiment presupposes the knowledge of the boundaries. This information (which we assume to be accurate) al- lows us to create sequences of pure exon and pure intron material. We were then able to form the slid- ing window entropy estimate z for each exon and intron along 66 different genes (see Section 1.4). This process produces the random variables Liy; where type is either exon or intron, i is the index of the gene, and j is index of occurrence within the gene, that is, Ljy is the length of the jth segment of type type in gene i. For this experiment we let N,, the window size, be 16.

We performed two tests on the data. The differ- ence between the entropies is small (for each were nearly maximal entropy). We chose to perform a signed rank test, selecting, with out loss of general- ity, the event of interest to be when the estimated entropy of the intron was larger than the estimated entropy of the exon.

To this end let,

yij = sgn[L;;“iZO” - ,j;-q.

Then, under the hypothesis that the entropy of the

Page 7: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

54 FARACH ET AL.

introns is identical to that of the exons (or more strongly, that they are stochastically equivalent), E[Yij] = 0. Using this data to perform the signed rank test on the paired comparisons of adjacent exon/intron sequences, we found that of the 303 comparisons, 73% found the average match length to be larger for the intron. Our conclusion is that the data does not support the equivalence hypothesis (with power P x 10m5).

Since z is a difficult quantity to pin down math- ematically, we ran a control test on a randomly se- lected test sequence chosen with equal probability among the 4 characters (A,C,G,T}. As expected the paired comparison test showed no significant difference between the groups.

3.2 Variability Measure We performed the identical signed rank test on the paired compar- isons, using the variance of the match lengths in- stead of the mean. One would expect the match lengths to have a greater variability for the lower entropic sequences - and this was observed. Sur- prisingly, the variance measure proved to be an even more sensitive discriminator. We found that the variability of the intron match lengths, in 80% of the paired comparisons, to be higher than that of its neighboring exon. In fact, in a large number of pairs with lower ezon entropy, the intron vari- ability was still greater.

4 Detecting Splice Junctions

4.1 Entropy Tests Recall from Section 1.1 that each intron begins with a GT and ends with an AG . We now consider the task of discriminating between au arbitrary GT or AG from one that signals a splice junction. We will focus for now on the GT discriminator. Our data set provides 39439 strings which begin with GT . It contains 579 introns. Our method will be to use conditional entropies to find which bits flanking the GT pairs are significantly correlated with an occurrence of GT.

Let R, denote the ensemble of na letters following all GT ‘s, and let Rr-’ denote the set of first n - 1 letters following all GT ‘s. From the properties of entropy we mentioned in the introduction, H(&,IRy-l) 5 H(R,) 5 2. Hence,

if we observe that H(5!,.JRy-1) cz 2, we can make two statements about the nfi letter in in a GT string. The first is that the n” letter is chosen nearly equiprobably from the set of characters and the second is that the n” letter in a GT string is essentially independent of the previous letters in the string. We will arbitrarily say that H(&,lRy-‘) x 2 if H(&,lRI;‘l) > 1.8. We can compute a symmetric profile for characters to the left of each GT . The are shown on the left plot of Figure 3. For n > 5, we do not have enough data to make statistically significant estimates since 4” 2 4= = 4096. Because of the size of our data set, our heuristic is to estimate H(&JRJ;-l) by H(R,IR:); this quantity measures the dependence of the nth letter of a string to the right of a GT with the first five letters of the string. With this approximation, we find that H(&,l$+l) x 2 for all n < 31, and similarly to the left. Symmetric experiments were performed for the symbols to the left of the GT boundaries with similar results.

Now, we can restrict ourselves to the charac- ters flanking the GT ‘s at the begining of introns. The right plot of Figure 3 give this results. The fi variables are defined analogously to the R vari- ables of the preceding paragraph. Here, for n 2 4, the size of the data set prevents the sample val- ues of A(&.J&‘) from being statistically sign.& cant. To be consistent with our earlier estimation technique, we resort to approximating ljy( &JR:-‘) with H(&l@). For 5 5 n 5 30, we have that H(4TJii;l-l) x 2, which suggests that we can ig- nore the 5ti , 6’” , . . . , 30th letters in the intron in our study of the splicing mechanism. The currently ex- isting theories about splicing suggest that we cau also ignore letters that are even further away from the beginning of the intron. For n = 4, we have that H(E,‘) N 1.74, so the fourth letter of au in- tron may or may not be helpful in differentiating the intron from an arbitrary string which begins with GT . We have decided not to ignore it be- cause the splice junction consensus indicates that it might be important. The E variables are defined symmetrically to the R variables (i.e., they look at the patterns immediately to the left of an intron beginning). For n 5 3, the fairly significant dif-

Page 8: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

ON THE ENTROPY OF DNA 55

Conditional Entropy Surrounding All GT Pairs 2

the well-known statistical test called the Neyman-

1.8 ---’ . l l l l . . . Pearson criterion [15, 71 to produce a useful rule

. ________--____------------------~~ to decide whether or not a GT pair should be

1.6 classified as an intron beginning.

1.4

1.2

1

We let z be the seven-letter pattern of interest surrounding a certain GT pair. Upon observing z, we wish to choose one of the following two hypotheses:

0.8 I&: The GT pair does not mark the beginning of an intron.

0.6 -5 -4 -3 -2 -1 G T 1 2 3 4 5

~nditiond Entropy Slarounding GT Pairs At Intmns

HI: The GT pair does mark the beginning of an intron. If Ut7) denotes the set of 47 = 16384 possible seven-

1.8 _________________ -----------------------,--’ letter patterns, then a decision rule will partition .

1.6 - Ut7) into two sets 20 and 2, with the following

. properties: 1.4 . 1. Every element B E U t7) will be an element in . . 1.2 . exactly one of the sets 20 and 21.

1 . . 2. If z E 20, we select hypothesis I& and other- 0.8 . wise we select hypothesis Hr.

. 0.6 -3 -2 -1 G T 1 2 3 4

The performance of a decision rule can be measured by two criteria known as the detection probability and the fake alarm probability, which are denoted

Figure 3: Estimates of Conditional Entropy Pro- by PD and pF, respectively. pD is the probability files that the decision rule correctly decides that a

GT pair which begins an intron is an intron

ference in H(L,]i;y-‘) from the value it previously beginning; pF is the probability that the decision

assumed indicates that the n”’ letter preceding a rul e incorrectly decides that a GT pair which does

GT pair is information that may be used by the not begin an intron is an intron beginning. In other words

splicing mechanism. ,

To summarize our findings, we see that the beginning of an intron appears to be linked to certain patterns of the form {zzzGTzzzz}, where the GT marks the beginning of the intron. We look at these patterns in the following section. Similar estimates of entropy profiles have been attempted at the end of introns. Unfortunately, the results of the tests do not give clear indications of the type of pattern which differentiates the ends of introns from other AG pairs. This quantifies earlier observations that the ends of introns are more difficult to predict than the beginings [9,13].

4.2 Patterns We have reduced our search-space for good discriminators to the 7-tuples surrounding a GT . We implement a discriminator by applying

In general, we would like to pick a decision rule which makes PD as large as possible and pF as Smtd as possible. These objectives conflict, so there is a trade-off between these goals. However, we can consider the following problem: if we are given a constraint on the maximum acceptable value Of pF, we would like to select a decision rule which maxi- mizes pD while satisfying the constraint on pF. The solution to this problem is known as a Neyman- Pearson criterion. As usual, for any t E Ut7), we approximate Pr(zl&) and Pr(zJliI~) by their sample values in our data set. For the 143 genes we considered, there are 574 intron beginnings and

Page 9: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

56 FARACH ET AL.

38839 non-intron beginning GT pairs which are surrounded by the seven-letter patterns of interest. We have seen before that the letters surrounding an arbitrary GT pair are close in an entropy sense to being independent and equiprobably taking on the values from the set (A,C,G,T}. Hence, we will assume that Pr(relAc) = P~(zllHo) for all zc and z1 E U(‘). Under this assumption, we choose our discriminating set by greedily including the z

which maximizes P ,$). If there is more than

one such Z, pick one which m&mizes Pr(zlH~). We procede until we achieve a false positive rate pi > a. Such a heuristic produces a decision rule which is sometimes optimal and always “good” in the Neyman-Pearson sense.

The currently existing consensus rules suggest that there are four patterns which differentiate the beginning of an intron from an arbitrary GT pair. These are AAGmAAGT, AAGmGAGT, CAG-AAGT, and CAG-GAGT. Accord- ing to our data set, if 21 is selected to be the set of these four patterns, then pD = 0.041812, and pF = 0.000232. Clearly, if this 21 were used in a parsing rule, the performance of the rule would be extremely poor, at least for this data set. One can do much better by taking advantage of the infor- mation collected by the method we describe, and which is summarized in Figure 4. For example, we can find Zr which will satisfy pD = 0.212544 andpF = 0, or if we are willing to let pF be as large as 0.000232, we can use a decision rule with PD = 0.297909 and PF = 0.000206. For this data set, if we use all 265 patterns specified above, we have pD = 1.0 and pF = 0.023173, and this value of pF may be acceptable for many parsing appli- cations. A list of patterns to achieve any point on the curve is available upon request.

5 Conclusion

While entropy measures information, most at- tempts to provide meaningful or useful entropy based characterization of DNA have ended in fail- ure. We suggest that this is because the wrong entropy estimators were used in previous studies. In particular, the most well known entropy estima- tors have notoriously slow convergence rates. The

aa!

I a015 .

B

%

aor .

am5 -

0 0 al a2 a3 a4 Ds&)&g a7 a9 a9 I

Figure 4: Neyman-Pearson Criterion for Splice Junction Detection

main result of this paper is that we find that the en- tropy of exons is higher than that of introns. This seems surprising in that introns are presumed to be the mechanism by which many random changes can accumulate without being subjected to restora- tive survival forces and thereby produce an entirely new gene, without each small incremental change being more fit for survival.

The natural explanation which occurs for our observation, based on using a new estimator of en- tropy, more suitable for estimating the entropy of a short string, is that (a large) part of the introns are also subject to restorative forces, and that some or all of these parts may serve to define and determine the splice juctions, i.e. the intron-exon boundaries. If this is the case, there must be rules which are coded into the introns and should be inferrable, given enough data. If these regions are the only parts of the introns subject to restorative forces then the rules must be complicated - which seems to be the case. Indeed, we attempt to find these (presumed) coding rules for the splice junctions in the introns, but unfortunately conclude that we will need much more data to infer the rules and solve this fascinating and important problem.

We believe that designers of algorithms which rely on some entropic property of a source must be careful not to misuse entropy estimators. In par- ticular, in practical settings in which an algorithm relies on an estimation of the entropy of a source, one must be innovative in choosing an estimator

Page 10: On the Entropy of DNA: Algorithms and Measurements Based ... · that approximate entropy in the limit has proven to be useless for finding the entropy of genetic regions. Further,

ON THE ENTROPYOF DNA

that fits the asymptotics of the problem.

References

57

ology of the Gene. Benjamin/Cummings, Menlo Park, CA, fourth edition, 1987.

[17] A.D. Wyner and A.J. Wyner. An improved

PI

PI

[31

PI

[51

PI

171

PI

PI

PI

WI

WI

[131

1141

WI

WI

D. Aldous and P. Shields. A diffusion limit for a class of randomly growing binary trees. Probability Theory, 79:509-542, 1988. R. Breathnach, C. Benoist, K. O’Hare, F. Gannon, and P. Chambon. Ovalbumin gene: Evidence for leader sequence in mRNA and DNA sequences at the exon-intron boundaries. Proceedings of the National Academy of Science, 75:4853-4857,1978. S. Brunak, J. Engelbrecht, and S. Knudsen. Pre- diction of human mRNA donor and acceptor sites from the DNA sequence. Journal of Molecular Bi- ology, 220:49, 1991. Jack Cophen and Ian Stewart. The information in your hand. The Mathematical Xntelligencer, 13(3), 1991. R. G. Gallager. Information Theory and Reliable Communication. John Wiley & Sons, Inc., 1968. Ali Hariri, Bruce Weber, and John Olmstead. On the validity of Shannon-information calculations for molecular biological sequence. Jovmal of Theoretical Biology, 147:235-254, 1990. W. B. Davenport Jr. and W. L. Root. An In- troduction to the Theory of Random Signals and Noise. McGraw-Hill, 1958. Andrzej Knopka and John Owens. Complexity charts can be used to map functional domains in DNA. Gene Anal. Techn., 6, 1989. S.M. Mount. A catalogue of splice-junction se- quences. Nucleic Acids Research, 10:459-472, 1982. H.M. Seidel, D.L. Pompliano, and J.R. Knowles. Exons as microgenes? Science, 257, September 1992. C. E. Shannon. A mathematical theory of commu- nication. Bell System Tech. J., 27:379-423, 623- 656, 1948. Peter S. Shenkin, Batu Erman, and Lucy D. Mastrandrea. Information-theoretical entropy as a measure of sequence variability. Proteins, 11(4):297, 1991. R. Staden. Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Research, 12:551-567, 1984. J.A. Steits. Snurps. Scieniific American, 258(6), June 1988. H. van Trees. Detection, estimation and modula- tion theory. Wiley, 1971. J. D. Watson, N. H. Hopkins, J. W. Roberts, J. Ar- getsinger Steitz, and A. M. Weiner. Molecular Bi-

version of the Lempel-Ziv algorithm. tinsactions of Information Theory.

[18] A.J. Wyner. String Matching Theorems and Ap- plications to Data Compreseion and Stat&ice. PhD thesis, Stanford University, 1993.

[19] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Zhnsactions on Information Theory, IT-23(3):337-343, 1977.