intron correlations evolution intron/exon

5
Proc. Natl. Acad. Sci. USA Vol. 92, pp. 12495-12499, December 1995 Evolution Intron phase correlations and the evolution of the intron/exon structure of genes (exon shuffling/ancient conserved regions) MANYUAN LONG*, CARL ROSENBERGt, AND WALTER GILBERT* *Department of Molecular and Cellular Biology, The Biological Laboratories, Harvard University, Cambridge, MA 02138; and tWhitehead Institute/ Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139 Contributed by Walter Gilbert, September 13, 1995 ABSTRACT Two issues in the evolution of the intron/ exon structure of genes are the role of exon shuffling and the origin of introns. Using a large data base of eukaryotic intron-containing genes, we have found that there are corre- lations between intron phases leading to an excess of sym- metric exons and symmetric exon sets. We interpret these excesses as manifestations of exon shuffling and make a conservative estimate that at least 19%6 of the exons in the data base were involved in exon shuffling, suggesting an important role for exon shuffling in evolution. Furthermore, these excesses of symmetric exons appear also in those regions of eukaryotic genes that are homologous to prokaryotic genes: the ancient conserved regions. This last fact cannot be ex- plained in terms of the insertional theory of introns but rather supports the concept that some of the introns were ancient, the exon theory of genes. How did the intron/exon structure of genes evolve? Broadly speaking, there are two alternative theories for the origin of introns. The insertional theory of introns, also called the "introns-late" theory, invokes recent events of intron insertion into eukaryotic genomes (1-3). On the other hand, the "in- trons-early" hypothesis (4-7) takes the position that introns were used in the progenote to assemble genes and that the evolutionary pattern since was one of loss down many lines, some retention, and, possibly, some gain. On this model one expects some correlation of exons with functional elements of the protein gene products. The introns-late hypothesis assumes that introns were added to preexisting genes at some time during evolution. One form would suggest that introns were added early in eukaryotic evolution and were used to shuffle parts of genes to account for the Cambrian explosion of metazoan species. Such models would expect exon shuffling to appear only in evolutionarily late gene products. A still more extreme model is that introns were added only very late in evolution and that no exon shuffling has occurred. In all introns-late models one expects no correlation between exons and elements of protein function for ancient genes, and observations of correlations between three-dimensional structure and exon structures (8-10), of correlations between intron positions in plants and animal genes (11-15), or of correlations between intron positions in genes that diverged in the progenote (16-19) are all viewed as statistical coincidences (20-22). This study began with a search for different testable pre- dictions derived from these two competing theories. Intron phase [the position of the intron within a codon, phase 0, 1, or 2 lying before the first base, after the first base, or after the second base, respectively (23)], is an important evolutionary character for whose distribution the two theories generate different predictions. Since the insertion of introns into pre- The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. viously uninterrupted genes, as the introns-late view advo- cates, does not affect the structure of the gene product, introns in each phase should have similar chances to survive. Thus, the simple form of the introns-late theory predicts a random phase distribution for introns. On the contrary, the introns-early theory predicts a nonrandom phase distribution for introns as a consequence of exon shuffling events, since exon shuffling works better if introns are in the same phase. Furthermore, exon shuffling should produce correlations in intron phases, since symmetric exons shuffle more easily, while insertional models predict that intron phases are uncorrelated. The pre- diction of a random distribution of intron phases and random correlations allows us to detect exon shuffling and to test the insertional theory of intron origins. In this report, we show that this population analysis indicates both an important role for exon shuffling throughout evolution and also support for the hypothesis of the ancient origin of introns. METHODS Exon Data Bases. We used GenBank (24) entries released in 1994 [release 84, which also includes European Molecular Biology Laboratory (EMBL) and DNA Database of Japan (DDBJ) entries] to construct a raw data base that contains information about the intron/exon structures, definitions of genes, locus names, and, finally, protein sequences. Then we calculated the intron positions and phases and the numbers and sizes of the exons for these 9276 sequences (45,095 exons) to create a processed data base. We discarded entries with obvious problems, such as wrong feature tables. In testing our extraction program, we identified all entries in GenBank that had the word exon in the definition line and sampled randomly about 10% of these entries to check how many include the "cds... join" and "cds... .complement join" command that we used to identify exon-containing genes. We found that about 3% of exons lacked this reference, and we excluded them. A further 5% lacked peptide sequence and were discarded. Therefore, we recovered about 97% of all annotated genes in the data base. To discard duplicates and partial sequences, we compared all the protein sequences with BLASTP (25) and purged all related entries but the longest, using a similarity score of 99% (99% match over the length of the shorter sequence), yielding 6611 genes and 33,150 exons. We then further distilled the data base, using FASTA (26) (similarity scores are calculated from the FASTA alignment as the match percentage times the length of the overlapping region divided by the length of the shorter sequence) and purging down to a criterion of a 20% match to the shorter sequence, keeping the sequence with more exons each time (the purging yielded 4681 genes at the 80% level, 3704 at 60%, 2942 at 40%, and 1925 at 20%). The plant genes are from FASTA purging of a complete data base of plant intron-containing genes to the 20% simi- Abbreviation: ACRs, ancient conserved regions. 12495

Upload: others

Post on 03-Feb-2022

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intron correlations evolution intron/exon

Proc. Natl. Acad. Sci. USAVol. 92, pp. 12495-12499, December 1995Evolution

Intron phase correlations and the evolution of the intron/exonstructure of genes

(exon shuffling/ancient conserved regions)

MANYUAN LONG*, CARL ROSENBERGt, AND WALTER GILBERT**Department of Molecular and Cellular Biology, The Biological Laboratories, Harvard University, Cambridge, MA 02138; and tWhitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139

Contributed by Walter Gilbert, September 13, 1995

ABSTRACT Two issues in the evolution of the intron/exon structure of genes are the role of exon shuffling and theorigin of introns. Using a large data base of eukaryoticintron-containing genes, we have found that there are corre-lations between intron phases leading to an excess of sym-metric exons and symmetric exon sets. We interpret theseexcesses as manifestations of exon shuffling and make aconservative estimate that at least 19%6 of the exons in the database were involved in exon shuffling, suggesting an importantrole for exon shuffling in evolution. Furthermore, theseexcesses of symmetric exons appear also in those regions ofeukaryotic genes that are homologous to prokaryotic genes:the ancient conserved regions. This last fact cannot be ex-plained in terms ofthe insertional theory of introns but rathersupports the concept that some of the introns were ancient, theexon theory of genes.

How did the intron/exon structure of genes evolve? Broadlyspeaking, there are two alternative theories for the origin ofintrons. The insertional theory of introns, also called the"introns-late" theory, invokes recent events of intron insertioninto eukaryotic genomes (1-3). On the other hand, the "in-trons-early" hypothesis (4-7) takes the position that intronswere used in the progenote to assemble genes and that theevolutionary pattern since was one of loss down many lines,some retention, and, possibly, some gain. On this model oneexpects some correlation of exons with functional elements ofthe protein gene products.The introns-late hypothesis assumes that introns were added

to preexisting genes at some time during evolution. One formwould suggest that introns were added early in eukaryoticevolution and were used to shuffle parts of genes to accountfor the Cambrian explosion of metazoan species. Such modelswould expect exon shuffling to appear only in evolutionarilylate gene products. A still more extreme model is that intronswere added only very late in evolution and that no exonshuffling has occurred. In all introns-late models one expectsno correlation between exons and elements of protein functionfor ancient genes, and observations of correlations betweenthree-dimensional structure and exon structures (8-10), ofcorrelations between intron positions in plants and animalgenes (11-15), or of correlations between intron positions ingenes that diverged in the progenote (16-19) are all viewed asstatistical coincidences (20-22).

This study began with a search for different testable pre-dictions derived from these two competing theories. Intronphase [the position of the intron within a codon, phase 0, 1, or2 lying before the first base, after the first base, or after thesecond base, respectively (23)], is an important evolutionarycharacter for whose distribution the two theories generatedifferent predictions. Since the insertion of introns into pre-

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement" inaccordance with 18 U.S.C. §1734 solely to indicate this fact.

viously uninterrupted genes, as the introns-late view advo-cates, does not affect the structure of the gene product, intronsin each phase should have similar chances to survive. Thus, thesimple form of the introns-late theory predicts a random phasedistribution for introns. On the contrary, the introns-earlytheory predicts a nonrandom phase distribution for introns asa consequence of exon shuffling events, since exon shufflingworks better if introns are in the same phase. Furthermore,exon shuffling should produce correlations in intron phases,since symmetric exons shuffle more easily, while insertionalmodels predict that intron phases are uncorrelated. The pre-diction of a random distribution of intron phases and randomcorrelations allows us to detect exon shuffling and to test theinsertional theory of intron origins. In this report, we show thatthis population analysis indicates both an important role forexon shuffling throughout evolution and also support for thehypothesis of the ancient origin of introns.

METHODSExon Data Bases. We used GenBank (24) entries released

in 1994 [release 84, which also includes European MolecularBiology Laboratory (EMBL) and DNA Database of Japan(DDBJ) entries] to construct a raw data base that containsinformation about the intron/exon structures, definitions ofgenes, locus names, and, finally, protein sequences. Then wecalculated the intron positions and phases and the numbersand sizes of the exons for these 9276 sequences (45,095 exons)to create a processed data base. We discarded entries withobvious problems, such as wrong feature tables. In testing ourextraction program, we identified all entries in GenBank thathad the word exon in the definition line and sampled randomlyabout 10% of these entries to check how many include the"cds... join" and "cds... .complement join" command that weused to identify exon-containing genes. We found that about3% of exons lacked this reference, and we excluded them. Afurther 5% lacked peptide sequence and were discarded.Therefore, we recovered about 97% of all annotated genes inthe data base. To discard duplicates and partial sequences, wecompared all the protein sequences with BLASTP (25) andpurged all related entries but the longest, using a similarityscore of 99% (99% match over the length of the shortersequence), yielding 6611 genes and 33,150 exons. We thenfurther distilled the data base, using FASTA (26) (similarityscores are calculated from the FASTA alignment as the matchpercentage times the length of the overlapping region dividedby the length of the shorter sequence) and purging down to acriterion of a 20% match to the shorter sequence, keeping thesequence with more exons each time (the purging yielded 4681genes at the 80% level, 3704 at 60%, 2942 at 40%, and 1925 at20%). The plant genes are from FASTA purging of a completedata base of plant intron-containing genes to the 20% simi-

Abbreviation: ACRs, ancient conserved regions.

12495

Page 2: Intron correlations evolution intron/exon

Proc. Natl. Acad. Sci. USA 92 (1995)

5()

,.

I0x~

,,

(

11

I I

I -II length(i i IExoni lengtlh (amino atcid residiues)

.L"I II2l.I1l

FIG. 1. Distribution of exon lengths. For the internal exons in the data base purged to 20% similarity, the histogram shows the numbers of exonsat each length: 94% of all the exons are of length less than 150 amino acids, and there is a scattering of longer exons, the longest having 3205 aminoacid residues.

larity level. The animal genes are drawn from the final database.

Expected Frequencies of Intron Phase Combinations. Theexpected frequencies of sets of intron phase combinations,Bayesian expectations, are calculated from the observed fre-quencies of their component members. For example, givenfi(i, j), the observed frequency of single exons whose boundingphases are i andj (5' and 3' respectively), the probabilitiesfor the sets of length 2, P2 (i, j) = Emfi(, m)fl(m, j)/7,E1jkf1(i, j)-f'(j, k). In general, the probability for a set oflength n, Pn(i, j), is related to the product of n f1s, Pn,(i, j)(E..E)n-lfl ...f )+ I fl ..fl.Exon Number Involved in Shuffling. For sets of singles,

doubles, and triples, we counted only the symmetric exons, thedoubles that have phase combinations: 010, 020, 101, 121, 202,and 212, and triples: 0120, 0210, 1021, 1201, 2012, and 2102.Their expectations were calculated as Bayesian expectations.

Ancient Conserved Regions (ACRs). Following the methodof Green et al. (27), we identified genes in the final data basethat match prokaryotic proteins (GenBank release 85) withBLAST scores above 75. We purged false alignments due tohighly biased amino acid composition. Then we aligned eachof the prokaryotic matches to the eukaryotic gene by using a

Smith-Waterman alignment (the program used was sent gen-erously by P. Green) and defined a new data base thatcontained only those introns that lie in the region of theeukaryotic protein that matches the prokaryotic one. Thisrepresents a class of genes (regions) that have introns butwhich arose as prokaryotic genes. There are 296 genes with1496 introns and 1200 internal exons in this data base ofhomologs of prokaryotic genes.

RESULTS

Construction ofExon Data Base. To survey the intron phasedistribution, we first built a data base of eukaryotic intron-containing genes extracted from GenBank release 84. Todiscard duplicates and partial sequences in this data base, wepurged all related sequences by using BLASTP and FASTA. Thefinal data base, purged to 20%, contains 1925 independent orquasi-independent protein sequences containing 13,042 exonsand 11,117 introns (we use the prefix quasi here because onecannot infer with certainty that the sequences are truly inde-pendent; these low criteria for similarity still miss a few relatedgenes). We found 1600 genes that have more than one intron,

and there are 9192 internal exons. Fig. 1 shows a histogram ofinternal exon lengths: a distribution peaked at 35 residues.

Distribution of Intron Phases. The three classes of intronphases are far from evenly distributed. Table 1 shows that thereis an excess of phase zero introns, as was found in a previoussurvey of a small sample (28). While this result from a largesample is consistent with the prediction of the exon theory ofgenes and contradicts the simple form of the insertionaltheory, there are alternative hypotheses for the insertionaltheory which cannot be rejected. To disfavor the appearanceof phase one or phase two introns, one might assume that theinsertion was into long nucleotide sequences which might becorrelated with the local amino acid sequence of the geneproduct, such as has been suggested for an AGIGT sequenceat the exon boundaries (28). However, we can distinguishbetween the two theories by studying intron phase correla-tions.

Intron Phase Correlations. The association of two adjacentintrons in eukaryotic genes can be in any of nine differentintron phase combinations, which can be classified into twogroups-symmetric exons (0, 0), (1, 1), and (2, 2) and asym-metric exons (0, 1), (0, 2), (1, 2), (1, 0), (2, 0), and (2, 1) (thefirst number is the phase of the 5' intron and the secondnumber is the phase of the 3' intron). Symmetric exons willspread more easily by exon shuffling (29) because the additionof a symmetric exon into an intron of the same phase does notdisturb the reading frame. On the contrary, the intron inser-tional theory predicts a random correlation of intron phasesacross exons.The simplest null hypothesis again is the random expectation

that intron phases should have equal frequencies, 1/3, and thateach pair of introns, each exon combination, should beequiprobable at 1/9. Compared to this, the intron phasecorrelation is very far from random, and there is a 125% excessof symmetric (0, 0) exons over the null expectation. Thisdramatic excess of symmetric exons could be taken as supportfor exon shuffling, with the corollary that the excess of phasezero introns would then be a result of the shuffling ofsymmetric (0, 0) exons. However, let us consider the more

Table 1. Proportions of the three intron phases

No. (%) of introns

Phase zero Phase one Phase two Total

5263 (48%) 3372 (30%) 2482 (22%) 11,117

12496 Evolution: Long et al.

I

Page 3: Intron correlations evolution intron/exon

Proc. Natl. Acad. Sci. USA 92 (1995) 12497

Table 2. Observed and expected symmetric intron phase associations

No. of No. ofexons (0, 0) (1, 1) (2, 2) exons x2

1 2325/2075(12%) 1085/832(30%) 519/461 (13%) 9192 200.42 1935/1829 (6%) 930/662(40%) 421/373 (13%) 7556 177.23 1593/1492 (7%) 715/518 (38%) 330/305 (8%) 6202 132.84 1357/1230(10%) 602/424(41%) 311/252 (23%) 5116 150.55 1060/1015 (4%) 461/350 (32%) 209/208 (0%) 4224 53.66 899/842 (7 %) 390/290(34%) 183/172 (6%) 3503 60.47 736/696 (6 %) 318/240(33%) 133/142(-6%) 2893 42.6

The observed and expected intron associations are expressed as the observed number/the expectationand excess of observed is in parentheses. For one exon, the expectation is PjPjN, where Pi is the proportionof introns of phase i and N is the total number of exons. For sets of exons, the expectation is the Bayesianprobability based on the observed proportions of internal exons (15). The x2 values are calculated on thetotal distribution (nine numbers). The asymmetric exon sets are all less than expectation.

conservative hypothesis, that the phase zero bias might havesome cause other than exon shuffling, and compare theobserved frequencies of exons with the expected frequenciesbased on a random association of introns [the expectedfrequency of exon (i, j) being then the product of the frequen-cies of introns of phase i and phase j].The x2 test shows that the deviation of the 9192 exons from

a random association of phase is very significant (X2 = 200, aP value less than 10-40). Thus intron phases are not combinedin a random fashion. Furthermore, symmetric exons are sig-nificantly more frequent than expected and asymmetric exonsare fewer. Table 2 shows that there is a 12% excess of (0, 0)exons over expectation and a 30% excess of (1, 1) exons. Thisis strong evidence for the exon shuffling hypothesis, because,even if there were some hitherto undiscovered rule aboutphase zero introns and correlations of phase zero introns, onewould not expect a significant excess of (1, 1) exons; symmetryper se seems to be favored. One source of nonrandom bias ishuman error in the data base. We estimate that the featuretable is wrong in a few percent of the entries. The mostcommon error is to assign introns to phase zero rather thanusing ...IGT.AGI. rules. Such a bias would not lead to anexcess of (1, 1) correlations.One further property of exon shuffling is that groups of

exons are likely to shuffle together. Table 2 also surveys thedistributions of the outside intron phases for groups of two toseven exons. The observed frequencies of each group arecompared with Bayesian conditional probabilities based on thefrequencies of the exons making up the group. Since thenumbers of exons in the sets of exons falls as the sets get longer(next-to-last column in Table 2), the statistical significance ofthe deviation from expectation (last column in Table 2) fallsas the length of the set of exons increases. However, thestriking fact is that the fractional excess over expectation ismaintained for each type of symmetrical set; (0, 0) sets areabout 7% over expectation, while (1, 1) sets are all about 30%over expectation. This behavior of sets of exons is a furtherargument for exon shuffling.Could this excess of symmetric exons and exon sets be due

to alternative splicing as distinct from exon shuffling? Alter-native splicing is dependent solely on the biological properties

of individual genes (30, 31) rather than the distances betweenthe introns. Alternative splicing frequently adds alternativesets of exons to the beginning or ends of genes. Such events donot restrict the phases or the phase combinations of exons.Alternative splicing events that add an additional internal exonto a previously functioning structure do require that the addedexon be symmetric, but, of course, also require that theexon/intron structure be related to the structure and functionof the gene product.

Role of Exon Shuffling in Evolution. How much of the database would have to be involved in exon shuffling events toproduce these observed deviations from randomness? We takethe difference between the observed and the expected numberof exons in symmetric sets as an estimate of the number ofexons which had to be involved in exon shuffling. Since theremay be an overlap in the use of exons between the exon setsof different length, we calculate only the excess symmetricexons over expectation, and the specific excesses of pairs andtriples that do not contain any symmetric exons within each set.The differences between observations and expectations (d =

O - E) are singles, d = 561; doubles, d = 400; and triples, d= 124. Then the number of exons involved in shuffling is 1 x561 + 2 x 400 + 3 x 124 = 1733, which is 19% of the purgeddata base. This is a lower bound for the true fraction involvedin shuffling, since we considered the excesses only for singles,certain pairs, and certain triples of exons to avoid the possi-bility of counting twice, and other factors like intron driftwould weaken the evidence for symmetric exon patterns. (If allof the deviation from the 1/3 expectation of intron phaseswere to be due to exon shuffling, then at least 28% of the database had to be involved.) Thus, quantitatively, exon shufflingis very important in the evolution of the intron/exon structureof genes.

This analysis has shown that the intron correlations ineukaryotic genes are not random, and it suggests a dominantrole for exon shuffling. However, several authors (32-34) haveargued that exon shuffling is a recent phenomenon, limited tovertebrate genes. Further analysis indicates that this may notbe the case.We analyzed the intron phase correlations for both plant and

animal genes separately. Table 3 shows that the nonrandom

Table 3. Intron correlations in data base subsets

No. ofSubset (0, 0) (1, 1) (2, 2) exons X2

Plant 579/515 (12%) 118/88 (34%) 76/71 (7%) 1642 33.5All animal 1869/1678 (11%) 992/774 (28%) 464/409 (13%) 7923 160.0Animal without

Caenorhabditis elegans 1189/1011 (11%) 778/604 (29%) 235/208 (13%) 5010 163.1

The proportions of phase zero, one, and two introns for the internal exons of plant genes are 56%, 23%,and 21% (total 2790 introns); the proportions for animal genes are 46%, 31%, and 23% (total 10,999introns). The expectations for the symmetric exons are calculated on the basis of these proportions.

Evolution: Long et al.

Page 4: Intron correlations evolution intron/exon

Proc. Natl. Acad. Sci. USA 92 (1995)

Table 4. Intron correlations in ACRs

No. of Six-way test Two-way testexons (0,0) (1, 1) (2,2) X2 P (0,0) + (1,1) + (2,2) x2 P

1 407/389 (7%) 88/76 (16%) 67/59 (14%) 9.1 0.110 562/515 (9%) 7.1 0.00802 340/305 (11%) 62/51 (22%) 37/44 (-15%) 13.1 0.026 439/400 (10%) 6.5 0.01103 275/236 (17%) 43/39 (12%) 30/34 (-10%) 13.2 0.027 348/309 (13%) 8.4 0.00354 217/182 (19%) 30/30 (0%) 20/26 (23%) 11.5 0.042 267/238 (12%) 6.0 0.0140

Results are expressed as observed/expected (percent difference). The expected values are calculated on the basis of the prior expectation offrequencies for the composite exon sets. The critical x2 value in a test of three symmetric and three asymmetric summed combinations, or six-waytest, is 11.07 (df = 5) for the 0.05 level; the critical x2 value in a two-way test at the 0.05 level is 3.84 (df = 1).

correlations of introns and the excess of symmetric exons holdsnot only for animal genes but also for plant genes. Thus, exonshuffling is not limited to vertebrate or animal genomes. [Thisstatistical argument is also supported by a recent report of exonshuffling in plant genes (35).] The animal genes in the database include a large number of genes from the C. elegansproject whose intron positions were determined by a computerprogram, GENEFINDER (36). One might worry that these"hypothetical" genes might perturb the calculation. Table 3also shows that all the C. elegans genes can be dropped fromthe data base without affecting the phase correlation excesses.

Intron Phase Correlations in ACRs. More importantly, wehave investigated intron phase correlations in a subset of 296genes from the final data base that are homologous to pro-karyotic genes (27), using only those introns that lie in theregion of match between the eukaryotic proteins and theprokaryotic ones. These ACRs represent gene structures thatwere in existence in the last common ancestor of the pro-karyotes and the eukaryotes. Since these genes have no intronsin the prokaryotes, the introns-late theory holds that all theintrons in this portion of the data base are insertional. Noshuffling could have occurred after the prokaryotic-eukaryoticdivergence, since these gene regions are colinear. However, justas for the overall data base, the distribution of intron phases andphase combinations in these ACRs are not random. Among the1496 introns, phase zero introns constitute 55%; phase oneintrons, 24%; and phase two introns, 21%. Furthermore, thereare statistically significant excesses of symmetric exon sets. Table4 shows the distribution of individual symmetric exon types forsets of one to four exons. The distributions for individual patternsof pairs, triples, and quadruples of exons reach statistical signif-icance. However, in all cases, a two-way test shows clearly thatsymmetrical exons are in excess at the 1% probability level. Thusthe intron/exon structure of these ACRs suggests that they wereconstructed by exon shuffling, which would have to have occurredin the progenote.

DISCUSSIONHow robust is this analysis? One might argue that those introncorrelations are due to some pattern of repeated genes in the

A

-{1.|(T...-1i( 117)

v1flK+%

data base: that the purging had not produced a set of inde-pendent entries. To test this possibility we repeated thecalculation for data bases created at different purging levels.Fig. 2 shows that the intron phase frequencies and the excessof symmetric exons (only pairs are shown) are stable across thepurging levels. Thus, the deviations from randomness are nota property of redundancy in the data base. The few quasi-independent sequences in the data base which may come fromsame-gene families do not bias the statistical analysis. Even ifwe purge the data base to the 10% level, yielding only 689genes and 7115 exons (657 genes with 5737 internal exons), thephase correlation excesses still stand.To give an insertional theory a best shot, we have assumed

that the nonrandom intron phase distribution might be a resultof introns inserted into specific sequences of bases which couldbe correlated with local amino acid sequence. One example ofthis is the suggestion that an AGIGT sequence in the exons atthe junctions is a remnant of a preinsertion target. However,Stephens and Schneider (37) find no significant AGIGT se-quence within the exons in a study of 1800 human introns, andonly a slight excess of AGIG. [An alternative interpretation ofthis AGIG bias would be that such sequences were selected for,after the fact, by evolutionary pressure from the splicingmechanism, such as better matching to the small RNAs (38).]On such insertional models, one might argue that the basecomposition of the DNA would bias the third positions ofcodons and thus that variation in the GC bias would drive such"entry" sequences into different phase relations in differentgenes. To test if the intron phase correlations are the result ofsome genes being rich in phase zero introns while other genesare rich in phase one introns, a feature that would createcorrelations between the phases, we examined the distributionof introns within genes in the data base. Histograms of thedistribution of introns within genes show a pattern broadlypeaked around the average positions, except for spikes at 100%for the three pure phases. We have dropped those sets of exons(68 genes) and repeated the calculation of symmetric exonexcesses to find again a 9% excess of (0, 0) exons and a 24%excess of (1, 1) exons.

B

-~-

J,

7-

(1. 2()

iflilai-M., ICOi. t:l,,2lE fo I.--il-4 1

T---ilri-cr rsc __ ____Zi

(1. 11.

I-(

( i. I ) exons

( . 2) Cx()I1S(* -)::2.

( ()h 0)) xo()1s---.s--- --e

20 ( 40 5 (d)

SimkiritvrcvScore ISC or pLII'(i1ic:

FIG. 2. Stability of the deviations from expectation at various purging levels. (A) Deviations of the intron phases from the 1/3 expectation. (B)Difference between the frequencies of symmetric pairs of exons and the expectation based on the observed exon frequencies.

12498 Evolution: Long et al.

.I '.s(MIo!' c )11

-

.i ). - 0.40) ---_

_9 -

Page 5: Intron correlations evolution intron/exon

Proc. Natl. Acad. Sci. USA 92 (1995) 12499

This study of intron phase correlations has revealed a signalwhich we interpret as a mark of exon shuffling. This statisticalapproach has the advantage that such a signal is only weak-ened, not abolished, by the drift or loss of introns overevolutionary time or by the addition of a few novel introns. Theargument is that the intron correlations in the eukaryoticversions of "ancient" genes, the excess of symmetric sets ofexons, shows that these exon sets were once used in exonshuffling which had to occur in the progenote. Hence theseexons were elements that were used to compose the originalgenes. These intron phase correlations are an independentsupport for the exon theory of genes. The argument fromphase correlations thus adds to the previous arguments basedon the correlation of intron positions with three-dimensionalstructure for triosephosphate isomerase (8-10), the coinci-dence of plant and animal introns (11-15), or the identificationof ancient introns on the basis of the patterns of descent(16-19) to support the idea that some introns are extremelyancient. Unlike the previous arguments, the intron phasecorrelation calculation detects a clear, statistically significantsignal for the existence of early introns against a backgroundof loss and gain.

We thank Richard C. Lewontin, Nathan Goodman, Charles H.Langley, Stephen M. Mount, Daniel Weinreich, Andrew Berry, LeslieGotlieb, and all the members of the Gilbert Lab, especially KeithRobison and Zhiping Liu, for useful discussions and helpful computingadvice. We thank Phillip Green for the computer program for theSmith-Waterman algorithm and Eric Lander for allowing us access tohis computing facility. We thank the National Institutes of Health fortheir support, Grant GM37997 to W.G. and Grant HG0098 to EricLander.

1. Cavalier-Smith, T. (1991) Trends Genet. 7, 145-148.2. Rogers, J. (1985) Nature (London) 315, 458-459.3. Rogers, J. (1989) Trends Genet. 5, 213-216.4. Blake, C. C. F. (1978) Nature (London) 273, 267.5. Doolittle, W. F. (1978) Nature (London) 272, 581-582.6. Gilbert, W. (1987) Cold Spring Harbor Symp. Quant. Biol. 52,

901-905.7. Gilbert, W. (1978) Nature (London) 271, 501.8. Gilbert, W. & Glynias, M. (1993) Gene 135, 137-144.9. Gilbert, W., Marchionni, M. & McKnight, G. (1986) Cell 46,

151-153.10. Go, M. (1981) Nature (London) 291, 90-93.

11. Marchionni, M. & Gilbert, W. (1986) Cell 46,133-141.12. Nawrath, C., Schell, J. & Koncz, C. (1990) Mol. Gen. Genet. 223,

65-75.13. Pardo, J. M. & Serrano, R. (1989) J. Biol. Chem. 264, 8557-8562.14. Shah, D. M., Hightower, R. C. & Meagher, R. B. J. (1983) Mo.

Appl. Genet. 2, 111-126.15. Imajuku, Y., Hirayama, T., Endoh, H. & Oka, A. (1992) FEBS

Lett. 304, 73-77.16. Iwabe, N., Kuma, K., Kishino, H., Hasegawa, M. & Miyata, T.

(1990) J. Mol. Evol. 31, 205-210.17. Kersanach, R., Brinkmann, H., Liaud, M.-F., Zhang, D.-X.,

Martin, W. & Cerff, R. (1994) Nature (London) 367, 387-389.18. Obaru, K., Tsuzuki, T., Setoyama, C. & Shimada, K. (1988) J.

Mol. Biol. 200, 13-22.19. Setoyama, C., Joh, T., Tsuzuk, T. & Shimada, K. (1988) J. Mol.

Biol. 202, 355-364.20. Logsdon, J. M., Jr., & Palmer, J. D. (1994) Nature (London) 369,

526.21. Stoltzfus, A. (1994) Nature (London) 369, 526-527.22. Stoltzfus, A., Spencer, D. F., Zuker, M., Logsdon, J. M., Jr., &

Doolittle, W. F. (1994) Science 265, 202-207.23. Sharp, P. A. (1981) Cell 23, 643-646.24. Burks, C., Cinkosky, M. J., Gilna, P., Hayden, J. E., Abe, Y.,

Atencio, E. J., Barnhouse, S., Benton, D., Buenafe, C. A., Cu-mella, K. & Burks, E. (1990) Methods Enzymol. 183, 3-22.

25. Altschul, S. F., Gish, N. M., Miller, W., Myers, E. W. & Lipman,D. J. (1990) J. Mol. Biol. 215, 403-410.

26. Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA85, 2444-2448.

27. Green, P., Lipman, D., Hillier, L., Waterson, R., States, D. &Claverie, J.-M. (1993) Science 259, 1711-1716.

28. Fedorov, A., Fedorov, A., Suboch, G., Bujakov, M. & Fedorova,L. (1992) Nucleic Acids Res. 20, 2553-2557.

29. Patthy, L. (1987) FEBS Lett. 214, 1-7.30. Hodges, D. & Bernstein, S. I. (1994) Adv. Genet. 31, 207-228.31. McKeown, M. (1992) Annu. Rev. Cell Biol. 8, 133-155.32. Palmer, J. D. & Logsdon, J. M., Jr. (1991) Curr. Opin. Genet. Dev.

1, 470-477.33. Patthy, L. (1991) BioEssays 13, 187-192.34. Stoltzfus, A., Spencer, D. F., Zuker, M., Logsdon, J. M., Jr., &

Doolittle, W. F. (1994) Science 265, 202-207.35. Domon, C. & Steinmetz, A. (1994) Mol. Gen. Genet. 244,

312-317.36. Wilson, R. (1994) Nature (London) 368, 32-38.37. Stephens, R. M. & Schneider, T. D. (1992) J. Mol. Biol. 228,

1124-1136.38. Horowitz, D. S. & Krainer, A. R. (1994) Trends Genet. 10,

100-106.

Evolution: Long et al.