non-random genomic divergence in repetitive sequences of human and chimpanzee in genes of different...

15
Mol Genet Genomics (2007) 277:441–455 DOI 10.1007/s00438-007-0210-8 123 ORIGINAL PAPER Non-random genomic divergence in repetitive sequences of human and chimpanzee in genes of diVerent functional categories Ravi Shankar · Amit Chaurasia · Biswaroop Ghosh · Dmitry Chekmenev · Evgeny Cheremushkin · Alexander Kel · Mitali Mukerji Received: 18 July 2006 / Accepted: 13 January 2007 / Published online: 9 March 2007 © Springer-Verlag 2007 Abstract Sequencing of the human and chimpanzee genomes has revealed »99% similarity in the coding sequence between both the species, which in no way parallels the observable phenotypic diVerences. Con- tribution of the non-coding sequences which comprise a bulk of the genome, in functional divergence between human and chimpanzee, is largely under- studied. In this context, we have compared extents of divergence in the non-coding repetitive DNA in a data set of well-classiWed neuronal and housekeeping genes between human and chimpanzee. The coding regions of these genes have earlier been extensively compared between the two species. It was revealed that the neurodevelopmental genes show accelerated evolution compared to neurophysiology and house- keeping genes in human. In this study, comparative analysis in terms of repeat spectrum, divergence in dinucleotide content density, JC divergence and its partitioning in repeats versus unique regions and transcription factor binding sites indicate diVerent extents of functional constraints associated with the non-coding repeat regions. The constraints are also diVerent when the upstream and downstream genic regions are compared across the functional catego- ries. The neurodevelopmental genes seem to diverge more in the genic regions, whereas the neurophysiol- ogy genes show higher divergence in the upstream 2 kb region. Most of the divergence observed in the housekeeping genes is contributed by repeats. We also observe an accumulation of function-speciWc transcription factor proWles in the human lineage. Interestingly, a major fraction of the regulatory sites in these regions is diVerently partitioned in the repeti- tive sequences which in turn is dependant upon the relative distribution of the repeats across the func- tional categories. Thus, diVerential distribution of repeats across the various functional categories could have substantial eVects on genome wide regulation and structure. The insights obtained from this study further add a new facet to the contribution of non- coding factors especially repeats in divergence of human and chimpanzee. Communicated by P. Ruiz. Electronic supplementary material The online version of this article (doi:10.1007/s00438-007-0210-8) contains supplementary material, which is available to authorized users. R. Shankar · A. Chaurasia · B. Ghosh · M. Mukerji (&) Functional Genomics Unit, Institute of Genomics and Integrative Biology, CSIR, Mall Road, Delhi 110007, India e-mail: [email protected] R. Shankar e-mail: [email protected] A. Chaurasia e-mail: [email protected] B. Ghosh e-mail: [email protected] D. Chekmenev · A. Kel BIOBASE GmbH, Halchtersche Strasse 33, 38304 Wolfenbuettel, Germany e-mail: [email protected] A. Kel e-mail: [email protected] E. Cheremushkin A.P. Ershov’s Institute of Informatics Systems, 6, Lavrentiev ave., 630090 Novosibirsk, Russia e-mail: [email protected]

Upload: nsu-ru

Post on 27-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Mol Genet Genomics (2007) 277:441–455

DOI 10.1007/s00438-007-0210-8

ORIGINAL PAPER

Non-random genomic divergence in repetitive sequences of human and chimpanzee in genes of diVerent functional categories

Ravi Shankar · Amit Chaurasia · Biswaroop Ghosh · Dmitry Chekmenev · Evgeny Cheremushkin · Alexander Kel · Mitali Mukerji

Received: 18 July 2006 / Accepted: 13 January 2007 / Published online: 9 March 2007© Springer-Verlag 2007

Abstract Sequencing of the human and chimpanzeegenomes has revealed »99% similarity in the codingsequence between both the species, which in no wayparallels the observable phenotypic diVerences. Con-tribution of the non-coding sequences which comprisea bulk of the genome, in functional divergencebetween human and chimpanzee, is largely under-

studied. In this context, we have compared extents ofdivergence in the non-coding repetitive DNA in adata set of well-classiWed neuronal and housekeepinggenes between human and chimpanzee. The codingregions of these genes have earlier been extensivelycompared between the two species. It was revealedthat the neurodevelopmental genes show acceleratedevolution compared to neurophysiology and house-keeping genes in human. In this study, comparativeanalysis in terms of repeat spectrum, divergence indinucleotide content density, JC divergence and itspartitioning in repeats versus unique regions andtranscription factor binding sites indicate diVerentextents of functional constraints associated with thenon-coding repeat regions. The constraints are alsodiVerent when the upstream and downstream genicregions are compared across the functional catego-ries. The neurodevelopmental genes seem to divergemore in the genic regions, whereas the neurophysiol-ogy genes show higher divergence in the upstream2 kb region. Most of the divergence observed in thehousekeeping genes is contributed by repeats. Wealso observe an accumulation of function-speciWctranscription factor proWles in the human lineage.Interestingly, a major fraction of the regulatory sitesin these regions is diVerently partitioned in the repeti-tive sequences which in turn is dependant upon therelative distribution of the repeats across the func-tional categories. Thus, diVerential distribution ofrepeats across the various functional categories couldhave substantial eVects on genome wide regulationand structure. The insights obtained from this studyfurther add a new facet to the contribution of non-coding factors especially repeats in divergence ofhuman and chimpanzee.

Communicated by P. Ruiz.

Electronic supplementary material The online version of this article (doi:10.1007/s00438-007-0210-8) contains supplementary material, which is available to authorized users.

R. Shankar · A. Chaurasia · B. Ghosh · M. Mukerji (&)Functional Genomics Unit, Institute of Genomics and Integrative Biology, CSIR, Mall Road, Delhi 110007, Indiae-mail: [email protected]

R. Shankare-mail: [email protected]

A. Chaurasiae-mail: [email protected]

B. Ghoshe-mail: [email protected]

D. Chekmenev · A. KelBIOBASE GmbH, Halchtersche Strasse 33, 38304 Wolfenbuettel, Germanye-mail: [email protected]

A. Kele-mail: [email protected]

E. CheremushkinA.P. Ershov’s Institute of Informatics Systems, 6, Lavrentiev ave., 630090 Novosibirsk, Russiae-mail: [email protected]

123

442 Mol Genet Genomics (2007) 277:441–455

Keywords Evolution · Repeats · Divergence · Transcription · Genome · Genes · Genetics · Human · Chimpanzee · Neuronal

List of abbreviationsJC Distance is Jukes–Cantor distanceDC Density distance is dinucleotide content

density distanceTF Transcription factorLCR Low complexity repeatLINE Long interspersed elementsSR Simple repeat

Introduction

Genome sequence analysis of chimpanzee and humanhas revealed an average divergence of about 1.2%between the two species (Chen and Li 2001; Chen et al.2001; Lander et al. 2001). This observation is diYcult toreconcile with, when one compares the morphological,cognitive or behavioral skills between both the species.The vast quantity of variability and diversity bothwithin and between human and chimpanzees (Britten2002; Caceres et al. 2003; Chen and Li 2001; Chen et al.2001; Cheng et al. 2005; Dorus et al. 2004; Enard et al.2002; Gilad et al. 2003; Gu and Gu 2003; Krubitzer andKahn 2003; Watanabe et al. 2004) makes it even morechallenging to distinguish variations which mark speci-ation events from those which determine intra-speciesvariability. Comparative analysis of the recently avail-able sequence of the entire chimpanzee genome withhuman has revealed many regional variations in thegenomic landscape (Cheng et al. 2005). Although theaverage sequence divergence still remains nearly thesame, genome wide rearrangements mediated by trans-posable elements/segmental duplications as well asinsertions/deletions (Britten 2002; Chen and Li 2001;Chen et al. 2001) seem to have contributed a great dealto such obvious phenotype diVerences. Whole genomecomparisons have revealed evidence of positive selec-tion in many genes involved in sensory perception,immune defense, tumor suppression, apoptosis, sper-matogenesis, olfaction and amino acid catabolism inthe human lineage (Nielsen et al. 2005). However, inthese genome wide studies, genes with their maximalexpression in brain did not show an excess tendencytowards positive selection (Nielsen et al. 2005). This ismost surprising since the most drastic changes havebeen observed in the evolution of a complex brainarchitecture, development and function in human. Amajority of the observable changes in the hominid lin-eages has been attributed either to neutral mutations

or weaker negative selection on deleterious mutationsdue to smaller eVective population size compared tomurid genomes (Chen and Li 2001; Chen et al. 2001;Gilad et al. 2003). That a neutral model of evolutioncan predict the main features of transcriptome evolu-tion in brains of primate and mice has been further cor-roborated by comparing gene expression and sequencedivergence in brains of primates and mice (Khaitovichet al. 2004b). However, in a number of studies, expres-sion proWles between human and chimpanzee hasrevealed substantial gene expression diVerences in thebrain compared to liver (Preuss et al. 2004; Uddin et al.2004). This has been recently corroborated by observa-tion of rates of evolution of neuronal genes and sen-sory perception which are suggestive of an ongoingadaptive evolutionary process (Caceres et al. 2003;Dorus et al. 2004; Gilad et al. 2003; Gu and Gu 2003;Pavlicek and Jurka 2006) in these higher order func-tions. Most of the sequence analysis so far has focusedon the protein coding regions and contribution of func-tional variations in the non-coding regions, as well asthe possible role of repetitive sequences is still largelyunderstudied. King and Wilson in 1975 had stressedthe need to look at regions beyond the coding regionsto understand this diVerence. The non-coding regionsshow substantial diVerences at whole genome level aswell as are rich in repeats. Besides repeats have beenshown to be distributed non-randomly and also harbortranscription factor binding sites which have been dem-onstrated to be functionally active (Thornburg et al.2006; Grover et al. 2005; Shankar et al. 2004). It isbeing increasingly felt that repeats and variationswithin them could have global eVects on both genestructure and expression. Kouprina et al. (2004) haveshown that the high density of repeats associated withASPM gene regions is associated with indels, recombi-nations and lineage-speciWc insertions in genic regions.Stankiewicz et al. (2004) have discussed the role oftransposable repeats in the evolution of primategenome through low-copy repeats and observed somein genes expressed in brain.

Recently, Dorus et al. (2004) in an extensive study ofthe coding sequences of neuronal and housekeepinggenes observed accelerated rates of evolution of neuro-developmental genes in the human lineages. Since thegenes were already well identiWed, classiWed and thor-oughly studied, we selected these sets of genes to inves-tigate whether non-coding repetitive regions have anycontribution to the process of evolution of higher orderneuronal functions in human. We attempted to esti-mate the diVerences through analysis of thesesequences in genic and 2 kb upstream promoterregions. We retained the same set of housekeeping

123

Mol Genet Genomics (2007) 277:441–455 443

genes also for comparison. The three categories ofgenes considered have diVerent functional roles to play.For instance, the developmental genes are both spa-tially and temporally regulated in speciWc regions, phys-iological genes would, besides above also respond toenvironmental, hormonal and physiological cues. Thehousekeeping genes would be generally conserved andexpressed in all the tissues but the absolute amountmay vary. Hence we hypothesized that each categorymight be subject to diVerent extent of functional con-straints and repeats might be inXuential in regulatingthis constraint. This study, we felt, would help usaddress few questions. Firstly, do diVerent functionalcategories show similar extents of divergence in thenon-coding repetitive regions? Secondly, are patternsof substitutions observed in the coding regions also mir-rored in the non-coding repetitive regions? Couldrepeat distribution also inXuence the regulatory reper-toire and consequently the function of gene?

Methods

Collection of sequences

We retrieved sequences for the genic as well as 2 kbupstream regions of the genes studied by Dorus et al.from Ensembl (http://www.ensembl.org). Currentreleases of the sequencing draft were used to retrievethe genomic sequences of Human (July 2004), Chim-panzee (November 2003), Rat (December 2004) andMouse (May 2005). We excluded all those sequenceswhich were not well sequenced in any of these tworegions or did not show satisfactory homology when wecompared the sequences using BLAT (Kent 2002) atUCSC. Prior to any analysis, the completeness ofsequences was re-veriWed manually and cross checked.All the sorting of sequences for speciWc analysis werecarried out programmatically.

Dinucleotide content (DC) density distance measurement

The dinucleotide combination captures the sequencecomposition by considering both, the neighboringnucleotide eVect on substitution as well as variationsdue to indels. The densities of 16 dinucleotides werecalculated for each of the categories of genes in humanand chimpanzee. The distance between the variousfunctional classes on the basis of their dinucleotidecontent was computed using Euclidian distance mea-surement. Each functional class gives a characteristicDC density proWle using which the distance between

them was calculated assuming them as the points inspace of 16 dimensions (density scales for all 16 dinu-cleotides). Similarly for each functional class the DCdensity distance was measured for each of the genepairs, from chimpanzee and human. The analysis wasnormalized for the non-sequenced regions. Throughthis method we were able to estimate the composi-tional diVerence satisfactorily.

Repeats analysis

Repeat analysis was performed using locally installedRepeatMasker (http://www.repeatmasker.org/). Thisprogram uses the repeat annotations and informationfrom Repbase at GIRI (Jurka et al. 2005a). Variousrepeats were sorted according to the requirementsusing shell scripts written locally. The repeats were Wrstcollectively analyzed for their distribution across thevarious regions of the selected genes. The Alu repeatswere analyzed separately for their subfamily distribu-tion. Density for each gene’s 2 kb upstream region aswell as genic region was estimated, and ratio betweenhuman and chimpanzee for the densities was calcu-lated. The means considered here were weightedmeans instead of simple means, as the lengths ofsequences were not equal but variable. All sequenceswere normalized for the non-sequenced regions andscreening was performed. In our analysis we have con-sidered long interspersed elements, short interspersedelements, simple and low complexity repeats (LCR).

Alignment analysis

Two alignment programs, Stretcher and Needle (http://www.emboss.sourceforge.net/download), were installedlocally. Final alignment was performed with Stretcher,since it globally aligns the long genomic sequences, satis-factorily at default parameters. All the pairs of genesfrom human and chimpanzee were aligned for theirgenomic sequences of genic as well as 2 kb upstreamregions. The whole process was automated for a batchusing a Java code. All alignments were veriWed manually.

The discrepancies in the alignments like wrongindels, stretches of non-sequenced regions in either ofthe species were not considered during calculation ofactual values of identity and indels as well as duringalignments. This was done using a script written in Perl.

In order to retrieve detailed description of substitu-tions and indels, coordinates, length of these stretchesand their coordinates in the genes as well as their posi-tion in repeats as well as details of alignments, a codewas written in C++. Besides this, alignments were alsocollected from UCSC.

123

444 Mol Genet Genomics (2007) 277:441–455

The alignments and repeat maps were superimposedto determine the nature of substitution and indelswithin and outside repeats. The Jukes Cantor percent-age of divergence was calculated for substitutions inthese alignments. It was further normalized for esti-mating the extent of variations in the repeats.

The complete alignment analysis result is availablein detailed format in the supplementary information(Supplementary Table 1).

Upstream region analysis

From UCSC using the golden path and table browser,we retrieved the aligned 2 kb upstream sequences ofthe genes considered, from Human, Chimpanzee,Mouse and Rat. Also in these 2 kb upstream sequenceswe identiWed conserved non-coding regions (CNCN)and regulatory sites using Mulan and multiTF atDcode.org (http://www.dcode.org), at optimum thresh-old values. The conserved regions in these alignedsequences were considered for estimation and compar-ison of divergence between functional categorieswithin and between both the lineages. Furthermore weanalyzed overall divergence within each lineagewherein we also included regions which did not alignbetween human and murid genomes.

Using phylogenetic anti-footprinting program devel-oped previously we identiWed TF binding sites gainedand lost in human and chimpanzee. Position from tran-scription start site has been considered as an importantfactor and evaluated accordingly. The analysis for sitegain and association with signal pathways was done usingTRANSPATH database and associated tool ArrayAna-lyzer (Krull et al. 2003). The detailed result Wle isattached with the supplementary Wles. The obtained datawere subjected for Multiple Binomial Test (Table 1) totest the hypothesis for enrichment of the TF sites.

For all the analyses here, we have considered theremoval of false positive results to minimize the noise.

Statistical analysis

All statistical analysis was performed using publiclyavailable statistical language package R (http://www.r-project.org/). We did analysis using the locally installedR package on a Red Hat Linux machine. We usedFisher’s exact test to assess repeat density bias betweenthe various functional categories. Wilcoxon tests(Mann–Whitney test) were performed to identifydiVerences in distribution across diVerent functionalcategories as well as across the species. We also veri-Wed our Wndings with Kolmogorov–Smirnov tests. AKruskal–Wallis test was done to test for diVerences in

distance across the functional classes. In order to testdiVerences and increase of divergence across the func-tional class as well as lineages, we performed Binomialtests. The same was done to test the enrichment of the TFsites during anti-footprinting analysis as well as for thetest of diVerential partitioning of mutations in repeats.

Results

At the outset, we retrieved homologs of all the genes ofdiVerent functional categories (earlier studied byDorus et al.) from both the species and conWrmed themthrough alignments and BLAT (http://www.genome.ucsc.edu/cgi-bin/hgBlat), since the earlier study wasrestricted to cDNA sequences of these genes. All thosegenes that did not have complete sequences or hadlarge number of non-sequenced or ambiguousstretches were Wltered out prior to this analysis. Even-tually the analysis was carried out on 47 neurodevelop-mental, 76 neurophysiological and 62 housekeepinggenes for their genomic sequences. We compared theoverall diVerences between chimpanzee and human, inthe 2 kb upstream as well as in the entire downstreamgenic region comprising coding regions including5�UTR, 3� UTR and introns of the genes. We also com-pared whether there were diVerences across diVerent

Table 1 Result obtained from antifootprinting analysis of sitegains for transcription factor binding sites 2 kb upstream of thegenes of the three functional classes

We show the P values of obtaining the observed diVerence innumber of sites between human and chimp promoters, calculatedusing Binomial distribution test

Transcription factors

Developmental Physiological Housekeeping

HNF1_Q6 1.93E-37 1.56E-73 4.43E-74FOXD3_01 9.40481E-07 5.51057E-05 1.1285E-11FOX_Q2 9.15772E-07 0.002791334 4.97E-11EVI1_04 0.000110982 3.39547E-05 1.75127E-05GATA4_Q3 0.002275692 0.000140678 9.00E-11HNF3ALPHA_Q6 0.000773584 1 4.05924E-06SP1_01 0.00025413 1 8.03E-04IRF7_01 1 1 0.007316649HNF3B_01 1 1 0.008766618CDPCR1_01 0.009605408 1.06279E-06 1CDX2_Q5 0.007616101 0.00355765 1PAX6_01 0.008239117 0.000222845 1MAZR_01 7.24793E-05 1 1KROX_Q6 0.00533692 1 1MAZ_Q6 0.001828883 1 1MEF2_Q6_01 1 0.004841425 1NFKB_Q6_01 1 0.005836361 1PAX_Q6 1 1 0.000266527E2F_Q6 0.002162002 1 0.003828026E2F_Q4 1 1 0.001443624

123

Mol Genet Genomics (2007) 277:441–455 445

functional categories and whether it correlated withthe diVerences observed in coding region sequences.Homologous genes from the same functional catego-ries in murids were taken as control for comparativestudy as well as to test whether the Wndings are speciWcto hominids. We observed that in majority of the casesthe genes were syntenically conserved in chimpanzeeand human and always present in homologous chromo-somes. However, in a few cases inversions were alsoobserved (Supplementary Table 2).

DiVerential accumulation of repeat classes

Conserved and near identical repeat proWle with veryfew exceptions was observed in all the homologs in thehominid lineage (Figs. 1, 2). A signiWcant number ofdevelopmental genes were observed to be entirelydevoid of repeats in the upstream 2 kb region, whereasthey were signiWcantly enriched in housekeeping genes(P value <0.05 for Fisher’s exact test). Majority ofgenes of physiological category also harbored repeats.

Analysis of diVerent kinds of repeats revealed sub-stantial variation in terms of numbers as well as densitybetween the upstream and the genic regions. Amongstthe repeats, Alu was observed to be most abundant fol-lowed by LINEs and simple repeats, and the repeatdensity was also observed to be higher in the 2 kbupstream regions compared to genic regions. Alu den-sity had a negative correlation with LINE density in allthe cases and unlike other repeats Alu density was alsohigh in the genic region of housekeeping genes. Almostno increment in density of Alu repeat was observed inboth the regions (Fig. 1a, b; Supplementary Table 3) inhominids. Density of simple repeats was observed tohave increased in all the categories of genes both in theupstream and downstream regions in human, and thiswas especially signiWcant in the human lineages in thepromoter regions of the housekeeping genes (Fig. 1a,b). An earlier comprehensive analysis of human trans-posable element insertions has shown that LINE andSINE elements are strongly biased towards regions ofdiVerent GC content (Lander et al. 2001; Grover et al.

Fig. 1 a Collective repeat density ratio (human/chimpanzee) in2 kb upstream regions of genes of diVerent functional categories.Simple repeats, Alu repeats and long interspersed repeats wereconsidered, for their distribution and density in the 2 kb upstreamregions of the three functional classes of genes. The simple re-peats have increased by approximately twofold in housekeepinggenes, whereas the Alu repeats and LINEs remain almost staticin most of the cases. The housekeeping genes have the highest

density of repeats, whereas neurodevelopmental genes haveleast. b Repeat density ratio (human/chimpanzee) in Genic re-gion of genes of diVerent functional categories. Simple repeats,Alu repeats and long interspersed repeats were considered, fortheir distribution and density in the 2 kb upstream regions of thethree functional classes of genes. No major increment in the re-peat content was observed in human

123

446 Mol Genet Genomics (2007) 277:441–455

2004). Taking this into account we compared the GCcontent and the repeat density for the three functionalcategories (Tables 2, 3). In corroboration with earlierobservations this was not found to be signiWcant (Sup-plementary Table 4).

In the murid genome, overall repeat density wasobserved to be lower than the hominid genome.Between the functional categories no signiWcant diVer-ence in accumulation of repeats was observed in themurid lineage. However, it is worthwhile mentioningthat apparent deWcit of retrotransposons derivedsequences could also be due to higher substitution ratewhich makes it diYcult to recognize ancient repeatsequences in the murid lineage (Waterston et al. 2002;Jurka et al. 2005b; Gibbs et al. 2004). Besides, it has

been reported that the ability of the RepeatMaskerprogram to detect repeats fall oV rapidly for divergenceabove 37%. Since Alus are absent in murid genome adirect comparison of Alu proWles could not be madebetween the lineages.

Dinucleotide composition divergence and repeat inXuence

We next used dinucleotide content (DC) density as ameasure to estimate whether DNA sequences evolvealike in both species in all the three categories of genesas well as between each gene pairs (human and chim-panzee) for the various functional classes. Through mea-suring compositional drifts, we might detect grosssequence alterations in the homologous genes during thecourse of evolution. The distance measure using dinucle-otide content has been described in the Methods section.We observe diVerences in the DC density between theupstream 2 kb and the genic regions in all the categoriesof genes, and the trends are similar between human andchimp. However, when we compare the ratio of DCdensity of human/chimp in 2 kb upstream and genicregion we observed signiWcant diVerences with respectto diVerent categories of genes. While the developmen-tal genes do not show much variation, there is a wideXuctuation in the DC density ratio with respect to boththe physiological as well as housekeeping genes (Supple-mentary Table 5). A wide diVerence in DC density couldbe either due to deletions/insertions or accumulation ofextraneous sequences which could be both simplesequence repeats or interspersed transposable elements.Hence, analysis of the DC density was also carried outafter masking the simple as well as complex repeats andalso LINEs and SINEs. We observe that masking of therepeats alter the dinucleotide characteristics in theupstream 2 kb region but not in the genic region of theneurodevelopmental and housekeeping genes. How-ever, this eVect is not observed in the upstream 2 kb aswell as in the genic region of the physiological genes.

We also compared the extent of DC density distancebetween the functional categories i.e. (1) developmental

Fig. 2 a Alu subfamily density ratio (human/ chimpanzee) in2 kb upstream region. The housekeeping genes are richest in Alurepeats. b Alu subfamily density ratio (human/chimpanzee) ingenic region

Table 2 GC percentage and repeat density (Alu and LINE) forthe 2 kb upstream region

No correlation can be found between GC content and repeatdensity

Class GC Density (Alu)

GC Density (LINE)

2 kb upstreamNeuroDevelopmental 49.80 0.22 52.10 0.13NeuroPhysiological 48.62 0.20 48.12 0.12Housekeeping 48.55 0.25 47.02 0.17

Table 3 GC percentage and repeat density (Alu and LINE) forthe genic region

No correlation can be found between GC content and repeatdensity

Class GC Density (Alu)

GC Density (LINE)

GenicNeuroDevelopmental 44.26 0.10 44.32 0.10NeuroPhysiological 43.84 0.09 43.91 0.11Housekeeping 44.45 0.20 43.82 0.09

123

Mol Genet Genomics (2007) 277:441–455 447

and housekeeping, (2) physiological and house keepingand (3) developmental and physiological gene sequencesin chimp and human. We observe increased distancebetween developmental & physiological genes anddevelopmental & housekeeping genes in the 2 kb

upstream region. This is due to divergence in both thehousekeeping and physiological genes. This is substanti-ated by the analysis on individual genes for the DC den-sity distance between human and chimpanzee (Table 4).

Analysis of dinucleotide distance after masking therepeats reveals that a majority of diVerence in theneuronal genes could be ascribed to the accumulationof simple repeats. Interestingly, upon masking Alurepeats it is observed that the overall distance betweenall the classes described earlier increases in both thechimp and human when compared to unmaskedsequences. Strikingly this diVerence is increased andwidened more between developmental and housekeep-ing genes in human compared to chimp. This suggeststhat presence of Alu repeats in the upstream regionscontributes substantially to the dinucleotide contentsimilarity or sequence homogenization (Fig. 3a–c).

Table 4 The H test was done to test the diVerences in dinucleo-tide composition observed in the 2 kb upstream of the genes ofvarious functional categories of genes between human and chim-panzee

NDG neurodevelopmental genes, HKG housekeeping genes,NPG neurophysiological genes

Compared functional categories

H test values

Critical region

SigniWcant

NDG–NPG–HKG 0.71 ¸0.6745 YesNDG–HKG 0.96 ¸0.0026 YesNPG–HKG 0.4 ¸0.6985 No

Fig. 3 a Distance between diVerent functional categories basedon dinucleotide content density in 2 kb upstream region. Compo-sitionally the 2 kb upstream regions of neurophysiological andhousekeeping genes have undergone higher changes in human. bDistance between diVerent functional categories based on dinu-cleotide content density in 2 kb upstream region after masking thesimple and LCR repeats. We observe a decrease in distance be-tween the functional classes in 2 kb upstream region. c Distancebetween diVerent functional categories based on dinucleotide con-tent density in 2 kb upstream region obtained after masking Alu re-peats. The distance increases once Alus are removed, indicating thatAlu might be playing homogenizing role in 2 kb upstream region

and could be functionally important. d Distance between diVerentfunctional categories based on dinucleotide content density inGenic region without masking repeats. No major diVerence isfound between human and chimpanzee. Compositionally neuro-nal genes are closest to each other. e Distance between diVerentfunctional categories based on dinucleotide content density in 2 kbupstream region obtained after masking the simple and low com-plexity repeats. No major diVerence was observed indicating sim-ple repeats do not play a major role in divergence between all thethree categories of genes. f Distance between diVerent functionalcategories based on dinucleotide content density in 2 kb upstreamregion obtained after masking Alu repeats

123

448 Mol Genet Genomics (2007) 277:441–455

When we look at the unmasked sequences, we observethat the distance between all the classes have facedsome sort of leveling eVect in human (Fig. 3a).

Strikingly, no major diVerence in DC density dis-tance is observed in human and chimpanzee in thegenic region in any of the categories, irrespective of thepresence or absence of repeats. However, unlike theupstream 2 kb region, the distance between house-keeping and developmental/physiological genes ismuch greater in the genic region in both the lineages.This could be ascribed to a higher accumulation ofrepetitive sequences in the housekeeping genes com-pared to the other classes as described in the earliersection (Fig. 3d–f). Compositionally the highest dis-tance was observed between physiological and house-keeping genes.

Partitioning of divergence between non-repetitive and repetitive regions

We further analyzed the sequence divergence in 2 kbupstream and genic regions through performing geno-mic-level alignments of the homologous sequencesfrom human and chimpanzee. We analyzed genicregions comprising 6,710,828 bps of neurophysiological,5,412,364 bps of neurodevelopmental and 1,721,284bps of housekeeping genes through alignments. Simi-larly, we analyzed the 2 kb upstream regions of all thethree classes comprising 156,214 bps of physiological,94,127 bps of developmental and 127,754 bps of house-keeping genes.

Following alignments, we estimated Jukes–Cantor(JC) divergence in all the three functional classes ofgenes in both the lineages. We also mapped thealigned regions containing repeats in the respectivesequences, which enabled us to estimate the diver-gence due to substitutions in the repeat regions. Weobserved diVerent extents of divergence in the genicand upstream regions. Strikingly, the developmentalgenes show higher degree of divergence in the genicregions as opposed to the upstream 2 kb regions,while it is just the reverse for physiological and house-keeping genes.

Interestingly, even though the trend of divergenceremains the same, the extent of divergence is reducedif the JC distances are computed after removing therepetitive sequences. The decline is most drastic in thehousekeeping genes followed by neurophysiologicalgenes and developmental genes in both genic as well asin upstream regions (Tables 5, 6). This suggests thatthe housekeeping genes have accumulated more sub-stitutions in the repeat regions both in the upstream aswell as in the genic region. Compared to only 26% ofthe substitutions within repeats in the neurodevelop-mental genes, the housekeeping genes had nearly 42%of the substitutions partitioned in the repeats, for thegenic regions (Supplementary Table 6). Besides, in thegenic regions, this divergence is extremely reduced inphysiological and housekeeping genes. Once therepeats are removed, both the categories show similardivergence (P value = 0.03 using Mann–Whitney test).However, the developmental genes show higher diver-gence even after masking repeats in the genic regionswith signiWcantly increased diVerence from the house-keeping genes (P value <0.005 using Mann–Whitneytest).

We next analyzed whether the trend for partitioningof substitutions within repeats in diVerent functionalcategories was comparable between hominid andmurid lineages. SigniWcant diVerences observed withrespect to partitioning of substitutions in the repeatregions amongst the functional categories in the homi-nid lineages (P value <<0.01 for binomial test), werenot observed for murid lineages (P value >>0.05 for

Table 5 JC divergence in the upstream 2 kb region. For eachgene the Jukes Cantor distance was calculated separately for 2 kbupstream region from the alignment data

JC distance (with repeats, %)

JC distance (without repeats, %)

Ratio of JC distance (without repeats/with repeats)

Developmental 1.20 1.05 0.87Physiological 2.15 1.91 0.88Housekeeping 2.33 1.40 0.60

Table 6 Ratio of JC divergence in the genic region in the absence and presence of repeats

The Jukes Cantor distance calculated between human and chimpanzee genes for genic regions suggests role of repeats in divergence

JC distance (with repeats, %)

JC distance (without repeats, %)

Ratio of JC distance (without repeats/with repeats)

Murid (without repeats/with repeats)

Developmental 2.60 1.90 0.73077 0.952381Physiological 1.33 0.89 0.66917 0.952381Housekeeping 1.55 0.89 0.57419 0.961538

123

Mol Genet Genomics (2007) 277:441–455 449

binomial test). Besides, the earlier described Mann–Whitney test performed in the murids did not give anysigniWcant result. It is possible, that lower density of therepeats in the murid could contribute to this diVerencein substitutions partitioning when compared to humanlineage (Table 6).

Trends of divergence in the proximal promoter regions and role of repeats

In order to estimate functional correlates of sequencedivergence, we focused our analysis on the 2 kbupstream region of all the three functional classes ofthe genes. This is because, the regulatory sites in non-coding regions are heterogeneous and the proximalpromoter regions are the most characterized. How-ever, we would like to emphasize that changes muchupstream in the 2 kb regions could also inXuenceexpression. In both the lineages we compared diver-gence in both the entire upstream as well as the con-served non-coding nucleotides (CNCN) within theseregions across all the four genomes. CNCN sites areexpected to be conserved due to functional constraintsand hence might show diVerent divergence comparedto the entire upstream region.

Comparison of divergence between upstream andCNCN regions across the murids and hominids revealssigniWcantly reduced divergence in the CNCN regionsof housekeeping genes in hominids.

A Wilcoxon-test was performed between the vari-ous functional classes for murids and hominids to testthe distribution of divergence in the CNCN regions. Itwas observed that the diVerences between variousfunctional classes are more signiWcant in murids(Table 7). In hominids there seems to be a sort of level-ing eVect, which could be due to decrease in the diver-gence in the CNCN regions of housekeeping genes.

In order to determine in which speciWc functional cat-egory the divergence is more drastic within the lineages,we carried out a binomial test. We analyzed the fractionof genes which have undergone high divergence in hom-inid and murids in each functional category.

We observe a sharp decline in divergence in CNCNof housekeeping genes in the human lineage (P value0.009 for Binomial test) and a slight increase in devel-opmental genes in hominids (P value 0.02). No signiW-cant diVerence was found for neurophysiological genesacross murids and hominids. Overall, however, moreregulatory regions in the CNCN are conserved indevelopmental genes as compared to other two classes.In human increased divergence when the entire 2 kbupstream regions are considered, could be due to theglobal degradation as reported earlier by Keightley

et al. (2005). However, if we dissect the genes on thebasis of functional category this degradation is compar-atively lower in the 2 kb upstream regions of neurode-velopmental genes. This could be due to regulatoryand functional constraints as they are part of complexadaptive processes (Lee et al. 2005).

To further delineate subtle variations betweenhuman and chimpanzee in the conserved regions of2 kb upstream region we carried out phylogenetic anti-footprinting. This method compares and estimates theextent of transcription factor binding site gain and lossbetween human and chimpanzee. Analysis of the TFbinding sites using this method revealed an interestingdistribution of various transcription factors whose siteswere gained in human promoters versus chimpanzeepromoters (Table 1). First of all, we observed signiW-cant enrichment of sites for several ubiquitous tran-scription factors, such as, GATA, C/EBP, SP-1, someof the FOX factors, in human promoters compared tochimp promoters. This trend was observed in all threegroups of promoters, although much more profound inthe promoters of housekeeping genes. Interestingly,such high diVerence in the site gain between humanand chimp promoters cannot be explained just bydiVerence in the background frequencies of these sitesin the promoters (Table 7). It was very intriguing toobserve that sites for tissue-speciWc factors, such asHNF-1 and HNF-3, are much more frequently gainedin human promoters compared to the chimp promoter.This partially can be explained by the AT richness ofthe HNF-1 site consensus and the gain of short AT richrepeats in human promoters (Supplementary Table 7).

One of the most important Wnding is that weobserved a signiWcant gain of new sites for factors, suchas PAX-6, CDX-2 and MAZR that are known to beinvolved in the early processes of brain development.Overrepresentation of these sites was observed exclu-sively in the two groups of brain-speciWc promoters.Moreover, the PAX sites were clearly underrepre-sented in the housekeeping promoters.

Table 7 Wilcox test to test the signiWcance of divergence varia-tion across the functional categories for conserved non-codingnucleotide regions (CNCN) in 2 kb upstream

This test shows signiWcance variation across the various func-tional categories in murids for CNCN regions while it’s insigniW-cant in hominids, which appears to be consequence of levelingeVect observed in hominids

Functional categories compared

P value for Wilcox test

NDG–HKG 0.004NDG–NPG 0.01NPG–HKG >0.05

123

450 Mol Genet Genomics (2007) 277:441–455

It is also interesting to observe that sites for NF-kap-paB and MEF-2 are overrepresented in their gain inthe human promoters of the group of physiologicalgenes. It is known that these factors are involved inregulation of many physiological processes. Some sig-nal transduction pathways that are involved in con-certed regulation of TFs can have diVerences with sitegain in the promoters (Supplementary Fig. 1a, b).

Next, we estimated how much of these gain of sitesis associated with the repeat regions in human. For thispurpose we mapped the repetitive regions and the reg-ulatory sites in the proximal promoter regions. Wefound that approximately one-third of diVerences inTF sites between these two species is associated withrepeats. Fisher’s exact test was carried out to estimatewhether this gain of site was common over all the func-tional categories. We found that repeats contributed toa signiWcant fraction of regulatory sites in the house-keeping genes and least in the neurodevelopmentalgenes. This is in concordance with our earlier results(Table 8). This strongly suggests that distribution ofrepeats inXuence the sites gain/loss in a functional cate-gory-speciWc manner.

Discussion

The non-coding upstream 2 kb region follows a distinctpattern of sequence evolution compared to the non-coding genic regions. This is corroborated by theobservations of dinucleotide patterns, repeat densityproWles, JC divergence as well as transcription factorproWle analysis. This is intuitively obvious since boththe regions could be subject to diVerent degree of func-tional constraints. Interestingly, what we observe isthat even though trends are similar, there also seems tobe diVerent degree of constraint depending on thefunctional category of genes. The classiWcation of

genes into neurodevelopmental, neurophysiologicaland housekeeping based on an earlier study by Doruset al. (2004) allowed us to observe these diVerences.

When we focus further on the gene structure, overallthe upstream regions are more GC rich compared tothe genic region in all the categories which is a revali-dation of earlier observations (Webster et al. 2003).There was no signiWcant diVerence in GC contentbetween genes of diVerent functional categories. How-ever, we observe diVerent degree of constraint on accu-mulation of repeats between these categories. Thisfurther substantiates an earlier observation of non-ran-dom, functional category dependant distribution ofrepeats in the human genome (Grover et al. 2003). Thehousekeeping genes which are considered to be themost conserved have accumulated maximum repeats inboth the regions with exceptionally high number ofsimple repeats in the upstream 2 kb regions. Thisenhanced accumulation of repetitive elements in thenon-coding region of the housekeeping genes ofhuman has been also reported (Vinogradov 2006).

Does presence of these elements confer any func-tional advantage or are they selectively disadvanta-geous? Their low numbers in developmental genessuggest the latter. However, repetitive sequences ear-lier considered junk are being increasingly shown toplay many regulatory roles both structurally and func-tionally (Conrad et al. 1986; Deininger and Batzer1999; Grover et al. 2005; Versteeg et al. 2003). There-fore, their diVerential accumulation could lead toaltered gene regulation in many diVerent ways. Forinstance, simple repeats through formation of alternatesecondary structure in response to physiological cuescould alter nucleosome positioning and eVect tran-scription site accessibility (Brahmachari et al. 1995;Englander and Howard 1995; Englander et al. 1993). Ithas been recently shown in a study that the chromatinarchitecture with respect to S/MAR biding, nucleo-some positioning potential and repetitive sequence aresigniWcantly diVerent in human housekeeping and tis-sue-speciWc genes (Ganapathi et al. 2005). On the otherhand, the Alu elements have been shown to accumu-late a large number of RNA pol II regulatory sites, sothat their accumulation in the upstream regions couldlead to novel regulatory networks (Britten 2002;Englander and Howard 1995; Englander et al. 1993;Ludwig et al. 2005; Shankar et al. 2004). The observa-tion that the Alu repeats density has been constantbetween the both species but the Alu subfamily densityis remarkably varied between the two species, points totheir possible dynamic behavior in the 2 kb upstreamregions. Due to variations at some speciWc sites, theAlu repeats are classiWed into diVerent subfamilies. We

Table 8 Fisher’s exact test shows signiWcant impact of repeat dis-tribution on the regulatory sites gain in 2 kb upstream regions ofgenes in human. It is highest for housekeeping genes followed byneurophysiological genes (NPG) and neurodevelopmental genes(NDG)

The relative site gains were compared between the categories andthe diVerence stands highest between neurodevelopmental andhousekeeping genes

Functional class

Percentage of genes inXuenced by repeats

P value (HKG)

P value (NPG)

P value (NDG)

HKG 54.8 – 0.056 0.0006NPG 43.4 0.056 – 0.012NDG 23 0.0006 0.012 –

123

Mol Genet Genomics (2007) 277:441–455 451

observe these subfamily diVerences between humanand chimp in the upstream regions (Fig. 2). The pro-portion of oldest Alu subfamilies in human comparedto chimp is lower in all functional categories except forhousekeeping genes, while middle-aged Alu which arerich in regulatory elements are at equilibrium withabsolutely no major shift across the classes. This sug-gests that the middle-aged subfamilies which throughacquisition of regulatory sites through mutations mayget Wxed in the human lineage and hence not convert toolder subfamilies. Further, conversion of newer andyounger subfamilies could lead to similar overall main-tenance of Alu density. This is marked by the observa-tion that all the functional categories have increasedyounger Alu in human genes 2 kb upstream. This dem-onstrates the continuous cycle of conversion from onesubfamily to another due to mutations in the 2 kbupstream region, which might have an important inXu-ence on the regulatory structure of the 2 kb upstreamregions.

LINE elements besides harboring regulatory ele-ments also have internal promoters which could lead tonovel transcription initiation sites. The negative corre-lation with respect to Alu and LINE repeats is not anovel observation and so also their contrasting pres-ence in the upstream and genic regions given that Alutend to be preferentially associated with GC richregion and LINEs in AT rich region (Wichman et al.1992). However, what is surprising is their abundantpresence in the housekeeping genes. This is intriguingsince it is well known that accumulation of repetitivesequences in promoters could also be deleterious asthey are also frequently involved in non-homologousrecombination leading to gene duplication, rearrange-ments and deletions (Bailey et al. 2003; Callinan et al.2005). It is also possible that since both the housekeep-ing genes and Alu elements tend to prefer GC richregions, it would therefore follow that both of themwould tend to cluster together. Since repetitivesequences are also known to position nucleosome,their presence in the housekeeping genes might ensurean open structure for transcription factor binding andconstitutive expression (Britten 2002; Englander andHoward 1995; Englander et al. 1993).

We observe that compared to developmental genes,the housekeeping and physiological genes havediverged signiWcantly in the hominid lineage in theupstream 2 kb regions. This seems to be in conforma-tion with the observation of widespread degradation ofgene control regions in the hominid genome earlierreported (Keightley et al. 2005). However, our studiessuggest that repetitive sequences contribute signiW-cantly to divergence in regions associated with genes of

diVerent functional categories. Noteworthy, neurode-velopmental genes are signiWcantly devoid of repeats inthe upstream regions. This could be due to stringentselection both at the level of repeat integration as wellas removal of possible deleterious integration throughselection. It has been observed that genes with tissue-speciWc expression have a more compact chromatinstructure, which may prevent integration of transpos-able elements. Besides, a lot of apparent homogeneityin the 2 kb upstream region between physiological andhousekeeping genes is lost once the dinucleotide dis-tances are estimated in the absence of Alu repeatswhich suggests that the structure of the physiologicalgene is distinct from the housekeeping genes and pres-ence of Alu repeats mask this diVerence. The dinucleo-tide divergence described earlier gives a qualitativeestimate of compositional diVerences in sequences.How much of this diVerence is contextual? To addressthis issue we performed genomic sequence alignmentsand compared the JC distance between the genes ofthe same categories in the presence and absence of therepeats. As observed with respect to dinucleotide pat-terns, the higher divergence observed in the house-keeping and physiological genes in upstream 2 kbregions is to a great extent reduced by removing therepeat sequences. Despite higher rates of substitutionsin the murid lineage, such a partitioning of divergencebetween repeats and unique regions was not observedin the murid lineages in any of the functional catego-ries. This as described earlier could also be either dueto our inability to detect repeats in murids or due tohigher substitutions.

Strikingly, in the genic region the developmentalgenes show higher divergence even though theirupstream 2 kb region seems to be nearly conserved.This is just the reverse of what is observed for the phys-iological genes. The unique regions (sequences devoidof repeats) seem to be well conserved in the physiologyand housekeeping genes compared to neurodevelop-mental genes which mirrors to a great extent the obser-vations of Dorus et al. (2004) by Ka/Ks estimations. Inthe housekeeping genes similar to the upstream region,the repetitive sequences have accumulated most of thechanges compared to the unique regions, and there isextreme conservation in these regions once the repeatsare removed. It might be worthwhile reiterating thatgenes of diVerent functional categories do not reside inregions with diVerent mutation rates. This is substanti-ated by earlier observations (Dorus et al. 2004) on thesame set of genes. In order to compute the average Ka/Ks and Ks values for each categories of genes we usedthe values reported by Dorus et al. (Table 9). Weobserve that the Ks values for each of the categories of

123

452 Mol Genet Genomics (2007) 277:441–455

genes are same and due to diVerent Ka values, theKa/Ks values are diVerent [maximum for neurode-velopmental and minimum for Housekeeping genes](Supplementary Table 8). It has been recently demon-strated through genome level human mouse compari-sons, that fraction of conserved sequence and itsabsolute length were higher in introns of tissue-speciWcgenes than the housekeeping genes (Vinogradov 2006).He proposed that a considerable length of the non-coding genic region might be under selection in tissue-speciWc genes due to their possible involvement inhigher order functional complexity mediated by chro-matin. This has been proposed as a “Genomic designmodel” which postulates that the length of genomicelement is determined by their function. On the con-trary in a recent work on genes that are expressed inpollens of Arabidopsis thaliana, it has been shownthat selection for eYciency rather than genomicdesign or regional mutational bias plays a major role inshaping intron content (Seoighe et al. 2005). Boththese studies had not analyzed the contribution ofrepetitive sequences in shaping intron content. Ouranalysis in the hominid lineage suggests that in genesinvolved in housekeeping and neurophysiology there isextreme conservation in genic non-repetitive region,whereas in the neurodevelopmental genes consider-able divergence is observed even after removingrepeats. The higher level of divergence in genic regionsof neurodevelopmental genes to some extent mirrorsthe observations of non-neutral rates of mutationsobserved in the coding regions (Dorus et al. 2004). Thiswork suggests that the genic content has diVerent extentsof functional constraints.

SigniWcance of these repeats in human genome evo-lution is also marked by the diVerences observedbetween murid and hominid genomes for patterns ofdivergence where diVerence in repeat density and dis-tribution seems a major inXuencing factor. We observethat one-third of the diVerence in transcription factorsite proWle in the upstream region is contributed by therepeat elements. Therefore, accumulation of novel

sites through these elements in the housekeeping genesmay further increase their scope of expression in heter-ogeneous tissues. Since there has been an increase inthe transcription factor binding sites in the human lin-eages, gradual accumulation and Wxation of mutationsin the repeats could lead to such events. As the physio-logical genes are subject to diVerential regulation inresponse to cellular cues, this could also hold true forthese category of genes. However, it may be empha-sized that all but only a fraction of variations in therepeats may be functionally relevant. Since the devel-opmental genes need to be much more stringently reg-ulated this might explain the absence of repeatsthrough negative selection. In the follow up weassessed the eVect of repeats over this gain of transcrip-tion factors binding sites in human. The diVerentialaccumulation of repeats in the three functional catego-ries seems to inXuence the site gain/loss phenomenondirectly. Housekeeping genes, which are rich in theserepeats and majority with Alu repeats, has the highestnumber of such sites within genes while it is the least indevelopmental genes wherein repeats are also sparse.This observation suggests that distribution of repeatscould guide and inXuence the regulatory repertoire in afunctional category-speciWc manner.

Noteworthy, analysis of the CNCN regions also sug-gests greater degree of conservation in the CNCNregions of the developmental genes compared to neu-rophysiological and housekeeping genes. This is in con-currence with the observations of Khaitovich et al.(2005) wherein they reported that ubiquitouslyexpressed genes have extremely low divergence Fur-ther, a recent study comparing the conservation of pro-moters upstream of genes classiWed in diVerentfunctional categories ranked genes involved in devel-opment amongst the highest (Lee et al. 2005). In thispart of study too the diVerences between murids andhominids pattern of CNCN regions evolution could beinXuenced by the repeats. While Murids exhibit sharpdiVerence across the functional categories for suchregions, it is reduced in hominids. The 2 kb upstreamregion is devoid of repeats in murids while in hominidsthese regions are rich in repeats, a majority of whichare SINEs and LINES. Since these repeats are well-conserved repeats, they could contribute to overallreduced divergence in these regions.

Anti-footprinting proWles suggest that the higherdivergence in the human lineage compared to chim-panzee could be functionally relevant. We observe thateven though overall transcriptional sites have accumu-lated in the human lineage, the kinds of sites which hasincreased in each category is more speciWc and beWtsthe functional requirements of each category of genes.

Table 9 Average Ks values for each functional category of geneand JC distance for 2 kb upstream and genic region

It has been shown here that variation in divergence across func-tional groups is non-randomly distributed

Ks JC distance (with repeats) upstream (%)

JC distance (with repeats) genic (%)

Developmental 0.061 1.20 2.60Physiological 0.071 2.15 1.33Housekeeping 0.061 2.33 1.55

123

Mol Genet Genomics (2007) 277:441–455 453

This would not only lead to quantitative, but also qual-itative changes in gene expression during evolution.However, we would like to emphasize that eventhough this hypothesis is testable through measuringpromoter activities of homologs in human and chimp incell lines, it is too early to predict physiologically rele-vant gene-expression diVerences (Heissig et al. 2005).

Dorus et al. had identiWed a few primate speciWcoutliers which demonstrated accelerated evolution inthe human lineages. We observe that these changes inthe coding region are also mirrored in the non-codingregions in majority of cases. Surprisingly, we alsoobserve a large number of genes, which revealed sub-stantial divergence in the upstream promoter regionsdespite striking conservation in the coding regions(Supplementary Table 9). Notably, a large number ofgenes which show divergence in the upstream 2 kb aswell as in the coding regions are well known for theirinvolvement in neo-cortical organization. Although wehave not carried out functional analysis of diVerencesin expression of genes which show higher divergence inthe human, it might be important to mention that thelist includes important genes which might be furtherstudied. These genes are involved in regulating ordetermining the size of cortical sheet, changes in corti-cal domain and cortical Weld speciWcation and activitydependant intracellular mechanisms that regulate thestructure and function of neurons during development(Krubitzer and Kahn 2003). For instance caspases andB-catenins which could modulate cell kinetics and sur-vival, transcription factors which are involved in neuro-genesis like EN1, OTX1, LHX1, ID2, GAS7, a host ofneurotrophins like NTF3, cell adhesion as well as sig-naling molecules like FGF2, SHH, ADCYAP1 as wellas proteins involved in synaptic transmission (Kru-bitzer and Kahn 2003). Together they could contributeto system-level changes in human brain organization,morphology and behavior. Besides, accumulation ofglobal transcription factor sites in physiological genescould modulate activities of neurotrophins as well asdiVerent signaling molecules which could modulate thestructure, function and connectivity of neurons. Muta-tions knock out experiments and ectopic expression ofmany of the genes described earlier have been shownto alter the fate of cortical development (Krubitzer andKahn 2003; Simeone 1998). Besides, dysfunctions ofmany of the genes have also been shown to be associ-ated with disorders related to cognition and neurode-generation.

Our analysis of non-coding regions divergenceshows distinctive patterns between hominid and mur-ids and hence cannot be generalized for all mammals.What King and Wilson had postulated three decades

ago that the evolution of regulatory circuit could becritical in the comparative study of human evolutionwith respect to chimpanzee (King and Wilson 1975) isalso supported by our study. Given the dynamic natureof cellular networks in the brain as well as even intraindividual regional variations in gene expression indiVerent regions of the brain, delineation of key regu-latory changes which could lead to diVerent develop-mental fates in human and chimpanzees seemsinsurmountable (Khaitovich et al. 2004a). This studythough limited to a dataset of neuronal and housekeep-ing genes in human and chimpanzees suggests thatdivergence in the non-coding repetitive regions is dis-tinctly diVerent in these functional categories both inthe 2 kb upstream and in the genic regions. Whereasthe neurodevelopmental genes are more diverged inthe genic regions, and less in the upstream region it isjust the opposite for the neurophysiology genes.Housekeeping genes seems to have diverged to a sig-niWcant extent both in the upstream and in the down-stream region. However, in this category thisdivergence is signiWcantly reduced once the repeats areremoved from analysis. That repeats have an activerole in imparting functional divergence is substantiatedby the observation that nearly one-third of the tran-scription factor binding sites are provided by theserepeat elements in the upstream 2 kb regions. Theobservation of higher divergence in the genic region ofneurodevelopmental genes mirrors what has beenobserved for the coding regions. This suggests that thenon-coding genic regions may have evolved regulatory/structural function involved in neurodevelopment. Onthe other hand, high conservation in the upstreamregion hints at stringent regulation. Besides, we alsouncover certain neuronal genes whose upstreamregions seem to have diverged signiWcantly althoughthere are no diVerences in the coding region. Notewor-thy, the diVerential distribution of repeats across thespecies as well as the functional classes of genes seemsto be participating in the process of evolution acrossthe lineages in a functional category-speciWc manner.The insights obtained from this study may be a steptowards understanding the contribution of genomicrepeats in functional and phenotypic evolution in spe-cies. Experimental validation of these variations aswell as study of variability in these regions in primateand human population might give us insights into theirrole in evolving functions.

Acknowledgments We thank Prof. Samir K. Brahmachari forproviding intellectual support during the course of this investiga-tion. We are grateful to Dr. Jerzy Jurka, GIRI, Mountain View,CA, USA and reviewers for their valuable suggestions. We wouldalso like to thank system administrators IGIB for technical

123

454 Mol Genet Genomics (2007) 277:441–455

support. Financial support from CSIR project (CMM0017) andG.N. Ramachandran Knowledge Centre, IGIB, Delhi for fellow-ship support is duly acknowledged. Parts of the work were fundedby a grant from the German Ministry of Education and Research(BMBF) together with BioRegioN GmbH “BioProWl”, grantno. 0313092; EU grants: “TRANSISTOR” and “COMBIO” andINTAS grant no:03-51-5218.

References

Bailey JA, Liu G, Eichler EE (2003) An Alu transposition modelfor the origin and expansion of human segmental duplica-tions. Am J Hum Genet 73:823–834

Brahmachari SK et al (1995) Simple repetitive sequences in thegenome: structure and functional signiWcance. Electrophore-sis 16:1705–1714

Britten RJ (2002) Divergence between samples of chimpanzeeand human DNA sequences is 5%, counting indels. ProcNatl Acad Sci USA 99:13633–13635

Caceres M et al (2003) Elevated gene expression levels distin-guish human from non-human primate brains. Proc NatlAcad Sci USA 100:13030–13035

Callinan PA, Wang J, Herke SW, Garber RK, Liang P, BatzerMA (2005) Alu retrotransposition-mediated deletion. J MolBiol 348:791–800

Chen FC, Li WH (2001) Genomic divergences between humansand other hominoids and the eVective population size of thecommon ancestor of humans and chimpanzees. Am J HumGenet 68:444–456

Chen FC, Vallender EJ, Wang H, Tzeng CS, Li WH (2001) Geno-mic divergence between human and chimpanzee estimatedfrom large-scale alignments of genomic sequences. J Hered92:481–489

Cheng Z et al (2005) A genome-wide comparison of recent chim-panzee and human segmental duplications. Nature 437:88–93

Conrad M, Brahmachari SK, Sasisekharan V (1986) DNA struc-tural variability as a factor in gene expression and evolution.Biosystems 19:123–126

Deininger PL, Batzer MA (1999) Alu repeats and human disease.Mol Genet Metab 67:183–193

Dorus S et al (2004) Accelerated evolution of nervous systemgenes in the origin of Homo sapiens. Cell 119:1027–1040

Enard W et al (2002) Intra- and interspeciWc variation in primategene expression patterns. Science 296:340–343

Englander EW, Howard BH (1995) Nucleosome positioning byhuman Alu elements in chromatin. J Biol Chem 270:10091–10096

Englander EW, WolVe AP, Howard BH (1993) Nucleosomeinteractions with a human Alu element. Transcriptionalrepression and eVects of template methylation. J Biol Chem268:19565–19573

Ganapathi M et al (2005) Comparative analysis of chromatinlandscape in regulatory regions of human housekeeping andtissue speciWc genes. BMC Bioinformatics 6:126.:126

Gibbs RA et al (2004) Genome sequence of the Brown Norwayrat yields insights into mammalian evolution. Nature428:493–521

Gilad Y, Bustamante CD, Lancet D, Paabo S (2003) Naturalselection on the olfactory receptor gene family in humansand chimpanzees. Am J Hum Genet 73:489–501

Grover D, Majumder PP, Rao B, Brahmachari SK, Mukerji M(2003) Nonrandom distribution of alu elements in genes ofvarious functional categories: insight from analysis of humanchromosomes 21 and 22. Mol Biol Evol 20:1420–1424

Grover D, Mukerji M, Bhatnagar P, Kannan K, Brahmachari SK(2004) Alu repeat analysis in the complete human genome:trends and variations with respect to genomic composition.Bioinformatics 20:813–817

Grover D, Kannan K, Brahmachari SK, Mukerji M (2005) ALU-ring elements in the primate genomes. Genetica 124:273–289

Gu J, Gu X (2003) Induced gene expression in human brain afterthe split from chimpanzee. Trends Genet 19:63–65

Heissig F, Krause J, Bryk J, Khaitovich P, Enard W, Paabo S(2005) Functional analysis of human and chimpanzee pro-moters. Genome Biol 6:R57

Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O,Walichiewicz J (2005a) Repbase update, a database ofeukaryotic repetitive elements. Cytogenet Genome Res110:462–467

Jurka J, Kohany O, Pavlicek A, Kapitonov VV, Jurka MV(2005b) Clustering, duplication and chromosomal distribu-tion of mouse SINE retrotransposons. Cytogenet GenomeRes 110:117–123

Keightley PD, Lercher MJ, Eyre-Walker A (2005) Evidence forwidespread degradation of gene control regions in hominidgenomes. PLoS Biol 3:e42

Kent WJ (2002) BLAT—the BLAST-like alignment tool. Ge-nome Res 12:656–664

Khaitovich P et al (2004a) Regional patterns of gene expression inhuman and chimpanzee brains. Genome Res 14:1462–1473

Khaitovich P et al (2004b) A neutral model of transcriptome evo-lution. PLoS Biol 2:E132

Khaitovich P et al (2005) Parallel patterns of evolution in the ge-nomes and transcriptomes of humans and chimpanzees. Sci-ence 309:1850–1854

King MC, Wilson AC (1975) Evolution at two levels in humansand chimpanzees. Science 188:107–116

Kouprina N et al (2004) Accelerated evolution of the ASPM genecontrolling brain size begins prior to human brain expansion.PLoS Biol 2:E126

Krubitzer L, Kahn DM (2003) Nature versus nurture revisited: anold idea with a new twist. Prog Neurobiol 70:33–52

Krull M, Voss N, Choi C, Pistor S, Potapov A, Wingender E(2003) TRANSPATH: an integrated database on signaltransduction and a tool for array analysis. Nucleic Acids Res31:97–100

Lander ES et al (2001) Initial sequencing and analysis of the hu-man genome. Nature 409:860–921

Lee S, Kohane I, Kasif S (2005) Genes involved in complex adap-tive processes tend to have highly conserved upstream re-gions in mammalian genomes. BMC Genomics 6:168–168

Ludwig A, Rozhdestvensky TS, Kuryshev VY, Schmitz J, BrosiusJ (2005) An unusual primate locus that attracted two inde-pendent Alu insertions and facilitates their transcription. JMol Biol 350:200–214

Nielsen R, et al (2005) A scan for positively selected genes in thegenomes of humans and chimpanzees. PLoS Biol 3:e170

Pavlicek A, Jurka J (2006) Positive selection on the nonhomolo-gous end-joining factor Cernunnos-XLF in the human line-age. Biol Direct 1:15.:15

Preuss TM, Caceres M, Oldham MC, Geschwind DH (2004) Hu-man brain evolution: insights from microarrays. Nat RevGenet 5:850–860

Seoighe C, Gehring C, Hurst LD (2005) Gametophytic selectionin Arabidopsis thaliana supports the selective model of in-tron length reduction. PLoS Genet 1:e13

Shankar R, Grover D, Brahmachari SK, Mukerji M (2004) Evo-lution and distribution of RNA polymerase II regulatorysites from RNA polymerase III dependant mobile Alu ele-ments. BMC Evol Biol 4:37

123

Mol Genet Genomics (2007) 277:441–455 455

Simeone A (1998) Otx1 and Otx2 in the development and evolu-tion of the mammalian brain. EMBO J 17:6790–6798

Stankiewicz P, Shaw CJ, Withers M, Inoue K, Lupski JR (2004)Serial segmental duplications during primate evolution re-sult in complex human genome architecture. Genome Res14:2209–2220

Thornburg BG, Gotea V, Makalowski W (2006) Transposableelements as a signiWcant source of transcription regulatingsignals. Gene 365:104–110 (Epub;%2006 Jan 10:104–110)

Uddin M et al (2004) Sister grouping of chimpanzees and humansas revealed by genome-wide phylogenetic analysis of braingene expression proWles. Proc Natl Acad Sci USA 101:2957–2962

Versteeg R et al (2003) The human transcriptome map revealsextremes in gene density, intron length, GC content, and

repeat pattern for domains of highly and weakly expressedgenes. Genome Res 13:1998–2004

Vinogradov AE (2006) “Genome design” model: evidence fromconserved intronic sequence in human–mouse comparison.Genome Res 16:347–354

Watanabe H et al (2004) DNA sequence and comparative analy-sis of chimpanzee chromosome 22. Nature 429:382–388

Waterston RH et al (2002) Initial sequencing and comparativeanalysis of the mouse genome. Nature 420:520–562

Webster MT, Smith NG, Ellegren H (2003) Compositional evolu-tion of noncoding DNA in the human and chimpanzee ge-nomes. Mol Biol Evol 20:278–286

Wichman HA, Van den Bussche RA, Hamilton MJ, Baker RJ(1992) Transposable elements and the evolution of genomeorganization in mammals. Genetica 86:287–293

123