chimp_analysis genetic v. human

Upload: pgandz

Post on 04-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    1/19

    Vol437|1September2005|doi:10 .10 38/nat ure0 40 72

    ARTICLESI n i ti a l se q ue n ce o f t h e ch i m pa n z eegenome and comparison with t he humangenomeTheChimpanzeeSequencingandAnalysisConsortium*

    Herewepresentadraftgenomesequenceofthecommonchimpanzee(Pantroglodytes).Throughcomparisonwiththehumangenome,wehavegeneratedalargelycompletecatalogueofthegeneticdifferencesthathaveaccumulatedsincethehumanandchimpanzeespeciesdivergedfromourcommonancestor,constitutingapproximatelythirty-fivemillionsingle-nucleotide

    changes,

    five

    million

    insertion/deletion

    events,

    and

    various

    chromosomal

    rearrangements.

    We

    use

    this

    cataloguetoexplorethemagnitudeandregionalvariationofmutationalforcesshapingthesetwogenomes,andthestrengthofpositiveandnegativeselectionactingontheirgenes.Inparticular,wefindthatthepatternsofevolutioninhumanandchimpanzeeprotein-codinggenesarehighlycorrelatedanddominatedbythefixationofneutralandslightlydeleteriousalleles.Wealsousethechimpanzeegenomeasanoutgrouptoinvestigatehumanpopulationgeneticsandidentifysignaturesofselectivesweepsinrecenthumanevolution.MorethanacenturyagoDarwin1andHuxley2positedthathumanssharerecentcommonancestorswiththeAfricangreatapes.Modernmolecularstudieshavespectacularlyconfirmedthispredictionandhaverefinedtherelationships,showingthatthecommonchimpanzee(Pantroglodytes)andbonobo(Panpaniscusorpygmychimpanzee)areourclosestlivingevolutionaryrelatives3.Chimpanzeesarethus

    especially

    suited

    to

    teach

    us

    about

    ourselves,

    both

    in

    terms

    of

    theirsimilaritiesanddifferenceswithhuman.Forexample,Goodallspioneering studies on the common chimpanzee revealed startlingbehaviouralsimilaritiessuchastooluseandgroupaggression4,5. By contrast,otherfeaturesareobviouslyspecifictohumans,includinghabitualbipedality,agreatlyenlargedbrainandcomplexlanguage5.Importantsimilaritiesanddifferenceshavealsobeennotedfortheincidenceandseverityofseveralmajorhumandiseases6.

    Genomecomparisonsofhumanandchimpanzeecanhelptorevealthemolecularbasisforthesetraitsaswellastheevolutionaryforcesthathavemouldedourspecies,includingunderlyingmutationalprocessesandselectiveconstraints.Earlystudiessoughttodrawinferencesfromsets of a few dozen genes79, whereas recent studies have examinedlarger data sets such as protein-coding exons10, random genomicsequences11,12andanentirechimpanzeechromosome13.

    Herewereportadraftsequenceof thegenomeofthecommonchimpanzee, andundertakecomparativeanalyseswith the humangenome.Thiscomparisondiffers fundamentally fromrecentcomparativegenomicstudiesofmouse,rat,chickenandfish1417.Becausethesespecieshavedivergedsubstantiallyfromthehumanlineage,thefocusinsuchstudiesisonaccuratealignmentofthegenomesandrecognitionofregionsofunusuallyhighevolutionaryconservationtopinpointfunctionalelements.Becausethechimpanzeeliesatsuchashortevolutionarydistancewithrespecttohuman,nearlyallofthebasesareidenticalbydescentandsequencescanbereadilyalignedexcept in recentlyderived, large repetitiveregions. The focus thusturnstodifferencesratherthansimilarities.Anobserveddifferenceatasitenearlyalwaysrepresentsasingleevent,notmultipleindepen

    dent changes over time. Most of the differences reflect randomgeneticdrift,andthustheyholdextensiveinformationaboutmutationalprocessesandnegativeselectionthatcanbereadilyminedwithcurrent analytical techniques. Hidden among the differences is aminorityoffunctionally importantchangesthatunderliethephenotypic differences between the two species. Our ability to distinguish

    such

    sites

    is

    currently

    quite

    limited,

    but

    the

    catalogue

    of

    humanchimpanzeedifferencesopensthisissuetosystematicinvestigationforthefirsttime.Wewouldalsohopethat,inelaboratingthefewdifferencesthatseparatethetwospecies,wewillincreasepressuretosavechimpanzeesandothergreatapesinthewild.

    Ourresultsconfirmmanyearlierobservations,butnotablychallenge some previous claims based on more limited data. Thegenome-wide data also allow some questions to be addressed forthefirsttime.(Hereandthroughout,werefertochimpanzeehumancomparisonasrepresentinghominidsandmouseratcomparisonasrepresentingmuridsofcourse,eachpaircoversonlyasubsetoftheclade.)Themainfindingsinclude:. Single-nucleotide substitutions occur at a mean rate of 1.23%betweencopiesofthehumanandchimpanzeegenome,with1.06%orlesscorrespondingtofixeddivergencebetweenthespecies.. Regional variation in nucleotide substitution rates is conservedbetweenthehominidandmuridgenomes,butratesinsubtelomericregionsaredisproportionatelyelevatedinthehominids.. SubstitutionsatCpGdinucleotides,whichconstituteone-quarterofallobservedsubstitutions,occuratmoresimilarratesinmaleandfemalegermlinesthannon-CpGsubstitutions.. Insertion and deletion (indel) events are fewer in number thansingle-nucleotidesubstitutions,butresultin,1.5%oftheeuchromaticsequenceineachspeciesbeinglineage-specific.. Therearenotabledifferencesintherateoftransposableelementinsertions:shortinterspersedelements(SINEs)havebeenthreefoldmoreactiveinhumans,whereaschimpanzeeshaveacquiredtwonewfamiliesofretroviralelements.

    *Listsofparticipantsandaffiliationsappearattheendofthepaper.69

    2005 Nature Publishing Group

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    2/19

    ARTICLES NATURE|Vol437|1September2005

    . Orthologous proteins in human and chimpanzee are extremelysimilar, with,29% being identical and the typical orthologuedifferingbyonlytwoaminoacids,oneperlineage.. Thenormalizedratesofamino-acid-alteringsubstitutionsinthehominidlineagesareelevatedrelativetothemuridlineages,butcloseto that seen for common human polymorphisms, implying thatpositiveselectionduringhominidevolutionaccountsforasmallerfraction of protein divergence than suggested in some previousreports.. Thesubstitutionrateatsilentsitesinexonsislowerthantherateatnearby intronic sites, consistent with weak purifying selection onsilentsitesinmammals.. Analysis of the pattern of human diversity relative to hominiddivergence identifiesseveral lociaspotential candidates forstrongselectivesweepsinrecenthumanhistory.

    In this paper, we beginwith information about the generation,assembly and evaluation of the draft genome sequence. We thenexplore overall genome evolution, with the aim of understandingmutationalprocessesatworkinthehumangenome.Wenextfocusontheevolutionofprotein-codinggenes,withtheaimofcharacterizingthenatureofselection.Finally,webrieflydiscussinitialinsightsintohumanpopulationgenetics.

    Inrecognitionofitsstrongcommunitysupport,wewillrefertochimpanzee chromosomes using the orthologous numberingnomenclatureproposed by ref.18, which renumbers the chromosomesofthegreatapesfromtheInternationalSystemforHumanCytogeneticNomenclature(ISCN;1978)standardtodirectlycorrespondtotheirhumanorthologues,usingtheterms2Aand2Bforthetwoapechromosomescorrespondingtohumanchromosome2.G e n om e s e q ue n c in g a n d a s s em b l y

    Wesequencedthegenomeofasinglemalechimpanzee(Clint;YerkespedigreenumberC0471;SupplementaryTableS1),acaptive-borndescendant of chimpanzees from the West Africa subspecies Pantroglodytes verus, using a whole-genome shotgun (WGS)approach19,20. The datawere assembled using both the PCAPandARACHNEprograms21,22(seeSupplementaryInformationGenomesequencing

    and

    assembly

    and

    Supplementary

    Tables

    S2S6).

    The

    formerwasadenovoassembly,whereasthelattermadelimiteduseofhuman genome sequence (NCBI build 34)23,24 to facilitate andconfirmcontiglinking.TheARACHNEassemblyhasslightlygreatercontinuity(Table1)andwasusedforanalysisinthispaper.Thedraftgenomeassemblygeneratedfrom,3.6-foldsequenceredundancyof theautosomesand,1.8-fold redundancyofboth sexchromosomescovers,94%ofthechimpanzeegenomewith.98%ofthesequenceinhigh-qualitybases.Atotalof50%ofthesequence(N50)iscontainedincontigsoflengthgreaterthan15.7kilobases(kb)andsupercontigsoflengthgreaterthan8.6megabases(Mb).Theassemblyrepresentsaconsensusoftwohaplotypes,withoneallelefromeachheterozygouspositionarbitrarilyrepresentedinthesequence.Assessment of quality and coverage. The chimpanzee genomeassembly was subjected to rigorous quality assessment, based oncomparisontofinishedchimpanzeebacterialartificialchromosomes(BACs)andtothehumangenome(seeSupplementaryInformation

    Table 1 |Chimpanzee assembly st at ist icsAssembler PCAP ARACHNEMajorcontigs* 400,289 361,782Contiglength(kb;N50) 13.3 15.7Supercontigs 67,734 37,846Supercontiglength(Mb;N50) 2.3 8.6Sequenceredundancy:allbases(Q20) 5.0 (3.6) 4.3 (3.6)Physicalredundancy 20.7 19.8Consensusbases(Gb) 2.7 2.7*Contigs.1kb. N50lengthisthesizexsuchthat50%oftheassemblyisinunitsoflengthatleastx.

    Genome sequencing and assembly and Supplementary TablesS7S16).

    Nucleotide-levelaccuracyishighbyseveralmeasures.About98%ofthechimpanzeegenomesequencehasqualityscores25ofatleast40(Q40),correspondingtoanerrorrateof#1024.ComparisonoftheWGS sequence to 1.3Mb of finished BACs from the sequencedindividual is consistent with this estimate, giving a high-qualitydiscrepancy rate of 31024 substitutions and 21024 indels,which isnomorethanexpectedgiventheheterozygosity rate(seebelow),as50%ofthepolymorphicallelesintheWGSsequencewilldiffer from the single-haplotype BACs. Comparison of protein-coding regions aligned between the WGS sequence, the recentlypublishedsequenceofchimpanzeechromosome21(ref.13;formerlychromosome22(ref.18))andthehumangenomealsorevealednoexcess of substitutions in the WGS sequence (see SupplementaryInformationGenomesequencingandassembly).Thus,byrestrictingouranalysistohigh-qualitybases,thenucleotide-levelaccuracyoftheWGSassemblyisessentiallyequaltothatoffinishedsequence.

    StructuralaccuracyisalsohighbasedoncomparisonwithfinishedBACsfromtheprimarydonorandotherchimpanzees,althoughtherelatively low levelofsequenceredundancy limits localcontiguity.On thebasis of comparisonswith the primarydonor, some smallsupercontigs (most,5kb) have not been positioned within largesupercontigs(,1eventper100kb);thesearenotstrictlyerrorsbutnonethelessaffect theutilityoftheassembly.Therearealsosmall,undetectedoverlaps(all,1kb)betweenconsecutivecontigs(,1.2eventsper100kb)andoccasionallocalmisorderingofsmallcontigs(,0.2 events per 100kb). No misoriented contigs were found.Comparison with the finished chromosome 21 sequenceyieldedsimilardiscrepancyrates(seeSupplementaryInformationGenomesequencingandassembly).

    The most problematic regions are those containing recent segmental duplications. Analysis of BAC clones from duplicated(n75) and unique (n28) regions showed that the formertend to be fragmented into more contigs (1.6-fold) and moresupercontigs (3.2-fold). Discrepancies in contig order are alsomore frequent in duplicated than unique regions (,0.4 versus,

    0.1events

    per

    100

    kb).

    The

    rate

    is

    twofold

    higher

    in

    duplicated

    regionswiththehighestsequenceidentity(.98%).Ifwerestricttheanalysistoolderduplications(#98%identity)wefindfewerassembly problems: 72% of those that can be mapped to the humangenomearesharedasduplicationsinbothspecies.TheseresultsareconsistentwiththedescribedlimitationsofcurrentWGSassemblyfor regions of segmental duplication26. Detailed analysis of theserapidly changing regions of the genome is being performed withmoredirectedapproaches27.Chimpanzeepolymorphisms.Thedraftsequenceofthechimpanzeegenome also facilitates genome-wide studies of genetic diversityamongchimpanzees,extendingrecentwork2831.Wesequencedandanalysedsequencereads from theprimarydonor, fourother WestAfrican and three central African chimpanzees (Pan troglodytestroglodytes)todiscoverpolymorphicpositionswithinandbetweentheseindividuals(SupplementaryTableS17).Atotalof1.66millionhigh-qualitysingle-nucleotidepolymorphisms(SNPs)wereidentified,ofwhich1.01millionareheterozygouswithintheprimarydonor,Clint.Heterozygosityrateswereestimatedto be 9.51024 for Clint, 8.01024 among West African chimpanzeesand17.61024 amongcentralAfricanchimpanzees,withthevariationbetweenWestandcentralAfricanchimpanzeesbeing19.01024.ThediversityinWestAfricanchimpanzeesissimilartothat seen for human populations32, whereas the level for centralAfricanchimpanzeesisroughlytwiceashigh.

    The observed heterozygosity in Clint is broadly consistent withWestAfricanorigin,althoughthereareasmallnumberofregionsofdistinctlyhigherheterozygosity.ThesemayreflectasmallamountofcentralAfricanancestry,butmorelikelyreflectundetectedregionsofsegmentalduplicationspresentonlyinchimpanzees.

    2005 Nature Publishing Group70

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    3/19

    NATURE|Vol437|1September2005 ARTICLES

    Geno me evo lutio n

    We set out to study the mutational events that have shaped thehumanandchimpanzeegenomessincetheirlastcommonancestor.Weexploredchangesatthelevelofsinglenucleotides,smallinsertions and deletions, interspersed repeats and chromosomalrearrangements. The analysis is nearly definitive for the smallestchanges,butismorelimitedforlargerchanges,particularlylineage-specific segmental duplications, owing to the draft nature of thegenomesequence.Nucleotidedivergence.Bestreciprocalnucleotide-levelalignmentsofthechimpanzeeandhumangenomescover,2.4gigabases(Gb)ofhigh-quality sequence, including 89Mb from chromosome X and7.5MbfromchromosomeY.Genome-widerates.Wecalculatethegenome-widenucleotidedivergence between human and chimpanzee to be 1.23%, confirming

    limited studies12,33,34recent results from more . The differencesbetween one copy of the human genome and one copy of thechimpanzee genome include both the sites of fixed divergencebetweenthespeciesandsomepolymorphicsiteswithineachspecies.Bycorrectingfortheestimatedcoalescencetimesinthehumanandchimpanzeepopulations(seeSupplementaryInformationGenomeevolution),weestimatethatpolymorphismaccountsfor1422%ofthe observed divergence rate and thus that the fixed divergence is,1.06%orless.

    Figure 1 |H u m a n- c h i m pa n z ee d i v e rg e n c e i n 1 - M b s e g me n t s a c r o s s t h egenome. a, D i s t ri b ut i on o f d i ve r ge n ce o f t h e a u to s om e s ( bl u e) , t h e Xc h r o mo s o me ( r e d ) a n d t h e Y c h r o mo s o me ( g r e e n) . b, D i s t r ib u t i on o f

    vari ati on by c hrom osom e, shown as a box plot. The edges of the box c o r r es p o nd t o q u a r ti l e s ; t h e n o t ch e s to t h e s t a n da r d e r r o r o f t h e m e d ia n ; an dt h e v e r t ic a l b a r s t o t h e r a n g e. T h e X a n d Y c h r om o s o me s a r e c l e a r o u t li e r s ,b u t t h er e i s a l so h i gh l o ca l v a ri a ti o n w i th i n e a ch o f t h e a u to s om e s.

    Nucleotidedivergenceratesarenotconstantacrossthegenome,ashas been seen in comparisons of the human and murid genomes16,17,24,35,36.Theaveragedivergencein1-Mbsegmentsfluctuateswithastandarddeviationof0.25%(coefficientofvariation0.20),whichismuchgreaterthanthe0.02%expectedassumingauniformdivergencerate(Fig.1a;seealsoSupplementaryFig.S1).

    Regional variation in divergence could reflect local variation ineithermutationrateorotherevolutionaryforces.Amongthelatter,one important force is genetic drift, which can cause substantialdifferences in divergence time across loci when comparing closelyrelatedspecies,asthedivergencetimefororthologuesisthesumoftwoterms:t1,thetimesincespeciation,andt2,thecoalescencetimefororthologueswithinthecommonancestralpopulation37.Whereast1 is constant across loci (,67millionyears38), t2 is a randomvariable that fluctuates across loci (with a mean that depends onpopulationsizeandheremaybeontheorderof12millionyears39).However, because of historical recombination, the characteristicscaleofsuchfluctuationswillbeontheorderoftensofkilobases,which istoosmalltoaccountfor thevariationobservedfor1-Mbregions40 (see Supplementary Information Genome evolution).Otherpotentialevolutionaryforcesarepositiveornegativeselection.Althoughitismoredifficulttoquantifytheexpectedcontributionsofselectionintheancestralpopulation4143,itisclearthattheeffectswould have to be very strong to explain the large-scale variationobservedacrossmammaliangenomes16,44.Thereistentativeevidencefrom in-depth analysis of divergence and diversity that naturalselectionisnotthemajorcontributortothelarge-scalepatternsofgeneticvariabilityinhumans4547.Forthesereasons,wesuggestthatthelarge-scalevariationinthehumanchimpanzeedivergencerateprimarilyreflectsregionalvariationinmutationrate.Chromosomal variation in divergence rate. Variation in divergencerateisevidentevenatthelevelofwholechromosomes(Fig.1b).Themost striking outliers are the sex chromosomes, with a meandivergenceof1.9%forchromosomeYand0.94%forchromosomeX. The likely explanation is a higher mutation rate in the malecompared with female germ line48. Indeed, the ratio of the male/femalemutationrates(denoteda)canbeestimatedbycomparingthe

    divergence

    rates

    among

    the

    sex

    chromosomes

    and

    the

    autosomes

    andcorrectingforancestralpolymorphismasafunctionofpopulation size of the most recent common ancestor (MRCA; seeSupplementary Information Genome evolution). Estimates forarangefrom3to6,dependingonthechromosomescomparedandtheassumedancestralpopulationsize(SupplementaryTableS18).Thisis significantly higher than recent estimates of a for the murids(,1.9)(ref.17)andresolvesarecentcontroversybasedonsmallerdatasets12,24,49,50.

    The higher mutation rate in the male germ line is generallyattributedtothe56-foldhighernumberofcelldivisionsundergonebymalegermcells48.WereasonedthatthiswouldaffectmutationsresultingfromDNAreplicationerrors(therateshouldscalewiththenumber of cell divisions) but not mutations resulting from DNAdamagesuchasdeaminationofmethylCpGtoTpG(therateshouldscale with time). Accordingly, we calculateda separately for CpGsites,obtainingavalueof,2fromthecomparisonofratesbetweenautosomesandchromosomeX.ThisintermediatevalueisacompositeoftheratesofCpGlossandgain,andisconsistentwithroughlyequalratesofCpGtoTpGtransitionsinthemaleandfemalegermline51,52.

    Significant variation in divergence rates is also seen amongautosomes(Fig.1b;P,310215,KruskalWallistestover1-Mbwindows), confirming earlier observations based on low-coverageWGS sampling12. Additional factors thus influence the rate ofdivergence between chimpanzee and human chromosomes. Thesefactorsare likely toactat lengthscalessignificantlyshorter thanachromosome, because the standard deviation across autosomes(0.21%) is comparable to the standard deviation seen in 1-Mbwindowsacrossthegenome(0.130.35%).Wethereforesoughtto

    2005 Nature Publishing Group 71

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    4/19

    ARTICLES NATURE|Vol437|1September2005

    understand local factorsthatcontributetovariation indivergencerate.ContributionofCpGdinucleotides.Sites containingCpGdinucleotidesineitherspeciesshowasubstantiallyelevateddivergencerateof15.2% per base; they account for 25.2% of all substitutions whileconstitutingonly2.1%ofallalignedbases.ThedivergenceatCpGsitesrepresentsboththelossofancestralCpGsandthecreationofnewCpGs.TheformerprocessisknowntooccuratarapidrateperbaseduetofrequentmethylationofcytosinesinaCpGcontextandtheirfrequentdeamination53,54,whereasthelatterprocessprobablyproceeds at a rate more typical of other nucleotide substitutions.AssumingthatlossandcreationofCpGsitesareclosetoequilibrium,themutationrateforbasesinaCpGdinucleotidemustbe1012-foldhigherthanforotherbases(seeSupplementaryInformationGenomeevolutionandref.51).

    BecauseofthehighrateofCpGsubstitutions,regionaldivergencerateswouldbeexpectedtocorrelatewithregionalCpGdensity.CpGdensity indeedvariesacross1-Mbwindows(mean2.1%,coefficient of variation0.44 compared with 0.0093 expected under aPoisson distribution),butonlyexplains4%ofthedivergenceratevariance.In fact,regionalCpGand non-CpGdivergence is highlycorrelated(r0.88;SupplementaryFig.S2),suggestingthathigher-order effects modulate the rates of two very different mutationprocesses(seealsoref.47).Increased divergence in distal regions. The most striking regionalpattern is a consistent increase in divergence towards the ends ofmostchromosomes(Fig.2).Theterminal10Mbofchromosomes(including distal regions and proximal regions of acrocentricchromosomes) averages15%higherdivergencethantherestofthegenome(MannWhitneyU-test;P,10230),withasharpincreasetowards the telomeres. The phenomenon correlates better withphysical distance than relative position along the chromosomesand maypartiallyexplainwhysmallerchromosomes tend tohavehigherdivergence(Supplementary Fig. S3; see also ref. 15).Theseobservationssuggestthatlarge-scalechromosomalstructure,directlyor indirectly, influencesregionaldivergencepatterns.Thecauseofthiseffectisunclear,buttheseregions(,15%of thegenome)are

    notableinhavinghighlocalrecombinationrate,highgenedensityandhighGCcontent.Correlationwithchromosomebanding.Anotherinterestingpatternisthat divergence increases with the intensity of Giemsa staining incytogeneticallydefinedchromosomebands,withtheregionscorresponding to Giemsa dark bands (G bands) showing 10% higherdivergencethanthegenome-wideaverage(MannWhitneyU-test;P,10214)(seeFig.2).Incontrasttoterminalregions,theseregions(17%ofthegenome)tendtobegenepoor,(GC)-poorandlowinrecombination55,56.Theelevateddivergenceseenintwosuchdifferenttypesofregionssuggeststhatmultiplemechanismsareatwork,andthatnosingleknownfactor,suchasGCcontentorrecombination rate, is an adequate predictor of regional variation in themammalian genome by itself (Fig. 3). Elucidation of the relativecontributionsoftheseandothermechanismswillbeimportantforformulatingaccuratemodelsforpopulationgenetics,naturalselection,divergencetimesandtheevolutionofgenome-widesequencecomposition57.Correlationwithregionalvariationinthemuridgenome.Giventhatsequence divergence shows regional variation in both hominids(humanchimpanzee)andmurids(mouserat),weaskedwhetherthe regional rates are positively correlated between orthologousregions. Such acorrelationwould suggest that the divergencerateisdriven,inpart,byfactorsthathavebeenconservedoverthe,75millionyears since rodents, humans and apes shared a commonancestor. Comparativeanalysis of thehuman andmurid genomeshassuggestedsuchacorrelation5860,butthechimpanzeesequenceprovidesadirectopportunitytocompareindependentevolutionaryprocessesbetweentwomammalianclades.

    Wecomparedthelocaldivergenceratesinhominidsandmuridsacross major orthologous segments in the respective genomes(Fig. 4). For orthologous segments that are non-distal in bothhominids and murids, there is a strong correlation between thedivergence rates (r0.5, P,10211). In contrast, orthologoussegments that are centred within 10Mb of a hominid telomerehave disproportionately high divergence rates and GC contentrelative to the murids (MannWhitney U-test; P,10211 and

    Figure 2 |R e g i o na l v a r i a t i on i n d i v e r ge n c e r a t es . H um an c hi mp an ze e a t re nd t ha t h ol ds f or m os t s u b te lo me ri c r eg io ns ( se e t e x t) . I nt er na ll y o n t hedi vergenc e (blue), G C c o nt e n t ( g r e e n) a n d h u m a n r e c o mb i n a ti o n r a t e s173 c h r o mo s o me , r e g io n s of l o w G C c on te nt a nd h ig h d iv er ge nc e o ft en( r ed ) i n s l id i ng 1 - Mb w i nd ow s f o r h u ma n a n d c h im pa n ze e c h ro mo s om e 1 . c o rr e sp o nd t o t h e d a rk G b a nd s .D i ve r ge n ce a n d G C c o nt e nt a r e n o ti c ea b ly e l ev a te d n e ar t h e 1 p t e lo me r e,

    2005 Nature Publishing Group72

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    5/19

    NATURE|Vol437|1September2005 ARTICLES

    P,1024), implying that theelevation in these regions is,at leastpartially,lineagespecific.Thesamegeneraleffectisobserved(albeitlesspronounced)ifCpGdinucleotidesareexcluded(SupplementaryFig.S4).IncreaseddivergenceandGCcontentmightbeexplainedbybiasedgeneconversion61duetothehighhominidrecombinationratesinthesedistalregions.Segmentsthataredistalinmuridsdonotshowelevateddivergencerates,whichisconsistentwiththismodel,becausetherecombinationratesofdistalregionsarenotaselevatedinmouseandrat62.Taken together, these observations suggest that sequence divergence rate is influenced by both conserved factors (stable acrossmammalianevolution)and lineage-specific factors(suchasproximitytothetelomereorrecombinationrate,whichmaychangewithchromosomalrearrangements).Insertionsanddeletions.Wenextstudiedtheindeleventsthathaveoccurred in the human and chimpanzee lineages by aligning thegenomesequencestoidentifylengthdifferences.Wewillreferbelowtoalleventsasinsertionsrelativetotheothergenome,althoughtheymayrepresentinsertionsordeletionsrelativetothegenomeofthecommonancestor.

    The observable insertions fall into two classes: (1) completelycoveredinsertions,occurringwithincontinuoussequence inbothspecies;and(2)incompletelycoveredinsertions,occurringwithinsequence containing one or more gaps in the chimpanzee, butrevealed by a clear discrepancy between the species in sequencelength. Different methods are needed for reliable identification ofmodest-sized insertions (1 base to 15kb) and large insertions(.15kb), with the latter only being reliably identifiable in thehuman genome (see Supplementary Information Genome evolution).

    Theanalysisofmodest-sizedinsertionsreveals,32Mbofhuman-specific sequence and,35Mb of chimpanzee-specific sequence,contained in,5million events in each species (SupplementaryInformationGenomeevolutionandSupplementaryFig.S5).Nearlyallofthehumaninsertionsarecompletelycovered,whereasonlyhalfofthechimpanzeeinsertionsarecompletelycovered.Analysisofthecompletelycoveredinsertionsshowsthatthevastmajorityaresmall(45%

    of

    events

    cover

    only

    1base

    pair

    (bp),

    96%

    are

    ,20

    bp

    and

    98.6% are,80bp), but that the largest few contain most of the

    sequence(withthe,70,000indelslargerthan80bpcomprising73%oftheaffectedbasepairs)(Fig.5).Thelatterindels.80bpfallintothreecategories:(1)aboutone-quarterarenewlyinsertedtransposableelements;(2)morethanone-thirdareduetomicrosatelliteandsatellitesequences;(3)andtheremainderareassumedtobemostlydeletionsintheothergenome.

    Theanalysisof larger insertions(.15kb)identified163humanregions containing 8.3Mb of human-specific sequence in total(Fig. 6). These cases include 34 regions that involve exons fromknowngenes,whicharediscussedinasubsequentsection.Althoughwe have no direct measure of large insertions in the chimpanzeegenome,itappearslikelythatthesituationissimilar.

    On the basis of this analysis, we estimate that the human andchimpanzee genomes each contain 4045Mb of species-specificeuchromatic sequence,and the indel differences between the genomes thus total,90Mb. This difference corresponds to,3% ofboth genomes and dwarfs the 1.23% difference resulting fromnucleotide substitutions; this confirms and extends several recentstudies6367.Ofcourse,thenumberofindeleventsisfarfewerthanthenumberofsubstitutionevents(,5millioncomparedwith,35million,respectively).Transposable element insertions. We next used the catalogue oflineage-specifictransposableelementcopiestocomparetheactivityoftransposonsinthehumanandchimpanzeelineages(Table2).Endogenous retroviruses. Endogenous retroviruses (ERVs) havebecome all but extinct in the human lineage, with only a singleretrovirus(humanendogenousretrovirusK(HERV-K))stillactive24.HERV-Kwas found to be active in both lineages, with at least 73human-specific insertions(7full lengthand66solo longterminalrepeats(LTRs))andatleast45chimpanzee-specificinsertions(1fulllengthand44soloLTRs).AfewotherERVclassespersisted inthehuman genome beyond the humanchimpanzee split, leaving,9human-specific insertions (all solo LTRs, including five HERV9elements)beforedyingout.

    Against this background, it was surprising to find that thechimpanzee genome has two active retroviral elements (PtERV1andPtERV2)thatareunlikeanyolderelements ineithergenome;

    Figure 3 |D i v e rg e n c e r a t e s v e r s u s G 1C c o n t en t f o r 1 - M b s e g m e n t s a c r o s st he aut osomes. C o n di t i on a l o n r e c o mb i n a ti o n r a t e , t h e r e l a ti o n sh i pb e tw e en d iv e rg e nc e a n d G C c o nt e nt v a r ie s . I n r e g io n s w i thr e c o mb i n a ti o n r a t e s l e s s t h an 0 . 8 c M M b21 ( b l u e) , th e r e i s an i n v er s er e la t io n sh i p, w h e r e h i gh d i ve r ge n ce r e gi o ns t e n d t o b e ( G C)-poor andl o w d i ve r ge n ce r e gi o ns t e nd t o b e ( G C ) - r ic h . I n r e g i o ns w i t hr e c o mb i n a ti o n r a t e s g r e a t er t h a n 2 . 0 cM M b21, w h e t he r w i t hi n 1 0 Mb ( r e d )o r pr o xi m a l ( g r e e n) o f c h r o mo s o me e n d s , b o t h d i v er g e n ce a n d G Cc o nt e nt a r e u n if o rm l y h i gh .

    F i gu r e 4 |D i s p ro p o r t io n a t el y e l e v at e d d i v e r ge n c e a n d G 1C c o n t en t n e a rh o m i n id t e l o me r e s . S c at t er p l o t o f t h e r a ti o o f h u ma n c h im pa n ze ed i ve r ge n ce o v er m o u s e r at d i ve r ge n ce v e rs u s t h e r a ti o o f h u ma n G Cc o nt e nt o v er m o u se G C c o n te n t a c r o ss 1 9 9 s yn t e n ic b l o c ks f o r w hi c hm o re t h an 1 M b o f s e qu e nc e c o ul d b e a l ig n ed b e tw e en a l l f o ur s p ec i es .B l oc k s f o r wh i ch t h e c e nt r e i s w it h in 1 0 M b o f a t e lo m er e i n h o mi n id s on l y ( g r e e n ) o r i n h o m in i d s a n d m u r i ds ( m a g en t a ) , b u t n o t i n m u r id s o n ly ( l i g h tb l u e ) , s h o w a s i g n ifi c a n t t r e n d t o w ar d s hi g h e r ra t i o s t h a n i n t e rn a l b lo c k s( d ar k b l ue ) . B l oc k s o n t h e X c h ro m os o me ( r ed ) t e nd t o s h ow a l o we rd i v er g e n ce r a t i o t h a n a u t o so m a l b l oc k s , c on s i st e n t w i t h a s m a l le r d i f fe r e n ceb e t w ee n a u t o s om a l a n d X d i v er g e n ce i n m u r id s th a n i n h o m in i d s ( l o we r a).

    2005 Nature Publishing Group 73

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    6/19

    ARTICLES NATURE|Vol437|1September2005

    these must have been introduced by infection of the chimpanzeegermline.Thesmallerfamily(PtERV2)hasonlyafewdozencopies,whichnonethelessrepresentmultiple(,58)invasions,becausethesequencedifferencesamongreconstructedsubfamiliesaretoogreat(,8%)tohavearisenbymutationsincedivergencefromhuman.Itiscloselyrelatedtoababoonendogenousretrovirus(BaEV,88%ORF2productidentity)andafelineendogenousvirus(ECE-1,86%ORF2productidentity).Thelargerfamily(PtERV1)ismorehomogeneousand has over 200 copies. Whereas older ERVs, like HERV-K, areprimarilyrepresentedbysoloLTRsresultingfromLTRLTRrecombination,morethanhalfofthePtERV1copiesarestillfulllength,probably reflecting theyoung age of the elements. PtERV1-likeelementsarepresentintherhesusmonkey,olivebaboonandAfricangreat apes but not in human, orang-utan or gibbon, suggestingseparategermlineinvasionsinthesespecies68.HigherAlu activity in humans. SINE (Alu) elements have beenthreefold more active in humans than chimpanzee (,7,000 compared with,2,300 lineage-specific copies in the aligned portion),refining the rather broad range (27-fold) estimated in smallerstudies13,67,69. Most chimpanzee-specific elements belong to a subfamily(AluYc1)thatisverysimilartothesourcegeneinthecommonancestor.Bycontrast,mosthuman-specificAluelementsbelongtotwonewsubfamilies(AluYa5andAluYb8)thathaveevolvedsincethechimpanzeehuman divergence and differ substantially from theancestral source gene69. It seems likely that the resurgence of Aluelements in humans is due to these potent new source genes.However,basedonanexaminationofavailablefinishedsequence,thebaboonshowsa1.6-foldhigherAluactivityrelativetohumannew insertions,suggestingthattheremayalsohavebeenageneraldeclineinactivityinthechimpanzee67.

    Someofthehuman-specificAluelementsarehighlydiverged(92with.5%divergence),whichwouldseemtosuggestthattheyaremucholderthanthehumanchimpanzeesplit.Possibleexplanationsinclude:geneconversionbynearbyolderelements;processedpseudogenesarisingfromaspurioustranscriptionofanolderelement;precise excision from the chimpanzee genome; or high localmutationrate.Inanycase,thepresenceofsuchanomaliessuggeststhat

    caution

    is

    warranted

    in

    the

    use

    of

    single-repeat

    elements

    as

    homoplasy-freephylogenetic markers.NewAluelementstarget(AT)-richDNAinhumanandchimpanzeegenomes.OlderSINEelementsarepreferentiallyfoundingene-rich,

    (GC)-richregions,whereasyoungerSINEelementsarefoundingene-poor, (AT)-rich regions where long interspersed element(LINE)-1(L1)copiesalsoaccumulate24,70.ThelatterdistributionisconsistentwiththefactthatAluretrotranspositionismediatedbyL1(ref.71).Muridgenomesrevealednochange inSINEdistributionwithage17.

    ThehumanpatternmightreflecteitherpreferentialretentionofSINEsin(GC)-richregions,duetoselectionormutationbias,orarecentchangeinAluinsertionpreferences.Withtheavailabilityofthechimpanzeegenome, it ispossible toclassify theyoungestAlucopies more accurately and thus begin to distinguish thesepossibilities.

    Analysis shows that lineage-specific SINEs in both human andchimpanzeearebiasedtowards(AT)-richregions,asopposedtoeventhemostrecentcopiesintheMRCA(Fig.7).ThisindicatesthatSINEsareindeedpreferentiallyretainedin(GC)-richDNA,butcomparisonwithamoredistantprimateisrequiredtoformallyruleoutthepossibilitythattheinsertionbiasofSINEsdidnotchangejustbeforespeciation.Equal activity of L1 in both species. The human and chimpanzeegenomesbothshow,2,000lineage-specificL1elements,contrarytopreviousestimatesbasedonsmallsamplesthatL1activityis23-foldhigherinchimpanzee72.

    TranscriptionfromL1sourcegenescansometimescontinueinto30

    flankingregions,whichcanthenbeco-transposed73,74.Humanchimpanzeecomparisonrevealedthat,15%ofthespecies-specificinsertionsappeartohavecarriedwiththematleast50bpofflankingsequence(followedbyapoly(A)tailandatargetsiteduplication).Inprinciple,incompletereversetranscriptioncouldresultininsertionsoftheflankingsequenceonly(withoutanyL1sequence),mobilizinggeneelementssuchasexons,butwefoundnoevidenceofthis.Retrotransposedgenecopies.TheL1machineryalsomediatesretrotranspositionofhostmessengerRNAs,resultinginmanyintronless(processed)pseudogenes in thehumangenome7577.We identified163lineage-specificretrotransposedgenecopiesinhumanand246inchimpanzee(SupplementaryTableS19).Correctingforincompletesequencecoverageofthechimpanzeegenome,weestimatethatthereare

    ,200

    and

    ,300

    processed

    gene

    copies

    in

    human

    and

    chimpanzee,respectively.Processedgenesthusappeartohavearisenatarate

    of,50 per millionyears since the divergence of human andchimpanzee;thisislowerthantheestimatedrateforearlyprimateevolution75,perhapsreflectingtheoveralldecreaseinL1activity.Asexpected78, ribosomal protein genes constitute the largest class inbothspecies.ThesecondlargestclassinchimpanzeecorrespondstozincfingerC2H2genes,whicharenotamajorclassinthehumangenome.

    Figure 5 |L e n g th d i s t ri b u t i on o f s m a l l i n d e l e v e nt s , a s d e t er m i n ed u s i n gb o u n de d s e q ue n c e g a p s . S e qu e nc e s p r es e nt i n c h im p an z ee b u t n o t i nh u ma n ( b lu e ) o r p re s en t i n h u ma n b u t n o t i n c h im pa n ze e ( r ed ) a re s h ow n . Figure 6 |L e ng t h d i st r ib u ti o n o f l a rg e i n de l e v en t s (>1 5 k b ) , as d e t er m i n edT h e p r o mi n e n t s p i k e a r o u nd 3 0 0 n u c l e ot i d e s c o r re s p o nd s t o S I N E i n s e r ti o n u s i n g p a i r ed - e n d s e q ue n c e s f r o m c h i m p an z e e m a p p ed a g a i n st t h e h u m a ne v en ts . M o st o f t h e i n de l s a r e s m al l er t h a n 2 0 bp , b u t l a rg e r i n de l s a cc o un t genome. B o t h t h e t o t a l n u m b e r o f c a n di d a t e h u m a n i n s e rt i o n s/ c h i mp a n z eef o r t h e b u lk o f l i ne a ge - s pe c ifi c s e qu e nc e i n t h e t w o g e no m es . d e le t io n s ( b lu e ) a n d t h e n u mb e r o f b a se s a l te r ed ( r ed ) a r e s h ow n .

    2005 Nature Publishing Group74

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    7/19

    NATURE|Vol437|1September2005 ARTICLES

    TheretrotransposonSVAanddistributionofCpGislandsbytransposableelements. The thirdmost active element since speciation hasbeenSVA,whichcreatedabout1,000copiesineachlineage.SVAisacompositeelement(,1.52.5kb)consistingoftwoAlufragments,atandemrepeatandaregionapparentlyderivedfromthe3 0 endofaHERV-Ktranscript;itisprobablymobilizedbyL1(refs79,80).ThiselementisofparticularinterestbecauseeachcopycarriesasequencethatsatisfiesthedefinitionofaCpGisland81 andcontainspotentialtranscriptionfactorbindingsites;thedispersionof1,000SVAcopiescouldthereforebeasourceofregulatorydifferencesbetweenchimpanzeeandhuman(SupplementaryTableS20).AtleastthreehumangenescontainSVAinsertionsneartheirpromoters(SupplementaryTableS21),oneofwhichhasbeenfoundtobedifferentiallyexpressedbetween the two species82,83, but additional investigations will berequiredtodeterminewhethertheSVAinsertiondirectlycausedthisdifference.Homologous recombination between interspersed repeats. Humanchimpanzeecomparisonalsomakesitpossibletostudyhomologousrecombination between nearby repeat elements as a source ofgenomicdeletions.Wefound612deletions(totalling2Mb)inthehuman genome that appear to have resulted from recombinationbetweentwonearbyAluelementspresentinthecommonancestor;thereare914sucheventsinthechimpanzeegenome.(Theeventsarenot biased to (AT)-rich DNA and thus would not explain thepreferential lossofAluelements insuchregionsdiscussedabove.)Similarly,wefound26and48instancesinvolvingadjacentL1copiesand 8 and 22 instances involving retroviral LTRs in human andchimpanzee, respectively. None of the repeat-mediated deletionsremoved an orthologous exon of a known human gene inchimpanzee.

    Thegenomecomparisonallowsonetoestimatethedependencyofhomologousrecombination on divergenceanddistance.Homologousrecombinationseemstooccurbetweenquite(.25%)divergedcopies (Fig. 8), whereas the number of recombination events (n)variesinverselywiththedistance(d,inbases)betweenthecopies(as

    d21.7n

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    8/19

    ARTICLES NATURE|Vol437|1September2005

    mutationratevariesacrossthegenome,itiscrucialtonormalizeKAfor comparisons between genes. A striking illustration of thisvariationisthefactthatthemeanKA is37%higherintherapidlydivergingdistal10Mbofchromosomesthaninthemoreproximalregions. Classically, the background rate is estimated byKS, thesynonymous substitution rate (coding base substitutions that,becauseofcodonredundancy,donotresultinaminoacidchange).Becauseatypicalgenehasonlyafewsynonymouschangesbetweenhumansandchimpanzees,andnotinfrequentlyiszero,weexploitedthegenomesequencetoestimatethelocal intergenic/intronicsubstitutionrate,KI,whereappropriate.KAandKSwerealsoestimatedforeachlineageseparatelyusingmouseandratasoutgroups(Fig.9).

    TheKA/KSratioisaclassicalmeasureoftheoverallevolutionaryconstraintonagene,whereKA/KS,,1indicatesthatasubstantialproportion of amino acid changes must have been eliminated bypurifyingselection.Undertheassumptionthatsynonymoussubstitutions are neutral, KA/KS.1 implies, but is not a necessarycondition for, adaptive or positive selection. TheKA/KI ratio hasthesameinterpretation.Theratioswillsometimesbedenotedbelowbyqwithanappropriatesubscript(forexample,qhuman)toindicatethebranchoftheevolutionarytreeunderstudy.Evolutionaryconstraintonaminoacidsiteswithinthehominidlineage.Overall,humanandchimpanzeegenesareextremelysimilar,withtheencodedproteinsidentical inthetwospecies in29%ofcases.Themediannumberofnon-synonymousandsynonymoussubstitutionspergenearetwoandthree,respectively.About5%oftheproteinsshowin-frameindels,butthesetendtobesmall(median1codon)

    andtooccurinregionsofrepeatedsequence.Theclosesimilarityofhumanandchimpanzeegenesnecessarilylimitstheabilitytomakestronginferencesaboutindividualgenes,butthereisabundantdatatostudyimportantsetsofgenes.

    TheKA/KSratioforthehumanchimpanzeelineage(qhominid) is 0.23.Thevalueismuchlowerthansomerecentestimatesbasedonlimited sequence data (ranging as high as 0.63 (ref. 7)), but isconsistentwithanestimate(0.22)fromrandomexpressed-sequencetag(EST)sequencing45.Similarly,KA/KIwasalsoestimatedas0.23.Undertheassumptionthatsynonymousmutationsareselectivelyneutral, the results imply that 77% of amino acid alterations inhominid genes are sufficiently deleterious as to be eliminated bynatural selection. Because synonymous mutations are not entirelyneutral(seebelow),theactualproportionofaminoacidalterationswith deleterious consequences may be higher. Consistent withprevious studies8, we find thatKA/KS of human polymorphismswith frequencies up to 15% is significantly higher than that ofhumanchimpanzeedifferencesandmorecommonpolymorphisms(Table3),implyingthatatleast25%ofthedeleteriousaminoacidalterationsmayoftenattainreadilydetectablefrequenciesandthuscontributesignificantlytothehumangeneticload.Evolutionaryconstraintonsynonymoussiteswithinhominidlineage.Wenextexploredtheevolutionaryconstraintsonsynonymoussites,specificallyfourfolddegeneratesites.Becausesuchsiteshavenoeffectontheencodedprotein,theyareoftenconsideredtobeselectivelyneutralinmammals.

    Were-examinedthisassumptionbycomparingthedivergenceatfourfolddegeneratesiteswiththedivergenceatnearbyintronicsites.Althoughoveralldivergenceratesareverysimilaratfourfolddegenerateandintronicsites,directcomparisonismisleadingbecausetheformerhaveahigherfrequencyofthehighlymutableCpGdinucleotides(9%comparedwith2%).WhenCpGandnon-CpGsitesareconsideredseparately,wefindthatbothCpGsitesandnon-CpGsitesshowmarkedlylowerdivergenceinexonicsynonymoussitesthaninintrons(,50%and,30%lower,respectively).Thisresultresolvesrecentconflictingreportsbasedonlimiteddatasets45,89 byshowingthatsuchsitesareindeedunderconstraint.

    Theconstraint

    does

    not

    seem

    to

    result

    from

    selection

    on

    the

    usage

    ofpreferredcodons,whichhasbeendetected inlowerorganisms90suchasbacteria91,yeast92 andflies93.Infact,divergenceatfourfold

    Figure 8 |D e p en d e nc y o f h o m o l o go u s r e c o m bi n a t io n b e t we e n A l ue l e m en t s o n d i v e rg e n c e a n d d i s t an c e . a, Whereas hom ologousr e c o mb i n a ti o n o c c ur s b e t w ee n q u i te d i v er g e n t ( S m it h Wa t e r ma n s c o re,1 , 0 0 0) , c l o s e l y s p a c ed c o pi e s , m o r e d i s ta n t r e c om b i na t i o n s e e m s t o f a vo u ra b e t t er m a t ch b e t w ee n t h e r e c o mb i n in g r e p e at s . b, T h e f r eq u en c y o f A l uA l u - me d i a te d r e c om b i na t i o n f a l l s m a r ke d l y a s a f u n ct i o n o f d i s ta n c eb e t w ee n t h e r e c o mb i n in g c o pi e s . Th e fi r s t t h r e e p o i n ts ( m a g en t a ) in v ol v er e co mb i na t io n b e tw e en l e ft o r r ig h t a r ms o f o n e A l u i n se r te d i n to a n ot h er .T h e h i g h n u m be r o f o c c ur r e n ce s a t a d i s ta n c e o f 3 0 0 40 0 n u c l eo t i de s i s d u et o t h e p r e f er e n ce o f i n t e gr a t i on i n t h e A - r ic h t a i l ; e x c lu s i o n o f t h i s p o i nt d o e sn ot c ha ng e t he p ar a me te rs o f t he e qu at io n.

    Figure 9 |Humanchimpanzeemouserat t ree wit h branch-specific KA /KS(q) v a l u es . a, E v o lu t i on a r y t r e e . T he b r a n ch l e n g th s a r e p r o po r t i on a l t o t h ea b so l ut e r a te s o f a m in o a c id d i ve r ge n ce . b, Max i m um -li keli hood esti m ateso f t h e r a t e s o f e v o l ut i o n i n p r o te i n - co d i ng g e n e s f o r h u ma n s , c h im p a nz e e s ,m ic e a n d r a t s . I n t h e t e xt , qhominid i s the KA/KS o f t h e c o mb i ne d h u ma n a n dc h i mp a n z ee b r a n ch e s a n d qmurid o f t h e c o mb i ne d m ou s e a n d r a t b r an c he s .T h e s l ig h t d i ff e re n ce b e tw e en qhuman and qchimpanzee i s n o t s t a t is t i ca l l y s i g n ifi c a n t; m a s k in g o f s o m e h e t e r oz y g o us b a s e s in t h e c h i mp a n z ees e qu e nc e m a y c o nt r ib u te t o t h e o b se r ve d d i ff e re n ce ( s ee S u pp l em e nt a ry I n f or m a t io n G e ne e v o lu t i o n ) .

    2005 Nature Publishing Group76

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    9/19

    NATURE|Vol437|1September2005 ARTICLES

    degeneratesites increasesslightlywithcodonusagebias(Kendallst0.097, P,10214). Alternatively, the observed constraint atsynonymoussitesmightreflectbackgroundselectionthatis,theindirect effect of purifying selection at amino acid sites causingreduceddiversityandtherebyreduceddivergenceatclosely linkedsites42.Giventhelowrateofrecombinationinhominidgenomes(a1kbregionexperiencesonly,1crossoverper100,000generationsor2millionyears), such backgroundselection should extend beyondexons to includenearby intronicsites94.However,when thedivergencerateisplottedrelativetoexonintronboundaries,wefindthatthe ratejumps sharply within a short region of,7 bp at the boundary (Fig. 10). This pattern strongly suggests that the actionof purifying selection at synonymous sites is direct rather thanindirect, suggesting that other signals, for example those involvedinsplicesiteselection,maybeembeddedinthecodingsequenceandthereforeconstrainsynonymoussites.Comparisonwithmurids. An accurate estimate ofKA/KS makes itpossibletostudyhowevolutionaryconstraintvariesacrossclades.Itwaspredictedmorethan30yearsago95thatselectionagainstdeleteriousmutationswould dependonpopulation size,withmutationsbeing strongly selected only if they reduce fitness by s..1/4N(whereN is effective population size). This would predict thatgeneswouldbeunderstrongerpurifyingselection inmurids thanhominids, owing to their presumed larger population size. Initialanalyses(involvingfewerthan50genes96)suggestedastrongeffect,butthewidevariation inestimatesofKA/KS inhominids7,8,97 andmurids98 hascomplicatedthisanalysis45.

    Using the large collection of 7,043 orthologous quartets, wecalculatedmeanKA/KSvaluesforthevariousbranchesofthefour-species evolutionary tree (human, chimpanzee, mouse and rat;Fig.9).TheKA/KS ratioforhominidsis0.20.(Thisisslightlylowerthanthevalueof0.23obtainedwithallhumanchimpanzeeorthologues,probablyreflectingslightlygreaterconstraintontheclassofproteinswithclearorthologuesacrosshominidsandmurids.)

    TheKA/KSratioismarkedlylowerformuridsthanforhominids(qmurid

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    10/19

    ARTICLES NATURE|Vol437|1September2005

    follow-upstudiesoncandidates fromthis list,onemaybeabletodrawconclusionsaboutpositiveselectiononotherindividualgenes.Insubsequentsections,weexaminetherateofdivergenceforsetsofrelatedgeneswiththeaimofdetectingsubtlersignalsofacceleratedevolution.Variationinevolutionaryrateacrossphysicallylinkedgenes.Weexplored how the rate of evolution varies regionally across thegenome.Severalstudiesofmammaliangeneevolutionhavenotedthattherateofaminoacidsubstitutionshowslocalclustering,withproteinsencodedbynearbygenesevolvingatcorrelatedrates16,105107.Variationacrosschromosomes.On thebasisofananalysisof,100genes108,itwasrecentlyreportedthatthenormalizedrateofproteinevolutionisgreaterontheninechromosomesthatunderwentmajorstructuralrearrangementduringhumanevolution(chromosomes 1,2,5,9,12,15,16,17and18);itwassuggestedthatsuchrearrangementsledtoreducedgeneflowandacceleratedadaptiveevolution.Asubsequent studyof acollectionof chimpanzee ESTsgavecontradictory results109,110.Withour largerdataset,were-examined thisissue and foundno evidenceof acceleratedevolutiononchromosomes with major rearrangements, even if we considered eachrearrangementseparately(SupplementaryTableS25).

    Among all hominid chromosomes, the most extreme outlier ischromosomeXwithameanKA/KIof0.32.Thehighermeanseemstoreflectaskeweddistributionatbothhighandlowvalues,withthemedian value (0.17) being more in line with other chromosomes(0.15).Theexcessoflowvaluesmayreflectgreaterpurifyingselectionatsomegenes,owingtothehemizygosityofchromosomeXinmales.Theexcessofhighvaluesmayreflectincreasedadaptiveselectionalsoresultingfromhemizygosity,ifaconsiderableproportionofadvantageousallelesarerecessive111.Interestingly,thehigherKA/KIvalueontheXchromosomeversusautosomesislargelyrestrictedtogenesexpressedintestis83.Variation in localgene clusters. We next searched for genomicneighbourhoodswithanunusuallyhighdensityofrapidlyevolvinggenes. Specifically, we calculated the median KA/KI for slidingwindows of ten orthologues and identified extreme outliers(P,0.001comparedtorandomorderingofgenes;seeSupplementary

    Information

    Gene

    evolution).

    A

    total

    of

    16

    such

    neighbourhoodswerefound,whichgreatlyexceeds randomexpectation(Table4).

    Repeatingtheanalysiswithlargerwindows(25,50and100orthologues)didnotidentifyadditionalrapidlydivergingregions.

    Figure 10 |P u ri f yi n g s e le c ti o n o n s y no n ym o us s i te s . M e a n d i ve r g e n cea r o u nd e x on b o u n da r i e s a t n o n - Cp G , e xo n i c, f o u r fo l d d e g e ne r a t e s i t e s a ndi n tr o ni c s i te s , r e la t iv e t o t h e c l os e st m R NA s p li ce j u nc t io n . T h e d i ve r ge n cer a t e a t e x on i c , f o u rf o l d d e g e ne r a t e s i t es i s s i g n ifi c a nt l y l o w er t ha n a t n e a r b y i ntroni c si tes (MannWhi tney U-test; P,10227), suggesti ng that puri f yi ngs e l e ct i o n l i m it s t h e r a t e o f s y n on y m ou s c o d on s u b s ti t u ti o n s .

    Innearlyallcases,theregionscontainlocalclustersofphylogeneticallyandfunctionallyrelatedgenes.Therapiddiversificationofgenefamilies,postulatedbyref.112,canthusbereadilydiscernedeven at the relatively close distance of humanchimpanzee divergence.Mostoftheclustersareassociatedwithfunctionalcategoriessuch as host defence and chemosensation (see below). Examplesinclude the epidermal differentiation complex encoding proteinsthathelpformthecornifiedlayeroftheskinbarrier(SupplementaryFig.S8),theWAP-domainclusterencodingsecretedproteaseinhibitorswithantibacterialactivity,andtheSiglecclusterencodingCD33relatedgenes.Rapidevolutionintheseclustersdoesnotseemtobeuniquetoeitherhumanorchimpanzee113,114.Variation inevolutionary rateacrossfunctionally relatedgenes.We next studied variation in the evolutionary rate of functionalcategories of genes, based on the Gene Ontology (GO) classification115.Rapidlyandslowlyevolvingcategorieswithinthehominidlineage.Westarted by searching for sets of functionally related genes withexceptionally highor lowconstraint in humans andchimpanzees.For each of the 809 categories with at least 20 genes,KA/KS wascalculated by concatenating the gene sequences. The category-specificratioswerecomparedtotheaverageacrossallorthologuestoidentifyextremeoutliersusingametricbasedonthebinomialtest(Supplementary Information Gene evolution and SupplementaryTablesS26S29).Thenumbersofobservedoutliersbelowaspecificthreshold(teststatistic,0.001)werethencomparedtotheexpecteddistributionofoutliersgivenrandomlypermutedannotations.

    A total of 98 categories showed elevated KA/KS ratios at thespecifiedthreshold(Table5).Only30wouldbeexpectedbychance,indicating that most (but not all) of these categories undergosignificantlyacceleratedevolutionrelativetothegenome-wideaverage(P,1024).Therapidlyevolvingcategorieswithinthehominidlineageareprimarilyrelatedtoimmunityandhostdefence,reproduction,andolfaction,whicharethesamecategoriesknowntobeundergoingrapidevolutionwithinthebroadermammalianlineage,aswellasmoredistantlyrelatedspecies15,16,116.Hominidsthusseemtobetypicalofmammalsinthisrespect(butseebelow).

    Atotal

    of

    251

    categories

    showed

    significantly

    low

    KA/KS

    ratios

    (comparedwith,32expectedbychance;P,1024).Theseincludeawide range of processes including intracellular signalling, metabolism,neurogenesisandsynaptictransmission,whichareevidentlyunder stronger-than-average purifying selection. More generally,genes expressed in the brain show significantly stronger averageconstraintthangenesexpressedinothertissues83.Differencesbetweenhominidandmuridlineages.Havingfoundgenecategories thatshowsubstantialvariation inabsoluteevolutionaryratewithinhominids,wenextexaminedvariation inrelativerates

    Table 4 |R a pi d ly d i v e r gi n g g e ne c l us t er s i n h u ma n a n d c h im p an ze eLocation(human) Cluster MedianKA/KI*1q21 Epidermaldifferentiationcomplex 1.466p22 OlfactoryreceptorsandHLA-A 0.9620p11 Cystatins 0.9419q13 Pregnancy-specificglycoproteins 0.9417q21 Hairkeratinsandkeratin-associatedproteins 0.9319q13 CD33-relatedSiglecs 0.9020q13 WAPdomainproteaseinhibitors 0.9022q11 Immunoglobulin-l/breakpointcriticalregion 0.8512p13 Tastereceptors,type2 0.8117q12 Chemokine(C-Cmotif)ligands 0.8119q13 Leukocyte-associatedimmunoglobulin-likereceptors 0.805q31 Protocadherin-b 0.771q32 Complementcomponent4-bindingproteins 0.7621q22 Keratin-associatedproteinsanduncharacterizedORFs 0.761q23 CD1antigens 0.724q13 Chemokine(C-X-Cmotif)ligands 0.70*MaximummedianKA/KI iftheclusterstretchedovermorethanonewindowoftengenes.

    2005 Nature Publishing Group78

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    11/19

    NATURE|Vol437|1September2005 ARTICLES

    Table 5 |G O c a t e go r i e s w i t h t h e h i g h es t d i v e r ge n c e r a t e s i n h o m i n id sGOcategorieswithinbiologicalprocess Numberoforthologues Aminoaciddivergence KA/KSGO:0007606sensoryperceptionofchemicalstimulus 59 0.018 0.590GO:0007608perceptionofsmell 41 0.018 0.521GO:0006805xenobioticmetabolism 40 0.013 0.432GO:0006956complementactivation 22 0.013 0.428GO:0042035regulationofcytokinebiosynthesis 20 0.011 0.402GO:0007565pregnancy 34 0.014 0.384GO:0007338fertilization 24 0.010 0.371GO:0008632apoptoticprogramme 36 0.010 0.358GO:0007283spermatogenesis 80 0.008 0.354GO:0000075cellcyclecheckpoint 27 0.006 0.354ListedarethetencategoriesinthetaxonomybiologicalprocesswiththehighestKA/KSratios,whicharenotsignificantsolelyduetosignificantsubcategories.

    between murids and hominids. The KA/KS of each of the GOcategories are highly correlated between the hominid and muridorthologue pairs, suggesting that the selective pressures acting onparticular functional categories have been largely proportional inrecenthominidandrecentmuridevolution(Fig.11).However,thereareseveralcategorieswithsignificantlyacceleratednon-synonymousdivergenceoneachofthelineages,whichmightrepresentfunctionsthathaveundergonelineage-specificpositiveselectionoralineage-specificrelaxationofconstraint(SupplementaryInformationGeneevolutionandSupplementaryTablesS30S39).

    Atotalof59categories(comparedwith11expectedatrandom,P,0.0003)showevidenceofacceleratednon-synonymous divergenceinthemuridlineage.Thesearedominatedbyfunctionsandprocesses related to host defence, such as immune response andlymphocyteactivation. Examples includegenesencoding interleukinsandvariousT-cellsurfaceantigens(Cd4,Cd8,Cd80).Combinedwiththerecentobservationthatgenesinvolvedinhostdefencehaveundergonegenefamilyexpansioninmurids16,17,thissuggeststhattheimmunesystemhasundergoneextensivelineage-specificinnovationinmurids.Additionalcategoriesthatalsoshowrelativeaccelerationin murids include chromatin-associated proteins and proteinsinvolvedinDNArepair.Thesecategoriesmayhavesimilarlyundergone stronger adaptive evolution in murids or, alternatively, theymaycontainfewersitesformutationswithslightlydeleteriouseffects(with the result that the KA/KS ratios are less affected by thedifferencesinpopulationsize96,117).

    Another58categories(versus14expectedatrandom,P,0.0005)show evidence of accelerated evolution in hominids, with the setdominated by genes encoding proteins involved in transport (forexample,iontransport),synaptictransmission,spermatogenesisandperceptionofsound(Table6).Notably,someoutliersincludegeneswith brain-related functions, compatible with a recent finding118.Potentialpositiveselectiononspermatogenesisgenesinthehominidswasalsorecentlynoted119.However,asabove,itispossiblethatthese categories could have more sites for slightly deleteriousmutationsandthusbemoreaffectedbypopulationsizedifferences.Sequence information from more species and from individuals

    withinspecieswillbenecessarytodistinguishbetweenthepossibleexplanations.Differencesbetween thehumanandchimpanzee lineage.Oneof themostinterestingquestionsisperhapswhethercertaincategorieshaveundergoneacceleratedevolutioninhumansrelativetochimpanzees,because such genes might underlie unique aspects of humanevolution.

    Aswasdoneforhominidsandmuridsabove,wecomparednon-synonymous divergence for each category to search for relativeaccelerationineitherlineage(Fig.12).Sevencategoriesshowsignsofacceleratedevolutiononthehumanlineagerelativetochimpanzee,butthisisonlyslightlymorethanthefourexpectedatrandom(P,0.22).Intriguingly,thesinglestrongestoutlieristranscriptionfactoractivity,withthe348humangenesstudiedhavingaccumulated47%moreaminoacidchangesthantheirchimpanzeeorthologues. Genes with accelerated divergence in human includehomeotic, forkhead and other transcription factors that have keyroles in early development. However, given the small number ofchanges involved, additional datawill be required to confirm thistrend. There was no excess of accelerated categories on the chimpanzeelineage.

    Wealsocomparedhumangeneswithandwithoutdiseaseassociations,includingmentalretardation,fordifferencesinmutationratewhen compared to chimpanzee. Briefly, no significant differenceswereobservedineitherthebackgroundmutationrateorintheratioof human-specific changes to chimpanzee-specific amino acidchanges (see Supplementary Information Gene evolution andSupplementaryTablesS40andS41).

    Wethusfindminimalevidenceofaccelerationuniquetoeitherthehuman or chimpanzee lineage across broad functional categories.Thisisnotsimplyduetogeneral lackofpowerresultingfromthesmall number of changes since the divergence of human andchimpanzee, because one can detect acceleration of categories ineitherhominidrelativetoeithermurid.Forexample,29acceleratedcategoriesversus9expectedatrandom(P,0.02)canbedetectedonthehumanlineage,and40categoriesversus11expectedatrandom(P,0.007)onthechimpanzee lineage,relativetomouse.Butthe

    Table 6 |G O c a t e go r i e s w i t h a c c e l er a t e d d i v e rg e n c e r a t es i n h o m i n id s r e l a ti v e t o m u r i d sAminoaciddivergencein Aminoaciddivergencein

    GOcategorieswithinbiologicalprocess Numberoforthologues hominids murids KA/KS inhominids KA/KS inmuridsGO:0007283spermatogenesis 43 0.0075 0.054 0.323 0.188GO:0006869lipidtransport 22 0.0081 0.051 0.306 0.120GO:0006865aminoacidtransport 24 0.0058 0.033 0.218 0.084GO:0015698inorganicaniontransport 29 0.0061 0.027 0.195 0.072GO:0006486proteinaminoacidglycosylation 50 0.0056 0.040 0.166 0.100GO:0019932second-messenger-mediatedsignalling 58 0.0049 0.036 0.159 0.083GO:0007605perceptionofsound 28 0.0052 0.033 0.158 0.085GO:0016051carbohydratebiosynthesis 27 0.0047 0.028 0.147 0.067GO:0007268synaptictransmission 93 0.0040 0.025 0.126 0.069GO:0006813potassiumiontransport 65 0.0035 0.022 0.113 0.056Listedarethetencategoriesinthetaxonomybiologicalprocesswiththestrongestevidenceforacceleratedevolutioninhominidsrelativetomurids,whicharenotsignificantsolelyduetosignificantsubcategories.

    2005 Nature Publishing Group 79

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    12/19

    ARTICLES NATURE|Vol437|1September2005

    outliers are largely the same for both human and chimpanzee,indicating that the fraction of amino acid mutations that havecontributed to human- and chimpanzee-specific patterns of evolutionmustbesmallrelativetothefractionthathavecontributedtoacommon hominid and, to a large extent, mammalian pattern ofevolution.

    It was recently reported10 that several functional categories areenrichedforgeneswithevidenceofpositiveselectioninthehumanlineage or the chimpanzee lineage, and that these categories arelargely different between the two lineages. These results and oursdiffer in ways that will require further investigation. With thepotentialexceptionofsomedevelopmentalregulators,thecategoriesthatref.10reportedasshowingthestrongestenrichmentofpositiveselectioninonelineage(includingcelladhesion,iontransportandperception of sound) are among those that we show as havingaccelerateddivergenceinbothhumanandchimpanzee.Thissuggeststhatpositiveselection andrelaxationofconstraints maybecorrelated,oralternatively, thattheresultsofref.10maybeenrichedforfalsepositivesincategoriesthathaveexperiencedparticularlystrongrelaxation of constraints in the hominids. Data from additionalprimates,aswellasadvancesinanalyticalmethods,willbenecessarytodistinguishbetweenthesealternatives.Atpresent,strongevidenceofpositiveselectionuniquetothehumanlineageisthuslimitedtoahandfulofgenes120.

    Ouranalysisabovelargelyomittedgenesbelongingtolargegenefamilies,becausegenefamilyexpansionmakesitdifficulttodefine1:1:1:1orthologuesacrosshominidsandmurids.Oneofthelargestsuch families, the olfactory receptors, is known to be undergoingrapiddivergence in primates.Directedstudyof thesegenes in thedraftassemblyhassuggestedthatmorethan100functionalhumanolfactory receptors are likely to be under no evolutionary constraint121. Our analysis also omitted the majority of very recentlyduplicated genes owing to their lower coverage in the currentchimpanzeeassembly.However,recenthuman-specificduplicationscanbereadilyidentifiedfromthefinishedhumangenomesequence,andhavepreviouslybeenshowntobehighlyenrichedforthesamecategories found to have high absolute rates of evolution in 1:1orthologues

    here;

    that

    is,

    olfaction,

    immunity

    and

    reproduction

    23

    .

    Genedisruptionsinhumanandchimpanzee.Whereasmostgeneshave undergone only subtle substitutions in their amino acidsequence, a few dozen have suffered more marked changes. Wefoundatotalof53knownorpredictedhumangenesthatareeitherdeletedentirely(36)orpartially(17)inchimpanzee(Supplementary

    TableS42).Wehavesofartestedandconfirmed15ofthesecasesbypolymerase chain reaction (PCR) or Southern blotting. Anadditional eight genes have sustained large deletions (.15kb)entirelywithinanintron.Somegenesmayhavebeenmissedinthiscount owing to limitations of the draft genome sequence. Inaddition,somegenesmayhavesufferedchainterminationmutationsoralteredreadingframesinchimpanzee,butaccurateidentificationofthesewillrequirehigher-qualitysequence.Thesensitivityofthereciprocalanalysisofgenesdisruptedinhumaniscurrentlylimitedbythesmallnumberofindependentlypredictedgenemodelsforthechimpanzee.Someofthegenedisruptionsmayberelatedtointerestingbiologicaldifferencesbetweenthespecies,asdiscussedbelow.Geneticbasisforhuman- andchimpanzee-specificbiology.Giventhesubstantialnumberofneutralmutations,onlyasmallsubsetoftheobservedgenedifferencesislikelytoberesponsibleforthekeyphenotypic changes in morphology, physiology and behaviouralcomplexitybetweenhumansandchimpanzees.Determiningwhichdifferencesareinthisevolutionarilyimportantsubsetandinferringtheir functional consequences will require additional types of evidence,includinginformationfromclinicalobservationsandmodelsystems122.Wedescribesomenovelexamplesofgeneticchangesforwhich plausible functional or physiological consequences can besuggested.Apoptosis.Mouseandhumanareknowntodifferwithrespecttoanimportant mediator of apoptosis, caspase-12 (refs 123125). Theproteintriggersapoptosisinresponsetoperturbedcalciumhomeostasisinmice,buthumansseemtolackthisactivityowingtoseveralmutations intheorthologousgenethattogetheraffect theproteinproduced by all known splice forms; the mutations include apremature stop codon and a disruption of the SHG box requiredforenzymaticactivityofcaspases.Bycontrast,thechimpanzeegeneencodesanintactopenreadingframeandSHGbox,indicatingthatthefunctionallossoccurredinthehumanlineage.Intriguingly,loss-of-functionmutations inmiceconfer increasedresistancetoamyloid-inducedneuronalapoptosiswithoutcausingobviousdevelopmentalorbehaviouraldefects126.Thelossoffunctioninhumansmaycontributetothehuman-specificpathologyofAlzheimersdisease,which

    involves

    amyloid-induced

    neurotoxicity

    and

    deranged

    calciumhomeostasis.

    Inflammatory response. Human and chimpanzee show a notabledifference with respect to important mediators of immune andinflammatoryresponses.Threegenes(IL1F7,IL1F8andICEBERG)

    Figure 12 |H u m a n a n d c h i m pa n z ee KA /KS (q) i n G O c a te g or i es w i th m o ret h a n 2 0 a n a l y se d g e n es . G O c a t eg o r i es w i t h p u t a ti v e ly a c c el e r a te d ( t e s t

    Figure 11 |H o mi n id a n d m u ri d KA /KS (q) i n G O c a te g or i es w i th m o re t h an stati sti c ,0 . 0 0 1; s e e M e t ho d s ) no n - s yn o n ym o u s di v e rg e n c e o n t h e h u m a n2 0 a n a l y s e d g e n es . G O c a te g or i es w i th p u ta t iv e ly a c ce l er a te d ( t es t s t at i st i c l i ne a ge ( r ed ) a n d o n t h e c h im pa n ze e l i ne a ge ( o ra n ge ) a r e h i gh l ig h te d . T h e,0 . 0 0 1; s e e M e t h od s ) n o n - sy n o ny m o us d i ve r g e nc e o n t h e h o m in i d l i n e ag e s v a r i a nc e o f t h e s e e s t i ma t e s i s l a r g e r t h a n t h a t s e e n i n t h e h o m in i d m u r id( re d) a nd o n t he m ur id l in ea g es ( or an ge ) a re h ig hl ig h te d. O wi ng t o t he c om pa ri s on o wi ng t o t he s ma ll n um be r of l in ea ge -s pe ci fic s ub st it ut io ns .h i er a rc h ic a l n a tu r e o f G O, t h e c a te g or i es do n o t a l l r e pr e se n t i n de p en d en t O w in g t o t h e h i er a rc h ic a l n a tu r e o f t h e G O o n to l og y, t h e c a te g or i es d o n o td at a p oi nt s. A n on -r ed un da nt l is t o f s ig ni fic an t c at eg or ie s is p ro vi de d i n a ll r ep re se nt i nd ep en de nt d at a p oi nt s. A c om pl et e l is t o f c at eg or ie s isTa b le 8 a n d a c o mp le t e l i st i n S u pp l em e nt a r y Ta b le S 3 0. p r ov i de d i n S u pp l em e nt a r y Ta b le S 3 0.

    2005 Nature Publishing Group80

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    13/19

    NATURE|Vol437|1September2005 ARTICLES

    thatactinacommonpathwayinvolvingthecaspase-1geneallappeartobedeletedinchimpanzee.ICEBERGisthoughttorepresscaspase1-mediated generationofpro-inflammatoryIL1cytokines,and itsabsenceinchimpanzeemaypointtospecies-specificmodulationofthe interferon-g- and lipopolysaccharide-induced inflammatory

    127response .Parasite resistance. Similarly, we found that two members of theprimate-specificAPOLgenecluster(APOL1andAPOL4)havebeendeleted from the chimpanzee genome. The APOL1 protein isassociatedwiththehigh-densitylipoproteinfractioninserumandhas recently been proposed to be the lytic factor responsible forresistancetocertainsubspeciesofTrypanosomabrucei,theparasitethat causes human sleeping sickness and the veterinary diseasenagana128.The lossof theAPOL1gene inchimpanzeescouldthusexplaintheobservationthathuman,gorillaandbaboonpossessthetrypanosomelyticfactor,whereasthechimpanzeedoesnot129.Sialicacidbiologyrelatedproteins.Sialicacidsarecell-surfacesugarsthatmediatemanybiologicalfunctions130.Of54genes involvedinsialicacidbiology,47weresuitableforanalysis.Weconfirmedandextended findings on several that have undergone human-specificchanges,includingdisruptions,deletionsanddomain-specificfunctional changes113,131,132. Human- and chimpanzee-specific changeswerealsofoundinotherwiseevolutionarilyconservedsialylmotifsinfoursialyltransferases(ST6GAL1,ST6GALNAC3,ST6GALNAC4andST8SIA2),suggestingchangesindonorand/oracceptorbinding130.Lineage-specific changes were found in a complement factor H(HF1)sialicacidbindingdomainassociatedwithhumandisease133.Human SIGLEC11 has undergone gene conversion with a nearbypseudogene, correlating with acquisition of human-specific brainexpressionandalteredbindingproperties134.Humandiseasealleles. Wenext sought to identify putative functionaldifferencesbetweenthespeciesbysearchingforinstancesinwhich a human disease-causing allele appears to be the wild-typeallele in the chimpanzee. Starting from 12,164 catalogued diseasevariantsin1,384humangenes,weidentified16casesinwhichthealtered sequence in a disease allele matched the chimpanzee sequence,andhadplausiblesupport in the literature(Table7;seealso

    Supplementary

    Table

    S43).

    Upon

    re-sequencing

    in

    seven

    chimpanzees, 15 cases were confirmed homozygous in all individuals,

    whereasone(PON1I102V)appearstobeasharedpolymorphism(SupplementaryTableS44).

    Six cases represent de novo human mutations associated withsimplemendeliandisorders.Similarcaseshavealsobeenfound incomparisons of more distantly related mammals135, as well as

    Table 7 |C a n d id a t e h u m a n d i s e as e v a r i a nt s f o u n d i n c h i m p an z e eGene Variant* Diseaseassociation Ancestral FrequencyAIRE P252L159 Autoimmunesyndrome Unresolved 0MKKS R518H160 BardetBiedlsyndrome Wildtype 0MLH1 A441T161 Colorectalcancer Wildtype 0MYOC Q48H162 Glaucoma Wildtype 0OTC T125M163 Hyperammonaemia Wildtype 0PRSS1 N29T137 Pancreatitis Disease 0ABCA1 I883M164 Coronaryarterydisease Unresolved 0.136APOE C130R165 Coronaryarterydiseaseand Disease 0.15

    AlzheimersdiseaseDIO2 T92A166 Insulinresistance Disease 0.35ENPP1 K121Q167 Insulinresistance Disease 0.17GSTP1 I105V168 Oralcancer Disease 0.348PON1 I102V169 Prostatecancer Wildtype 0.016PON1 Q192R170 Coronaryarterydisease Disease 0.3PPARG A12P139 Type2diabetes Disease 0.85SLC2A2 T110I171 Type2diabetes Disease 0.12UCP1 A64T172 Waist-to-hipratio Disease 0.12*Thistakesthefollowingformat:benignvariant,codonnumber,disease/chimpanzeevariant.Ancestralvariantasinferredfromclosestavailableprimateoutgroups(Supplementary Information).Frequencyofthediseasealleleinhumanstudypopulation.Polymorphicinchimpanzee.

    betweeninsects136,andhavebeeninterpretedasaconsequenceofarelatively high rate of compensatory mutations. If compensatorymutationsaremorelikelytobefixedbypositiveselectionthanbyneutraldrift136,thenthevariantsidentifiedheremightpointtowardsadaptive differences between humans and chimpanzees. For example, the ancestral Thr29 allele of cationic trypsinogen (PRSS1)causes autosomal dominant pancreatitis in humans137, suggestingthat the human-specific Asn29 allele may represent a digestion-relatedmolecularadaptation138.Theremainingtencasesrepresentcommonhumanpolymorphismsthathavebeenreportedtobeassociatedwithcomplextraits,includingcoronaryarterydiseaseanddiabetesmellitus.Inallofthesecases we confirmed that the disease-associated allele in humans isindeedtheancestralallelebyshowingthatitiscarriednotonlybychimpanzee but also by outgroups such as the macaque. Theseancestralallelesmaythushavebecomehuman-specificriskfactorsdue to changes in human physiology or environment, and thepolymorphisms may represent ongoing adaptations. For example,PPARG Pro12 is the wild-type allele in chimpanzee but has beenclearlyassociatedwithincreasedriskoftype2diabetesinhuman139.Itistemptingtospeculatethatthisallelemayrepresentanancestralthriftygenotype140.

    Thecurrentresultsmustbeinterpretedwithcaution,becausefewcomplexdiseaseassociationshavebeenfirmlyestablished.Thefactthatthehumandiseaseallele isthewild-typeallele inchimpanzeemay actually indicate that some of the putative associations arespuriousandnotcausal.However,thisapproachcanbeexpectedtobecomeincreasinglyfruitfulasthequalityandcompletenessofthediseasemutationdatabasesimprove.

    H u m an p o pu l at i on g e ne t i cs

    The chimpanzeehas a specialrole in informing studiesofhumanpopulationgenetics,afieldthatisundergoingrapidexpansionandacquiring new relevance to human medicalgenetics141. The chimpanzee sequence allows recognition of those human alleles thatrepresent the ancestral state and the derived state. It also allowsestimates of local mutation rates, which serve as an importantbaseline

    in

    searching

    for

    signs

    of

    natural

    selection.

    Ancestralandderivedalleles.Of,7.2millionSNPsmappedtothehumangenomeinthecurrentpublicdatabase,wecouldassigntheallelesasancestralorderivedin80%ofthecasesaccordingtowhichallele agrees with the chimpanzee genome sequence142 (see Supplementary Information Human population genetics). For theremaining cases, no assignment could be made because of thefollowing: the orthologous chimpanzee base differed from bothhumanalleles(1.2%);waspolymorphicinthechimpanzeesequencesobtained(0.4%);orcouldnotbereliablyidentifiedwiththecurrentdraft sequence of the chimpanzee (18.8%), with many of theseoccurringinrepeatedorsegmentallyduplicatedsequence.Thefirsttwocasesarisepresumablybecauseasecondmutationoccurredinthechimpanzeelineage.Itshouldbepossibletoresolvemostofthesecasesbyexaminingacloseoutgroupsuchasgorillaororang-utan.

    Mutations in the chimpanzee may also lead to the erroneousassignmentofhumanallelesasderivedalleles.Thiserrorratecanbeestimated as the probability of a second mutation resulting in thechimpanzeesequencematchingthederivedallele(seeSupplementaryInformationHumanpopulationgenetics).TheestimatederrorratefortypicalSNPsis0.5%,owingtothelownucleotidesubstitutionrate.TheexceptionsarethoseSNPsforwhichthehumanallelesareCpGandTpGandthechimpanzeesequenceisTpG.Forthese,anon-negligiblefraction may have arisen by two independent deamination eventswithinanancestralCpGdinucleotide,whicharewell-knownmutationalhotspots51(alsoseeabove).HumanSNPsinaCpGcontextforwhichtheorthologouschimpanzeesequenceisTpGaccountfor12%ofthetotal,andhaveanestimatederrorrateof9.8%.AcrossallSNPs,theaverageerrorrate,1,isthusestimatedtobe,1.6%.

    We compared thedistributionofallele frequencies forancestral 2005 Nature Publishing Group 81

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    14/19

    ARTICLES NATURE|Vol437|1September2005

    andderivedallelesusingadatabaseofallelefrequenciesfor,120,000SNPs (see Supplementary Information Human population genetics). As expected, ancestral alleles tend to have much higherfrequencies than derived alleles (Supplementary Fig. S9). Nonetheless,asignificantproportionofderivedalleleshavehighfrequencies:9.1%ofderivedalleleshavefrequency$80%.

    Anelegantresultinpopulationgeneticsstatesthat,forarandomlyinterbreeding population of constant size, the probability that analleleisancestralisequaltoitsfrequency143.Weexploredtheextenttowhichthissimpletheoreticalexpectationfitsthehumanpopulation.We tabulated the proportionp a(x) of ancestral alleles for variousfrequencies ofx and compared this with the predictionp a(x)x(Fig.13).

    Thedatalienearthepredictedline,buttheobservedslope(0.83)issubstantiallylessthan1.Oneexplanationforthisdeviationisthatsome ancestral alleles are incorrectly assigned (an error rate of 1wouldartificiallydecreasetheslopebyafactorof121).However,with1estimatedtobeonly1.6%,errorscanonlyexplainasmallpartof the deviation. The most likely explanation is the presence ofbottlenecksduringhumanhistory,whichtendtoflattenthedistributionofallelefrequencies.Theoreticalcalculationsindicatethatarecent bottleneck would decrease the slopebya factorof(12b),wherebistheinbreedingcoefficientinducedbythebottleneck(seeSupplementaryInformationHumanpopulationgeneticsandSupplementaryFig.S10).Thissuggeststhatmeasurementsoftheslopeindifferent human groups may shed light on population-specificbottlenecks. Consistent with this, preliminary analyses of allelefrequencies in several regions for SNPs obtained by systematicuniformsamplingindicatethattheslopeissignificantlylowerthan1inEuropeanandAsiansamplesandcloseto1inanAfricansample(seeSupplementaryInformationHumanpopulationgeneticsandSupplementaryFig.S11).Signaturesofstrongselectivesweepsinrecenthumanhistory.Thepattern of human genetic variation holds substantial informationaboutselectioneventsthathaveshapedourspecies.Strongpositiveselection creates the distinctive signature of a selective sweep,wherebyarareallelerapidlyrisestofixationandcarriesthehaplotypeon

    which

    it

    occurs

    to

    high

    frequency

    (the

    hitchhiking

    effect).

    The

    surrounding region should show twodistinctive signatures: a significantreductionofoveralldiversity,andanexcessofderivedalleleswith high frequency in the population owing to hitchhiking of

    Figure 13 |T h e o b s e rv e d f r a c t io n o f a n c e st r a l a l l e l es i n 1 % b i n s o f o b s e rv e dfrequency. T h e s o li d l i ne s h ow s t h e r e gr e ss i on (b0 . 83 ) . T h e d o tt e d l i neshows the theoreti c al relationshi ppa(x) x. N o t e t h at b e ca u se e a ch v a ri a nt

    yi elds a deri ved and an anc estral allele, the data are nec essari ly sym m etri c ala b ou t 0 . 5.

    derivedallelesontheselectedhaplotype(seeSupplementaryInformationHumanpopulationgenetics).Thepatternmightbedetectable for up to 250,000years after a selective sweep has ended144.Notably, the chimpanzee genome provides crucial baseline informationrequiredforaccurateassessmentofbothsignatures.

    Thesizeoftheintervalaffectedbyaselectivesweepisexpectedtoscale roughly with s, the selective advantage due to the mutation.Simulationscanbeusedtostudythedistributionoftheintervalsize(see Supplementary Information Human population genetics).With s1%, the interval over which heterozygosity falls by 50%hasamodalsizeof600kbandaprobabilityofgreaterthan10%ofexceeding1Mb.

    Weundertookaninitialscanforlargeregions(.1Mb)withthetwosignaturessuggestiveofstrongselectivesweepsinrecenthumanhistory. We began by identifying regions in which the observedhuman diversity rate was much lower than the expectation basedon the observed divergence rate with chimpanzee. The humandiversity rate was measured as the number of occurrences from adatabaseof1.92millionSNPsidentifiedbyshotgunsequencinginapanel of AfricanAmerican individuals (see Supplementary InformationGenomesequencingandassembly).Thecomparisonwiththe chimpanzee eliminates regions in which low diversity simplyreflectsa lowmutationrate in theregion.Regionswere identifiedbased on a simple statistical procedure (see Supplementary Information Human population genetics). Six genomic regions standoutasclearoutliersthatshowsignificantlyreduceddiversityrelativetodivergence(Table8;seealsoSupplementaryFig.S12).

    WenexttestedwhetherthesesixregionsshowahighproportionofSNPswithhigh-frequencyderivedalleles(definedhereasalleleswithfrequency$80%). Within each region, we focused on the 1-Mbintervalwith thegreatestdiscrepancybetweendiversityanddivergenceandcompareditto1-Mbregionsthroughoutthegenome.Forthedatabaseof120,000SNPswithallelefrequenciesdiscussedabove,thetypical1-Mbregioninthehumangenomecontains,40SNPs,andtheproportionphofSNPswithhigh-frequencyderivedallelesis,9.1%.Allsixregionsidentifiedbyourscanforreduceddiversityhaveahigherthanaveragefractionofhigh-frequencyderivedalleles;all

    six

    fall

    within

    the

    top

    10%

    genome-wide

    and

    three

    fall

    within

    the

    top1%.Althoughthis isnotdefinitiveevidenceforanyparticularregion,thejointprobabilityofallsixregionsrandomlyscoringinthetop 10% is 1026. The results indicate that the six regions arecandidates for strong selective sweeps during the past250,000years144. The regions differ notably with respect to genecontent,rangingfromonecontaining57annotatedgenes(chromosome22)toanotherwithnoannotatedgeneswhatsoever(chromosome4).Wehavenoevidencetoimplicateanyindividualfunctionalelementasatargetofrecentselectionatthispoint,buttheregionscontain a number of interesting candidates for follow-up studies.Intriguingly,thechromosome4genedesert,whichflanksaprotocadheringeneandisconservedacrossvertebrates15,hasbeenimplicated in two independent studies as being associated with obesity145,146.

    Inadditiontothesixregions,onefurthergenomicregiondeservesmention:anintervalof7.6Mbonchromosome7q(seeSupplementaryInformation Humanpopulationgenetics).The intervalcontains several regions with high scores in the diversity-divergenceanalysis(includingtheseventhhighestscoreoverall)aswellasintheproportionofhigh-frequencyderivedalleles.TheregioncontainstheFOXP2andCFTRgenes.Theformerhasbeenthesubjectofmuchinterestasapossibletargetforselectionduringhumanevolution147andthelatterasatargetofselectioninEuropeanpopulations148.

    Convincingproofofpastselectionwillrequirecarefulanalysisofthe precise pattern of genetic variation in the region and theidentificationofalikelytargetofselection.Nonetheless,ourfindingssuggestthattheapproachoutlinedheremayhelptounlocksomeofthe secrets of recent human evolution through a combination ofwithin-speciesandcross-speciescomparison.

    2005 Nature Publishing Group82

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    15/19

    NATURE|Vol437|1September2005 ARTICLES

    Table 8 |H u ma n r e gi o ns w i th s t ro n g es t s i g n al o f s el e ct i on b a se d o n d i ve r si t y r e l a ti v e t o d i ve r ge n ceChromosome Start(Mb) End(Mb) Regressionlog-score SkewP-value Genes1222128

    48.58144.3536.1584.6934.91

    52.58148.4740.2289.0137.54

    103.384.881.880.976.9

    0.0710.074

    0.000220.031

    0.00032

    FourteenknowngenesfromELAVL4toGPX7ARHGAP15(partial),GTDC1andZFHX1B

    Fifty-sevenknowngenesfromCARD10toPMM1TenknowngenesfromPAMCItoATP2B1

    UNC5DandFKSG24 32.42 35.62 55.9 0.00067 NoknowngenesorEnsemblpredictions

    Discussio n

    Our knowledge of the human genome is greatly advanced by theavailability of a second hominid genome. Some questions can bedirectly answered by comparing the human and chimpanzee sequences,includingestimatesofregionalmutationratesandaverage selective constraints on gene classes. Other questions can beaddressedinconjunctionwithotherlargedatasets,suchasissuesinhuman population genetics for which the chimpanzee genomeprovidescrucialcontrols.Forstillotherquestions,thechimpanzeegenomesimplyprovidesastartingpointforfurtherinvestigation.

    Thehardestsuchquestionis:whatmakesushuman?Thechallengeliesinthefactthatmostevolutionarychangeisduetoneutraldrift.Adaptivechangescompriseonlyasmallminorityofthetotalgeneticvariationbetweentwospecies.Asaresult,theextentofphenotypicvariationbetweenorganismsisnotstrictlyrelatedtothedegreeofsequencevariation.Forexample,grossphenotypicvariationbetweenhuman and chimpanzee is much greater than between the mousespeciesMus musculus andMus spretus, although the sequencedifference in the two cases is similar. On the other hand, dogsshowconsiderablephenotypicvariationdespitehavinglittleoverallsequence variation (,0.15%). Genomic comparison markedlynarrows the search for the functionally important differencesbetween species, but specific biological insights will be needed tosiftthestill-largelistofcandidatestoseparateadaptivechangesfromneutralbackground.

    Ourcomparativeanalysissuggeststhatthepatternsofmolecularevolutioninthehominidsaretypicalofabroaderclassofmammalsinmanyways,butdistinctiveincertainrespects.Aswiththemurids,the most rapidly evolving gene families are those involved inreproductionandhostdefence.Incontrasttothemurids,however,hominidsappeartoexperiencesubstantiallyweakernegativeselection; this probably reflects their smaller population size. Consequently,hominidsaccumulatedeleteriousmutationsthatwouldbeeliminated by purifyingselection in murids. Thismaybe bothanadvantageandadisadvantage.Althoughdecreasedpurifyingselectionmaytendtoerodeoverallfitness,itmayalsoallowhominidstoexplorelargerregionsofthefitnesslandscapeandtherebyachieveevolutionary adaptations that can only be reached by passingthroughintermediatestatesofinferiorfitness149,150.

    Although the analyses presented here focus on protein-codingsequences,thechimpanzeegenomesequencealsoallowssystematicanalysisoftherecentevolutionofgeneregulatoryelementsforthefirst time. Initial analysis of both gene expression patterns andpromoter regions suggest that their overall patterns of evolutioncloselymirrorthatofprotein-codingregions.Inanaccompanyingpaper83,weshowthattheratesofchangeingeneexpressionamongdifferent tissues in human and chimpanzee correlate with thenucleotidedivergenceintheputativeproximalpromotersandevenmoreinterestinglywiththeaveragelevelofconstraintonproteinsinthesametissues.Anotherstudy151hassimilarlyusedthechimpanzeesequencedescribedheretoshowthatgenepromoterregionsarealsoevolvingundermarkedlylessconstraintinhominidsthaninmurids.

    The draft chimpanzee sequence here is sufficient for initialanalyses,butitisstillimperfectandincomplete.Definitivestudiesof gene and genome evolutionincluding pseudogene formation,genefamilyexpansionandsegmentalduplicationwillrequirehigh-

    quality finished sequence. In this regard, we note that efforts arealready underway to construct a BAC-based physical map and toincrease the shotgun sequence coverage to approximately sixfoldredundancy.The added coverage alone will not affect the analysisgreatly, but plans are in place to produce finished sequence fordifficulttosequenceandimportantsegmentsofthegenome.

    Ourclosebiologicalrelatednesstochimpanzeesnotonlyallowsunique insights into human biology, it also creates ethical obligations.Althoughthegenomesequencewasacquiredwithoutharmto chimpanzees, the availability of the sequence may increasepressuretousechimpanzeesinexperimentation.Westronglyopposereducing the protection of chimpanzees and instead advocate thepolicy positions suggested by an accompanying paper152. Furthermore, the existence of chimpanzees and other great apes in theirnative habitats is increasingly threatened by human civilization.Moreeffectivepoliciesareurgentlyneeded toprotectthem in thewild. We hope that elaborating how few differences separate ourspecieswillbroadenrecognitionofourdutytotheseextraordinaryprimatesthatstandasoursiblingsinthefamilyoflife.

    ME THODSSequencing and assembly. Approximately 22.5 million sequence reads werederivedfrombothendsofinserts(pairedendreads)from4-,10-,40-and180-kbclones,allpreparedfromprimarybloodlymphocyteDNA.Genomicresourcesavailable fromthesourceanimal includea lymphoidcell line(S006006)andgenomicDNA(NS06006)atCoriellCellRepositories(http://locus.umdnj.edu/ccr/), as well as a BAC library (CHORI-251)153 (see also SupplementaryInformation

    Genome

    sequencing

    and

    assembly).

    Genomealignment.BLASTZ154 wasusedtoalignnon-repetitive chimpanzeeregionsagainstrepeat-maskedhumansequence.BLAT155wassubsequentlyusedtoalignthemorerepetitiveregions.Thecombinedalignmentswerechained156andonlybestreciprocalalignmentswereretainedforfurtheranalysis.Insertionsanddeletions.Smallinsertion/deletion(indel)events(,15kb)wereparseddirectly fromtheBLASTZgenomealignmentbycountingthenumberandsizeofalignmentgapsbetweenbaseswithinthesamecontig.Sitesoflarge-scale indels (.15kb) were detected from discordant placements of pairedsequence reads against the human assembly. Size thresholds were obtainedfrombothhumanfosmidalignmentsonhumansequence(40^2.58kb)andchimpanzee plasmid alignments against human chromosome 21(4.5^1.84kb). Indels were inferred by two or more pairs surpassing thesethresholdsbymorethantwostandarddeviationsandtheabsenceofsequencedatawithinthediscordancy.Geneannotation.Atotalof19,277humanRefSeqtranscripts157,representing 16,045distinctgenes,wereindirectlyalignedtothechimpanzeesequenceviathegenomealignment.Afterremovinglow-qualitysequencesandlikelyalignmentartefacts,aninitialcataloguecontaining13,454distinct1:1humanchimpanzee orthologueswascreated for theanalysesdescribedhere.Asubsetof7,043ofthesegeneswithunambiguousmouseandratorthologueswererealignedusingClustalW158 forthelineage-specificanalyses.Updatedgenecataloguescanbeobtainedfromhttp://www.ensembl.org .Ratesofdivergence.NucleotidedivergencerateswereestimatedusingbasemlwiththeREVmodel.Non-CpGrateswereestimatedfromallsitesthatdidnotoverlapaCGdinucleotide ineitherhumanorchimpanzee.KA andKS wereestimatedjointly for each orthologue using codeml with the F3x4 codonfrequency model and no additional constraints, except for the comparison ofdivergentandpolymorphicsubstitutionswhereKA/KSforbothwasestimatedas(DA/NA)/(DS/NS),withNS/NA, theratioofsynonymous tonon-synonymoussites,estimatedas0.36fromtheorthologuealignments.Unlessotherwisespecified,KA/KSforasetofgeneswascalculatedbysummingthenumberofsubstitutionsandthenumberofsitestoobtainKAandKSfortheconcatenatedsetbeforetaking

    2005 Nature Publishing Group 83

    http://locus.umdnj.edu/http://www.ensembl.org/http://locus.umdnj.edu/http://www.ensembl.org/
  • 8/13/2019 Chimp_Analysis Genetic v. Human

    16/19

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    17/19

  • 8/13/2019 Chimp_Analysis Genetic v. Human

    18/19

    http://www.nature.com/naturehttp://genome.ucsc.edu/http://www.ensembl.org/http://www.ncbi.nlm.nih.gov/mailto:[email protected]:[email protected].