department stanford - pnasthe late allanc. wilson(2, 3). samples, rflps,andallele frequencies are...

5
Proc. Nati. Acad. Sci. USA Vol. 91, pp. 6515-6519, July 1994 Evolution Inference of human evolution through cladistic analysis of nuclear DNA restriction polymorphisms J. L. MOUNTAIN AND L. L. CAVALLI-SFORZA Department of Genetics, Stanford University Medical Center, Stanford, CA 94305-5120 Contributed by L. L. Cavalli-Sforza, March 11, 1994 ABSTRACT Testing of nuclear DNA polymorphisms in human populations has been extended to closely related pri- mates. For many polymorphisms, one allele is shared by two or more species: such shared alles are likey to be ancestral and provide insight not only into the relationships among the primates but also into the evolutionary history of modern humans. Hns from among eight worldwide populations share an allele with chimpanzees for 62 out of 79 polymor- phisms examined. Frequencies of these ancestral alleles strengthen the conclusion that the earliest major separation of modern humans was between Africans and non-Africans. The average time since mutation of the ancestral alleles producing the current set of polymorphiss is estimated to be 700,000 years. While differences anong ancestral allele frequencies In human populations suggest that natural selection may have played a role in the evoluton of a subset of these poymor- phis, simulations indicate that a European bias in the ascertainment of polymorphisms may be at least partially responsible for observed differences. Simulations also suggest that observed heterozygosity levels in African populations, for ical polymorphisms and rericto frgment length poly morphisms, are artfilly low due to the same bias. Observed patterns of mean heterozygosity and mean ancestral'alele frequency provide support for the hypothesis that Europeans and northeast Asians are closely related. This work suggests that polymorphisms should be selected by testing a random sample of extant humans. A critical step toward inferring any phylogenetic tree is the placement of the root: this often depends upon a set of assumptions that are difficult to test. Identification of ances- tral states may provide a cladistic solution to this problem (1). For any given polymorphism found in one species, an allele shared with a closely related species is likely to be ancestral if mutation rates are low. Here we describe analyses of 79 human restriction fragment length polymorphisms (RFLPs) tested in eight human populations from four continents. We examined these RFLPs in nonhuman primates and, thereby, identified likely ancestral alleles for 62 of these polymor- phisms. The intraspecific analysis described below is an adaptation of cladistic analysis to the study of gene frequencies; by incorporating data on ancestral states, we were able to root a least squares tree relating eight human populations with greater confidence than we could without such information. Identification of ancestral states also enabled us to address other issues, such as the age of more recent nonancestral alleles and the distributions of ancestral allele frequencies, and their implications for human evolutionary history. RFLPs in Eight Iuman Populations. The 79 human RFLPs considered here contain about equal numbers of random DNA fragments and known genes. Initially detected among Europeans, they have now been tested on samples from eight human populations: five were from cultures of transformed B lymphocytes (Central African Republic and Zaire pygmies, Chinese, Japanese, and Melanesians), one (Europeans) was from opportunistic samples or literature data, and two (Aus- tralians and New Guineans) were from DNA obtained from the late Allan C. Wilson (2, 3). Samples, RFLPs, and allele frequencies are discussed in detail elsewhere (4-6). Trees relating the eight populations have been inferred according to two methods (Fig. 1). These two trees are topologically similar except for the position of the European sample. The average linkage tree (Fig. 1A) assumes a con- stant evolutionary rate, therefore, is rooted, and groups the Europeans with the Japanese and Chinese samples. The neighbor-joining tree (Fig. 1B) allows a variable evolutionary rate, is unrooted, and places the European sample close to the center of the tree with an extremely short branch. The degree to which a population is central in the tree relative to other populations can be measured by a quantity, Si (Table 1). For the 79 polymorphisms, the value of Si for Europeans, Seur (Table 1), is significantly different from zero, indicating that the European branch is significantly short relative to all other branches in the tree. A similar tree was obtained with classical polymorphisms (12, 13). As discussed (11), the centrality of Europeans in the neighbor-joining trees is a puzzle, unlikely to be resolved by concluding that Europeans have not evolved since their origin. Such centrality is more likely due to the ascertainment of most polymorphisms (both RFLPs and classical markers- blood groups, HLA, protein, and enzymes) through testing of Europeans, the origin of Europeans as an ancient admixture, or both (11). We approach this puzzle through simulations described below. Identification of Ancestral Alleles. Seventy-nine loci poly- morphic among humans were tested in three nonhuman pnmate samples (chimpanzees, gorillas, and orangutans). Humans share exactly one allele with chimpanzees for 62 loci (6), with gorillas for 42 loci, and with orangutans for 41 loci. Of the 62 loci where humans and chimpanzees share an allele, 56 are biallelic. For one additional locus, a polymorphism appears to be shared by humans and chimpanzees. It seems likely that almost all human polymorphisms arose in the human lineage after the separation from chimpanzees; here, we classify as ancestral the human allele observed in chim- panzees. Cladic Tree Inference and Average Age of New Allees. Identification of ancestral alleles provides a cladistic criterion for the inference of evolutionary trees. By assuming a roughly constant evolutionary rate, we can infer the position of the root of a tree with greater confidence than without identification of ancestral states. We exemplify with the simple case of three populations (Fig. 3). One can estimate, on a relative scale, all four unknown branch length parame- ters, thereby obtaining the position of the root, R. We define a virtual origin, V, as a population with the frequency of the ancestral allele at each locus equal to the frequency, po, of the Abbreviation: RFLP, restriction fragment length polymorphism. 6515 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. Downloaded by guest on July 24, 2020

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Department Stanford - PNASthe late AllanC. Wilson(2, 3). Samples, RFLPs,andallele frequencies are discussed in detail elsewhere (4-6). Trees relating the eight populations have been

Proc. Nati. Acad. Sci. USAVol. 91, pp. 6515-6519, July 1994Evolution

Inference of human evolution through cladistic analysis of nuclearDNA restriction polymorphismsJ. L. MOUNTAIN AND L. L. CAVALLI-SFORZADepartment of Genetics, Stanford University Medical Center, Stanford, CA 94305-5120

Contributed by L. L. Cavalli-Sforza, March 11, 1994

ABSTRACT Testing of nuclear DNA polymorphisms inhuman populations has been extended to closely related pri-mates. For many polymorphisms, one allele is shared by two ormore species: such shared alles are likey to be ancestral andprovide insight not only into the relationships among theprimates but also into the evolutionary history of modernhumans. Hns from among eight worldwide populationsshare an allele with chimpanzees for 62 out of 79 polymor-phisms examined. Frequencies of these ancestral allelesstrengthen the conclusion that the earliest major separation ofmodern humans was between Africans and non-Africans. Theaverage time since mutation of the ancestral alleles producingthe current set of polymorphiss is estimated to be 700,000years. While differences anong ancestral allele frequencies Inhuman populations suggest that natural selection may haveplayed a role in the evoluton of a subset of these poymor-phis, simulations indicate that a European bias in theascertainment of polymorphisms may be at least partiallyresponsible for observed differences. Simulations also suggestthat observed heterozygosity levels in African populations, for

ical polymorphisms and rericto frgment length polymorphisms, are artfilly low due to the same bias. Observedpatterns of mean heterozygosity and mean ancestral'alelefrequency provide support for the hypothesis that Europeansand northeast Asians are closely related. This work suggeststhat polymorphisms should be selected by testing a randomsample of extant humans.

A critical step toward inferring any phylogenetic tree is theplacement of the root: this often depends upon a set ofassumptions that are difficult to test. Identification of ances-tral states may provide a cladistic solution to this problem (1).For any given polymorphism found in one species, an alleleshared with a closely related species is likely to be ancestralif mutation rates are low. Here we describe analyses of 79human restriction fragment length polymorphisms (RFLPs)tested in eight human populations from four continents. Weexamined these RFLPs in nonhuman primates and, thereby,identified likely ancestral alleles for 62 of these polymor-phisms.The intraspecific analysis described below is an adaptation

of cladistic analysis to the study of gene frequencies; byincorporating data on ancestral states, we were able to roota least squares tree relating eight human populations withgreater confidence than we could without such information.Identification of ancestral states also enabled us to addressother issues, such as the age of more recent nonancestralalleles and the distributions of ancestral allele frequencies,and their implications for human evolutionary history.RFLPs in Eight Iuman Populations. The 79 human RFLPs

considered here contain about equal numbers of randomDNA fragments and known genes. Initially detected amongEuropeans, they have now been tested on samples from eight

human populations: five were from cultures oftransformed Blymphocytes (Central African Republic and Zaire pygmies,Chinese, Japanese, and Melanesians), one (Europeans) wasfrom opportunistic samples or literature data, and two (Aus-tralians and New Guineans) were from DNA obtained fromthe late Allan C. Wilson (2, 3). Samples, RFLPs, and allelefrequencies are discussed in detail elsewhere (4-6).

Trees relating the eight populations have been inferredaccording to two methods (Fig. 1). These two trees aretopologically similar except for the position of the Europeansample. The average linkage tree (Fig. 1A) assumes a con-stant evolutionary rate, therefore, is rooted, and groups theEuropeans with the Japanese and Chinese samples. Theneighbor-joining tree (Fig. 1B) allows a variable evolutionaryrate, is unrooted, and places the European sample close tothe center of the tree with an extremely short branch. Thedegree to which a population is central in the tree relative tootherpopulations can be measured by a quantity, Si (Table 1).For the 79 polymorphisms, the value ofSi for Europeans, Seur(Table 1), is significantly different from zero, indicating thatthe European branch is significantly short relative to all otherbranches in the tree. A similar tree was obtained withclassical polymorphisms (12, 13).As discussed (11), the centrality of Europeans in the

neighbor-joining trees is a puzzle, unlikely to be resolved byconcluding that Europeans have not evolved since theirorigin. Such centrality is more likely due to the ascertainmentofmost polymorphisms (both RFLPs and classical markers-blood groups, HLA, protein, and enzymes) through testing ofEuropeans, the origin of Europeans as an ancient admixture,or both (11). We approach this puzzle through simulationsdescribed below.

Identification of Ancestral Alleles. Seventy-nine loci poly-morphic among humans were tested in three nonhumanpnmate samples (chimpanzees, gorillas, and orangutans).Humans share exactly one allele with chimpanzees for 62 loci(6), with gorillas for 42 loci, and with orangutans for 41 loci.Ofthe 62 loci where humans and chimpanzees share an allele,56 are biallelic. For one additional locus, a polymorphismappears to be shared by humans and chimpanzees. It seemslikely that almost all human polymorphisms arose in thehuman lineage after the separation from chimpanzees; here,we classify as ancestral the human allele observed in chim-panzees.Cladic Tree Inference and Average Age of New Allees.

Identification ofancestral alleles provides a cladistic criterionfor the inference of evolutionary trees. By assuming aroughly constant evolutionary rate, we can infer the positionof the root of a tree with greater confidence than withoutidentification of ancestral states. We exemplify with thesimple case of three populations (Fig. 3). One can estimate,on a relative scale, all four unknown branch length parame-ters, thereby obtaining the position of the root, R. We definea virtual origin, V, as a population with the frequency of theancestral allele at each locus equal to the frequency, po, ofthe

Abbreviation: RFLP, restriction fragment length polymorphism.

6515

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement"in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Dow

nloa

ded

by g

uest

on

July

24,

202

0

Page 2: Department Stanford - PNASthe late AllanC. Wilson(2, 3). Samples, RFLPs,andallele frequencies are discussed in detail elsewhere (4-6). Trees relating the eight populations have been

6516 Evolution: Mountain and Cavalli-Sforza

Zaire Pygmy B CAR ZaireCAR Pygmy Pygmy PygmyEuropeanChineseJapanese Melanesian EuropeanMelanesian hinese

New~~uinean JapneseNew Guinean New AustralianAustralian Guinean

FIG. 1. Trees inferred from upper triangle of distance matrix ofFig. 2 according to the average linkage (Unweighted Pair GroupMethod with Arithmetic Mean, or UPGMA) algorithm (7) (A) and theneighbor-joining algorithm (8) (B) (for bootstrap analyses, see ref. 6).CAR, Central African Republic.

ancestral allele when the first mutant arose [po = (2Ne -

1)/2N., where Ne is the effective population size]. While Vdoes not correspond to a single time, it represents an averagepoint of origin of nonancestral alleles. We can estimate agenetic distance, based on all polymorphisms, between thisvirtual ancestral population and each extant population (Fig.2).From the matrix of genetic distances including such a

virtual ancestor (Fig. 2, below the diagonal), we inferred thehuman portion of the tree in Fig. 4 (from V to the presentday). The topology of this tree is identical to that of Fig. 1A.The root clearly separates African and non-African popula-tions, strengthening the conclusion, reached previously with-out information regarding ancestral states (11), that modernhumans arose in or near Africa.Having inferred distances between extant populations and

the virtual origin, we can also estimate a parameter, e, theaverage time of initial mutation from the ancestral stateleading to the current polymorphism (Fig. 3). Small arrows inFig. 4 suggest the origin of a number of alleles, each arisingat an individual locus and surviving in human populations tothe present day, along with an ancestral allele. The mean timeoforigin is represented in Fig. 3 by the branch from the virtualorigin (V) to the tips of the tree. The length of the time fromV to the first split in the tree of Fig. 4 is 6 times greater thanthe interval between the earliest split in the human tree and

CARZAIEURJPNCHIAUSNGNMELV

bs0.04 10.14 0.25

0.67 0.14 0.23

1.892.813.032.863.223.847.14ZAI

1.411.311.601.701.98

0.02

0.412.332.152.61

8.87 10.89 ]EUR JPN

1.592.802.942.392.943.337.78CAR

,b20.25 0.23 0.31 0.27 a10.24 0.25 0.31 0.280.09 0.12 0.18 0.140.02 0.18 0.21 0.19

0.18 0.22 0.172.27 0.04 0.12 a2

2.17 0.52 0.132.26 1.37 1.0910.27 9.48 9.60 9.89CHI AUS NGN MEL

FiG. 2. Genetic distances. ZAI, Zaire Pygmy; CAR, CentralAfrican Republic Pygmy; MEL, Melanesian; NGN, New Guinean;AUS, Australian; JPN, Japanese; CHI, Chinese; EUR, European.Above diagonal, distances among eight human populations estimatedfrom allele frequencies for 79 RFLPs (6) according to ref. 9; belowdiagonal, unweighted FST distances among eight extant humanpopulations and a virtual population representing the origin ofnonancestral alleles, estimated from ancestral allele frequencies for62 loci (6). Alleles shared with chimpanzees were assumed to beancestral. FSr distance between two populations is defined as (xi -x2)2/[2P0(1 - Po)], where xi is the frequency of the ancestral allele inpopulation iandpo is the initial frequency. Frequencies for the virtualancestral population (V) are po = (2Ne - 1)/2Ne [in practice theywere set to 1 and the quantitypo(l - po), identical for all comparisonsbetween population pairs, was omitted]. Boxes show blocks thatinclude distances between clusters (see Table 1).

Table 1. Values of Si, measuring shortness of a branch relativeto other branches in a tree, for i = eur, the European branch

Block Block Simulation1 2 S. (obs) Scur (no bias) Seur (bias)al a2 0.055 + 0.003 -0.002 ± 0.003 0.013 ± 0.002bi b2 0.124 ± 0.003 -0.003 ± 0.003 0.032 ± 0.003Overall 0.090 ± 0.002 -0.002 ± 0.002 0.022 ± 0.002Clusters are those suggested by the tree in Fig. 1A; blocks include

distances between clusters (labeled similarly in Fig. 2, above thediagonal). Si is associated with "treeness" (10) in that for theappropriate tree topology, it has an expectation of zero for allpopulations, if treeness is perfect in all branches (see also table 1 ofref. 11). SEM values for observed values were estimated from 100bootstraps; those for simulation values were estimated from the 100sets of 100 loci described in Fig. 6. In general, Si = mean distancewithin block 2 - mean distance within block 1 = .jqjk djk/[(Nm -1)NW] - Tkditk/Nn, where i is the population in question;j representsthe N. populations in cluster m (ofwhich population i is a member);k represents the N. populations in another cluster, n; d4 is the geneticdistance from population i to population j. The overall value is theaverage over all possible clusters containing population i in a givenrooted tree.

the present day. If the latter interval is 100,000 years (15-19),then the average time since initial mutation to the currentpolymorphic state is estimated to be 700,000 years.

Distributions of Ancestral Allele Frequencies. Average fre-quencies of the 62 ancestral alleles vary significantly acrossthe eight populations (Table 2). Distributions of ancestralallele frequencies were obtained for each of the eight humanpopulations; the best fit of a polynomial to each distributionis given in Fig. 5. The expected distribution of mutant allelefrequencies under flux equilibrium for a single gene in inde-pendent populations or for independent genes in one popu-lation was given by Wright (21). This distribution is highlyskewed: mutant alleles are generally found at lower frequen-cies. We are interested in the expected distribution of fre-quencies for the ancestral nonmutant allele given in Fig. 5Inset.The two Pygmy populations, along with Melanesians and

New Guineans, tend to have high frequencies of ancestralalleles for more loci. These populations, particularly theAfrican samples, resemble the expected distribution mostclosely. For both Melanesians and New Guineans, however,the distributions tend to be more u-shaped than j-shaped; fora few loci mutant alleles are found at higher frequencies (andancestral alleles are at lower frequencies) than expected. TheAustralian sample, which is very small and is possiblygenetically mixed (=25%) with Europeans (6), has a slightlyflatter distribution, as do the Japanese and Chinese samples.The convex distribution observed for the European sample

(Fig. 5) differs most sharply from the expected distribution(Fig. 5 Inset) and may be an artifact due to the ascertainmentof polymorphisms almost exclusively in Europeans. Whilethe definition of "polymorphic" varies, a locus with at least

V obs. exp.e AB a+b+d

R > AC a+c+da ACED BC b+c>ec AV a+eKb eV b+d+e

A B C CV c+d+e

FiG. 3. Diagram illustrating an approach to es ating bachlengths when ancestral alleles have been identified. A, B, and C areextant populations; R represents root of the tree; V is a virtualpopulation, representing the time of origin of new alleles for allpolymorphisms (Fig. 2). obs., Observed distances between popula-tions; exp., expected distances between populations. Estimates canbe made using least squares or maximum likelihood.

Proc. Natl. Acad Sci. USA 91 (1994)

A r-I I

Dow

nloa

ded

by g

uest

on

July

24,

202

0

Page 3: Department Stanford - PNASthe late AllanC. Wilson(2, 3). Samples, RFLPs,andallele frequencies are discussed in detail elsewhere (4-6). Trees relating the eight populations have been

Proc. Natl. Acad. Sci. USA 91 (1994) 6517

"5 million years

FiG. 4. Human evolutionary tree and cartoon indicating a pos-sible relationship between humans and chimpanzees. Details of thehuman tree are inferred from distances below the diagonal in Fig. 2;least squares estimates were obtained as suggested in Fig. 3 by usingKITSCH OF PHYLIP (14). The rate of change of allele frequencies wasassumed constant. The inferred topology of the human tree is thesame as that of Fig. 1A. Small arrows are the hypothetical distribu-tion of the times of origin ofnew alleles along the human lineage forpolymorphisms surviving to the present day in human populations.Surviving polymorphisms tend to be those that arose recently; earlierpolymorphisms were more likely to be lost due to drift. The virtualpopulation (V) represents the mean time of origin of nonancestralalleles for 62 loci. V also represents an outgroup for the human tree.

two alleles with frequencies >1% has often been consideredpolymorphic (22). In searching for RFLPs, however, therewere attempts to choose polymorphisms with frequenciesnear 50%6 if biallelic, the optimal condition for use in linkagetests. In fact, many research workers searched for polymor-phisms by testing very few (sometimes two to five) individ-uals generally of European origin. Thus, many RFLPs wereprobably originally chosen, more or less intentionally, be-cause of high heterozygosity in Europeans.As noted in Table 2, the highest estimated mean heterozy-

gosity is for Europeans, followed by Chinese and Japanese.Although such factors as the age and demographic history ofeach population have contributed to variation among levels ofheterozygosity, the ascertainment of polymorphisms in Eu-ropeans may also be an important factor.

Simulation. To explore the effect of such ascertainmentbias on mean heterozygosities, ancestral allele frequencydistributions, and genetic distances, we carried out simula-tions. For each set of parameters, we simulated as manybiallelic loci as were necessary to obtain 100 loci polymorphicto a specified extent. We chose 100 loci to match, roughly, thesize of current RFLP data sets. The evolution at each locuswas modeled in two phases. The first phase of 250,000generations corresponded to a long branch leading up to thefirst split among modern humans, as seen in Fig. 4. Thesecond phase involved specifying a tree of populations,initializing the root population with the frequency reached inphase I, and allowing that frequency to change along eachbranch through drift and mutation for 5000 generations. A setof "present day" population frequencies was thus obtained.

Table 2. Ancestral allele frequency and observed heterozygosityfor the 79 RFLPs in each population

Ancestral allelePopulation frequency HeterozygosityZAI 0.653 + 0.037 0.290 ± 0.020CAR 0.624 ± 0.037 0.304 ± 0.021AUS 0.560 ± 0.041 0.299 ± 0.020NGN 0.565 ± 0.044 0.260 ± 0.019MEL 0.572 ± 0.045 0.267 ± 0.021JPN 0.537 ± 0.041 0.317 ± 0.021CHI 0.527 ± 0.038 0.341 ± 0.019EUR 0.559 ± 0.034 0.379 ± 0.015

Data for 79 RFLPs (4, 5) are the mean ± SEM. Analyses ofvariance indicate that variation among populations is significant forboth mean heterozygosity (P < 0.001) (6) and mean frequency ofancestral alleles (P < 0.01) (20) (see Fig. 2 for abbreviations).

L0.4.3

0210.3I04I 0

\_0 0.2 0.4 0.6 0.8

frequency of ancestral allele

FIG. 5. Best fit of a polynomial (of degree 2) to distributions offrequencies of ancestral alleles. Z/ZAI, Zaire Pygmy; P/CAR,Central African Republic Pygmy; M/MEL, Melanesian; N/NGN,New Guinean; A/AUS, Australian; J/JPN, Japanese; C/CHI, Chi-nese; E/EUR, European. Frequencies were first sorted into 15 binsbetween 0.0 and 1.0 (see ref. 20 for individual distributions of 60ancestral alleles in these populations). (Inset) Expected distributionof ancestral allele frequencies, derived from Wright (21), for Ne.t =0.01. The frequency ofmutant alleles with frequency x,JAx), is givenby 4N.Al./x. The frequency of ancestral alleles with frequency y,fty)= 1 - x, therefore, is 4NaAA&(1 - x), where N. is the effectivepopulation size and 1L is the mutation rate per locus per generation.General shape of this distribution is independent of NedA.

By assuming that the phase II tree topology would notinfluence results dramatically, we simulated according to thetree of Fig. 1A. This topology presumably incorporates someof the complexities of the underlying population history andis most relevant given that we are interested in these partic-ular populations. General conclusions, however, are notdependent on this tree reflecting the true evolutionary historyof these populations.To examine the consequences of a bias in the ascertain-

ment of polymorphisms, we chose loci in two ways: (i) nobias-a locus was considered polymorphic if both alleleswere found at frequencies between specified minimum andmaximum frequencies in any of the- simulated populations-and (ii) bias-a locus was considered polymorphic only ifboth alleles were found at frequencies between the specifiedminimum and maximum frequencies in one particular popu-lation, the simulated European population.Simulatin Results. Most simulated loci did not meet the

requirements for either of the two cases just described: lociwere generally not polymorphic in any population at the endofthe two-phase process. Thousands ofnonpolymorphic lociwere therefore simulated for each set of 100 polymorphic lociretained. For each set of 100 loci, we calculated meanheterozygosities and mean ancestral allele frequencies, de-termined histograms of ancestral allele frequencies, esti-mated genetic distances among populations, and inferred atree.The effect of the European bias in polymorphism ascer-

tainment on mean ancestral allele frequencies and on meanheterozygosities is demonstrated in Fig. 6 A and B, respec-tively. For both measures, the no bias data sets showessentially no variation between populations. In the bias datasets, the simulated European population had the lowest meanancestral allele frequency, followed by the simulated Chineseand Japanese populations (Fig. 6A). As forthe observed data,the simulated European population had the highest heterozy-gosity followed by Chinese and Japanese (Fig. 6B). Thepatterns of simulated values mimic the tree used in thesimulation in the sense that populations more closely relatedto the source ofthe bias (Europeans) are more affected by thebias. Fig. 6 suggests that the polymorphism ascertainment

Evolution: Mountain and Cavalli-Sforza

Dow

nloa

ded

by g

uest

on

July

24,

202

0

Page 4: Department Stanford - PNASthe late AllanC. Wilson(2, 3). Samples, RFLPs,andallele frequencies are discussed in detail elsewhere (4-6). Trees relating the eight populations have been

6518 Evolution: Mountain and Cavalli-Sforza

A 0.9

n A:>% V.0 -

ID°0.7_j

0-.6al) a1)

0.5-

0.4-

B 0.4

0)

o) 0.3

N

0

a)

C 0.2

a)(

E

0.1

o simulation - no biassimulation - bias

* observed

ZAI CAR AUS NGN MEL CHI JPN EUR

observedO simulation- biaso simulation-no bias

I I i <

ZAI CAR AUS NGN MEL CHI JPN EUR

FIG. 6. Comparison of observed and simulated values in eightpopulations. (A) Mean ancestral allele frequencies. (B) Mean popu-lation heterozygosities. SEM values for observed values were esti-mated from 100 bootstraps; SEM values for simulation values wereestimated from 100 simulated sets of 100 loci. Simulation: two-phaseprocess simulated 10,000 times for a mutation rate of 10-7; effectivepopulation size, 10,000; minimum frequency of 10%1 and maximumfrequency of 90%o. One hundred sets of frequencies for 100 loci werethereby generated. Phase I corresponded to a branch leading to thefirst split among modem humans (see Fig. 4): the ancestral allelebegan at a frequency of 100%o and then passed through 250,000generations of random genetic drift and mutation from ancestralstate. Phase II involved specifying tree ofpopulations, initializing theroot population with the frequency reached in phase I, and allowingthe frequency to change along each branch through drift and muta-tion over 5000 generations. Topology was that of Fig. 1A. {R(Aus-tralian, 723; New Guinean, 723), 1806; Melanesian, 2529], 1261;[European, 2044; (Chinese, 826; Japanese, 826), 1218], 1746}, 1210;(CAR Pygmy, 867; Zaire Pygmy, 867), 4133. Values are branchlengths in generations (14). Bifurcations were represented by aninstantaneous doubling and splitting of a population. All mutants ata locus were grouped together as a "nonancestral" class for thepurpose of calculating a frequency. Mutation rates between 10-5 and10-8 were considered and led to qualitatively similar results. Lociwith mutation rates as high as 10-5, however, were less influencedby the bias in polymorphism ascertainment.

bias is at least partially responsible for the observed variationin ancestral allele frequencies and heterozygosities. Ob-served heterozygosities do not follow precisely the pattern ofthe simulated values in that the African samples have rela-tively high observed heterozygosities.

Genetic distances and, hence, inferred trees are also af-fected by the bias in polymorphism choice. When no bias isimposed, genetic distances between African populations andall non-African populations are essentially identical, as ex-pected (data not shown). When a European bias is imposed,however, the distances between Europeans and Africans, forexample, are significantly shorter than those between Afri-cans and other non-African populations. Sear, indicatingshortness of the European branch relative to other branches,is given for the simulated data sets in Table 1. For the no biasdata set, the mean Sur value does not differ significantly from0, as expected. For the bias data set, however, the mean issignificantly different from 0.We also examined the distribution of times of origin of the

nonancestral alleles, comparing simulated times with timesestimated using the tree method (Fig. 3). The tree method

tends to overestimate the time by =25% in the bias case and509o in the no bias case.Origin: 100,000 or 1 Million Years? To address the question

of the time of the origin of modem humans, simulations werealso carried out assuming that the first split among modemhumans took place 50,000 generations (-1 million years) ago.At the end of these simulations, populations were very oftenfixed for one allele: the mean heterozygosity for the bias casewas 0.066 ± 0.043, significantly lower than the observedvalues (Table 2) and those simulated values discussed earlier(see Fig. 6B). The simulated (bias) mean ancestral allelefrequency, 0.88 ± 0.037, was much higher than the observed(Table 2). Simulations placing the initial split at 5000 gener-ations ago lead to greater correspondence with observedvalues.

Discussion. The skewness of the observed ancestral allelefrequency distributions for the eight human populationssupports the classification of alleles shared with chimpanzeesas ancestral; a random sample of alleles from biallelic loci isnot expected to be skewed, and a sample from multiallelicloci is expected to be skewed in the other direction. Africansand Oceanic populations have the highest average ancestralallele frequency; in these populations the distribution of allelefrequencies is markedly asymmetrical, whereas east Asianshave a significantly more horizontal distribution and Euro-peans have a convex distribution. Much of the variation islikely to be due to the ascertainment of RFLPs almostexclusively in Europeans. Natural selection may have alsoplayed a role. As Wright discusses (21), several types ofselection theoretically lead to more u-shaped distributionssimilar to those of the Melanesians and New Guineans. Byassuming that modem humans originated in Africa, thosepopulations migrating furthest may have encountered envi-ronments favorable to rare or emerging nonancestral allelesat a few loci.The polymorphism ascertainment bias is likely to be only

a partial explanation for the short branch of the Europeans intrees allowing variable evolutionary rates. The observed Surof 0.0896, indicating an abnormally low apparent evolution-ary rate for Europeans, is about four times higher than theS. values calculated for the simulated data (Table 1). Thissuggests that the bias accounts for only a fraction (=25%) ofthe shortening of the branch length for Europeans. Admix-ture of Europeans is also likely to be partially responsible forthis phenomenon (11).

Neighbor-joining trees recently inferred from classicalpolymorphisms suggested, in contrast with average-linkagetrees, that Europeans arose as a second major split, separatefrom all other non-Africans (13). Simulations, however,provide strong support for the hypothesis that Europeans andnortheast Asians are directly and closely related, as in Fig.1A, and suggest that mean heterozygosity and mean ancestralallele frequency values for the Chinese and Japanese sampleswere influenced by their relatedness to the source ofthe bias,the European sample.Nuclear DNA polymorphisms have been used to demon-

strate that the greatest genetic difference is between Africansand non-Africans (11, 13, 23-26). These data are insufficient,however, for rooting a tree. The cladistic approach describedhere has strengthened the conclusion that the root of theevolutionary tree for anatomically modem humans (Homosapiens sapiens) falls along the branch connecting Africansand non-Africans.We also used a cladistic approach to infer the mean time of

origin of the nonancestral alleles. Simulations indicate thatthis estimate is inflated but roughly approximates the meantime of origin. The distribution of these times is knowntheoretically as that of the "Aage of existing mutants" (27-30).For an allele with a frequency of 0.5, the average age isexpected to be 2.8N0, where Ne represents effective popu-

Proc. NaM Acad Sci. USA 91 (1994)

Dow

nloa

ded

by g

uest

on

July

24,

202

0

Page 5: Department Stanford - PNASthe late AllanC. Wilson(2, 3). Samples, RFLPs,andallele frequencies are discussed in detail elsewhere (4-6). Trees relating the eight populations have been

Proc. Nail. Acad. Sci. USA 91 (1994) 6519

lation size (=28,000 generations if N. is on the order of10,000) (27, 30). Frequencies for the eight populations suggestthat the nonancestral alleles for these 62 loci arose -700,000years or 35,000 generations (=3.5N,, forN, = 10,000) ago, byassuming that the initial separation among modem humanstook place roughly 100,000 years ago. This is remarkablysimilar to the theoretical expectation mentioned earlier(2.8NC) and is, therefore, consistent with the hypothesis thatthe split between Africans and non-Africans was on the orderof 100,000 years ago, rather than 1 million, as suggested bysupporters of the multiregional hypothesis of human evolu-tion (31).

Simulations also suggest that the first split among modernhumans is likely to have taken place much more recently than1 million years ago. A greater correspondence was seenbetween observed and simulated estimates ofmean heterozy-gosity and mean ancestral allele frequency when assuming anorigin of modem humans 100,000 years ago than whenassuming a much earlier origin. Simulations did not, how-ever, incorporate gene flow.

Strong support for the hypothesis that the origin ofmodernhumans took place recently and in Africa originated with theanalysis of mitochondrial DNA (32). The conclusion that thephylogenetic tree of mitochondrial D-loop sequences has itsroot among Africans has, however, been questioned (33-35).The impossibility of testing all potential trees and the inev-itably minimal differences between the great number of treeswith essentially the same goodness of fit greatly limit thepower of this analysis. Additional evidence, however, isderived from the greater genetic diversity observed amongAfricans. Higher diversity is expected if the common ances-tors of extant humans originated in Africa and all non-Africanpopulations arose subsequently from an African subpopula-tion. African populations are more genetically diverse thanpopulations of other continents according to both mitochon-drial DNA and simple sequence repeats (microsatellites) (36).Both have higher mutation rates than RFLPs and are there-fore less subject to the ascertainment bias, in agreement withsimulation results.

In contrast, RFLPs and classical polymorphisms suggestthat Africans have lower levels of heterozygosity than Eu-ropeans (37). Our simulations indicate that this is an artifactof their ascertainment almost exclusively in Europeans.While observed heterozygosity values for African popula-tions appear low, they are high relative to expectations underthe bias simulation conditions (Fig. 6B). Ideally, to avoidsuch a bias one would select polymorphisms by testing locion a random sample of the world's population, representingto the greatest extent possible the genetic diversity amongextant humans.

We thank M. Nei for comments on an earlier version of thismanuscript, M. Feldman for his comments, and E. Minch forassistance with simulation code. We thank the Yerkes PrimateResearch Center, Emory University, and the Southwest Foundationfor Biomedical Research, San Antonio, for providing samples fromnonhuman primates. This research was supported in part by grantsfrom the National Institutes of Health to J.L.M. (EY07106) and toL.L.C.-S. (GM20467).

1. Hennig, W. (1966) Phylogenetic Systematics (Univ. of IllinoisPress, Urbana).

2. Stoneking, M., Jorde, L. B., Bhatia, K. & Wilson, A. C. (1990)Genetics 124, 717-733.

3. Vigilant, L., Stoneking, M., Harpending, H., Hawkes, K. &Wilson, A. C. (1991) Science 253, 1503-1507.

4. Bowcock, A. M., Bucci, C., Hebert, J. M., Kidd, J. R., Kidd,

K. K., Friedlaender, J. S. & Cavalli-Sforza, L. L. (1987) GeneGeogr. 1, 47-64.

5. Bowcock, A. M., Hebert, J. M., Mountain, J. L., Kidd, J. R.,Rogers, J., Kidd, K. K. & Cavalli-Sforza, L. L. (1991) GeneGeogr. 5, 151-173.

6. Lin, A. A., Hebert, J. M., Mountain, J. L. & Cavalli-Sforza,L. L. (1994) Gene Geogr., in press.

7. Sokal, R. R. & Michener, C. D. (1958) Univ. Kansas Sci. Bull.38, 1409-1437.

8. Saitou, N. & Nei, M. (1987) Mol. Biol. Evol. 4, 406-425.9. Reynolds, J., Weir, B. S. & Cockerham, C. C. (1983) Genetics

105, 767-779.10. Cavalli-Sforza, L. L. & Piazza, A. (1975) Theor. Popul. Biol. 8,

127-165.11. Bowcock, A. M., Kidd, J. R., Mountain, J. L., Hebert, J. M.,

Carotenuto, L., Kidd, K. K. & Cavalli-Sforza, L. L. (1991)Proc. Nati. Acad. Sci. USA 88, 839-843.

12. Edwards, A. W. F. & Cavalli-Sforza, L. L. (1964) Syst. Assoc.Publ. 6, 67-76.

13. Nei, M. & Roychoudhury, A. K. (1993) Mol. Biol. Evol. 10,927-943.

14. Felsenstein, J. (1989) Cladistics 5, 164-166.15. Valladas, H., Reyss, J. L., Joron, J. L., Valladas, G., Bar-

Yosef, 0. & Vandermeersch, B. (1988) Nature (London) 331,614-616.

16. Klein, R. G. (1989) in The Human Revolution:Behavioural andBiological Perspectives on the Origins of Modern Humans,eds. Mellars, P. & Stringer, C. (Edinburgh Univ. Press, Edin-burgh), pp. 529-546.

17. Brauer, G. (1989) in The Human Revolution: Behavioural andBiological Perspectives on the Origins of Modern Humans,eds. Mellars, P. & Stringer, C. (Edinburgh Univ. Press, Edin-burgh), pp. 123-154.

18. Clark, J. D. (1989) in The Human Revolution: Behavioural andBiological Perspectives on the Origins of Modern Humans,eds. Mellars, P. & Stringer, C. (Edinburgh Univ. Press, Edin-burgh), pp. 565-588.

19. Stringer, C. B. (1990) Sci. Am., 98-104.20. Mountain, J. L., Lin, A. A., Bowcock, A. M. & Cavalli-

Sforza, L. L. (1992) Philos. Trans. R. Soc. London B 337,159-165.

21. Wright, S. (1968) Evolution and the Genetics of Populations(Univ. of Chicago Press, Chicago).

22. Cavalli-Sforza, L. L. & Bodmer, W. F. (1971) The Genetics ofHuman Populations (Freeman, San Francisco).

23. Wainscoat, J. S., Hill, A. V. S., Boyce, A. L., Flint, J., Her-nandez, M., Thein, S. L., Old, J. M., Lynch, J. R., Falusi,A. G., Weatherall, D. J. & Clegg, J. B. (1986) Nature (London)319, 491-493.

24. Cavalli-Sforza, L. L., Piazza, A., Menozzi, P. & Mountain, J.(1988) Proc. Natl. Acad. Sci. USA 85, 6002-6006.

25. Nei, M. & Livshits, G. (1989) Hum. Hered. 39, 276-281.26. Nei, M. & Roychoudhury, S. (1982) Evol. Biol. 14, 1-59.27. Maruyama, T. (1974) Genet. Res. 23, 137-143.28. Watterson, G. A. (1976) Theor. Popul. Biol. 10, 239-253.29. Watterson, G. A. (1977) Theor. Popul. Biol. 12, 179-196.30. Kimura, M. (1983) The Neutral Theory ofMolecular Evolution

(Cambridge Univ. Press, Cambridge, U.K.).31. Wolpoff, M. H., Zhi, W. X. & Thorne, A. G. (1984) in Origins

of Modern Humans, eds. Smith, F. H. & Spencer, F. (Liss,New York), pp. 411-483.

32. Stringer, C. B. & Andrews, P. (1988) Science 239, 1263-1268.33. Templeton, A. R. (1992) Science 255, 737.34. Hedges, S. B., Kumar, S., Tamura, K. & Stoneking, M. (1992)

Science 255, 737-739.35. Stoneking, M. (1993) Evol. Anthropol. 2, 60-73.36. Bowcock, A. M., Ruiz-Linares, A., Tomfohrde, J., Minch, E.,

Kidd, J. R. & Cavalli-Sforza, L. L. (1994) Nature (London)368, 455-457.

37. Cavalli-Sforza, L. L., Piazza, A. & Menozzi, P. (1994) Historyand Geography of Human Genes (Princeton Univ. Press,Princeton), in press.

Evolution: Mountain and Cavalli-Sforza

Dow

nloa

ded

by g

uest

on

July

24,

202

0