supplementary information table of contents · 2 supplementary discussions evidence for the loss of...

56
1 Supplementary Information Table of Contents SUPPLEMENTARY DISCUSSIONS .......................................................................... 2 Evidence for the loss of Pp1-Y1 in D. mojavensis and of Ppr-Y in D. grimshawi. ...... 2 Comparison between the gene movements in the Y with the other chromosomes....... 2 Possible explanations for the gene gains in the Y chromosome. .................................. 4 SUPPLEMENTARY METHODS ................................................................................. 7 1. Analytical treatment of the ascertainment bias in the ratio gene gain / gene loss..... 7 1.1. Basic data and assumptions. ............................................................................... 8 1.2. Bias in the loss rate caused by the outgroup-specific genes. ............................ 10 1.3. Bias caused by unknown D. melanogaster Y-linked genes.............................. 11 1.4. General analytical model for bias correction. ................................................... 13 1.5. Statistical tests for the difference between the rates of gene gain and loss. ..... 15 2. Computer simulations and approximate Bayesian estimates of gene gain and gene loss............................................................................................................................... 17 SUPPLEMENTARY FIGURES AND LEGENDS.................................................... 20 Supplementary Figure 1. Assembly problems of Y-linked genes............................... 20 Supplementary Figure 2. Gene orthology confirmation by phylogeny....................... 21 Supplementary Figure 3. PCR test of Y-linkage of the ARY gene.............................. 31 Supplementary Figure 4. Synteny analysis of the kl-5 gene (all species). .................. 32 Supplementary Figure 5. Synteny analysis of the WDY gene. .................................... 35 Supplementary Figure 6. Synteny analysis of the Pp1-Y1 and Pp1-Y2 genes. ........... 37 Supplementary Figure 7. Synteny analysis of the ARY gene ...................................... 39 Supplementary Figure 8. Synteny analysis of the CCY gene...................................... 42 Supplementary Figure 9. Estimating gene gain and loss in the Y chromosome......... 43 Supplementary Figure 10. Experimental confirmation of the loss of the Ppr-Y gene in D. grimshawi. .............................................................................................................. 45 Supplementary Figure 11. Results of 1,000 computer simulations of gene gain and loss............................................................................................................................... 47 SUPPLEMENTARY TABLES ................................................................................... 48 Supplementary Table 1. Accession numbers of the genes used in this study. ............ 48 Supplementary Table 2. Ka/Ks ratios for the Y-linked genes..................................... 49 Supplementary Table 3. Original chromosomal location of the 7 gained genes......... 50 Supplementary Table 4. Quantities used to estimate the unbiased ratio of gene gain to gene loss. ..................................................................................................................... 51 Supplementary Table 5. FlyBase gene names............................................................. 52 SUPPLEMENTARY NOTES...................................................................................... 54 SUPPLEMENTARY INFORMATION doi: 10.1038/nature07463 www.nature.com/nature

Upload: doanhanh

Post on 01-Nov-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

1

Supplementary Information Table of Contents

SUPPLEMENTARY DISCUSSIONS ..........................................................................2

Evidence for the loss of Pp1-Y1 in D. mojavensis and of Ppr-Y in D. grimshawi. ......2 Comparison between the gene movements in the Y with the other chromosomes.......2 Possible explanations for the gene gains in the Y chromosome. ..................................4

SUPPLEMENTARY METHODS.................................................................................7

1. Analytical treatment of the ascertainment bias in the ratio gene gain / gene loss.....7 1.1. Basic data and assumptions. ...............................................................................8 1.2. Bias in the loss rate caused by the outgroup-specific genes. ............................10 1.3. Bias caused by unknown D. melanogaster Y-linked genes..............................11 1.4. General analytical model for bias correction. ...................................................13 1.5. Statistical tests for the difference between the rates of gene gain and loss. .....15

2. Computer simulations and approximate Bayesian estimates of gene gain and gene loss...............................................................................................................................17

SUPPLEMENTARY FIGURES AND LEGENDS....................................................20

Supplementary Figure 1. Assembly problems of Y-linked genes...............................20 Supplementary Figure 2. Gene orthology confirmation by phylogeny.......................21 Supplementary Figure 3. PCR test of Y-linkage of the ARY gene..............................31 Supplementary Figure 4. Synteny analysis of the kl-5 gene (all species). ..................32 Supplementary Figure 5. Synteny analysis of the WDY gene. ....................................35 Supplementary Figure 6. Synteny analysis of the Pp1-Y1 and Pp1-Y2 genes. ...........37 Supplementary Figure 7. Synteny analysis of the ARY gene ......................................39 Supplementary Figure 8. Synteny analysis of the CCY gene......................................42 Supplementary Figure 9. Estimating gene gain and loss in the Y chromosome.........43 Supplementary Figure 10. Experimental confirmation of the loss of the Ppr-Y gene in D. grimshawi. ..............................................................................................................45 Supplementary Figure 11. Results of 1,000 computer simulations of gene gain and loss...............................................................................................................................47

SUPPLEMENTARY TABLES ...................................................................................48

Supplementary Table 1. Accession numbers of the genes used in this study. ............48 Supplementary Table 2. Ka/Ks ratios for the Y-linked genes.....................................49 Supplementary Table 3. Original chromosomal location of the 7 gained genes.........50 Supplementary Table 4. Quantities used to estimate the unbiased ratio of gene gain to gene loss. .....................................................................................................................51 Supplementary Table 5. FlyBase gene names.............................................................52

SUPPLEMENTARY NOTES......................................................................................54

SUPPLEMENTARY INFORMATION

doi: 10.1038/nature07463

www.nature.com/nature

2

Supplementary Discussions

Evidence for the loss of Pp1-Y1 in D. mojavensis and of Ppr-Y in D. grimshawi.

We could not find these genes in the assembled genomes. Blast searches detected

similar sequences that, after phylogenetic analysis, proved to be paralogous genes in

both cases (not shown). We also searched for these genes in the raw traces, but not a

single trace was found (i.e., all traces we found belong to paralogs). Thus, either they

were lost in the corresponding lineages, or the entire sequence of both genes fell in

sequence gaps. In the case of the D. mojavensis Pp1-Y1 a sequence gap is very unlikely

because the gene is located in a conserved autosomal position in all species (except in

the melanogaster group, where it is Y-linked; Supplementary Fig. 6), and there is no gap

in this region in the assembled D. mojavensis genome. Regarding Ppr-Y in D.

grimshawi, we experimentally confirmed its loss with degenerate PCR (Supplementary

Fig. 10). Interestingly, data from D. melanogaster shows that Ppr-Y is not an essential

gene20, so its loss probably can be tolerated.

Comparison between the gene movements in the Y with the other chromosomes.

The Y chromosome seems to be a very inhospitable environment for genes due to

its heterochromatic state, sex-limited expression and inheritance, lack of recombination,

and smaller effective population size, and hence one might expect increased gene losses

and reduced gene gains when compared to other chromosomes. Three recent Drosophila

studies provide particularly interesting comparisons with our data.

Bachtrog and coworkers10 studied part of the neo-Y chromosome of D. miranda

and found that 55 out 118 genes became pseudogenes in ~ 1Myr, implying a nominal

rate of gene loss of 0.47 genes / gene / Myr . Such massive gene losses were also

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

3

observed in the mammalian Y16,31 and seems to be characteristic of the "standard

pathway" for the origin of Y chromosomes, when an ordinary chromosome became

male-restricted and lost recombination8,9. The rate of gene loss we measured in the

Drosophila Y (0.001026 genes / gene / Myr) is nearly 500 fold smaller. This huge

difference certainly reflects the different evolutionary histories of the genes: the

Drosophila Y-linked genes were first acquired from autosomes, and hence were already

"adapted" (and perhaps suited) to the harsh environment of the Y-chromosome, whereas

the neo-Y genes of D. miranda were a more or less random sample of genes, suddenly

caught in this environment by a Y-autosome fusion.

The rate of gene gain between D. melanogaster and D. yakuba was measured by

Zhou and collaborators32 and averaged ~ 8 genes / genome / Myr, which translates to

~1.6 genes / large arm / Myr. Bhutkar and collaborators6 found a high confidence set of

514 genes that moved to different locations in the 12 sequenced species, leaving or not a

copy at the original place (being called "duplicative transpositions" and "conservative

transpositions", respectively; ref 33). The actual number of relocated genes could be

higher due to the fact that only phylogenetically consistent cases were considered.

Though it is difficult to measure gene loss from this data (which is further complicated

by the possibility that a missing gene is actually an assembly gap), these 514

"positionally relocated genes" certainly imply 514 gene gains, which happened in a

divergence time of ~ 375 Myr6,24. The average gain rate of ~ 1.37 genes / large arm /

Myr is very close to Zhou and collaborators estimate32 , and both are one order of

magnitude higher than our estimate for the Y of 0.12 genes / Myr (P < 10-5 ; two-tailed

exact test for the ratio of two Poisson means34,35 ).

Although the Drosophila Y is a large chromosome in most species (41 Mbp in D.

melanogaster, whereas the large chromosome arms have ~ 25-40 Mbp), its smaller rate

of gene gain is not surprising, given the following factors: (i) The Y is entirely

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

4

heterochromatic, and euchromatic genes frequently are "silenced" when inserted in

heterochromatic regions. Thus, all else being equal, the expected number of gene gains

of similarly sized blocks of heterochromatin and euchromatin are not equal (the former

being smaller). (ii) 80% of the Y chromosome is composed of satellite DNA that cannot

harbor functional genes36. Again, the "effective size" of the Y is much smaller than its

physical size; (iii) The Y chromosome has a smaller population size (less than 1/4 that of

autosomes), which results in a smaller absolute number of mutations (gene insertions in

this case) available for fixation; (iv) female-related genes, as well as genes required in

both sexes cannot move to the Y.

Regarding the mechanism, retrotranspositions accounted for 10% to 24% of all

gene gains in the euchromatin6,25,32 . We found seven gene gains in the Y, but given that

two genes are intronless in their original locations (Pp1-Y1 and Pp1-Y2), in only five

events could retrotranspositions have been detected. In all five of these gains intron

positions were fully conserved, ruling out retrotransposition. However, the sample is too

small to allow any conclusion regarding the prevalence of retrotranspositions in the Y

chromosome.

Possible explanations for the gene gains in the Y chromosome.

There are two interconnected aspects of this problem. First, why does the Y

chromosome gain genes in the first place, given its "restrictive" characteristics

(heterochromatic state, lack of recombination, sex-limited expression and inheritance,

etc.)? Second, why are most or all these genes male-related?

Regarding the first question, the empirical evidence shows that Y chromosomes do

acquire genes, in a diverse range or organisms such as Drosophila, mammals16 and

plants37. However, the rate of gene gain in the Drosophila Y is one order of magnitude

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

5

lower than the other chromosomes of comparable size (Supplementary Discussion). So it

seems that the restrictive conditions mentioned above are not strong enough to totally

suppress gene gains, but they do reduce its rate.

Regarding the second question, it is widely assumed that the concentration of male

genes on the Y results from natural selection, a view that traces back to R. A. Fisher38.

The rationale is that male-female antagonistic effect of genes may hamper the evolution

of male-related traits, unless they are located in a male-specific region of the genome.

Hence there would be positive selection for Y-linkage of male-related genes. In recent

years M. Lynch and co-workers39 suggested that chance events (random drift) play a

large role on the fate of duplicated genes; i.e., natural selection may not be the main

evolutionary force driving genome organization. Extension of these ideas to the Y

chromosome leads to a quite different explanation of the concentration of male-genes on

the Y. Suppose that after a gene duplication to the Y chromosome there is, say, an 80%

chance of degeneration of the Y-linked copy, and a 20% chance of degeneration of the

autosomal copy. Given this, a male-specific gene would be "transferred" to the Y in 20%

of the duplications (because the autosomal copy would be lost). However, in the case of

a Y-duplication of a female-specific or house-keeping gene, there will be selection for

maintaining the original (autosomal) copy, because females need the gene. The most

probable result is the loss of the Y copy (or, in some cases, its specialization to a new

function). The net effect is the accumulation of male-related genes in the Y (as we

indeed observe), but not resulting from positive selection for Y-linkage. Our data does

not provide any clue on which force (natural selection or genetic drift) plays the major

role. However, as we briefly mentioned in the main text, the Y-linked Suppressor of

Stellate [ Su(Ste) ] locus may provide an example of positive selection. This multi-copy

gene was acquired by the Y in the D. melanogaster lineage, after the split from D.

simulans40. The sole known function of Su(Ste) is to repress (via RNAi41) the X-linked

gene Stellate. It has been suggested that Stellate distorts the X-Y segregation in favor of

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

6

the X (i.e., it is a meiotic driver gene), and that Su(Ste) evolved as a response, in a sort of

evolutionary arms race30 (but see ref 42). Meiotic drive creates a strong "evolutionary

prize" for suppressors, particularly in those located in the targeted chromosome43. If

Su(Ste) is indeed a suppressor of X-Y meiotic drive, than its Y-linkage almost certainly

resulted from selection.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

7

Supplementary Methods

1. Analytical treatment of the ascertainment bias in the ratio gene gain / gene loss.

In order to investigate the trend in the number of Y-linked genes (is it increasing?

decreasing? steady-state equilibrium between gain and loss ?), we need an unbiased

estimate of either the difference or the ratio of two quantities, gain rate (7 gains in 62.9

Myr ; raw value: 0.1113 genes / Myr ) and loss rate (2 losses in 275.2 Myr; raw value:

0.00727 genes / Myr; raw gain/loss ratio: 15.3) . As explained below, this raw estimate

of the gain/loss ratio is biased because of the way the gain and loss events were

ascertained. The main bias is in the loss rate, and a minor one affects the gain rate. Here

we show how to correct them. After defining the nomenclature we used (below), in

section 1.1. we specify three quantities and one assumption used to derive the unbiased

ratio of gain to loss. In section 1.2. we present the correction of the bias, and in section

1.3. we show that unknown D. melanogaster Y-linked genes do not cause a bias in the

ratio of gain to loss. Finally, in section 1.4. we derive a general model for bias

correction, needed for the formal statistical test of equality of gain and loss rates. This

bias correction will no longer be necessary when the knowledge about the Y

chromosome of the other Drosophila species becomes equivalent to D. melanogaster.

We are carrying out such direct searches for Y-linked genes in the other Drosophila

species, but we should note that heterochromatic regions (including the Y) are

notoriously refractory to genomic studies11,12, so the task requires considerable effort and

time.

A simplified scheme of our "melanogaster-centric" data is shown below, where

MS is the number melanogaster-specific Y-linked genes, MA (melanogaster-ancestral)

is the number of genes acquired before the split between the melanogaster lineage and

the outgroup, and MT (melanogaster total) is equal to MS plus MA. Since we do not

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

8

know the full gene set of D. melanogaster, we use the subscripts K ("known") , U

("unknown") and R ("real"). Of course, MSK + MSU = MSR . Finally, OS is "outgroup

specific", i.e., the number of genes that are Y-linked in the outgroup, and not Y-linked in

D. melanogaster. The outgroup can be any non-melanogaster species (e.g., D. willistoni

or D. virilis) and the "outgroup-specific" gene may either have been acquired in the

outgroup lineage, or it may have been lost in the D. melanogaster lineage. In our data

(Fig. 2), when we use D. virilis as the outgroup, MSK = 7, MAK = 5, MTK = 12 ; with D.

ananassae as the outgroup, MSK = 1, MAK = 11, MTK = 12.

1.1. Basic data and assumptions. We begin with the initial assumption that the

gain rate measured in the D. melanogaster lineage (the "red branches" in Supplementary

Fig. 9) and the loss rate measured in the other lineages (the "blue branches") are

homogeneous across the entire phylogeny. The unbiased ratio of gene gain / gene loss

comes from the three estimates detailed below.

1.1.1. The loss rate per gene (expressed in "genes lost per gene per Myr"), is

unbiased due to the very nature of data, where many instances allow high confidence

that the direct ancestral species possessed the Y-linked gene that is now missing in the

target species. For example, we observed one gene loss (Ppr-Y) in the D. grimshawi

branch (42.9 Myr), among five Y-linked genes (Fig. 2). Hence, the loss rate per gene in

this branch is (1 / 5) / 42.9 = 0.0047 genes lost / gene / Myr . The average value for all

branches is 0.001026 genes lost / gene / Myr (see section 1.4 and Supplementary Table

4) . Note that the loss rate per gene is different from the chromosome-wide loss rate,

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

9

which is expressed in "genes lost per Myr". For the sake of simplicity we will refer to

the latter simply as the "loss rate".

Note that we ignored the kl-5 gene in the Drosophila subgenus for the

computations of gene gain and loss rates, because its discovery depended on the gain of

the same gene in two branches, which may bring in unknown bias. However its inclusion

did not change any conclusion (not shown).

1.1.2. The ratio of melanogaster-specific to melanogaster-ancestral genes for a

given outgroup is unbiased because the discovery of all Y-linked genes of D.

melanogaster have not used any information from the other species (the other genomes

were not even available at that time). Hence the observed ratio of 7 / 5 with D. virilis

(and 1/11 with D. ananassae) is expected to hold for the full gene set of D.

melanogaster, apart from sampling variance. More formally,

MAMS

MAMS

K

K

R

R =⎟⎠⎞

⎜⎝⎛E (eqn. 1)

where E is the expected value, and the other symbols were defined before.

1.1.3. Similarly, the gene gain rate (7 genes / 62.9 Myr = 0.1113) is unbiased in the

sense that the discovery of the D. melanogaster Y-linked genes was done without

information from the other species, and so was not influenced by their condition of being

"ancestral" or "melanogaster-specific". The gene gain rate has a trivial bias caused by

unknown D. melanogaster Y-linked genes (which was dealt with in section 1.3), and

needs a minor correction, as follows.

The 7 gains in 62.9 Myr we observed is the net gain rate, which does not take into

account the genes that were acquired and subsequently lost in the D. melanogaster

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

10

lineage. This problem can be corrected with a simple birth-death model with gene

“births” (arrival of genes) being independent of population size (number of Y-linked

genes), and the number of "deaths" (gene losses) being proportional to population size

(number of Y-linked genes). The differential equation is:

dN/dt = -λ N + ν

where N is the number of genes, λ is the gene loss rate per gene (in genes lost /

gene / Myr), and ν is the gene gain rate (in genes / Myr). Its solution is

⎟⎠⎞⎜

⎝⎛ −−−−= 10

ttt eeNN λ

λνλ (eqn. 2)

where Nt is the number of genes at time t, N0 is the number of genes at time zero

(i.e., the present), and the other symbols were defined before. We can obtain the

corrected estimate of the gain rate ν by setting λ to 0.001026 , Nt to 12 genes , N0 to 5

genes, t to 62.9 Myr, and solving the equation for ν. The corrected value of the gain rate

is 0.12 genes / Myr (the raw value is 0.1113).

1.2. Bias in the loss rate caused by the outgroup-specific genes. Our current

estimates of the gain rate ( 0.12 genes / Myr ) and loss rate (2 losses in 275.2 Myr; raw

value: 0.00727 genes) are downward biased because we do not know the full gene set of

the D. melanogaster Y chromosome, but these biases cancel out when we consider the

ratio of gain to loss, and hence the ratio itself is unbiased (see section 1.3.). Let us focus

here on the more relevant bias: the loss rate is also downward biased because we do not

take into account the Y-linked genes that are present in the outgroup, but not in D.

melanogaster ( i.e., we have not taken into account the outgroup-specific genes), and

some of them are expected to have been lost. Putting it more formally, our current count

of losses in the outgroup (and the loss rate) is conditional on the existence of the same

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

11

gene in the Y chromosome of the D. melanogaster lineage, whereas an unbiased

estimate of the loss rate requires the inclusion of the outgroup-specific genes. At this

moment the knowledge about outgroup-specific genes (we know two in D. virilis) is too

incomplete to allow more detailed conclusions, other than that they really exist.

However, under the assumption that gain rates are homogeneous across the phylogeny,

the expected number of outgroup-specific Y-linked genes is:

E (OS) = gain rate × outgroup branch length.

For example, in the Y of D. willistoni we expect 0.12 genes/ Myr × 62.2 Myr =

7.5 genes , which would not be present in the Y of D. melanogaster. Among these genes,

we expect 7.5 × 0.001026 genes / gene / Myr × 62.2 Myr × 0.5 = 0.24 losses. The last

"0.5" factor stems from the fact that the acquisition of these new genes is expected to

occur on average at ½ of the branch length, so the chance of being lost is halved. In the

real phylogeny (Supplementary Fig. 8, panel B) we must calculate these "inferred

losses" for each outgroup branch, as shown in Supplementary Table 4. The sum of

expected additional losses across all outgroup branches is 1.025 . Note that the observed

value was 2 losses, so the unbiased number of losses is 3.025 and the unbiased loss rate

is 3.025 / 275.2 = 0.01099 genes / Myr .

The unbiased gain / loss ratio , 0.12 / 0.01099 , is 10.9 . The statistical testing of

this ratio (i.e., whether or not it is significantly different from 1) is presented in section

1.5.

1.3. Bias caused by unknown D. melanogaster Y-linked genes. As we

mentioned above, ignoring these genes cause a downward bias that is expected to affect

equally the gain and loss rates, and hence the effects are expected to cancel out in the

ratio gain / loss. The argument follows.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

12

1.3.1. Gain rate. Our current raw estimate of the gain rate is 7 genes / 62.9 Myr =

0.1113 genes / Myr . This is an under-estimate because we do not know all

melanogaster-specific genes, and the more we find, the higher will be the gain rate. So if

we find 7 additional melanogaster-specific genes, the raw rate will be 14 / 62.9 Myr =

0.2226 genes / Myr (the corrected value described in section 1.1.3. will also double,

from 0.12 to 0.24).

More generally,

unbiased Gain Rate = Gain Rate × MSMS

K

R (eqn. 3)

1.3.2. Loss rate. Our current raw estimate is 2 genes / 275.2 Myr = 0.00727 . It was

estimated in the non-melanogaster branch (the "blue branches" shown in Supplementary

Fig. 9), by counting the number of losses among the known Y-linked genes of these

branches, and dividing by the total branch length. The loss rate is underestimated

because we do not know all Y-linked genes of the outgroup, and if this number, say, is

30% higher, the expected rate of loss will also be 30% higher. A simple analogy may

help: if we observed 10 deaths per year in a random sample of 250 animals, we would

expect to observe 20 deaths per year in a sample of 500 animals.

Among these unknown Y-linked genes, some would have been acquired in the

outgroup branches; these are the "outgroup specific" genes mentioned and accounted for

in section 1.2. Note that the bias correction described in section 1.2. is not affected by

additional melanogaster-specific genes because if we found that the gain rate is, say,

0.24 (twice the current rate 0.12 ), the expected number of OS (and the expected number

of losses among them) also doubles, and the effect cancels out in the ratio gain / loss.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

13

The rest of the unknown Y-linked genes of the outgroup are the "melanogaster-

ancestral". If there are, say, 10 such genes (instead of five) we expect that the number of

losses among them will also double. More generally,

unbiased Loss Rate = Loss Rate × MAMA

K

R (eqn. 4)

It follows from eqn. 1 that

⎟⎠⎞

⎜⎝⎛=⎟

⎠⎞

⎜⎝⎛

MSMS

MAMA

K

R

K

R EE (eqn. 5)

Note that the left term of eqn. 5 is the bias of gene loss (eqn. 4), and that the right

term is the bias of gene gain (eqn. 3). Thus, the biases due to incomplete knowledge of

the D. melanogaster Y-linked genes are expected to cancel out in the ratio gain rate /

loss rate , as stated in the beginning of section 1.3.

1.4. General analytical model for bias correction. In the previous sections we

calculated the bias in the gain / loss ratio for the specific data we have. Here we analyze

a more general model for this bias, which is needed to specify more clearly the null

hypothesis for testing the ratio of gain to loss. The data used for the estimation of the

loss rate consists in the follow up of a set of genes known to be Y-linked in the ancestor,

along the phylogenetic branches (Supplementary Table 4, and Fig. 2) . The product

"number of genes × branch length" is called exposure in the context of Poisson

regression, and the higher it is, the larger the expected number of losses. It transpires

from the reasoning presented in section 1.2 that the bias discussed there originated from

not considering the exposure due to the outgroup-specific Y-linked genes, and that the

bias correction simply is its inclusion. Namely,

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

14

unbiased loss rate = loss rate × ( 1 + ∑∑

reobs_exposureinf_exposu

)

where obs_exposure is the observed exposure (from the ancestral Y-linked genes

present in the outgroup), and inf_exposure is the inferred exposure (from the outgroup-

specific Y-linked genes). Specifically,

obs_exposure = observed number of genes × branch length

inf_exposure = inferred effective number of genes × branch length ,

where "inferred effective number of genes" is the number of genes gained in the

same branch divided by 2 (to account for the fact that they were gained on average in the

middle of the branch) plus the number of genes acquired in previous branches. In the full

data set the bias in the loss rate is (1 + 998.9 / 1948.6 ) = 1.513 . Remember that the

biased loss rate is Σ obs_losses / Σ branch length = 2 / 275.2 = 0.00727 genes lost / Myr.

The unbiased loss rate equals to 0.00727 × 1.513 = 0.01099 genes lost / Myr , which is

the same value calculated with the less general approach used in section 1.2. The value

of this bias depends on the topology of the tree (Fig. 2) and on the specific points where

the genes were acquired in the D. melanogaster lineage, because these factors change

the observed and inferred exposures. For example, if we assume that the CCY gene was

acquired at the basal branch of the melanogaster / obscura groups instead of in the basal

Sophophora branch, the bias in the loss rate changes from 1.513 to 1.531 , and if we had

used only Sophophora species, the bias would be 1.274 . We included in the

Supplementary Information a Excel spreadsheet (Supplementary Data file analytical.xls)

that implements the bias estimation.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

15

The calculated bias provides the null hypothesis for testing the gain / loss ratio.

The loss rate has a bias of 1.513 . Remember that the gain rate has a minor bias (section

1.1.2), which amounts to 0.12 / 0.1113 = 1.078 . Hence, an unbiased gain / loss ratio of

1 implies an observed ratio of 1.513 / 1.078 = 1.403 . This 1.403 ratio is the null

hypothesis to be tested in the Poisson regression (and in the two Poisson means test).

I.e., we should test whether the ratio of 7 gains in 62.9 Myr divided by 2 losses in 275.2

Myr differ significantly from 1.403 . The answer is "yes" ( P = 0.003 , Poisson

regression; see section 1.5 ). The same qualitative result was obtained with direct

computer simulations (section 2).

1.5. Statistical tests for the difference between the rates of gene gain and loss.

The data consist of inferred gains and losses on each branch, using synteny and

parsimony to infer ancestral states (Fig. 2). As detailed in Supplementary Fig. 6 and

Supplementary Fig. 8, there are two uncertainties in the data, involving the Pp1-Y1 /

Pp1-Y2 and the CCY genes. We first assumed the scenario shown on Fig. 2 (CCY was

gained in the basal Sophophora branch, instead of in the basal branch of the

melanogaster / obscura groups; the gains of the Pp1-Y1 and Pp1-Y2 genes are two

independent events, instead of one). None of the alternative scenarios change the

conclusion (below). Finally, as commented in the Supplementary Discussion, we

excluded from our analysis the Y-linked gene Suppressor of Stellate [ Su(Ste) ] because

it is multi-copy and RNA-encoding. The gene was acquired in the D. melanogaster

lineage, after the split from D. simulans. Its inclusion did not change any qualitative

conclusion (below).

We assume that genes are gained and lost along each branch of the phylogeny

according to a homogeneous Poisson process. Here the “exposure” to gene gain or loss

is the length of the respective branch, and the model is:

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

16

E(X) = f(β0 + β1IGainLoss)

where X is the count of gene movements (both gains and losses), f is the Poisson

link function44, β0 is the intercept, adjusting for the branch lengths, and IGainLoss is a

binary indicator variable denoting whether the gene movement is a gain or loss (1 for a

gain and 0 for a loss). Testing the null hypothesis that the ratio of gain to loss is one (i.e.,

that they are equal) amounts to testing β1 = 0. Given the ascertainment bias discussed in

section 3, the appropriate null hypothesis is that the observed ratio of gain to loss is

1.403 , i.e., that β1 = ln ( 1.403) = 0.3386 . The residual deviances provided an

assessment of goodness-of-fit of the model to the data. The whole procedure was done

with the glm function in the R statistical package (setting “family = Poisson”)45, and is

implemented in the Supplementary Data files "gains_losses_script.R" and

"gains_losses_data.txt".

The Poisson regression model indicated that the rate of gene gain is significantly

larger than the rate of gene loss (P = 0.003). The nominal gain / loss ratio (after the bias

correction) is 10.9 ( 95% confidence interval: 2.3 - 52.5). Similar results are obtained if

we assume that CCY was gained in the basal branch of the melanogaster / obscura

groups (P = 0.003), if the gains of the Pp1-Y1 and Pp1-Y2 genes are counted as one

event (P = 0.005), or if the gain of the Su(Ste) gene is included (P = 0.001). Regarding

the goodness-of-fit of the model to the data, there is an indication of overdispersion in

the assumed scenario, or if we include the Su(Ste) gene (P = 0.035 and P = 0.022 ,

respectively), but not in the remaining scenarios (CCY alternative scenario: P > 0.20 ;

Pp1-Y1 / Pp1-Y2 alternative scenario: P = 0.06).

The same conclusion that gain rate largely exceeds loss rate is obtained with a

simple two-tailed exact test for the ratio of two Poisson means34,35, by comparing 7 gains

in 62.9 Myr with 2 losses in 275.2 Myr, under the null hypothesis of a gain / loss ratio of

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

17

1.403 (P = 0.002; the test was done with the StatCalc 2.0 program, available at

http://www.ucs.louisiana.edu/~kxk4695/StatCalc.htm ). Similar results were obtained

under the three alternative scenarios mentioned above.

Finally, the same conclusion is obtained if we use only Sophophora species ( 7

gains in 62.9 Myr vs. 0 losses in 136.9 Myr; P = 0.002; two-tailed exact test for the ratio

of two Poisson means). Hence the conclusion that the gene content of the Y is increasing

does not seem to be an artifact caused by estimating gains and losses in different and

rather distant lineages such as D. melanogaster and D. virilis, although it is formally

possible that the increase is occurring in D. melanogaster and related species, but not in

species from the Drosophila subgenus.

In this section and in the previous one we approached analytically the

ascertainment bias on gene gains and losses, and statistically tested their equality with a

Poisson regression. In the next section we used computer simulations and an

approximate Bayesian procedure to tackle these questions.

2. Computer simulations and approximate Bayesian estimates of gene gain and

gene loss.

In order to more fully explore the consequences of the ascertainment bias of gene

content, simulations of a Poisson process of gene gain and loss were run. The computer

code was written in the statistical language R45, and is available as a Supplementary

Material (file indelsim_free.R). The simulations employed the observed phylogeny and

branch lengths, and inferences of losses were conditional on observing genes in D.

melanogaster (identical to the true ascertainment). After drawing random rates of gene

gain and loss per gene from an uniform distribution and collecting 1,000 runs that

satisfied the rejection criteria (7 net gains on the D. melanogaster lineage and 2 losses of

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

18

known genes on the other branches), approximate Bayesian estimates46-48 of the

posterior densities of the gain rate, the loss rate per gene, and the net gene gain (gains

minus losses) were obtained (Figure 3 and Supplementary Fig. 11).

In all 1,000 simulations the gains outnumber the losses (Fig. 3 and Supplementary

Fig. 11A), which strongly suggest that the Y is gaining genes on average. Note that the

simulations required just a total of 7 gains in the red branches, irrespective where they

happened. This is important because there is some uncertainty in where the genes were

gained (Supplementary Fig. 6 and 8); the simulations are free of assumptions in this

respect, which increases the robustness of the conclusions drawn from them. The gain

rate and the loss rate per gene are the ultimate factors governing gene number dynamics;

their joint posterior distributions are shown in Supplementary Figure 11B. Analogously

to what we did in the analytical approach, we also run the simulations under two

alternative scenarios, to allow for the counting of the Pp1-Y1 and Pp1-Y2 gains as a

single event (6 gene gains, instead of 7), and to include the gain of the Suppressor of

Stellate gene (8 gene gains, instead of 7). In both cases we got the same result as before

(namely, in all 1,000 simulations the gains outnumber the losses; data not shown).

It is likely that as gene number increases the number of gene losses will increase

until an equilibrium between gains and losses is attained. Under the simple model

outlined in section 1.1.3 and in equation 2, the equilibrium gene number is ν / λ , where

ν is the gene gain rate (in genes / Myr), and λ is the gene loss rate per gene (in genes lost

/ gene / Myr). The simulations allow us to look at the posterior distribution of the

predicted equilibrium gene number (Supplementary Fig. 11C ). As expected given the

previous result that the gains outnumber the losses (Fig. 3), nearly all (997 out of 1,000)

of the values of equilibrium gene number are above the present Y-linked gene number in

D. melanogaster (12 genes). The average is 89 genes. However, the equilibrium gene

number does not have much biological significance because it is expected to take a very

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

19

long time to be achieved. Using the nominal gain rate of 0.12 genes / Myr and the

nominal loss rate of 0.001026 genes lost / gene / Myr , the predicted equilibrium is 117

genes, but it would take over 400 Myr for an increase from 12 genes to 50.

The parameters and values estimated by the simulations agree quite well with the

analytical solution. For example, the average ratio of gain rate to loss rate in the

simulations is 8.3 (Supplementary Fig. 11A), whereas the analytical value is 10.9

(section 1.2. ). Perfect agreement is not expected because some assumptions are

different. In particular, the simulations allowed variation among samples in the

phylogenetic pattern of gains (i.e. the rejection criterion focused on counts of gains, not

on which branches had gains), and this changes the exposure to losses (see section 1.4.).

If we re-run the simulation with additional constraints such that the gene gains fell on

the phylogeny as in Fig. 2, then its estimates of parameters and values match more

closely the analytical solution (not shown).

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

20

Supplementary Figure 1. Assembly problems of Y-linked genes. The figure

shows a BlastN search of the full cDNA of D. virilis kl-2 against the assembled

D. virilis genome. This and the cDNAs from all other genes were obtained after

gaps and frame-shifts were corrected, as described in the Methods Summary

section. Note that the gene is fragmented in several scaffolds and that there are

many gaps due to the low coverage of the Y. The numbers are the abridged

FlyBase scaffold identifiers (e.g. scaffold_9735 was abridged to 9735). Many

fragments were absent from the assembled genome and were sequenced de

novo using RT-PCR, RACE 5' and RACE 3'. The final coding sequences of the

orthologs were obtained with NAP49 and GeneWise250

(http://www.ebi.ac.uk/Wise2/advanced.html). We also used Apollo51 and SGP252

to help the annotation in the more difficult (i.e., less conserved) genes.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

21

Supplementary Figure 2. Gene orthology confirmation by phylogeny. The

Y-linked genes originated by duplication of autosomal genes18-22. We used

phylogenetic analysis to avoid the error of mixing the parental autosomal genes

(i.e., the paralogs) with the correct orthologs. The protein sequences were

aligned with ClustalW53, and a NJ tree with Poisson correction and complete

deletion was constructed with the program MEGA54. For the sake of simplicity, in

the figures we labeled the genes in the other species according to their names in

D. melanogaster (e.g., the D. erecta gene Dere_GLEANR_12165 is the ortholog

of the D. melanogaster gene CG3339, and as was labelled as "ere CG3339" ;

see Supplementary Table 5 for the remaining genes). Each panel shows the

phylogenetic analysis of one gene, indicated in the top of the corresponding

panel. In most cases we included only the closest paralog, but in case of doubt (

Pp1-Y1 and Pp1-Y2 ; and ARY ) we used several sequences returned by the

TblastN search. In a few cases we could not find any paralog (WDY) or only

very distant ones (PRY; the two proteins showed in the figure have less than

30% identity), so there is no doubt about the orthology. The CCY gene has a

shorter paralog (CG31161) present only in the melanogaster group; their

relationship is discussed in Supplementary Fig. 8.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

22

Supplementary Fig. 2: Panel A. kl-2 phylogeny.

mel kl-2

ere kl-2

yak kl-2

ana kl-2

pse kl-2

wil kl-2

moj kl-2

virkl-2

gri kl-2

mel CG9068

ere CG9068

yak CG9068

ana CG9068

pse CG9068

wil CG9068

moj CG9068

vir CG9068

gri CG9068

Anopheles XP310137

Aedes EAT40361

Apis XP396228.3

Tribolium XP967358

Chlamydomonas 1-beta dynein CAB99316

100

100

68

74

100

100

64100

100

100

100

100

96

100

95

100

100

99

100

48

0.1

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

23

Supplementary Fig. 2: Panel B. kl-3 phylogeny.

mel kl-3

ere kl-3

yak KL3

ana kl-3

pse kl-3

wil kl-3

moj kl-3

vir kl-3

gri kl-3

Anopheles XP308196

Aedes EAT41089

Tribolium XP966797

ere CG9492

mel CG9492

yak CG9492

ana CG9492

pseCG9492

wil CG9492

moj CG9492

vir CG9492

gri CG9492

Anopheles XP307780

Aedes EAT38201

Tribolium XP967934

Chlamydomomas gamma dynein Q39575

100

88

100

100

100

99

100

98

100

96

100

100100

100

77100

100

100

99

8248

100

0.1

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

24

Supplementary Fig. 2: Panel C. kl-5 phylogeny.

mel kl-5

ere kl-5

yak kl-5

ana kl-5

pse kl-5

wil kl-5

moj kl-5

vir kl-5

gri kl-5

Anopheles XP321424

Aedes EAT45561

ere CG3339

mel CG3339

yak CG3339

ana CG3339

pse CG3339

wil CG3339

moj CG3339

vir CG3339

gri CG3339

mel Dhc93AB

ere Dhc93AB

yak Dhc93AB

ana Dhc93AB

wil Dhc93AB

pse Dhc93AB

moj Dhc93AB

vir Dhc93AB

gri Dhc93AB

Anopheles XP559011

Aedes EAT39332

Chlamydomonas beta dynein Q39565

44100

99

78100

93

100

100

80

100

100

100

100

75

100

100

100

100

75

100

100

54

100

100

100

10055

87

0.05

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

25

Supplementary Fig. 2: Panel D. ORY phylogeny.

mel ORY

ere ORY

yak ORY

ana ORY

pse ORY

wil ORY

vir ORY

moj ORY

gri ORY

mel CG6059

ere CG6059

yak CG6059

ana CG6059

pse CG6059

wil CG6059

moj CG6059

vir CG6059

gri CG6059

Anopheles XP 313461.1

Aedes EAT43311.1

Tribolium XP 974953.1

100

100

99

100

93

52

98

98

96

100

6990

100

99

99

100

10041

0.2

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

26

Supplementary Fig. 2: Panel E. PRY phylogeny.

mel PRY

yak PRY

ere PRY

ana PRY

wil PRY

pse PRY

moj PRY

vir PRY

gri PRY

mel CG12636

yak CG12636

ere CG12636

wil CG12636

mel CG30048

ere CG30048

yak CG30048

ana CG30048

moj CG30048

vir CG30048

gri CG30048

90100

98

100

100

98

7562

95

9592

100

56

71

94

100

52

0.1

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

27

Supplementary Fig. 2: Panel F. Ppr-Y phylogeny.

Supplementary Fig. 2: Panel G. CCY phylogeny.

mel CCY

ere CCY

yak CCY

ana CCY

pse CCY

mel CG31161

ere CG31161

yak CG31161

ana CG31161

wil CCY-1

wil CCY-2

moj CCY

vir CCY

gri CCY

93

100

35

94

100

95

99

100

93

87

99

0.05

mel PPrY

yak PPrY

ere PPrY

ana PPrY

pse PPrY

wil PPrY

vir PPrY

moj PPrY

mel CG13125

yak CG13125

ere CG13125

ana CG13125

pse CG13125

wil CG13125

moj CG13125

vir CG13125

gri CG13125

Aedes XP 001649719.1

59100

100

84

100

94

100

98

96

100

98

100

99

100

55

0.1

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

28

Supplementary Fig. 2: Panel H. ARY phylogeny.

mel ARY

ere ARY

yak ARY

ana ARY

wil ARY

pse ARY

moj ARY

vir ARY

gri ARY

mel CG10638-PB

ere_GLEANR_14052

ere_GLEANR_14054

yak CG10638-PA

wil_GLEANR_12919

ana CG10638-PB

ana_GLEANR_8771

moj_GLEANR_13025

moj_GLEANR_13026

vir_GLEANR_13680

vir_GLEANR_13681

mel CG10638-PA

ere_GLEANR_14053

ere_GLEANR_14055

wil_GLEANR_12920

moj CG10638

vir CG10638

pse CG10638

Aedes AAEL004095-PA

97

91

100

80

92

97

76

100

8599

85

8154

73

99

49

47

81

98

7377

55

100

33

33

0.2

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

29

Supplementary Fig. 2: Panel I. WDY phylogeny.

mel WDY

ere WDY

yak WDY

ana WDY

pse WDY

wil WDY

vir WDY

moj WDY

gri WDY

Aedes EAT46343.1

Tribolium XP 970543.1

47100

100

99

77

100

4497

0.1

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

30

Supplementary Fig. 2: Panel J. Pp1-Y1 and Pp1-Y2 phylogeny.

mel Pp1-87B

mel Pp1-96A

mel Pp1-13C

moj PP1-13C

mel Pp1-9C

mel Pp1-Y2

yak Pp1-Y2

ana Pp1-Y2

ere Pp1-Y2

pse Pp1-Y2

wil Pp1-Y2

moj Pp1-Y2

vir Pp1-Y2

gri Pp1-Y2

mel Pp1-D5

mel PpY-55A

mel PpN58A

moj PpN58A

mel Pp1-Y1

ere Pp1-Y1

yak Pp1-Y1

ana Pp1-Y1

pse Pp1-Y1

wil Pp1-Y1

vir Pp1-Y1

gri Pp1-Y1

wil GLEANR 15902

vir GLEANR 7085

mel PpD6

wil PpD6

vir PpD6

moj PpD6

mel Pp4-19C

97

99

100

97

77

59

83

99

100

82

99

92

84

94

82

97

99

41

21

19

57

73

96

100

100

91

71

94

96

0.1

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

31

Supplementary Figure 3.

Supplementary Figure 3. PCR test of Y-linkage of the ARY gene. The gene

is Y-linked only in species from the D. melanogaster group and in D. willistoni.

Unabridged species names (in the order of appearance) are: D. melanogaster,

D. erecta, D. yakuba, D. ananassae, D. pseudoobscura, D. willistoni, D.

mojavensis, D. virilis, and D. grimshawi. A similar test was done for all genes,

across all 12 species.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

32

Supplementary Figure 4

Supplementary Figure 4. Synteny analysis of the kl-5 gene (all species).

Individual genes usually move through gene duplication, sometimes followed by

loss of the original copy33. The linkage changes shown in Table 1 can be caused

either by Y-to-autosome movement or vice versa. Synteny analysis is very

helpful to solve this, but could only be applied to the genes that are autosomal in

two or more species, because the Y chromosome assembly is too fragmented

Dana scaffold 13340

Dyak Chromosome 3R

Dere scaffold 4820

Dmel Chromosome 3R

Dmoj scaffold_6540

Dvir scaffold_13047

Dw il scaffold_2_1100000004902

Dgri scaffold_15074

Dpse Chromosome 2

CG3348CG3339

CG3330dgri_GLEANR_89

CG6599CG13980

side

CG3348CG3339

CG3330dv ir_GLEANR_9884

CG6599CG13980

side

CG3348CG3339 CG3330

dw il_GLEANR_11931CG6599

CG13980side

kl-5

CG3348CG3339

CG3330 dmoj_GLEANR_8194 CG6599CG13980

side

CG3348 CG3339

CG3330CG14264 CG6599

GA27106sidekl-5

CG3348CG3339

CG3330CG14264

CG6599CG13980

side

CG13980

CG3348CG3339

CG3330CG14264

CG6599CG13980

side

CG3348CG3339

CG3330CG14264

CG6599 CG13980 sidedere_GLEANR_11636

A. gambiae Chromosome 2R

kl-5CG6599

sideCG13980

Dana scaffold 13340

Dyak Chromosome 3R

Dere scaffold 4820

Dmel Chromosome 3R

Dmoj scaffold_6540

Dvir scaffold_13047

Dw il scaffold_2_1100000004902

Dgri scaffold_15074

Dpse Chromosome 2

CG3348CG3339

CG3330dgri_GLEANR_89

CG6599CG13980

side

CG3348CG3339

CG3330dv ir_GLEANR_9884

CG6599CG13980

side

CG3348CG3339 CG3330

dw il_GLEANR_11931CG6599

CG13980side

kl-5

CG3348CG3339

CG3330 dmoj_GLEANR_8194 CG6599CG13980

side

CG3348 CG3339

CG3330CG14264 CG6599

GA27106sidekl-5

CG3348CG3339

CG3330CG14264

CG6599CG13980

side

CG13980

CG3348CG3339

CG3330CG14264

CG6599CG13980

side

CG3348CG3339

CG3330CG14264

CG6599 CG13980 sidedere_GLEANR_11636

A. gambiae Chromosome 2R

kl-5CG6599

sideCG13980

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

33

(e.g., Supplementary Fig. 1) and one scaffold seldom contains more than one

gene. These genes are kl-5 (Fig. 1 and Supplementary Fig. 4), WDY

(Supplementary Fig. 5), Pp1-Y1 / Pp1-Y2 (Supplementary Fig. 6), ARY

(Supplementary Fig. 7), and CCY (Supplementary Fig. 8). However, the

remaining genes (kl-2, kl-3, PRY and Ppr-Y) are Y-linked in nearly all species

(Table 1), and the Y-linkage clearly is the ancestral state. The kl-5 gene is Y-

linked in all sequenced species, except D. willistoni and D. pseudoobscura / D.

persimilis, which might suggest a Y-to-autosome transfer in the D. willistoni

lineage. However, as the figure shows, there is synteny in this region between

D. willistoni and A. gambiae (and also with D. pseudoobscura / D. persimilis).

Hence, the former hypothesis would imply that in D. willistoni the kl-5 gene

moved from the Y to exactly its location in A. gambiae, which is nearly

impossible. The most likely explanation is that kl-5 moved twice to the Y-

chromosome: one transfer happened within the Drosophila subgenus (before

the split of D. virilis, D. grimshawi and D. mojavensis ), and the other transfer in

the basal branch of the melanogaster group. The phylogenetic pattern we

observed (Supplementary Fig. 2, panel C) rules out the hypothesis that there

was one duplication from the ancestral autosomal locus to the Y prior to the split

of all sequenced species, followed by retention of the Y or autosomal copies in

different lineages. Note that if we ignore synteny information from D.

pseudoobscura / D. persimilis, the second transfer might have happened as well

in the basal branch of the obscura and melanogaster groups, but the synteny

information rules out this hypothesis. The same reasoning positioned the

transfers of WDY (Supplementary Fig. 5) and Pp1-Y1 / Pp1-Y2 (Supplementary

Fig. 6) in the basal branch of the melanogaster group, instead of in the basal

branch of the melanogaster and obscura groups. Hence there is useful synteny

information in D. pseudoobscura / D. persimilis, but it should be used bearing in

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

34

mind that the ancestral Y became part of an autosome in this lineage23. In

particular, none of the D. melanogaster Y-linked genes is Y-linked in these

species (Table 1), but this is not due to individual Y-to-autosome transfer; the

lack of Y-linkage in D. pseudoobscura / D. persimilis is due either to the Y-

autosome fusion (e.g., kl-2) or to the fact that these genes were transferred to

the Y in the melanogaster lineage after its split from the pseudoobscura lineage

(e.g., kl-5 and WDY). Note that the whole kl-5 region is conserved across all

sequenced species, except for the absence of kl-5 in the species in which it is Y-

linked. The most likely explanation for this absence is that after the duplication to

the Y the autosomal copy of kl-5 degenerated. Supplementary Figures 4 to 8

were modified from FlyBase GBrowse (available at

http://flybase.bio.indiana.edu/) and from VectorBase (available at

http://www.vectorbase.org/index.php). For the sake of simplicity, in the figures

we labelled the genes in the other species according to the names of their

orthologs in D. melanogaster (Supplementary Table 5). Orthology information

came from Drosophila and Anopheles databases (http://species.flybase.net/cgi-

bin/gbrowse/dmel/; http://agambiae.vectorbase.org/index.php; genes painted in

green) and from the present work (Supplementary Fig. 2; genes painted in blue).

Genes in yellow do not have clear orthologs.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

35

Supplementary Figure 5

Supplementary Figure 5. Synteny analysis of the WDY gene. The WDY

gene is autosomal in all species of Drosophila subgenus and in the obscura and

willistoni groups, and Y-linked in the melanogaster group. When autosomal,

WDY is located in the same position in all species, which shows that the

autosomal position is ancestral. Note that the alternative hypothesis of ancestral

Dmoj scaffold_6500

Dvir scaffold_12963

Dw il scaffold_2_1100000004851

Dana scaffold 12943

Dyak Chromosome 2L

Dere scaffold 4929

Dmel Chromosome 2L

Dgri scaffold_15252

Dpse Chromosome 4-group2

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

36

Y-linkage of WDY would imply three independent movements to exactly the

same location in an autosome ( in the ancestors of the obscura group, of the

willistoni group, and of the subgenus Drosophila), which is nearly impossible.

The WDY region is conserved in the D. melanogaster group, except for the

absence of WDY, which strongly suggests that after the duplication to the Y the

autosomal copy of the gene degenerated. Interestingly, in D. melanogaster (and

also in D. erecta and D. yakuba) a small gene (CG34164; 106 amino acids) with

62% amino acid identity with the C-terminus of WDY is present at exactly the

location of WDY. Thus, CG34164 is a relic of the full WDY gene, which has ~

1000 amino acids.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

37

Supplementary Figure 6

Supplementary Figure 6. Synteny analysis of the Pp1-Y1 and Pp1-Y2

genes. These genes are Y-linked only in the melanogaster group, and are

autosomal and syntenic in the other species, as happens with WDY. Therefore

the same arguments and conclusion are valid for them: the autosomal position is

Dmoj scaffold_6496

Dvir scaffold_12875

Dgri scaffold_15245

Dw il scaffold_2_1100000004514

Dpse Chromosome 3

Dana scaffold_13266

Dyak Chromosome 2R

Dere scaffold_4845

Dmel Chromosome 2R

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

38

ancestral. Note that when autosomal they are located very close to each other,

that they are Y-linked in the same set of species (Table 1), and that in D.

melanogaster at least they are located in the same gross region of the Y-

chromosome20. These observations strongly suggest that their duplication to the

Y was a single mutational event. This may have consequences for the

estimation of the rate of gene gain (Supplementary Methods, section 1.5), but

we should note also that the duplication is only the first step of a gene gain by

the Y. The other step is the survival of the Y-linked copy as a functional gene39,

and this most likely was an independent process for Pp1-Y1 and Pp1-Y2

because the PpD6 gene probably was co-duplicated to the Y with them and yet

its only surviving copy is autosomal. Interestingly, PpD6 is located in a

completely different region in the melanogaster group, which hints that the

process of gene duplication, survival and degeneration was quite complex in this

case. Perhaps the gain of Pp1-Y1 and Pp1-Y2 may be considered partially a

single event (in the duplication step) and partially two independent events (in the

survival step). Whatever the case, this is the only uncertainty; the ancestral state

(autosomal) of these genes is very well supported.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

39

Supplementary Figure 7

Supplementary Figure 7. Synteny analysis of the ARY gene. ARY is Y-linked

in all Sophophora (except, of course, in D. pseudoobscura and D. persimilis) ,

and autosomal (and syntenic) in the species from the Drosophila subgenus.

Thus, there is no outgroup among the 12 sequenced species that would help to

establish the ancestral state (and also there is no conserved synteny with any of

the three sequenced mosquitoes; data not shown). Consequently, the linkage

pattern of the gene can be explained by two equally parsimonious hypothesis, a

Y-to-autosome transfer in the basal branch of the Drosophila subgenus, or an

autosome-to-Y transfer in the basal branch of the Sophophora subgenus. ARY

ancestral location could be inferred because the gene is autosomal and located

Dmoj scaffold_6680

Dvir scaffold_13049

Dw il scaffold_2_1100000004729

Dana scaffold 13337

Dyak Chromosome 3L

Dere scaffold 4784

Dmel Chromosome 3R

Dgri scaffold_15110

Dpse Chromosome XR_group6

Dmoj scaffold_6680

Dvir scaffold_13049

Dw il scaffold_2_1100000004729

Dana scaffold 13337

Dyak Chromosome 3L

Dere scaffold 4784

Dmel Chromosome 3R

Dgri scaffold_15110

Dpse Chromosome XR_group6

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

40

in a cluster of two to four related genes in D. virilis and D. mojavensis (the

related genes includes ARY and CG10638; see Supplementary Fig. 2, panel H).

Since these clusters usually are formed by tandem duplications, it is fairly safe

to conclude that this autosomal region is the ancestral location of ARY (the

alternative hypothesis is that in the Drosophila subgenus ARY moved from the Y

precisely to the autosomal location of its related genes). The cluster is

conserved also in many species of the Sophophora subgenus.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

41

Supplementary Figure 8: Panel A

Panel B Panel C

Dmoj scaffold_6540

Dvir scaffold_12855

Dw il scaffold_2_1100000004902

Dana scaffold 13340

Dyak Chromosome 3R

Dere scaffold 4820

Dmel Chromosome 3R

Dgri scaffold_15074

Dpse Chromosome 2

CG31161

CG31161

CG31161

CG31161

Dmoj scaffold_6540

Dvir scaffold_12855

Dw il scaffold_2_1100000004902

Dana scaffold 13340

Dyak Chromosome 3R

Dere scaffold 4820

Dmel Chromosome 3R

Dgri scaffold_15074

Dpse Chromosome 2

CG31161

CG31161

CG31161

CG31161

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

42

Supplementary Figure 8. Synteny analysis of the CCY gene. CCY is Y-linked

in all Sophophora, and autosomal (and syntenic) in the species from the

Drosophila subgenus, as happens with ARY. However, in the CCY case the

ascertainment of the ancestral state is simpler: We know that the autosomal

location present in the Drosophila subgenus is ancestral because as shown in

panel A and Supplementary Fig. 2 (panel G), at the same position in most

Sophophora species there is a shorter gene (CG31161 in D. melanogaster) with

a high identity to the N-terminus of CCY (i.e., a relic gene). CCY has 1200 to

1600 amino acids, and CG31161 has ~ 450. Although the ancestral state of the

autosomal copy is clear, it is difficult to reconcile the protein phylogeny data

(Supplementary Fig. 2, panel G) with a simple scenario of a single transfer of

CCY to the Y in the basal branch of the Sophophora subgenus (panel B). The

main problem is the position of the D. willistoni CCY, which would be expected

to group with the other Sophophora CCY , and not to branch before the

CG31161-CCY split (Supplementary Fig. 2, panel G). The observed pattern

suggests two independent transfers of CCY, one within the D. willistoni branch,

and the other in the basal branch of the melanogaster and obscura groups, as

shown in panel C. A less important incongruence is that in D. willistoni there are

two copies of CCY-like genes in the Y (one full length and one that seems to be

short like CG31161), that group together. This pattern suggests that the two Y-

linked copies arose from a duplication inside the Y chromosome. Given these

uncertainties and the possibility that the protein phylogeny is being affected by

gene conversion, mutational bias (which are different in the Y, due to its

heterochromatic state), and other confounding factors, we conservatively

assumed the simplest scenario of one transfer of CCY to the Y chromosome

(panel B), as shown in Fig. 2.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

43

Supplementary Figure 9: Panel A

Panel B

Supplementary Figure 9. Estimating gene gain and loss in the Y

chromosome. Our experimental approach identified Y-linked genes of the "non-

melanogaster" species using the known Y-linked genes of D. melanogaster.

Hence we cannot detect genes that were lost in the D. melanogaster lineage, or

that were acquired in phylogenetic branches that are not part of this lineage. (A)

Estimates of gene gain and loss can be obtained as follows . Consider a

D. melanogaster

species A

species B

species C

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

44

simplified phylogeny with three species (D. melanogaster and species A and B),

plus an outgroup (species C) to allow the determination of the ancestral states.

Gene gains can only be detected in the branch that leads to D. melanogaster

(shown in red), but in principle occurs at the same rate in the branches that lead

to species A and B (shown in blue). Exactly the opposite happens with gene

losses. Hence, the rate of gene gain can be obtained by counting the gains in

the D. melanogaster branch, and dividing this number by this branch length

(red). In the same way, an estimate of the rate of gene loss can be obtained by

counting the losses in the blue branches, and dividing them by the

corresponding branch length. Using only parsimony gene movements cannot be

unambiguously detected in the branch labelled in black because we cannot

distinguish a gain in the D. melanogaster lineage from a loss in the species C

lineage. (B) In the real data there were 7 gene gains in the Y chromosome in the

D. melanogaster lineage (CCY, ARY, WDY, kl-5, Pp1-Y1, Pp1-Y2, FDY), in a

total branch length of 62.9 Myr, and 2 gene losses (Ppr-Y and PRY) in a total

branch length of 275.2 Myr . We did not consider the Pp1-Y1 loss in D.

mojavensis (because it happened in an autosome), the kl-5 gain in the

Drosophila subgenus (because it happened outside the branches where

estimates are feasible), and linkage data from D. pseudoobscura / D. persimilis

(due to the Y-autosome fusion that happened in this lineage23 ). Branch lengths

are shown in the figure, and were obtained from Tamura et al.4, except D.

simulans / D. sechellia and virilis group / repleta group55. The nodes A-H are

also labelled. The raw rate of gene gain is 0.1113 genes / Myr ) and the raw rate

of gene loss is 0.0073 genes / Myr (see Supplementary Methods for bias

corrections).

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

45

Supplementary Figure 10

Supplementary Figure 10. Experimental confirmation of the loss of the

Ppr-Y gene in D. grimshawi. Degenerate PCR with the primers FVEH and

MHGE specifically amplified the Ppr-Y gene in a diverse set of species, but did

not recover any product with D. grimshawi. This result confirms that the gene is

absent from the genome of this species. The tested species are (in the order of

the figure): D. prosaltans (saltans group), D. bifasciata (obscura group), D.

fummipenis (willistoni group), D. arawakana (cardini group), D. bromeliae

(bromeliae group), D. tripunctata (tripunctata group), D. robusta (robusta group),

D. virilis (virilis group), and D. grimshawi. The first three species belong to the

Sophophora subgenus, and the remaining six to the Drosophila subgenus. The

primer sequences are: FVEH 5' GCCTAGCTTCAAGTTTYGTVGANCA 3' ;

MHGE 5' CAGGTGTATCWTCATCNTCNCCRTGCAT 3' . They were designed

with a modified version of the CodeHop procedure56. We used a hot-start

enzyme (AmpliTaq Gold) with 1 uM of each primer, and the following cycling

conditions: one cycle of 10 min at 94o C for initial DNA denaturation / activation

of Taq; plus 40 cycles of 50 sec at 94o C , 2 min at 55o C , and 1 min at 72o C;

plus one final cycle of 7 min at 72o C. We also tested other annealing

temperatures (between 53o C and 57o C) and never got bands of the expected

size (380 bp) in D. grimshawi.

♂ ♀D. bif

♂ ♀D. pro

♂ ♀D. fum

♂ ♀D. ara

♂ ♀D. bro

♂ ♀D. tri

♂ ♀D. rob

♂ ♀D. vir

♂ ♀D. gri

♂ ♀D. bif♂ ♀D. bif

♂ ♀D. pro♂ ♀D. pro

♂ ♀D. fum♂ ♀D. fum

♂ ♀D. ara♂ ♀D. ara

♂ ♀D. bro♂ ♀D. bro

♂ ♀D. tri♂ ♀D. tri

♂ ♀D. rob♂ ♀D. rob

♂ ♀D. vir♂ ♀D. vir

♂ ♀D. gri♂ ♀D. gri

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

46

Supplementary Figure 11

B

A

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

47

Supplementary Figure 11. Results of 1,000 computer simulations of gene

gain and loss. A) Posterior distribution of the ratio of the rate of gene gain

(genes/Myr) to the rate of gene loss (genes/Myr). The average value is 8.3

(range: 1.3 to 34.5; 95% credibility interval: 1.7 - 22.0). B) Joint posterior

distribution of gain rate and loss rate per gene. The average values are 0.1703

genes / Myr and 0.0034 genes / gene / Myr, respectively. The uniform

distributions used as priors for both parameters had a maximums well above the

highest accepted values (prior for gain: 0 - 1.0 genes / Myr ; prior for loss: 0 -

0.05 genes / gene /Myr). C) Posterior distribution of the predicted equilibrium

gene number (note the logarithmic scale of the abscissa). The average value is

89 genes (range: 8 to 859). Three out 1,000 simulations had predicted

equilibrium gene number below 12 (the present gene number of the D.

melanogaster Y).

C

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

48

Supplementary Tables

Supplementary Table 1. Accession numbers of the genes used in this

study.

Genes D. mel D. ere D. yak D. ana D. pse D. wil D. moj D. vir D. gri

kl-2 EU685283 EU595396 EU595398 EU595399 EU595397 EU595400 EU595403 EU595402 EU595401

kl-3 AAG29546 EU514469 EU514472 EU514468 BK005626 EU514467 EU514471 EU514470 EU514466

kl-5 NP001015499 EU417450 EU417452 EU417447 BK005628 NA EU417444 EU417438 EU417437

ORY NP001015498 BK006456 BK006457 BK006455 AAW23319 BK006454 BK006453 BK006452 BK006451

PRY BK006442 EU362867 BK006441 BK006440 BK006439 BK006438 BK006437 EU362864 BK006436

PPr-Y NP001015502 BK006434 BK006435 BK006433 AAW23326 BK006432 BK006431 BK006430 0

CCY EU685282 EU685280 EU685281 EU685279 EU685278 EU685277 EU685276 EU685275 EU685274

ARY BK006427 BK006421 BK006426 BK006429 BK006425 BK006428 BK006423 BK006424 BK006422

WDY BK006449 BK006448 BK006450 EU362855 BK006447 BK006446 BK006444 BK006445 BK006443

Pp1Y-1 AAL25117 BK006412 BK006413 BK006411 NA BK006410 0 BK006409 BK006408

Pp1Y-2 NP001015497 BK006419 BK006420 BK006418 NA BK006417 BK006416 BK006415 BK006414

FDY NA 0 0 0 0 0 0 0 0 Sequence data used in this paper are available in DDBJ/EMBL/GenBank as

original sequences and in the Third Party Annotation Section of the

DDBJ/EMBL/GenBank databases under the accession numbers shown in the

Table (81 sequences were first reported in this study). Genes absent from the

genome (Table 1) were labeled with a "0". Unabridged species names (in the

order of appearance) are: D. melanogaster, D. erecta, D. yakuba, D. ananassae,

D. pseudoobscura, D. willistoni, D. mojavensis, D. virilis, and D. grimshawi.

Experimental evidences to ARY annotation (BK006421-BK006429) were

obtained from D. willistoni and D. ananassae by ARY mRNA sequencing (

Genebank accession numbers EU334136 - EU334138).

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

49

Supplementary Table 2. Ka/Ks ratios for the Y-linked genes. These ratios measure the selection constraint in protein-

coding regions57. Ratios around 1 imply lack of selection (usually indicating pseudogenes). All ratios are well below 1, and

within the range of the majority of Drosophila genes24, which strongly suggests that all genes are functional and are evolving

under purifying selection. Calculations were performed at http://services.cbu.uib.no/tools/kaks (ref 57). Values left in blank

correspond to absent genes.

Branch name kl-2 kl-3 kl-5 ORY PRY Ppr-Y ARY CCY WDY Pp1-Y1 Pp1-Y2 FDY Mean SD

E_mel 0.053 0.021 0.025 0.025 0.162 0.036 0.095 0.380 0.016 0.078 0.005 0.224 0.093 0.112F_yak 0.055 0.042 0.041 0.344 0.238 0.158 0.248 0.548 0.110 0.014 0.043 0.167 0.166F_ere 0.069 0.032 0.056 0.065 0.309 0.119 0.348 0.447 0.023 0.075 0.035 0.143 0.150D_F 0.105 0.036 0.061 0.060 0.311 0.080 0.186 0.425 0.047 0.151 0.026 0.135 0.127C_ana 0.052 0.041 0.029 0.066 0.157 0.087 0.097 0.177 0.023 0.026 0.001 0.069 0.056C_D 0.093 0.045 0.044 0.078 0.221 0.144 0.176 0.322 0.029 0.064 0.013 0.112 0.095obs_C 0.043 0.031 0.042 0.062 0.196 0.085 0.086 0.334 0.038 0.169 0.050 0.103 0.094obs_pse 0.074 0.058 0.050 0.110 0.335 0.303 0.192 0.418 0.093 0.142 0.077 0.168 0.127B_obs 0.061 0.033 0.045 0.178 0.144 0.105 0.083 0.248 0.054 0.124 0.048 0.093 0.062B_wil 0.125 0.077 0.061 0.110 0.283 0.151 0.293 0.493 0.064 0.121 0.052 0.166 0.136I_vir 0.040 0.022 0.036 0.179 0.168 0.123 0.092 0.250 0.042 0.162 0.059 0.107 0.075I_moj 0.127 0.044 0.041 0.048 0.414 0.093 0.138 0.295 0.052 0.101 0.135 0.124H_I 0.226 0.139 0.121 0.219 0.319 0.123 0.047 0.229 0.044 0.069 0.154 0.091A_H 0.092 0.062 0.067 0.109 0.164 0.048 0.251 0.092 0.147 0.120 0.115 0.060H_gri 0.060 0.029 0.042 0.058 0.144 0.159 0.404 0.092 0.277 0.078 0.134 0.124A_B 0.094 0.061 0.067 0.105 0.188 0.124 0.062 0.275 0.098 0.136 0.092 0.118 0.064Mean 0.085 0.048 0.052 0.107 0.235 0.124 0.147 0.343 0.057 0.120 0.054 0.224 0.125 0.108SD 0.046 0.9 0.022 0.080 0.083 0.061 0.089 0.105 0.031 0.067 0.034 0.108

www.nature.com/nature

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

50

Supplementary Table 3. Original chromosomal location of the 7 gained

genes. Note that with the possible exception of Pp1-Y1 and Pp1-Y2, all genes

were acquired individually by the Y chromosome (as opposed to resulting from

large segmental duplications), since they are not adjacent to each other at their

original autosomal locations.

Gene Original

chromosome

Syntenic band in

D. melanogaster *

Time of gain

(branch name) †

CCY Muller E 94B6 A_B

ARY Muller D 69C4 A_B

kl-5 Muller E 97F3 obs_C

WDY Muller B 33C1 obs_C

Pp1-Y1 Muller C 58A2 obs_C

Pp1-Y2 Muller C 58A2 obs_C

FDY Muller E 96C1 E_mel

* The 7 genes are Y-linked in D. melanogaster and autosomal in several other species. The

column shows the original autosomal locations in these other species, referenced to the D.

melanogaster map. See Supplementary Figures 4 - 8 for detailed synteny information.

† See Supplementary Figure 9 for branch names.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

51

Supplementary Table 4. Quantities used to estimate the unbiased ratio of gene gain to gene loss. Branches were

named according to Supplementary Fig. 9 (e.g., the branch connecting nodes D and F is named "D_F"). See text and table

footnotes for details. The Supplementary Data file analytical.xls implements these calculations.

Branch name

Branch length

Number of genes

Observed losses

Observed exposure

Inferred new genes gained in

the branch*

Inferred new genes from

previous branches*

Inferred effective

number of genes†

Inferred losses‡

Inferred exposure§

G_sim 2 11 0 22 0.240 0.4073 0.5273 0.0011 1.055 G_sec 2 11 0 22 0.240 0.4073 0.5273 0.0011 1.055 E_G 3.4 11 0 37.4 0.408 0.0000 0.2040 0.0007 0.694 F_yak 10.4 11 0 114.4 1.248 0.2757 0.8997 0.0096 9.357 F_ere 10.4 11 0 114.4 1.248 0.2757 0.8997 0.0096 9.357 D_F 2.3 11 0 25.3 0.276 0.0000 0.1380 0.0003 0.317 C_ana 44.2 11 0 486.2 5.304 0.0000 2.6520 0.1203 117.218 B_wil 62.2 7 0 435.4 7.464 0.0000 3.7320 0.2382 232.130 I_vir 32.5 5 0 162.5 3.900 3.5914 5.5414 0.1848 180.095 I_moj 32.5 5 1 162.5 3.900 3.5914 5.5414 0.1848 180.095 H_I 10.4 5 0 52 1.248 2.3754 2.9994 0.0320 31.194 A_H 20 5 0 100 2.400 0.0000 1.2000 0.0246 24.000 H_gri 42.9 5 1 214.5 5.148 2.3754 4.9494 0.2178 212.328 TOTAL 275.2 2 1948.6 33.024 1.0249 998.893 * Estimated as gain rate × branch length, from the current (column 6) or previous branches (column 7). † Estimated as 0.5 × column 6 + column 7 . ‡ Estimated as loss rate per gene × branch length × column 8 . § Estimated as branch length × column 8 .

www.nature.com/nature

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

52

Supplementary Table 5. FlyBase gene names. For the sake of simplicity, in Fig. 2, Supplementary Fig. 2 and Supplementary

Figures 4 to 8, we labelled the genes in the other species (columns 2 to 9) using the name of the their D. melanogaster orthologs (column

1). The table shows the official names of these genes. Blank space means that the gene was not cited in this paper. The official names of

Anopheles genes (Fig. 2) are: kl-5, AGAP001672; CG6599, AGAP001673; side, AGAP001674. The Anopheles ortholog of CG13980 was

not found in the corresponding genome and was annotated in GeneWise using the WGS sequence AAAB01008987.1 (from positions

8391834 to 8381834).

D. mel D. ere D. yak D. ana D. pse D. wil D. moj D. vir D. grivih Dere_GLEANR_15642 Dyak_GLEANR_5629 Dana_GLEANR_10384 GA10491 Dwil_GLEANR_12927 Dmoj_GLEANR_12283 Dvir_GLEANR_11523 Dgri_GLEANR_14567Tom70 Dere_GLEANR_10223sti Dere_GLEANR_14049 Dyak_GLEANR_3961 Dana_GLEANR_8768 Dwil_GLEANR_12916 Dvir_GLEANR_13675 Dgri_GLEANR_16504side Dere_GLEANR_12163 Dyak_GLEANR_10477 Dana_GLEANR_19966 GA15977 Dwil_GELANR_11556 Dmoj_GLEANR_8197 Dvir_GLEANR_8015 Dgri_GLEANR_839rab3-GAP Dere_GLEANR_10217 Dyak_GLEANR_1271 Dana_GLEANR_826 GA20070 Dwil_GLEANR_3923 Dmoj_GLEANR_1564 Dvir_GLEANR_931 Dgri_GLEANR_10153prd Dere_GLEANR_10215PpN58A Dmoj_GLEANR_5675PpD6 GA25002 Dwil_GLEANR_19931 Dmoj_GLEANR_16157 Dvir_GLEANR_7086 Dgri_GLEANR_5320Pp13C Dmoj_GLEANR_3672Or67a Dana_GLEANR_825mRpS2 Dana_GLEANR_498mRpL45 Dere_GLEANR_11278 Dyak_GLEANR_10254 Dana_GLEANR_20122 GA11976 Dwil_GLEANR_11528 Dmoj_GLEANR_7481 Dvir_GLEANR_10749 Dgri_GLEANR_674lox2 Dere_GLEANR_5517 Dyak_GLEANR_13866 Dana_GLEANR_11663loco Dere_GLEANR_12514 Dyak_GLEANR_7755 Dana_GLEANR_17412 GA18761 Dwil_GLEANR_11960 Dmoj_GLEANR_10439 Dvir_GLEANR_10249 Dgri_GLEANR_243JhI-21 Dere_GLEANR_10218 Dyak_GLEANR_1272 Dana_GLEANR_827 GA11552 Dwil_GLEANR_3893 Dmoj_GLEANR_1565 Dvir_GLEANR_932 Dgri_GLEANR_10154Dlc90F Dvir_GLEANR_10253Dhc93AB Dere_GLEANR_905 Dyak_GLEANR_8668 Dana_GLEANR_19783 GA17641 Dwil_GLEANR_12312 Dmoj_GLEANR_10249 Dvir_GLEANR_10108 Dgri_GLEANR_751CycC Dvir_GLEANR_10744CG9492 Dere_GLEANR_2240 Dyak_GLEANR_8418 Dana_GLEANR_19657 GA21828 Dwil_GLEANR_12148 Dmoj_GLEANR_8658 Dvir_GLEANR_10212 Dgri_GLEANR_3612CG9284 Dyak_GLEANR_12495CG9068 Dere_GLEANR_10385 Dyak_GLEANR_1452 Dana_GLEANR_403 GA27740 Dwil_GLEANR_8422 Dmoj_GLEANR_1489 Dvir_GLEANR_16700 Dgri_GLEANR_10062CG7265 Dvir_GLEANR_10743CG7126 Dvir_GLEANR_10741CG6792 Dere_GLEANR_10291 Dyak_GLEANR_1273 Dana_GLEANR_828 GA19865 Dmoj_GLEANR_1566 Dvir_GLEANR_933 Dgri_GLEANR_9575CG6785 Dere_GLEANR_10221 Dyak_GLEANR_1275 Dana_GLEANR_16259 GA19861 Dmoj_GLEANR_1568 Dvir_GLEANR_935CG6770 Dere_GLEANR_10220 Dyak_GLEANR_1274 Dana_GLEANR_829 GA19852 Dmoj_GLEANR_1567CG6766 Dere_GLEANR_10222 Dyak_GLEANR_1276 Dana_GLEANR_16260 GA19848 Dmoj_GLEANR_1569CG6599 Dere_GLEANR_11633 Dyak_GLEANR_7495 Dana_GLEANR_17557 GA19712 Dwil_GELANR_11932 Dmoj_GLEANR_8195 Dvir_GLEANR_9885 Dgri_GLEANR_90CG6059 Dere_GLEANR_10496 Dyak_GLEANR_12184 Dana_GLEANR_7487 GA19330 Dwil_GLEANR_11984 Dmoj_GLEANR_9454 Dvir_GLEANR_8437 Dgri_GLEANR_2986CG5317 Dere_GLEANR_8552 Dyak_GLEANR_2353 Dana_GLEANR_500 GA18800 Dwil_GLEANR_3891 Dmoj_GLEANR_1865 Dvir_GLEANR_1913 Dgri_GLEANR_11449

www.nature.com/nature

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

53

Supplementary Table 5 (continuation).

D. mel D. ere D. yak D. ana D. pse D. wil D. moj D. vir D. griCG5284 Dmoj_GLEANR_13027 Dvir_GLEANR_13682 Dgri_GLEANR_16509CG5241 Dmoj_GLEANR_12281 Dvir_GLEANR_11521 Dgri_GLEANR_14565CG4386 Dere_GLEANR_5515 Dyak_GLEANR_13864 Dana_GLEANR_11660 GA18150 Dwil_GLEANR_15903 Dmoj_GLEANR_5682 Dvir_GLEANR_6022 Dgri_GLEANR_4997CG4377 Dyak_GLEANR_13860CG4363 Dere_GLEANR_5512 Dyak_GLEANR_13861CG3731 Dvir_GLEANR_10252CG34040 Dere_GLEANR_5513 Dyak_GLEANR_13862 Dana_GLEANR_11658 Dwil_GLEANR_19918 Dmoj_GLEANR_5680 Dvir_GLEANR_6020 Dgri_GLEANR_4995CG34029 Dere_GLEANR_5506 Dyak_GLEANR_13855 Dana_GLEANR_11662 Dwil_GLEANR_15892 Dmoj_GLEANR_5670 Dvir_GLEANR_6012 Dgri_GLEANR_4988CG3348 Dere_GLEANR_12166 Dyak_GLEANR_10480 Dana_GLEANR_19969 GA17395 Dwil_GLEANR_11560 Dmoj_GLEANR_9706 Dvir_GLEANR_8018 Dgri_GLEANR_842CG3339 Dere_GLEANR_12165 Dyak_GLEANR_10479 Dana_GLEANR_19968 GA17389 Dwil_GLEANR_11559 Dmoj_GLEANR_9705 Dvir_GLEANR_8017 Dgri_GLEANR_841CG33332 Dvir_GLEANR_10251CG33331 Dvir_GLEANR_10250CG33120 Dmoj_GLEANR_1863 Dvir_GLEANR_1911 Dgri_GLEANR_11447CG3300 Dere_GLEANR_12164 Dyak_GLEANR_14478 Dana_GLEANR_19967 GA17383 Dwil_GELANR_11557 Dmoj_GLEANR_9704 Dvir_GLEANR_8016 Dgri_GLEANR_840CG31161 Dere_GLEANR_11277 Dyak_GLEANR_10253 Dana_GLEANR_20121CG31159 Dere_GLEANR_11276 Dyak_GLEANR_10120 Dana_GLEANR_20120 GA16055 Dwil_GLEANR_11527 Dvir_GLEANR_10746 Dgri_GLEANR_672CG31158 Dere_GLEANR_11274 Dyak_GLEANR_10250 Dana_GLEANR_20118 GA16054 Dwil_GLEANR_11526 Dmoj_GLEANR_7478 Dgri_GLEANR_669CG31156 Dere_GLEANR_12515 Dyak_GLEANR_7756 Dana_GLEANR_17414 GA16052 Dwil_GLEANR_11962 Dmoj_GLEANR_10440 Dgri_GLEANR_244CG30048 Dere_GLEANR_5102 Dyak_GLEANR_12705 Dana_GLEANR_13612 Dmoj_GLEANR_2385 Dvir_GLEANR_2496 Dgri_GLEANR_10969CG18735 Dere_GLEANR_5516 Dyak_GLEANR_13865 Dana_GLEANR_11661 GA15058 Dmoj_GLEANR_5683 Dvir_GLEANR_6023 Dgri_GLEANR_4998CG18600 Dvir_GLEANR_10642CG17078 Dwil_GLEANR_3894CG14947 Dere_GLEANR_8854 Dyak_GLEANR_2355 Dana_GLEANR_501 GA13374 Dmoj_GLEANR_1867 Dvir_GLEANR_1915 Dgri_GLEANR_11450CG14946 Dere_GLEANR_10216 Dyak_GLEANR_1270 Dana_GLEANR_824 GA13373 Dwil_GLEANR_3890 Dmoj_GLEANR_1562 Dvir_GLEANR_930 Dgri_GLEANR_10152CG14945 Dere_GLEANR_8551 Dyak_GLEANR_2352 Dana_GLEANR_499 GA13372 Dwil_GLEANR_3922 Dmoj_GLEANR_1864 Dvir_GLEANR_1912 Dgri_GLEANR_11448CG14264 Dere_GLEANR_11634 Dyak_GLEANR_7494 Dana_GLEANR_22280 GA12867CG13980 Dere_GLEANR_11635 Dyak_GLEANR_7496 Dana_GLEANR_17556 GA12671 Dwil_GELANR_11933 Dmoj_GLEANR_8196 Dvir_GLEANR_9886 Dgri_GLEANR_91CG13843 Dere_GLEANR_11274 Dyak_GLEANR_10251 Dana_GLEANR_20119 GA12565 Dwil_GLEANR_20016 Dgri_GLEANR_670CG13492 Dere_GLEANR_5514 Dyak_GLEANR_13863 Dana_GLEANR_11659 GA12325 Dwil_GLEANR_15900 Dmoj_GLEANR_5681 Dvir_GLEANR_6021 Dgri_GLEANR_4996CG13125 Dere_GLEANR_8766 Dyak_GLEANR_1013 Dana_GLEANR_14911 GA12063 Dwil_GLEANR_8916 Dmoj_GLEANR_948 Dvir_GLEANR_693 Dgri_GLEANR_10378CG13075 Dgri_GLEANR_16508CG11927 Dana_GLEANR_830CG11926 Dana_GLEANR_831CG10681 Dere_GLEANR_15641 Dyak_GLEANR_5628 Dana_GLEANR_10383 GA10490 Dwil_GLEANR_12826 Dmoj_GLEANR_12282 Dvir_GLEANR_11522 Dgri_GLEANR_14566CG10660 Dana_GLEANR_8774 GA10475 Dwil_GLEANR_12922CG10657 Dere_GLEANR_14056 Dyak_GLEANR_3965 Dana_GLEANR_8773 GA10472 Dwil_GLEANR_12921CG10654 Dere_GLEANR_14050 Dyak_GLEANR_3962 Dana_GLEANR_8769 Dwil_GLEANR_12917 Dmoj_GLEANR_13021 Dvir_GLEANR_13676 Dgri_GLEANR_16505CG10646 Dere_GLEANR_14051 Dyak_GLEANR_3963 Dana_GLEANR_8770 GA10465 Dwil_GLEANR_12918 Dmoj_GLEANR_13022 Dvir_GLEANR_13677 Dgri_GLEANR_16506CG10638 Dyak_GLEANR_3964 Dana_GLEANR_8772 GA10458 Dmoj_GLEANR_13023 Dvir_GLEANR_13678CG12636 Dere_GLEANR_8992 Dyak_GLEANR_852 Dwil_GLEANR_18646

www.nature.com/nature

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

54

Supplementary Notes

31. Aitken, R. J. & Graves, J. A. M., Human spermatozoa: The future of sex. Nature 415, (2002).

32. Zhou, Q. et al., On the origin of new genes in Drosophila. Genome Res 18, 1446-1455 (2008).

33. Gonzalez, J., Casals, F. & Ruiz, A., Duplicative and conservative transpositions of larval serum protein 1 genes in the genus Drosophila. Genetics 168, 253-264 (2004).

34. Przyborowski, J. & Wilenski, H., Homogeneity of results in testing samples from Poisson series: with an application to testing clover seed for dodder. Biometrika 31, 313-323 (1940).

35. Krishnamoorthy, K. & Thomson, J., A more powerful test for comparing two Poisson means J. Stat. Plan. Inf. 119, 23-35 (2004).

36. Bonaccorsi, S. & Lohe, A., Fine mapping of satellite DNA sequences along the Y chromosome of Drosophila melanogaster: relationships between satellite sequences and fertility factors. Genetics 129, 177-189 (1991).

37. Matsunaga, S. et al., Duplicative transfer of a MADS box gene to a plant Y chromosome. Mol Biol Evol 20, 1062-1069 (2003).

38. Fisher, R. A., The evolution of dominance. Biological Reviews 6, 1 (1931).

39. Lynch, M. & Katju, V., The altered evolutionary trajectories of gene duplicates. Trends in Genetics 20, 544-549 (2004).

40. Usakin, L. A., Kogan, G. L., Kalmykova, A. I. & Gvozdev, V. A., An alien promoter capture as a primary step of the evolution of testes-expressed repeats in the Drosophila melanogaster genome. Mol Biol Evol 22, 1555-1560 (2005).

41. Aravin, A. A. et al., Double-stranded RNA-mediated silencing of genomic tandem repeats and transposable elements in the D. melanogaster germline. Curr Biol 11, 1017-1027 (2001).

42. Belloni, M., Tritto, P., Bozzetti, M. P., Palumbo, G. & Robbins, L. G., Does Stellate cause meiotic drive in Drosophila melanogaster? Genetics 161, 1551-1559 (2002).

43. Hamilton, W. D., Extraordinary sex-ratios Science 156, 477-488 (1967).

44. McCullagh, P. & Nelder, J. A., Generalized linear models, 2nd ed. (Chapman and Hall, London ; New York, 1989).

45. RDC Team, R: A language and environment for statistical computing (R Foundation for Statistical Computing, Vienna, 2007).

46. Tavare, S., Balding, D. J., Griffiths, R. C. & Donnelly, P., Inferring coalescence times from DNA sequence data. Genetics 145, 505-518 (1997).

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

55

47. Beaumont, M. A., Zhang, W. & Balding, D. J., Approximate Bayesian computation in population genetics. Genetics 162, 2025-2035 (2002).

48. Przeworski, M., Estimating the time since the fixation of a beneficial allele. Genetics 164, 1667-1676 (2003).

49. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R., A tool for analyzing and annotating genomic sequences. Genomics 46, 37-45 (1997).

50. Birney, E. & Durbin, R., Using GeneWise in the Drosophila annotation experiment. Genome Res 10, 547-548 (2000).

51. Lewis, S. E. et al., Apollo: a sequence annotation editor. Genome Biol. 3, R82 (2002).

52. Parra, G. et al., Comparative gene prediction in human and mouse. Genome Res 13, 108-117 (2003).

53. Thompson, J. D., Higgins, D. G. & Gibson, T. J., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680 (1994).

54. Tamura, K., Dudley, J., Nei, M. & Kumar, S., MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol 24, 1596-1599 (2007).

55. Sudhir Kumar, personal communication.

56. Rose, T. M., Henikoff, J. G. & Henikoff, S., CODEHOP (COnsensus-DEgenerate Hybrid Oligonucleotide Primer) PCR primer design. Nucleic Acids Res 31, 3763-3766 (2003).

57. Liberles, D. A., Evaluation of methods for determination of a reconstructed history of gene sequence evolution. Mol Biol Evol 18, 2040-2047 (2001).

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature

56

Supplementary Data

analytical.xls This MS-EXCEL file implements the analytical treatment of the

ascertainment bias, and estimates the unbiased ratio of the gain rate to loss rate, as

described in the Supplementary Methods (section 1).

indelsim_free.R This is a program written in R language that implements the computer

simulations of gain and loss of genes in the Y chromosome (across the 12 species), as

described in the Supplementary Methods (section 2). It produces approximate Bayesian

estimates of the posterior densities of the rates of gene gain and loss. The run time is

approximately two days in a 2 GHz Dual Core computer.

gains_losses_script.R This is a program written in R language that implements the

Poisson regression that tests the statistical significance of the gene gain / gene loss ratio.

gains_losses_data.txt This is the data file used by the gains_losses_script.R program.

doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION

www.nature.com/nature