supplementary information table of contents · 2 supplementary discussions evidence for the loss of...
TRANSCRIPT
1
Supplementary Information Table of Contents
SUPPLEMENTARY DISCUSSIONS ..........................................................................2
Evidence for the loss of Pp1-Y1 in D. mojavensis and of Ppr-Y in D. grimshawi. ......2 Comparison between the gene movements in the Y with the other chromosomes.......2 Possible explanations for the gene gains in the Y chromosome. ..................................4
SUPPLEMENTARY METHODS.................................................................................7
1. Analytical treatment of the ascertainment bias in the ratio gene gain / gene loss.....7 1.1. Basic data and assumptions. ...............................................................................8 1.2. Bias in the loss rate caused by the outgroup-specific genes. ............................10 1.3. Bias caused by unknown D. melanogaster Y-linked genes..............................11 1.4. General analytical model for bias correction. ...................................................13 1.5. Statistical tests for the difference between the rates of gene gain and loss. .....15
2. Computer simulations and approximate Bayesian estimates of gene gain and gene loss...............................................................................................................................17
SUPPLEMENTARY FIGURES AND LEGENDS....................................................20
Supplementary Figure 1. Assembly problems of Y-linked genes...............................20 Supplementary Figure 2. Gene orthology confirmation by phylogeny.......................21 Supplementary Figure 3. PCR test of Y-linkage of the ARY gene..............................31 Supplementary Figure 4. Synteny analysis of the kl-5 gene (all species). ..................32 Supplementary Figure 5. Synteny analysis of the WDY gene. ....................................35 Supplementary Figure 6. Synteny analysis of the Pp1-Y1 and Pp1-Y2 genes. ...........37 Supplementary Figure 7. Synteny analysis of the ARY gene ......................................39 Supplementary Figure 8. Synteny analysis of the CCY gene......................................42 Supplementary Figure 9. Estimating gene gain and loss in the Y chromosome.........43 Supplementary Figure 10. Experimental confirmation of the loss of the Ppr-Y gene in D. grimshawi. ..............................................................................................................45 Supplementary Figure 11. Results of 1,000 computer simulations of gene gain and loss...............................................................................................................................47
SUPPLEMENTARY TABLES ...................................................................................48
Supplementary Table 1. Accession numbers of the genes used in this study. ............48 Supplementary Table 2. Ka/Ks ratios for the Y-linked genes.....................................49 Supplementary Table 3. Original chromosomal location of the 7 gained genes.........50 Supplementary Table 4. Quantities used to estimate the unbiased ratio of gene gain to gene loss. .....................................................................................................................51 Supplementary Table 5. FlyBase gene names.............................................................52
SUPPLEMENTARY NOTES......................................................................................54
SUPPLEMENTARY INFORMATION
doi: 10.1038/nature07463
www.nature.com/nature
2
Supplementary Discussions
Evidence for the loss of Pp1-Y1 in D. mojavensis and of Ppr-Y in D. grimshawi.
We could not find these genes in the assembled genomes. Blast searches detected
similar sequences that, after phylogenetic analysis, proved to be paralogous genes in
both cases (not shown). We also searched for these genes in the raw traces, but not a
single trace was found (i.e., all traces we found belong to paralogs). Thus, either they
were lost in the corresponding lineages, or the entire sequence of both genes fell in
sequence gaps. In the case of the D. mojavensis Pp1-Y1 a sequence gap is very unlikely
because the gene is located in a conserved autosomal position in all species (except in
the melanogaster group, where it is Y-linked; Supplementary Fig. 6), and there is no gap
in this region in the assembled D. mojavensis genome. Regarding Ppr-Y in D.
grimshawi, we experimentally confirmed its loss with degenerate PCR (Supplementary
Fig. 10). Interestingly, data from D. melanogaster shows that Ppr-Y is not an essential
gene20, so its loss probably can be tolerated.
Comparison between the gene movements in the Y with the other chromosomes.
The Y chromosome seems to be a very inhospitable environment for genes due to
its heterochromatic state, sex-limited expression and inheritance, lack of recombination,
and smaller effective population size, and hence one might expect increased gene losses
and reduced gene gains when compared to other chromosomes. Three recent Drosophila
studies provide particularly interesting comparisons with our data.
Bachtrog and coworkers10 studied part of the neo-Y chromosome of D. miranda
and found that 55 out 118 genes became pseudogenes in ~ 1Myr, implying a nominal
rate of gene loss of 0.47 genes / gene / Myr . Such massive gene losses were also
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
3
observed in the mammalian Y16,31 and seems to be characteristic of the "standard
pathway" for the origin of Y chromosomes, when an ordinary chromosome became
male-restricted and lost recombination8,9. The rate of gene loss we measured in the
Drosophila Y (0.001026 genes / gene / Myr) is nearly 500 fold smaller. This huge
difference certainly reflects the different evolutionary histories of the genes: the
Drosophila Y-linked genes were first acquired from autosomes, and hence were already
"adapted" (and perhaps suited) to the harsh environment of the Y-chromosome, whereas
the neo-Y genes of D. miranda were a more or less random sample of genes, suddenly
caught in this environment by a Y-autosome fusion.
The rate of gene gain between D. melanogaster and D. yakuba was measured by
Zhou and collaborators32 and averaged ~ 8 genes / genome / Myr, which translates to
~1.6 genes / large arm / Myr. Bhutkar and collaborators6 found a high confidence set of
514 genes that moved to different locations in the 12 sequenced species, leaving or not a
copy at the original place (being called "duplicative transpositions" and "conservative
transpositions", respectively; ref 33). The actual number of relocated genes could be
higher due to the fact that only phylogenetically consistent cases were considered.
Though it is difficult to measure gene loss from this data (which is further complicated
by the possibility that a missing gene is actually an assembly gap), these 514
"positionally relocated genes" certainly imply 514 gene gains, which happened in a
divergence time of ~ 375 Myr6,24. The average gain rate of ~ 1.37 genes / large arm /
Myr is very close to Zhou and collaborators estimate32 , and both are one order of
magnitude higher than our estimate for the Y of 0.12 genes / Myr (P < 10-5 ; two-tailed
exact test for the ratio of two Poisson means34,35 ).
Although the Drosophila Y is a large chromosome in most species (41 Mbp in D.
melanogaster, whereas the large chromosome arms have ~ 25-40 Mbp), its smaller rate
of gene gain is not surprising, given the following factors: (i) The Y is entirely
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
4
heterochromatic, and euchromatic genes frequently are "silenced" when inserted in
heterochromatic regions. Thus, all else being equal, the expected number of gene gains
of similarly sized blocks of heterochromatin and euchromatin are not equal (the former
being smaller). (ii) 80% of the Y chromosome is composed of satellite DNA that cannot
harbor functional genes36. Again, the "effective size" of the Y is much smaller than its
physical size; (iii) The Y chromosome has a smaller population size (less than 1/4 that of
autosomes), which results in a smaller absolute number of mutations (gene insertions in
this case) available for fixation; (iv) female-related genes, as well as genes required in
both sexes cannot move to the Y.
Regarding the mechanism, retrotranspositions accounted for 10% to 24% of all
gene gains in the euchromatin6,25,32 . We found seven gene gains in the Y, but given that
two genes are intronless in their original locations (Pp1-Y1 and Pp1-Y2), in only five
events could retrotranspositions have been detected. In all five of these gains intron
positions were fully conserved, ruling out retrotransposition. However, the sample is too
small to allow any conclusion regarding the prevalence of retrotranspositions in the Y
chromosome.
Possible explanations for the gene gains in the Y chromosome.
There are two interconnected aspects of this problem. First, why does the Y
chromosome gain genes in the first place, given its "restrictive" characteristics
(heterochromatic state, lack of recombination, sex-limited expression and inheritance,
etc.)? Second, why are most or all these genes male-related?
Regarding the first question, the empirical evidence shows that Y chromosomes do
acquire genes, in a diverse range or organisms such as Drosophila, mammals16 and
plants37. However, the rate of gene gain in the Drosophila Y is one order of magnitude
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
5
lower than the other chromosomes of comparable size (Supplementary Discussion). So it
seems that the restrictive conditions mentioned above are not strong enough to totally
suppress gene gains, but they do reduce its rate.
Regarding the second question, it is widely assumed that the concentration of male
genes on the Y results from natural selection, a view that traces back to R. A. Fisher38.
The rationale is that male-female antagonistic effect of genes may hamper the evolution
of male-related traits, unless they are located in a male-specific region of the genome.
Hence there would be positive selection for Y-linkage of male-related genes. In recent
years M. Lynch and co-workers39 suggested that chance events (random drift) play a
large role on the fate of duplicated genes; i.e., natural selection may not be the main
evolutionary force driving genome organization. Extension of these ideas to the Y
chromosome leads to a quite different explanation of the concentration of male-genes on
the Y. Suppose that after a gene duplication to the Y chromosome there is, say, an 80%
chance of degeneration of the Y-linked copy, and a 20% chance of degeneration of the
autosomal copy. Given this, a male-specific gene would be "transferred" to the Y in 20%
of the duplications (because the autosomal copy would be lost). However, in the case of
a Y-duplication of a female-specific or house-keeping gene, there will be selection for
maintaining the original (autosomal) copy, because females need the gene. The most
probable result is the loss of the Y copy (or, in some cases, its specialization to a new
function). The net effect is the accumulation of male-related genes in the Y (as we
indeed observe), but not resulting from positive selection for Y-linkage. Our data does
not provide any clue on which force (natural selection or genetic drift) plays the major
role. However, as we briefly mentioned in the main text, the Y-linked Suppressor of
Stellate [ Su(Ste) ] locus may provide an example of positive selection. This multi-copy
gene was acquired by the Y in the D. melanogaster lineage, after the split from D.
simulans40. The sole known function of Su(Ste) is to repress (via RNAi41) the X-linked
gene Stellate. It has been suggested that Stellate distorts the X-Y segregation in favor of
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
6
the X (i.e., it is a meiotic driver gene), and that Su(Ste) evolved as a response, in a sort of
evolutionary arms race30 (but see ref 42). Meiotic drive creates a strong "evolutionary
prize" for suppressors, particularly in those located in the targeted chromosome43. If
Su(Ste) is indeed a suppressor of X-Y meiotic drive, than its Y-linkage almost certainly
resulted from selection.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
7
Supplementary Methods
1. Analytical treatment of the ascertainment bias in the ratio gene gain / gene loss.
In order to investigate the trend in the number of Y-linked genes (is it increasing?
decreasing? steady-state equilibrium between gain and loss ?), we need an unbiased
estimate of either the difference or the ratio of two quantities, gain rate (7 gains in 62.9
Myr ; raw value: 0.1113 genes / Myr ) and loss rate (2 losses in 275.2 Myr; raw value:
0.00727 genes / Myr; raw gain/loss ratio: 15.3) . As explained below, this raw estimate
of the gain/loss ratio is biased because of the way the gain and loss events were
ascertained. The main bias is in the loss rate, and a minor one affects the gain rate. Here
we show how to correct them. After defining the nomenclature we used (below), in
section 1.1. we specify three quantities and one assumption used to derive the unbiased
ratio of gain to loss. In section 1.2. we present the correction of the bias, and in section
1.3. we show that unknown D. melanogaster Y-linked genes do not cause a bias in the
ratio of gain to loss. Finally, in section 1.4. we derive a general model for bias
correction, needed for the formal statistical test of equality of gain and loss rates. This
bias correction will no longer be necessary when the knowledge about the Y
chromosome of the other Drosophila species becomes equivalent to D. melanogaster.
We are carrying out such direct searches for Y-linked genes in the other Drosophila
species, but we should note that heterochromatic regions (including the Y) are
notoriously refractory to genomic studies11,12, so the task requires considerable effort and
time.
A simplified scheme of our "melanogaster-centric" data is shown below, where
MS is the number melanogaster-specific Y-linked genes, MA (melanogaster-ancestral)
is the number of genes acquired before the split between the melanogaster lineage and
the outgroup, and MT (melanogaster total) is equal to MS plus MA. Since we do not
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
8
know the full gene set of D. melanogaster, we use the subscripts K ("known") , U
("unknown") and R ("real"). Of course, MSK + MSU = MSR . Finally, OS is "outgroup
specific", i.e., the number of genes that are Y-linked in the outgroup, and not Y-linked in
D. melanogaster. The outgroup can be any non-melanogaster species (e.g., D. willistoni
or D. virilis) and the "outgroup-specific" gene may either have been acquired in the
outgroup lineage, or it may have been lost in the D. melanogaster lineage. In our data
(Fig. 2), when we use D. virilis as the outgroup, MSK = 7, MAK = 5, MTK = 12 ; with D.
ananassae as the outgroup, MSK = 1, MAK = 11, MTK = 12.
1.1. Basic data and assumptions. We begin with the initial assumption that the
gain rate measured in the D. melanogaster lineage (the "red branches" in Supplementary
Fig. 9) and the loss rate measured in the other lineages (the "blue branches") are
homogeneous across the entire phylogeny. The unbiased ratio of gene gain / gene loss
comes from the three estimates detailed below.
1.1.1. The loss rate per gene (expressed in "genes lost per gene per Myr"), is
unbiased due to the very nature of data, where many instances allow high confidence
that the direct ancestral species possessed the Y-linked gene that is now missing in the
target species. For example, we observed one gene loss (Ppr-Y) in the D. grimshawi
branch (42.9 Myr), among five Y-linked genes (Fig. 2). Hence, the loss rate per gene in
this branch is (1 / 5) / 42.9 = 0.0047 genes lost / gene / Myr . The average value for all
branches is 0.001026 genes lost / gene / Myr (see section 1.4 and Supplementary Table
4) . Note that the loss rate per gene is different from the chromosome-wide loss rate,
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
9
which is expressed in "genes lost per Myr". For the sake of simplicity we will refer to
the latter simply as the "loss rate".
Note that we ignored the kl-5 gene in the Drosophila subgenus for the
computations of gene gain and loss rates, because its discovery depended on the gain of
the same gene in two branches, which may bring in unknown bias. However its inclusion
did not change any conclusion (not shown).
1.1.2. The ratio of melanogaster-specific to melanogaster-ancestral genes for a
given outgroup is unbiased because the discovery of all Y-linked genes of D.
melanogaster have not used any information from the other species (the other genomes
were not even available at that time). Hence the observed ratio of 7 / 5 with D. virilis
(and 1/11 with D. ananassae) is expected to hold for the full gene set of D.
melanogaster, apart from sampling variance. More formally,
MAMS
MAMS
K
K
R
R =⎟⎠⎞
⎜⎝⎛E (eqn. 1)
where E is the expected value, and the other symbols were defined before.
1.1.3. Similarly, the gene gain rate (7 genes / 62.9 Myr = 0.1113) is unbiased in the
sense that the discovery of the D. melanogaster Y-linked genes was done without
information from the other species, and so was not influenced by their condition of being
"ancestral" or "melanogaster-specific". The gene gain rate has a trivial bias caused by
unknown D. melanogaster Y-linked genes (which was dealt with in section 1.3), and
needs a minor correction, as follows.
The 7 gains in 62.9 Myr we observed is the net gain rate, which does not take into
account the genes that were acquired and subsequently lost in the D. melanogaster
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
10
lineage. This problem can be corrected with a simple birth-death model with gene
“births” (arrival of genes) being independent of population size (number of Y-linked
genes), and the number of "deaths" (gene losses) being proportional to population size
(number of Y-linked genes). The differential equation is:
dN/dt = -λ N + ν
where N is the number of genes, λ is the gene loss rate per gene (in genes lost /
gene / Myr), and ν is the gene gain rate (in genes / Myr). Its solution is
⎟⎠⎞⎜
⎝⎛ −−−−= 10
ttt eeNN λ
λνλ (eqn. 2)
where Nt is the number of genes at time t, N0 is the number of genes at time zero
(i.e., the present), and the other symbols were defined before. We can obtain the
corrected estimate of the gain rate ν by setting λ to 0.001026 , Nt to 12 genes , N0 to 5
genes, t to 62.9 Myr, and solving the equation for ν. The corrected value of the gain rate
is 0.12 genes / Myr (the raw value is 0.1113).
1.2. Bias in the loss rate caused by the outgroup-specific genes. Our current
estimates of the gain rate ( 0.12 genes / Myr ) and loss rate (2 losses in 275.2 Myr; raw
value: 0.00727 genes) are downward biased because we do not know the full gene set of
the D. melanogaster Y chromosome, but these biases cancel out when we consider the
ratio of gain to loss, and hence the ratio itself is unbiased (see section 1.3.). Let us focus
here on the more relevant bias: the loss rate is also downward biased because we do not
take into account the Y-linked genes that are present in the outgroup, but not in D.
melanogaster ( i.e., we have not taken into account the outgroup-specific genes), and
some of them are expected to have been lost. Putting it more formally, our current count
of losses in the outgroup (and the loss rate) is conditional on the existence of the same
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
11
gene in the Y chromosome of the D. melanogaster lineage, whereas an unbiased
estimate of the loss rate requires the inclusion of the outgroup-specific genes. At this
moment the knowledge about outgroup-specific genes (we know two in D. virilis) is too
incomplete to allow more detailed conclusions, other than that they really exist.
However, under the assumption that gain rates are homogeneous across the phylogeny,
the expected number of outgroup-specific Y-linked genes is:
E (OS) = gain rate × outgroup branch length.
For example, in the Y of D. willistoni we expect 0.12 genes/ Myr × 62.2 Myr =
7.5 genes , which would not be present in the Y of D. melanogaster. Among these genes,
we expect 7.5 × 0.001026 genes / gene / Myr × 62.2 Myr × 0.5 = 0.24 losses. The last
"0.5" factor stems from the fact that the acquisition of these new genes is expected to
occur on average at ½ of the branch length, so the chance of being lost is halved. In the
real phylogeny (Supplementary Fig. 8, panel B) we must calculate these "inferred
losses" for each outgroup branch, as shown in Supplementary Table 4. The sum of
expected additional losses across all outgroup branches is 1.025 . Note that the observed
value was 2 losses, so the unbiased number of losses is 3.025 and the unbiased loss rate
is 3.025 / 275.2 = 0.01099 genes / Myr .
The unbiased gain / loss ratio , 0.12 / 0.01099 , is 10.9 . The statistical testing of
this ratio (i.e., whether or not it is significantly different from 1) is presented in section
1.5.
1.3. Bias caused by unknown D. melanogaster Y-linked genes. As we
mentioned above, ignoring these genes cause a downward bias that is expected to affect
equally the gain and loss rates, and hence the effects are expected to cancel out in the
ratio gain / loss. The argument follows.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
12
1.3.1. Gain rate. Our current raw estimate of the gain rate is 7 genes / 62.9 Myr =
0.1113 genes / Myr . This is an under-estimate because we do not know all
melanogaster-specific genes, and the more we find, the higher will be the gain rate. So if
we find 7 additional melanogaster-specific genes, the raw rate will be 14 / 62.9 Myr =
0.2226 genes / Myr (the corrected value described in section 1.1.3. will also double,
from 0.12 to 0.24).
More generally,
unbiased Gain Rate = Gain Rate × MSMS
K
R (eqn. 3)
1.3.2. Loss rate. Our current raw estimate is 2 genes / 275.2 Myr = 0.00727 . It was
estimated in the non-melanogaster branch (the "blue branches" shown in Supplementary
Fig. 9), by counting the number of losses among the known Y-linked genes of these
branches, and dividing by the total branch length. The loss rate is underestimated
because we do not know all Y-linked genes of the outgroup, and if this number, say, is
30% higher, the expected rate of loss will also be 30% higher. A simple analogy may
help: if we observed 10 deaths per year in a random sample of 250 animals, we would
expect to observe 20 deaths per year in a sample of 500 animals.
Among these unknown Y-linked genes, some would have been acquired in the
outgroup branches; these are the "outgroup specific" genes mentioned and accounted for
in section 1.2. Note that the bias correction described in section 1.2. is not affected by
additional melanogaster-specific genes because if we found that the gain rate is, say,
0.24 (twice the current rate 0.12 ), the expected number of OS (and the expected number
of losses among them) also doubles, and the effect cancels out in the ratio gain / loss.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
13
The rest of the unknown Y-linked genes of the outgroup are the "melanogaster-
ancestral". If there are, say, 10 such genes (instead of five) we expect that the number of
losses among them will also double. More generally,
unbiased Loss Rate = Loss Rate × MAMA
K
R (eqn. 4)
It follows from eqn. 1 that
⎟⎠⎞
⎜⎝⎛=⎟
⎠⎞
⎜⎝⎛
MSMS
MAMA
K
R
K
R EE (eqn. 5)
Note that the left term of eqn. 5 is the bias of gene loss (eqn. 4), and that the right
term is the bias of gene gain (eqn. 3). Thus, the biases due to incomplete knowledge of
the D. melanogaster Y-linked genes are expected to cancel out in the ratio gain rate /
loss rate , as stated in the beginning of section 1.3.
1.4. General analytical model for bias correction. In the previous sections we
calculated the bias in the gain / loss ratio for the specific data we have. Here we analyze
a more general model for this bias, which is needed to specify more clearly the null
hypothesis for testing the ratio of gain to loss. The data used for the estimation of the
loss rate consists in the follow up of a set of genes known to be Y-linked in the ancestor,
along the phylogenetic branches (Supplementary Table 4, and Fig. 2) . The product
"number of genes × branch length" is called exposure in the context of Poisson
regression, and the higher it is, the larger the expected number of losses. It transpires
from the reasoning presented in section 1.2 that the bias discussed there originated from
not considering the exposure due to the outgroup-specific Y-linked genes, and that the
bias correction simply is its inclusion. Namely,
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
14
unbiased loss rate = loss rate × ( 1 + ∑∑
reobs_exposureinf_exposu
)
where obs_exposure is the observed exposure (from the ancestral Y-linked genes
present in the outgroup), and inf_exposure is the inferred exposure (from the outgroup-
specific Y-linked genes). Specifically,
obs_exposure = observed number of genes × branch length
inf_exposure = inferred effective number of genes × branch length ,
where "inferred effective number of genes" is the number of genes gained in the
same branch divided by 2 (to account for the fact that they were gained on average in the
middle of the branch) plus the number of genes acquired in previous branches. In the full
data set the bias in the loss rate is (1 + 998.9 / 1948.6 ) = 1.513 . Remember that the
biased loss rate is Σ obs_losses / Σ branch length = 2 / 275.2 = 0.00727 genes lost / Myr.
The unbiased loss rate equals to 0.00727 × 1.513 = 0.01099 genes lost / Myr , which is
the same value calculated with the less general approach used in section 1.2. The value
of this bias depends on the topology of the tree (Fig. 2) and on the specific points where
the genes were acquired in the D. melanogaster lineage, because these factors change
the observed and inferred exposures. For example, if we assume that the CCY gene was
acquired at the basal branch of the melanogaster / obscura groups instead of in the basal
Sophophora branch, the bias in the loss rate changes from 1.513 to 1.531 , and if we had
used only Sophophora species, the bias would be 1.274 . We included in the
Supplementary Information a Excel spreadsheet (Supplementary Data file analytical.xls)
that implements the bias estimation.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
15
The calculated bias provides the null hypothesis for testing the gain / loss ratio.
The loss rate has a bias of 1.513 . Remember that the gain rate has a minor bias (section
1.1.2), which amounts to 0.12 / 0.1113 = 1.078 . Hence, an unbiased gain / loss ratio of
1 implies an observed ratio of 1.513 / 1.078 = 1.403 . This 1.403 ratio is the null
hypothesis to be tested in the Poisson regression (and in the two Poisson means test).
I.e., we should test whether the ratio of 7 gains in 62.9 Myr divided by 2 losses in 275.2
Myr differ significantly from 1.403 . The answer is "yes" ( P = 0.003 , Poisson
regression; see section 1.5 ). The same qualitative result was obtained with direct
computer simulations (section 2).
1.5. Statistical tests for the difference between the rates of gene gain and loss.
The data consist of inferred gains and losses on each branch, using synteny and
parsimony to infer ancestral states (Fig. 2). As detailed in Supplementary Fig. 6 and
Supplementary Fig. 8, there are two uncertainties in the data, involving the Pp1-Y1 /
Pp1-Y2 and the CCY genes. We first assumed the scenario shown on Fig. 2 (CCY was
gained in the basal Sophophora branch, instead of in the basal branch of the
melanogaster / obscura groups; the gains of the Pp1-Y1 and Pp1-Y2 genes are two
independent events, instead of one). None of the alternative scenarios change the
conclusion (below). Finally, as commented in the Supplementary Discussion, we
excluded from our analysis the Y-linked gene Suppressor of Stellate [ Su(Ste) ] because
it is multi-copy and RNA-encoding. The gene was acquired in the D. melanogaster
lineage, after the split from D. simulans. Its inclusion did not change any qualitative
conclusion (below).
We assume that genes are gained and lost along each branch of the phylogeny
according to a homogeneous Poisson process. Here the “exposure” to gene gain or loss
is the length of the respective branch, and the model is:
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
16
E(X) = f(β0 + β1IGainLoss)
where X is the count of gene movements (both gains and losses), f is the Poisson
link function44, β0 is the intercept, adjusting for the branch lengths, and IGainLoss is a
binary indicator variable denoting whether the gene movement is a gain or loss (1 for a
gain and 0 for a loss). Testing the null hypothesis that the ratio of gain to loss is one (i.e.,
that they are equal) amounts to testing β1 = 0. Given the ascertainment bias discussed in
section 3, the appropriate null hypothesis is that the observed ratio of gain to loss is
1.403 , i.e., that β1 = ln ( 1.403) = 0.3386 . The residual deviances provided an
assessment of goodness-of-fit of the model to the data. The whole procedure was done
with the glm function in the R statistical package (setting “family = Poisson”)45, and is
implemented in the Supplementary Data files "gains_losses_script.R" and
"gains_losses_data.txt".
The Poisson regression model indicated that the rate of gene gain is significantly
larger than the rate of gene loss (P = 0.003). The nominal gain / loss ratio (after the bias
correction) is 10.9 ( 95% confidence interval: 2.3 - 52.5). Similar results are obtained if
we assume that CCY was gained in the basal branch of the melanogaster / obscura
groups (P = 0.003), if the gains of the Pp1-Y1 and Pp1-Y2 genes are counted as one
event (P = 0.005), or if the gain of the Su(Ste) gene is included (P = 0.001). Regarding
the goodness-of-fit of the model to the data, there is an indication of overdispersion in
the assumed scenario, or if we include the Su(Ste) gene (P = 0.035 and P = 0.022 ,
respectively), but not in the remaining scenarios (CCY alternative scenario: P > 0.20 ;
Pp1-Y1 / Pp1-Y2 alternative scenario: P = 0.06).
The same conclusion that gain rate largely exceeds loss rate is obtained with a
simple two-tailed exact test for the ratio of two Poisson means34,35, by comparing 7 gains
in 62.9 Myr with 2 losses in 275.2 Myr, under the null hypothesis of a gain / loss ratio of
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
17
1.403 (P = 0.002; the test was done with the StatCalc 2.0 program, available at
http://www.ucs.louisiana.edu/~kxk4695/StatCalc.htm ). Similar results were obtained
under the three alternative scenarios mentioned above.
Finally, the same conclusion is obtained if we use only Sophophora species ( 7
gains in 62.9 Myr vs. 0 losses in 136.9 Myr; P = 0.002; two-tailed exact test for the ratio
of two Poisson means). Hence the conclusion that the gene content of the Y is increasing
does not seem to be an artifact caused by estimating gains and losses in different and
rather distant lineages such as D. melanogaster and D. virilis, although it is formally
possible that the increase is occurring in D. melanogaster and related species, but not in
species from the Drosophila subgenus.
In this section and in the previous one we approached analytically the
ascertainment bias on gene gains and losses, and statistically tested their equality with a
Poisson regression. In the next section we used computer simulations and an
approximate Bayesian procedure to tackle these questions.
2. Computer simulations and approximate Bayesian estimates of gene gain and
gene loss.
In order to more fully explore the consequences of the ascertainment bias of gene
content, simulations of a Poisson process of gene gain and loss were run. The computer
code was written in the statistical language R45, and is available as a Supplementary
Material (file indelsim_free.R). The simulations employed the observed phylogeny and
branch lengths, and inferences of losses were conditional on observing genes in D.
melanogaster (identical to the true ascertainment). After drawing random rates of gene
gain and loss per gene from an uniform distribution and collecting 1,000 runs that
satisfied the rejection criteria (7 net gains on the D. melanogaster lineage and 2 losses of
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
18
known genes on the other branches), approximate Bayesian estimates46-48 of the
posterior densities of the gain rate, the loss rate per gene, and the net gene gain (gains
minus losses) were obtained (Figure 3 and Supplementary Fig. 11).
In all 1,000 simulations the gains outnumber the losses (Fig. 3 and Supplementary
Fig. 11A), which strongly suggest that the Y is gaining genes on average. Note that the
simulations required just a total of 7 gains in the red branches, irrespective where they
happened. This is important because there is some uncertainty in where the genes were
gained (Supplementary Fig. 6 and 8); the simulations are free of assumptions in this
respect, which increases the robustness of the conclusions drawn from them. The gain
rate and the loss rate per gene are the ultimate factors governing gene number dynamics;
their joint posterior distributions are shown in Supplementary Figure 11B. Analogously
to what we did in the analytical approach, we also run the simulations under two
alternative scenarios, to allow for the counting of the Pp1-Y1 and Pp1-Y2 gains as a
single event (6 gene gains, instead of 7), and to include the gain of the Suppressor of
Stellate gene (8 gene gains, instead of 7). In both cases we got the same result as before
(namely, in all 1,000 simulations the gains outnumber the losses; data not shown).
It is likely that as gene number increases the number of gene losses will increase
until an equilibrium between gains and losses is attained. Under the simple model
outlined in section 1.1.3 and in equation 2, the equilibrium gene number is ν / λ , where
ν is the gene gain rate (in genes / Myr), and λ is the gene loss rate per gene (in genes lost
/ gene / Myr). The simulations allow us to look at the posterior distribution of the
predicted equilibrium gene number (Supplementary Fig. 11C ). As expected given the
previous result that the gains outnumber the losses (Fig. 3), nearly all (997 out of 1,000)
of the values of equilibrium gene number are above the present Y-linked gene number in
D. melanogaster (12 genes). The average is 89 genes. However, the equilibrium gene
number does not have much biological significance because it is expected to take a very
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
19
long time to be achieved. Using the nominal gain rate of 0.12 genes / Myr and the
nominal loss rate of 0.001026 genes lost / gene / Myr , the predicted equilibrium is 117
genes, but it would take over 400 Myr for an increase from 12 genes to 50.
The parameters and values estimated by the simulations agree quite well with the
analytical solution. For example, the average ratio of gain rate to loss rate in the
simulations is 8.3 (Supplementary Fig. 11A), whereas the analytical value is 10.9
(section 1.2. ). Perfect agreement is not expected because some assumptions are
different. In particular, the simulations allowed variation among samples in the
phylogenetic pattern of gains (i.e. the rejection criterion focused on counts of gains, not
on which branches had gains), and this changes the exposure to losses (see section 1.4.).
If we re-run the simulation with additional constraints such that the gene gains fell on
the phylogeny as in Fig. 2, then its estimates of parameters and values match more
closely the analytical solution (not shown).
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
20
Supplementary Figure 1. Assembly problems of Y-linked genes. The figure
shows a BlastN search of the full cDNA of D. virilis kl-2 against the assembled
D. virilis genome. This and the cDNAs from all other genes were obtained after
gaps and frame-shifts were corrected, as described in the Methods Summary
section. Note that the gene is fragmented in several scaffolds and that there are
many gaps due to the low coverage of the Y. The numbers are the abridged
FlyBase scaffold identifiers (e.g. scaffold_9735 was abridged to 9735). Many
fragments were absent from the assembled genome and were sequenced de
novo using RT-PCR, RACE 5' and RACE 3'. The final coding sequences of the
orthologs were obtained with NAP49 and GeneWise250
(http://www.ebi.ac.uk/Wise2/advanced.html). We also used Apollo51 and SGP252
to help the annotation in the more difficult (i.e., less conserved) genes.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
21
Supplementary Figure 2. Gene orthology confirmation by phylogeny. The
Y-linked genes originated by duplication of autosomal genes18-22. We used
phylogenetic analysis to avoid the error of mixing the parental autosomal genes
(i.e., the paralogs) with the correct orthologs. The protein sequences were
aligned with ClustalW53, and a NJ tree with Poisson correction and complete
deletion was constructed with the program MEGA54. For the sake of simplicity, in
the figures we labeled the genes in the other species according to their names in
D. melanogaster (e.g., the D. erecta gene Dere_GLEANR_12165 is the ortholog
of the D. melanogaster gene CG3339, and as was labelled as "ere CG3339" ;
see Supplementary Table 5 for the remaining genes). Each panel shows the
phylogenetic analysis of one gene, indicated in the top of the corresponding
panel. In most cases we included only the closest paralog, but in case of doubt (
Pp1-Y1 and Pp1-Y2 ; and ARY ) we used several sequences returned by the
TblastN search. In a few cases we could not find any paralog (WDY) or only
very distant ones (PRY; the two proteins showed in the figure have less than
30% identity), so there is no doubt about the orthology. The CCY gene has a
shorter paralog (CG31161) present only in the melanogaster group; their
relationship is discussed in Supplementary Fig. 8.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
22
Supplementary Fig. 2: Panel A. kl-2 phylogeny.
mel kl-2
ere kl-2
yak kl-2
ana kl-2
pse kl-2
wil kl-2
moj kl-2
virkl-2
gri kl-2
mel CG9068
ere CG9068
yak CG9068
ana CG9068
pse CG9068
wil CG9068
moj CG9068
vir CG9068
gri CG9068
Anopheles XP310137
Aedes EAT40361
Apis XP396228.3
Tribolium XP967358
Chlamydomonas 1-beta dynein CAB99316
100
100
68
74
100
100
64100
100
100
100
100
96
100
95
100
100
99
100
48
0.1
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
23
Supplementary Fig. 2: Panel B. kl-3 phylogeny.
mel kl-3
ere kl-3
yak KL3
ana kl-3
pse kl-3
wil kl-3
moj kl-3
vir kl-3
gri kl-3
Anopheles XP308196
Aedes EAT41089
Tribolium XP966797
ere CG9492
mel CG9492
yak CG9492
ana CG9492
pseCG9492
wil CG9492
moj CG9492
vir CG9492
gri CG9492
Anopheles XP307780
Aedes EAT38201
Tribolium XP967934
Chlamydomomas gamma dynein Q39575
100
88
100
100
100
99
100
98
100
96
100
100100
100
77100
100
100
99
8248
100
0.1
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
24
Supplementary Fig. 2: Panel C. kl-5 phylogeny.
mel kl-5
ere kl-5
yak kl-5
ana kl-5
pse kl-5
wil kl-5
moj kl-5
vir kl-5
gri kl-5
Anopheles XP321424
Aedes EAT45561
ere CG3339
mel CG3339
yak CG3339
ana CG3339
pse CG3339
wil CG3339
moj CG3339
vir CG3339
gri CG3339
mel Dhc93AB
ere Dhc93AB
yak Dhc93AB
ana Dhc93AB
wil Dhc93AB
pse Dhc93AB
moj Dhc93AB
vir Dhc93AB
gri Dhc93AB
Anopheles XP559011
Aedes EAT39332
Chlamydomonas beta dynein Q39565
44100
99
78100
93
100
100
80
100
100
100
100
75
100
100
100
100
75
100
100
54
100
100
100
10055
87
0.05
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
25
Supplementary Fig. 2: Panel D. ORY phylogeny.
mel ORY
ere ORY
yak ORY
ana ORY
pse ORY
wil ORY
vir ORY
moj ORY
gri ORY
mel CG6059
ere CG6059
yak CG6059
ana CG6059
pse CG6059
wil CG6059
moj CG6059
vir CG6059
gri CG6059
Anopheles XP 313461.1
Aedes EAT43311.1
Tribolium XP 974953.1
100
100
99
100
93
52
98
98
96
100
6990
100
99
99
100
10041
0.2
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
26
Supplementary Fig. 2: Panel E. PRY phylogeny.
mel PRY
yak PRY
ere PRY
ana PRY
wil PRY
pse PRY
moj PRY
vir PRY
gri PRY
mel CG12636
yak CG12636
ere CG12636
wil CG12636
mel CG30048
ere CG30048
yak CG30048
ana CG30048
moj CG30048
vir CG30048
gri CG30048
90100
98
100
100
98
7562
95
9592
100
56
71
94
100
52
0.1
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
27
Supplementary Fig. 2: Panel F. Ppr-Y phylogeny.
Supplementary Fig. 2: Panel G. CCY phylogeny.
mel CCY
ere CCY
yak CCY
ana CCY
pse CCY
mel CG31161
ere CG31161
yak CG31161
ana CG31161
wil CCY-1
wil CCY-2
moj CCY
vir CCY
gri CCY
93
100
35
94
100
95
99
100
93
87
99
0.05
mel PPrY
yak PPrY
ere PPrY
ana PPrY
pse PPrY
wil PPrY
vir PPrY
moj PPrY
mel CG13125
yak CG13125
ere CG13125
ana CG13125
pse CG13125
wil CG13125
moj CG13125
vir CG13125
gri CG13125
Aedes XP 001649719.1
59100
100
84
100
94
100
98
96
100
98
100
99
100
55
0.1
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
28
Supplementary Fig. 2: Panel H. ARY phylogeny.
mel ARY
ere ARY
yak ARY
ana ARY
wil ARY
pse ARY
moj ARY
vir ARY
gri ARY
mel CG10638-PB
ere_GLEANR_14052
ere_GLEANR_14054
yak CG10638-PA
wil_GLEANR_12919
ana CG10638-PB
ana_GLEANR_8771
moj_GLEANR_13025
moj_GLEANR_13026
vir_GLEANR_13680
vir_GLEANR_13681
mel CG10638-PA
ere_GLEANR_14053
ere_GLEANR_14055
wil_GLEANR_12920
moj CG10638
vir CG10638
pse CG10638
Aedes AAEL004095-PA
97
91
100
80
92
97
76
100
8599
85
8154
73
99
49
47
81
98
7377
55
100
33
33
0.2
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
29
Supplementary Fig. 2: Panel I. WDY phylogeny.
mel WDY
ere WDY
yak WDY
ana WDY
pse WDY
wil WDY
vir WDY
moj WDY
gri WDY
Aedes EAT46343.1
Tribolium XP 970543.1
47100
100
99
77
100
4497
0.1
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
30
Supplementary Fig. 2: Panel J. Pp1-Y1 and Pp1-Y2 phylogeny.
mel Pp1-87B
mel Pp1-96A
mel Pp1-13C
moj PP1-13C
mel Pp1-9C
mel Pp1-Y2
yak Pp1-Y2
ana Pp1-Y2
ere Pp1-Y2
pse Pp1-Y2
wil Pp1-Y2
moj Pp1-Y2
vir Pp1-Y2
gri Pp1-Y2
mel Pp1-D5
mel PpY-55A
mel PpN58A
moj PpN58A
mel Pp1-Y1
ere Pp1-Y1
yak Pp1-Y1
ana Pp1-Y1
pse Pp1-Y1
wil Pp1-Y1
vir Pp1-Y1
gri Pp1-Y1
wil GLEANR 15902
vir GLEANR 7085
mel PpD6
wil PpD6
vir PpD6
moj PpD6
mel Pp4-19C
97
99
100
97
77
59
83
99
100
82
99
92
84
94
82
97
99
41
21
19
57
73
96
100
100
91
71
94
96
0.1
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
31
Supplementary Figure 3.
Supplementary Figure 3. PCR test of Y-linkage of the ARY gene. The gene
is Y-linked only in species from the D. melanogaster group and in D. willistoni.
Unabridged species names (in the order of appearance) are: D. melanogaster,
D. erecta, D. yakuba, D. ananassae, D. pseudoobscura, D. willistoni, D.
mojavensis, D. virilis, and D. grimshawi. A similar test was done for all genes,
across all 12 species.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
32
Supplementary Figure 4
Supplementary Figure 4. Synteny analysis of the kl-5 gene (all species).
Individual genes usually move through gene duplication, sometimes followed by
loss of the original copy33. The linkage changes shown in Table 1 can be caused
either by Y-to-autosome movement or vice versa. Synteny analysis is very
helpful to solve this, but could only be applied to the genes that are autosomal in
two or more species, because the Y chromosome assembly is too fragmented
Dana scaffold 13340
Dyak Chromosome 3R
Dere scaffold 4820
Dmel Chromosome 3R
Dmoj scaffold_6540
Dvir scaffold_13047
Dw il scaffold_2_1100000004902
Dgri scaffold_15074
Dpse Chromosome 2
CG3348CG3339
CG3330dgri_GLEANR_89
CG6599CG13980
side
CG3348CG3339
CG3330dv ir_GLEANR_9884
CG6599CG13980
side
CG3348CG3339 CG3330
dw il_GLEANR_11931CG6599
CG13980side
kl-5
CG3348CG3339
CG3330 dmoj_GLEANR_8194 CG6599CG13980
side
CG3348 CG3339
CG3330CG14264 CG6599
GA27106sidekl-5
CG3348CG3339
CG3330CG14264
CG6599CG13980
side
CG13980
CG3348CG3339
CG3330CG14264
CG6599CG13980
side
CG3348CG3339
CG3330CG14264
CG6599 CG13980 sidedere_GLEANR_11636
A. gambiae Chromosome 2R
kl-5CG6599
sideCG13980
Dana scaffold 13340
Dyak Chromosome 3R
Dere scaffold 4820
Dmel Chromosome 3R
Dmoj scaffold_6540
Dvir scaffold_13047
Dw il scaffold_2_1100000004902
Dgri scaffold_15074
Dpse Chromosome 2
CG3348CG3339
CG3330dgri_GLEANR_89
CG6599CG13980
side
CG3348CG3339
CG3330dv ir_GLEANR_9884
CG6599CG13980
side
CG3348CG3339 CG3330
dw il_GLEANR_11931CG6599
CG13980side
kl-5
CG3348CG3339
CG3330 dmoj_GLEANR_8194 CG6599CG13980
side
CG3348 CG3339
CG3330CG14264 CG6599
GA27106sidekl-5
CG3348CG3339
CG3330CG14264
CG6599CG13980
side
CG13980
CG3348CG3339
CG3330CG14264
CG6599CG13980
side
CG3348CG3339
CG3330CG14264
CG6599 CG13980 sidedere_GLEANR_11636
A. gambiae Chromosome 2R
kl-5CG6599
sideCG13980
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
33
(e.g., Supplementary Fig. 1) and one scaffold seldom contains more than one
gene. These genes are kl-5 (Fig. 1 and Supplementary Fig. 4), WDY
(Supplementary Fig. 5), Pp1-Y1 / Pp1-Y2 (Supplementary Fig. 6), ARY
(Supplementary Fig. 7), and CCY (Supplementary Fig. 8). However, the
remaining genes (kl-2, kl-3, PRY and Ppr-Y) are Y-linked in nearly all species
(Table 1), and the Y-linkage clearly is the ancestral state. The kl-5 gene is Y-
linked in all sequenced species, except D. willistoni and D. pseudoobscura / D.
persimilis, which might suggest a Y-to-autosome transfer in the D. willistoni
lineage. However, as the figure shows, there is synteny in this region between
D. willistoni and A. gambiae (and also with D. pseudoobscura / D. persimilis).
Hence, the former hypothesis would imply that in D. willistoni the kl-5 gene
moved from the Y to exactly its location in A. gambiae, which is nearly
impossible. The most likely explanation is that kl-5 moved twice to the Y-
chromosome: one transfer happened within the Drosophila subgenus (before
the split of D. virilis, D. grimshawi and D. mojavensis ), and the other transfer in
the basal branch of the melanogaster group. The phylogenetic pattern we
observed (Supplementary Fig. 2, panel C) rules out the hypothesis that there
was one duplication from the ancestral autosomal locus to the Y prior to the split
of all sequenced species, followed by retention of the Y or autosomal copies in
different lineages. Note that if we ignore synteny information from D.
pseudoobscura / D. persimilis, the second transfer might have happened as well
in the basal branch of the obscura and melanogaster groups, but the synteny
information rules out this hypothesis. The same reasoning positioned the
transfers of WDY (Supplementary Fig. 5) and Pp1-Y1 / Pp1-Y2 (Supplementary
Fig. 6) in the basal branch of the melanogaster group, instead of in the basal
branch of the melanogaster and obscura groups. Hence there is useful synteny
information in D. pseudoobscura / D. persimilis, but it should be used bearing in
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
34
mind that the ancestral Y became part of an autosome in this lineage23. In
particular, none of the D. melanogaster Y-linked genes is Y-linked in these
species (Table 1), but this is not due to individual Y-to-autosome transfer; the
lack of Y-linkage in D. pseudoobscura / D. persimilis is due either to the Y-
autosome fusion (e.g., kl-2) or to the fact that these genes were transferred to
the Y in the melanogaster lineage after its split from the pseudoobscura lineage
(e.g., kl-5 and WDY). Note that the whole kl-5 region is conserved across all
sequenced species, except for the absence of kl-5 in the species in which it is Y-
linked. The most likely explanation for this absence is that after the duplication to
the Y the autosomal copy of kl-5 degenerated. Supplementary Figures 4 to 8
were modified from FlyBase GBrowse (available at
http://flybase.bio.indiana.edu/) and from VectorBase (available at
http://www.vectorbase.org/index.php). For the sake of simplicity, in the figures
we labelled the genes in the other species according to the names of their
orthologs in D. melanogaster (Supplementary Table 5). Orthology information
came from Drosophila and Anopheles databases (http://species.flybase.net/cgi-
bin/gbrowse/dmel/; http://agambiae.vectorbase.org/index.php; genes painted in
green) and from the present work (Supplementary Fig. 2; genes painted in blue).
Genes in yellow do not have clear orthologs.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
35
Supplementary Figure 5
Supplementary Figure 5. Synteny analysis of the WDY gene. The WDY
gene is autosomal in all species of Drosophila subgenus and in the obscura and
willistoni groups, and Y-linked in the melanogaster group. When autosomal,
WDY is located in the same position in all species, which shows that the
autosomal position is ancestral. Note that the alternative hypothesis of ancestral
Dmoj scaffold_6500
Dvir scaffold_12963
Dw il scaffold_2_1100000004851
Dana scaffold 12943
Dyak Chromosome 2L
Dere scaffold 4929
Dmel Chromosome 2L
Dgri scaffold_15252
Dpse Chromosome 4-group2
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
36
Y-linkage of WDY would imply three independent movements to exactly the
same location in an autosome ( in the ancestors of the obscura group, of the
willistoni group, and of the subgenus Drosophila), which is nearly impossible.
The WDY region is conserved in the D. melanogaster group, except for the
absence of WDY, which strongly suggests that after the duplication to the Y the
autosomal copy of the gene degenerated. Interestingly, in D. melanogaster (and
also in D. erecta and D. yakuba) a small gene (CG34164; 106 amino acids) with
62% amino acid identity with the C-terminus of WDY is present at exactly the
location of WDY. Thus, CG34164 is a relic of the full WDY gene, which has ~
1000 amino acids.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
37
Supplementary Figure 6
Supplementary Figure 6. Synteny analysis of the Pp1-Y1 and Pp1-Y2
genes. These genes are Y-linked only in the melanogaster group, and are
autosomal and syntenic in the other species, as happens with WDY. Therefore
the same arguments and conclusion are valid for them: the autosomal position is
Dmoj scaffold_6496
Dvir scaffold_12875
Dgri scaffold_15245
Dw il scaffold_2_1100000004514
Dpse Chromosome 3
Dana scaffold_13266
Dyak Chromosome 2R
Dere scaffold_4845
Dmel Chromosome 2R
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
38
ancestral. Note that when autosomal they are located very close to each other,
that they are Y-linked in the same set of species (Table 1), and that in D.
melanogaster at least they are located in the same gross region of the Y-
chromosome20. These observations strongly suggest that their duplication to the
Y was a single mutational event. This may have consequences for the
estimation of the rate of gene gain (Supplementary Methods, section 1.5), but
we should note also that the duplication is only the first step of a gene gain by
the Y. The other step is the survival of the Y-linked copy as a functional gene39,
and this most likely was an independent process for Pp1-Y1 and Pp1-Y2
because the PpD6 gene probably was co-duplicated to the Y with them and yet
its only surviving copy is autosomal. Interestingly, PpD6 is located in a
completely different region in the melanogaster group, which hints that the
process of gene duplication, survival and degeneration was quite complex in this
case. Perhaps the gain of Pp1-Y1 and Pp1-Y2 may be considered partially a
single event (in the duplication step) and partially two independent events (in the
survival step). Whatever the case, this is the only uncertainty; the ancestral state
(autosomal) of these genes is very well supported.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
39
Supplementary Figure 7
Supplementary Figure 7. Synteny analysis of the ARY gene. ARY is Y-linked
in all Sophophora (except, of course, in D. pseudoobscura and D. persimilis) ,
and autosomal (and syntenic) in the species from the Drosophila subgenus.
Thus, there is no outgroup among the 12 sequenced species that would help to
establish the ancestral state (and also there is no conserved synteny with any of
the three sequenced mosquitoes; data not shown). Consequently, the linkage
pattern of the gene can be explained by two equally parsimonious hypothesis, a
Y-to-autosome transfer in the basal branch of the Drosophila subgenus, or an
autosome-to-Y transfer in the basal branch of the Sophophora subgenus. ARY
ancestral location could be inferred because the gene is autosomal and located
Dmoj scaffold_6680
Dvir scaffold_13049
Dw il scaffold_2_1100000004729
Dana scaffold 13337
Dyak Chromosome 3L
Dere scaffold 4784
Dmel Chromosome 3R
Dgri scaffold_15110
Dpse Chromosome XR_group6
Dmoj scaffold_6680
Dvir scaffold_13049
Dw il scaffold_2_1100000004729
Dana scaffold 13337
Dyak Chromosome 3L
Dere scaffold 4784
Dmel Chromosome 3R
Dgri scaffold_15110
Dpse Chromosome XR_group6
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
40
in a cluster of two to four related genes in D. virilis and D. mojavensis (the
related genes includes ARY and CG10638; see Supplementary Fig. 2, panel H).
Since these clusters usually are formed by tandem duplications, it is fairly safe
to conclude that this autosomal region is the ancestral location of ARY (the
alternative hypothesis is that in the Drosophila subgenus ARY moved from the Y
precisely to the autosomal location of its related genes). The cluster is
conserved also in many species of the Sophophora subgenus.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
41
Supplementary Figure 8: Panel A
Panel B Panel C
Dmoj scaffold_6540
Dvir scaffold_12855
Dw il scaffold_2_1100000004902
Dana scaffold 13340
Dyak Chromosome 3R
Dere scaffold 4820
Dmel Chromosome 3R
Dgri scaffold_15074
Dpse Chromosome 2
CG31161
CG31161
CG31161
CG31161
Dmoj scaffold_6540
Dvir scaffold_12855
Dw il scaffold_2_1100000004902
Dana scaffold 13340
Dyak Chromosome 3R
Dere scaffold 4820
Dmel Chromosome 3R
Dgri scaffold_15074
Dpse Chromosome 2
CG31161
CG31161
CG31161
CG31161
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
42
Supplementary Figure 8. Synteny analysis of the CCY gene. CCY is Y-linked
in all Sophophora, and autosomal (and syntenic) in the species from the
Drosophila subgenus, as happens with ARY. However, in the CCY case the
ascertainment of the ancestral state is simpler: We know that the autosomal
location present in the Drosophila subgenus is ancestral because as shown in
panel A and Supplementary Fig. 2 (panel G), at the same position in most
Sophophora species there is a shorter gene (CG31161 in D. melanogaster) with
a high identity to the N-terminus of CCY (i.e., a relic gene). CCY has 1200 to
1600 amino acids, and CG31161 has ~ 450. Although the ancestral state of the
autosomal copy is clear, it is difficult to reconcile the protein phylogeny data
(Supplementary Fig. 2, panel G) with a simple scenario of a single transfer of
CCY to the Y in the basal branch of the Sophophora subgenus (panel B). The
main problem is the position of the D. willistoni CCY, which would be expected
to group with the other Sophophora CCY , and not to branch before the
CG31161-CCY split (Supplementary Fig. 2, panel G). The observed pattern
suggests two independent transfers of CCY, one within the D. willistoni branch,
and the other in the basal branch of the melanogaster and obscura groups, as
shown in panel C. A less important incongruence is that in D. willistoni there are
two copies of CCY-like genes in the Y (one full length and one that seems to be
short like CG31161), that group together. This pattern suggests that the two Y-
linked copies arose from a duplication inside the Y chromosome. Given these
uncertainties and the possibility that the protein phylogeny is being affected by
gene conversion, mutational bias (which are different in the Y, due to its
heterochromatic state), and other confounding factors, we conservatively
assumed the simplest scenario of one transfer of CCY to the Y chromosome
(panel B), as shown in Fig. 2.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
43
Supplementary Figure 9: Panel A
Panel B
Supplementary Figure 9. Estimating gene gain and loss in the Y
chromosome. Our experimental approach identified Y-linked genes of the "non-
melanogaster" species using the known Y-linked genes of D. melanogaster.
Hence we cannot detect genes that were lost in the D. melanogaster lineage, or
that were acquired in phylogenetic branches that are not part of this lineage. (A)
Estimates of gene gain and loss can be obtained as follows . Consider a
D. melanogaster
species A
species B
species C
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
44
simplified phylogeny with three species (D. melanogaster and species A and B),
plus an outgroup (species C) to allow the determination of the ancestral states.
Gene gains can only be detected in the branch that leads to D. melanogaster
(shown in red), but in principle occurs at the same rate in the branches that lead
to species A and B (shown in blue). Exactly the opposite happens with gene
losses. Hence, the rate of gene gain can be obtained by counting the gains in
the D. melanogaster branch, and dividing this number by this branch length
(red). In the same way, an estimate of the rate of gene loss can be obtained by
counting the losses in the blue branches, and dividing them by the
corresponding branch length. Using only parsimony gene movements cannot be
unambiguously detected in the branch labelled in black because we cannot
distinguish a gain in the D. melanogaster lineage from a loss in the species C
lineage. (B) In the real data there were 7 gene gains in the Y chromosome in the
D. melanogaster lineage (CCY, ARY, WDY, kl-5, Pp1-Y1, Pp1-Y2, FDY), in a
total branch length of 62.9 Myr, and 2 gene losses (Ppr-Y and PRY) in a total
branch length of 275.2 Myr . We did not consider the Pp1-Y1 loss in D.
mojavensis (because it happened in an autosome), the kl-5 gain in the
Drosophila subgenus (because it happened outside the branches where
estimates are feasible), and linkage data from D. pseudoobscura / D. persimilis
(due to the Y-autosome fusion that happened in this lineage23 ). Branch lengths
are shown in the figure, and were obtained from Tamura et al.4, except D.
simulans / D. sechellia and virilis group / repleta group55. The nodes A-H are
also labelled. The raw rate of gene gain is 0.1113 genes / Myr ) and the raw rate
of gene loss is 0.0073 genes / Myr (see Supplementary Methods for bias
corrections).
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
45
Supplementary Figure 10
Supplementary Figure 10. Experimental confirmation of the loss of the
Ppr-Y gene in D. grimshawi. Degenerate PCR with the primers FVEH and
MHGE specifically amplified the Ppr-Y gene in a diverse set of species, but did
not recover any product with D. grimshawi. This result confirms that the gene is
absent from the genome of this species. The tested species are (in the order of
the figure): D. prosaltans (saltans group), D. bifasciata (obscura group), D.
fummipenis (willistoni group), D. arawakana (cardini group), D. bromeliae
(bromeliae group), D. tripunctata (tripunctata group), D. robusta (robusta group),
D. virilis (virilis group), and D. grimshawi. The first three species belong to the
Sophophora subgenus, and the remaining six to the Drosophila subgenus. The
primer sequences are: FVEH 5' GCCTAGCTTCAAGTTTYGTVGANCA 3' ;
MHGE 5' CAGGTGTATCWTCATCNTCNCCRTGCAT 3' . They were designed
with a modified version of the CodeHop procedure56. We used a hot-start
enzyme (AmpliTaq Gold) with 1 uM of each primer, and the following cycling
conditions: one cycle of 10 min at 94o C for initial DNA denaturation / activation
of Taq; plus 40 cycles of 50 sec at 94o C , 2 min at 55o C , and 1 min at 72o C;
plus one final cycle of 7 min at 72o C. We also tested other annealing
temperatures (between 53o C and 57o C) and never got bands of the expected
size (380 bp) in D. grimshawi.
♂ ♀D. bif
♂ ♀D. pro
♂ ♀D. fum
♂ ♀D. ara
♂ ♀D. bro
♂ ♀D. tri
♂ ♀D. rob
♂ ♀D. vir
♂ ♀D. gri
♂ ♀D. bif♂ ♀D. bif
♂ ♀D. pro♂ ♀D. pro
♂ ♀D. fum♂ ♀D. fum
♂ ♀D. ara♂ ♀D. ara
♂ ♀D. bro♂ ♀D. bro
♂ ♀D. tri♂ ♀D. tri
♂ ♀D. rob♂ ♀D. rob
♂ ♀D. vir♂ ♀D. vir
♂ ♀D. gri♂ ♀D. gri
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
46
Supplementary Figure 11
B
A
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
47
Supplementary Figure 11. Results of 1,000 computer simulations of gene
gain and loss. A) Posterior distribution of the ratio of the rate of gene gain
(genes/Myr) to the rate of gene loss (genes/Myr). The average value is 8.3
(range: 1.3 to 34.5; 95% credibility interval: 1.7 - 22.0). B) Joint posterior
distribution of gain rate and loss rate per gene. The average values are 0.1703
genes / Myr and 0.0034 genes / gene / Myr, respectively. The uniform
distributions used as priors for both parameters had a maximums well above the
highest accepted values (prior for gain: 0 - 1.0 genes / Myr ; prior for loss: 0 -
0.05 genes / gene /Myr). C) Posterior distribution of the predicted equilibrium
gene number (note the logarithmic scale of the abscissa). The average value is
89 genes (range: 8 to 859). Three out 1,000 simulations had predicted
equilibrium gene number below 12 (the present gene number of the D.
melanogaster Y).
C
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
48
Supplementary Tables
Supplementary Table 1. Accession numbers of the genes used in this
study.
Genes D. mel D. ere D. yak D. ana D. pse D. wil D. moj D. vir D. gri
kl-2 EU685283 EU595396 EU595398 EU595399 EU595397 EU595400 EU595403 EU595402 EU595401
kl-3 AAG29546 EU514469 EU514472 EU514468 BK005626 EU514467 EU514471 EU514470 EU514466
kl-5 NP001015499 EU417450 EU417452 EU417447 BK005628 NA EU417444 EU417438 EU417437
ORY NP001015498 BK006456 BK006457 BK006455 AAW23319 BK006454 BK006453 BK006452 BK006451
PRY BK006442 EU362867 BK006441 BK006440 BK006439 BK006438 BK006437 EU362864 BK006436
PPr-Y NP001015502 BK006434 BK006435 BK006433 AAW23326 BK006432 BK006431 BK006430 0
CCY EU685282 EU685280 EU685281 EU685279 EU685278 EU685277 EU685276 EU685275 EU685274
ARY BK006427 BK006421 BK006426 BK006429 BK006425 BK006428 BK006423 BK006424 BK006422
WDY BK006449 BK006448 BK006450 EU362855 BK006447 BK006446 BK006444 BK006445 BK006443
Pp1Y-1 AAL25117 BK006412 BK006413 BK006411 NA BK006410 0 BK006409 BK006408
Pp1Y-2 NP001015497 BK006419 BK006420 BK006418 NA BK006417 BK006416 BK006415 BK006414
FDY NA 0 0 0 0 0 0 0 0 Sequence data used in this paper are available in DDBJ/EMBL/GenBank as
original sequences and in the Third Party Annotation Section of the
DDBJ/EMBL/GenBank databases under the accession numbers shown in the
Table (81 sequences were first reported in this study). Genes absent from the
genome (Table 1) were labeled with a "0". Unabridged species names (in the
order of appearance) are: D. melanogaster, D. erecta, D. yakuba, D. ananassae,
D. pseudoobscura, D. willistoni, D. mojavensis, D. virilis, and D. grimshawi.
Experimental evidences to ARY annotation (BK006421-BK006429) were
obtained from D. willistoni and D. ananassae by ARY mRNA sequencing (
Genebank accession numbers EU334136 - EU334138).
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
49
Supplementary Table 2. Ka/Ks ratios for the Y-linked genes. These ratios measure the selection constraint in protein-
coding regions57. Ratios around 1 imply lack of selection (usually indicating pseudogenes). All ratios are well below 1, and
within the range of the majority of Drosophila genes24, which strongly suggests that all genes are functional and are evolving
under purifying selection. Calculations were performed at http://services.cbu.uib.no/tools/kaks (ref 57). Values left in blank
correspond to absent genes.
Branch name kl-2 kl-3 kl-5 ORY PRY Ppr-Y ARY CCY WDY Pp1-Y1 Pp1-Y2 FDY Mean SD
E_mel 0.053 0.021 0.025 0.025 0.162 0.036 0.095 0.380 0.016 0.078 0.005 0.224 0.093 0.112F_yak 0.055 0.042 0.041 0.344 0.238 0.158 0.248 0.548 0.110 0.014 0.043 0.167 0.166F_ere 0.069 0.032 0.056 0.065 0.309 0.119 0.348 0.447 0.023 0.075 0.035 0.143 0.150D_F 0.105 0.036 0.061 0.060 0.311 0.080 0.186 0.425 0.047 0.151 0.026 0.135 0.127C_ana 0.052 0.041 0.029 0.066 0.157 0.087 0.097 0.177 0.023 0.026 0.001 0.069 0.056C_D 0.093 0.045 0.044 0.078 0.221 0.144 0.176 0.322 0.029 0.064 0.013 0.112 0.095obs_C 0.043 0.031 0.042 0.062 0.196 0.085 0.086 0.334 0.038 0.169 0.050 0.103 0.094obs_pse 0.074 0.058 0.050 0.110 0.335 0.303 0.192 0.418 0.093 0.142 0.077 0.168 0.127B_obs 0.061 0.033 0.045 0.178 0.144 0.105 0.083 0.248 0.054 0.124 0.048 0.093 0.062B_wil 0.125 0.077 0.061 0.110 0.283 0.151 0.293 0.493 0.064 0.121 0.052 0.166 0.136I_vir 0.040 0.022 0.036 0.179 0.168 0.123 0.092 0.250 0.042 0.162 0.059 0.107 0.075I_moj 0.127 0.044 0.041 0.048 0.414 0.093 0.138 0.295 0.052 0.101 0.135 0.124H_I 0.226 0.139 0.121 0.219 0.319 0.123 0.047 0.229 0.044 0.069 0.154 0.091A_H 0.092 0.062 0.067 0.109 0.164 0.048 0.251 0.092 0.147 0.120 0.115 0.060H_gri 0.060 0.029 0.042 0.058 0.144 0.159 0.404 0.092 0.277 0.078 0.134 0.124A_B 0.094 0.061 0.067 0.105 0.188 0.124 0.062 0.275 0.098 0.136 0.092 0.118 0.064Mean 0.085 0.048 0.052 0.107 0.235 0.124 0.147 0.343 0.057 0.120 0.054 0.224 0.125 0.108SD 0.046 0.9 0.022 0.080 0.083 0.061 0.089 0.105 0.031 0.067 0.034 0.108
www.nature.com/nature
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
50
Supplementary Table 3. Original chromosomal location of the 7 gained
genes. Note that with the possible exception of Pp1-Y1 and Pp1-Y2, all genes
were acquired individually by the Y chromosome (as opposed to resulting from
large segmental duplications), since they are not adjacent to each other at their
original autosomal locations.
Gene Original
chromosome
Syntenic band in
D. melanogaster *
Time of gain
(branch name) †
CCY Muller E 94B6 A_B
ARY Muller D 69C4 A_B
kl-5 Muller E 97F3 obs_C
WDY Muller B 33C1 obs_C
Pp1-Y1 Muller C 58A2 obs_C
Pp1-Y2 Muller C 58A2 obs_C
FDY Muller E 96C1 E_mel
* The 7 genes are Y-linked in D. melanogaster and autosomal in several other species. The
column shows the original autosomal locations in these other species, referenced to the D.
melanogaster map. See Supplementary Figures 4 - 8 for detailed synteny information.
† See Supplementary Figure 9 for branch names.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
51
Supplementary Table 4. Quantities used to estimate the unbiased ratio of gene gain to gene loss. Branches were
named according to Supplementary Fig. 9 (e.g., the branch connecting nodes D and F is named "D_F"). See text and table
footnotes for details. The Supplementary Data file analytical.xls implements these calculations.
Branch name
Branch length
Number of genes
Observed losses
Observed exposure
Inferred new genes gained in
the branch*
Inferred new genes from
previous branches*
Inferred effective
number of genes†
Inferred losses‡
Inferred exposure§
G_sim 2 11 0 22 0.240 0.4073 0.5273 0.0011 1.055 G_sec 2 11 0 22 0.240 0.4073 0.5273 0.0011 1.055 E_G 3.4 11 0 37.4 0.408 0.0000 0.2040 0.0007 0.694 F_yak 10.4 11 0 114.4 1.248 0.2757 0.8997 0.0096 9.357 F_ere 10.4 11 0 114.4 1.248 0.2757 0.8997 0.0096 9.357 D_F 2.3 11 0 25.3 0.276 0.0000 0.1380 0.0003 0.317 C_ana 44.2 11 0 486.2 5.304 0.0000 2.6520 0.1203 117.218 B_wil 62.2 7 0 435.4 7.464 0.0000 3.7320 0.2382 232.130 I_vir 32.5 5 0 162.5 3.900 3.5914 5.5414 0.1848 180.095 I_moj 32.5 5 1 162.5 3.900 3.5914 5.5414 0.1848 180.095 H_I 10.4 5 0 52 1.248 2.3754 2.9994 0.0320 31.194 A_H 20 5 0 100 2.400 0.0000 1.2000 0.0246 24.000 H_gri 42.9 5 1 214.5 5.148 2.3754 4.9494 0.2178 212.328 TOTAL 275.2 2 1948.6 33.024 1.0249 998.893 * Estimated as gain rate × branch length, from the current (column 6) or previous branches (column 7). † Estimated as 0.5 × column 6 + column 7 . ‡ Estimated as loss rate per gene × branch length × column 8 . § Estimated as branch length × column 8 .
www.nature.com/nature
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
52
Supplementary Table 5. FlyBase gene names. For the sake of simplicity, in Fig. 2, Supplementary Fig. 2 and Supplementary
Figures 4 to 8, we labelled the genes in the other species (columns 2 to 9) using the name of the their D. melanogaster orthologs (column
1). The table shows the official names of these genes. Blank space means that the gene was not cited in this paper. The official names of
Anopheles genes (Fig. 2) are: kl-5, AGAP001672; CG6599, AGAP001673; side, AGAP001674. The Anopheles ortholog of CG13980 was
not found in the corresponding genome and was annotated in GeneWise using the WGS sequence AAAB01008987.1 (from positions
8391834 to 8381834).
D. mel D. ere D. yak D. ana D. pse D. wil D. moj D. vir D. grivih Dere_GLEANR_15642 Dyak_GLEANR_5629 Dana_GLEANR_10384 GA10491 Dwil_GLEANR_12927 Dmoj_GLEANR_12283 Dvir_GLEANR_11523 Dgri_GLEANR_14567Tom70 Dere_GLEANR_10223sti Dere_GLEANR_14049 Dyak_GLEANR_3961 Dana_GLEANR_8768 Dwil_GLEANR_12916 Dvir_GLEANR_13675 Dgri_GLEANR_16504side Dere_GLEANR_12163 Dyak_GLEANR_10477 Dana_GLEANR_19966 GA15977 Dwil_GELANR_11556 Dmoj_GLEANR_8197 Dvir_GLEANR_8015 Dgri_GLEANR_839rab3-GAP Dere_GLEANR_10217 Dyak_GLEANR_1271 Dana_GLEANR_826 GA20070 Dwil_GLEANR_3923 Dmoj_GLEANR_1564 Dvir_GLEANR_931 Dgri_GLEANR_10153prd Dere_GLEANR_10215PpN58A Dmoj_GLEANR_5675PpD6 GA25002 Dwil_GLEANR_19931 Dmoj_GLEANR_16157 Dvir_GLEANR_7086 Dgri_GLEANR_5320Pp13C Dmoj_GLEANR_3672Or67a Dana_GLEANR_825mRpS2 Dana_GLEANR_498mRpL45 Dere_GLEANR_11278 Dyak_GLEANR_10254 Dana_GLEANR_20122 GA11976 Dwil_GLEANR_11528 Dmoj_GLEANR_7481 Dvir_GLEANR_10749 Dgri_GLEANR_674lox2 Dere_GLEANR_5517 Dyak_GLEANR_13866 Dana_GLEANR_11663loco Dere_GLEANR_12514 Dyak_GLEANR_7755 Dana_GLEANR_17412 GA18761 Dwil_GLEANR_11960 Dmoj_GLEANR_10439 Dvir_GLEANR_10249 Dgri_GLEANR_243JhI-21 Dere_GLEANR_10218 Dyak_GLEANR_1272 Dana_GLEANR_827 GA11552 Dwil_GLEANR_3893 Dmoj_GLEANR_1565 Dvir_GLEANR_932 Dgri_GLEANR_10154Dlc90F Dvir_GLEANR_10253Dhc93AB Dere_GLEANR_905 Dyak_GLEANR_8668 Dana_GLEANR_19783 GA17641 Dwil_GLEANR_12312 Dmoj_GLEANR_10249 Dvir_GLEANR_10108 Dgri_GLEANR_751CycC Dvir_GLEANR_10744CG9492 Dere_GLEANR_2240 Dyak_GLEANR_8418 Dana_GLEANR_19657 GA21828 Dwil_GLEANR_12148 Dmoj_GLEANR_8658 Dvir_GLEANR_10212 Dgri_GLEANR_3612CG9284 Dyak_GLEANR_12495CG9068 Dere_GLEANR_10385 Dyak_GLEANR_1452 Dana_GLEANR_403 GA27740 Dwil_GLEANR_8422 Dmoj_GLEANR_1489 Dvir_GLEANR_16700 Dgri_GLEANR_10062CG7265 Dvir_GLEANR_10743CG7126 Dvir_GLEANR_10741CG6792 Dere_GLEANR_10291 Dyak_GLEANR_1273 Dana_GLEANR_828 GA19865 Dmoj_GLEANR_1566 Dvir_GLEANR_933 Dgri_GLEANR_9575CG6785 Dere_GLEANR_10221 Dyak_GLEANR_1275 Dana_GLEANR_16259 GA19861 Dmoj_GLEANR_1568 Dvir_GLEANR_935CG6770 Dere_GLEANR_10220 Dyak_GLEANR_1274 Dana_GLEANR_829 GA19852 Dmoj_GLEANR_1567CG6766 Dere_GLEANR_10222 Dyak_GLEANR_1276 Dana_GLEANR_16260 GA19848 Dmoj_GLEANR_1569CG6599 Dere_GLEANR_11633 Dyak_GLEANR_7495 Dana_GLEANR_17557 GA19712 Dwil_GELANR_11932 Dmoj_GLEANR_8195 Dvir_GLEANR_9885 Dgri_GLEANR_90CG6059 Dere_GLEANR_10496 Dyak_GLEANR_12184 Dana_GLEANR_7487 GA19330 Dwil_GLEANR_11984 Dmoj_GLEANR_9454 Dvir_GLEANR_8437 Dgri_GLEANR_2986CG5317 Dere_GLEANR_8552 Dyak_GLEANR_2353 Dana_GLEANR_500 GA18800 Dwil_GLEANR_3891 Dmoj_GLEANR_1865 Dvir_GLEANR_1913 Dgri_GLEANR_11449
www.nature.com/nature
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
53
Supplementary Table 5 (continuation).
D. mel D. ere D. yak D. ana D. pse D. wil D. moj D. vir D. griCG5284 Dmoj_GLEANR_13027 Dvir_GLEANR_13682 Dgri_GLEANR_16509CG5241 Dmoj_GLEANR_12281 Dvir_GLEANR_11521 Dgri_GLEANR_14565CG4386 Dere_GLEANR_5515 Dyak_GLEANR_13864 Dana_GLEANR_11660 GA18150 Dwil_GLEANR_15903 Dmoj_GLEANR_5682 Dvir_GLEANR_6022 Dgri_GLEANR_4997CG4377 Dyak_GLEANR_13860CG4363 Dere_GLEANR_5512 Dyak_GLEANR_13861CG3731 Dvir_GLEANR_10252CG34040 Dere_GLEANR_5513 Dyak_GLEANR_13862 Dana_GLEANR_11658 Dwil_GLEANR_19918 Dmoj_GLEANR_5680 Dvir_GLEANR_6020 Dgri_GLEANR_4995CG34029 Dere_GLEANR_5506 Dyak_GLEANR_13855 Dana_GLEANR_11662 Dwil_GLEANR_15892 Dmoj_GLEANR_5670 Dvir_GLEANR_6012 Dgri_GLEANR_4988CG3348 Dere_GLEANR_12166 Dyak_GLEANR_10480 Dana_GLEANR_19969 GA17395 Dwil_GLEANR_11560 Dmoj_GLEANR_9706 Dvir_GLEANR_8018 Dgri_GLEANR_842CG3339 Dere_GLEANR_12165 Dyak_GLEANR_10479 Dana_GLEANR_19968 GA17389 Dwil_GLEANR_11559 Dmoj_GLEANR_9705 Dvir_GLEANR_8017 Dgri_GLEANR_841CG33332 Dvir_GLEANR_10251CG33331 Dvir_GLEANR_10250CG33120 Dmoj_GLEANR_1863 Dvir_GLEANR_1911 Dgri_GLEANR_11447CG3300 Dere_GLEANR_12164 Dyak_GLEANR_14478 Dana_GLEANR_19967 GA17383 Dwil_GELANR_11557 Dmoj_GLEANR_9704 Dvir_GLEANR_8016 Dgri_GLEANR_840CG31161 Dere_GLEANR_11277 Dyak_GLEANR_10253 Dana_GLEANR_20121CG31159 Dere_GLEANR_11276 Dyak_GLEANR_10120 Dana_GLEANR_20120 GA16055 Dwil_GLEANR_11527 Dvir_GLEANR_10746 Dgri_GLEANR_672CG31158 Dere_GLEANR_11274 Dyak_GLEANR_10250 Dana_GLEANR_20118 GA16054 Dwil_GLEANR_11526 Dmoj_GLEANR_7478 Dgri_GLEANR_669CG31156 Dere_GLEANR_12515 Dyak_GLEANR_7756 Dana_GLEANR_17414 GA16052 Dwil_GLEANR_11962 Dmoj_GLEANR_10440 Dgri_GLEANR_244CG30048 Dere_GLEANR_5102 Dyak_GLEANR_12705 Dana_GLEANR_13612 Dmoj_GLEANR_2385 Dvir_GLEANR_2496 Dgri_GLEANR_10969CG18735 Dere_GLEANR_5516 Dyak_GLEANR_13865 Dana_GLEANR_11661 GA15058 Dmoj_GLEANR_5683 Dvir_GLEANR_6023 Dgri_GLEANR_4998CG18600 Dvir_GLEANR_10642CG17078 Dwil_GLEANR_3894CG14947 Dere_GLEANR_8854 Dyak_GLEANR_2355 Dana_GLEANR_501 GA13374 Dmoj_GLEANR_1867 Dvir_GLEANR_1915 Dgri_GLEANR_11450CG14946 Dere_GLEANR_10216 Dyak_GLEANR_1270 Dana_GLEANR_824 GA13373 Dwil_GLEANR_3890 Dmoj_GLEANR_1562 Dvir_GLEANR_930 Dgri_GLEANR_10152CG14945 Dere_GLEANR_8551 Dyak_GLEANR_2352 Dana_GLEANR_499 GA13372 Dwil_GLEANR_3922 Dmoj_GLEANR_1864 Dvir_GLEANR_1912 Dgri_GLEANR_11448CG14264 Dere_GLEANR_11634 Dyak_GLEANR_7494 Dana_GLEANR_22280 GA12867CG13980 Dere_GLEANR_11635 Dyak_GLEANR_7496 Dana_GLEANR_17556 GA12671 Dwil_GELANR_11933 Dmoj_GLEANR_8196 Dvir_GLEANR_9886 Dgri_GLEANR_91CG13843 Dere_GLEANR_11274 Dyak_GLEANR_10251 Dana_GLEANR_20119 GA12565 Dwil_GLEANR_20016 Dgri_GLEANR_670CG13492 Dere_GLEANR_5514 Dyak_GLEANR_13863 Dana_GLEANR_11659 GA12325 Dwil_GLEANR_15900 Dmoj_GLEANR_5681 Dvir_GLEANR_6021 Dgri_GLEANR_4996CG13125 Dere_GLEANR_8766 Dyak_GLEANR_1013 Dana_GLEANR_14911 GA12063 Dwil_GLEANR_8916 Dmoj_GLEANR_948 Dvir_GLEANR_693 Dgri_GLEANR_10378CG13075 Dgri_GLEANR_16508CG11927 Dana_GLEANR_830CG11926 Dana_GLEANR_831CG10681 Dere_GLEANR_15641 Dyak_GLEANR_5628 Dana_GLEANR_10383 GA10490 Dwil_GLEANR_12826 Dmoj_GLEANR_12282 Dvir_GLEANR_11522 Dgri_GLEANR_14566CG10660 Dana_GLEANR_8774 GA10475 Dwil_GLEANR_12922CG10657 Dere_GLEANR_14056 Dyak_GLEANR_3965 Dana_GLEANR_8773 GA10472 Dwil_GLEANR_12921CG10654 Dere_GLEANR_14050 Dyak_GLEANR_3962 Dana_GLEANR_8769 Dwil_GLEANR_12917 Dmoj_GLEANR_13021 Dvir_GLEANR_13676 Dgri_GLEANR_16505CG10646 Dere_GLEANR_14051 Dyak_GLEANR_3963 Dana_GLEANR_8770 GA10465 Dwil_GLEANR_12918 Dmoj_GLEANR_13022 Dvir_GLEANR_13677 Dgri_GLEANR_16506CG10638 Dyak_GLEANR_3964 Dana_GLEANR_8772 GA10458 Dmoj_GLEANR_13023 Dvir_GLEANR_13678CG12636 Dere_GLEANR_8992 Dyak_GLEANR_852 Dwil_GLEANR_18646
www.nature.com/nature
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
54
Supplementary Notes
31. Aitken, R. J. & Graves, J. A. M., Human spermatozoa: The future of sex. Nature 415, (2002).
32. Zhou, Q. et al., On the origin of new genes in Drosophila. Genome Res 18, 1446-1455 (2008).
33. Gonzalez, J., Casals, F. & Ruiz, A., Duplicative and conservative transpositions of larval serum protein 1 genes in the genus Drosophila. Genetics 168, 253-264 (2004).
34. Przyborowski, J. & Wilenski, H., Homogeneity of results in testing samples from Poisson series: with an application to testing clover seed for dodder. Biometrika 31, 313-323 (1940).
35. Krishnamoorthy, K. & Thomson, J., A more powerful test for comparing two Poisson means J. Stat. Plan. Inf. 119, 23-35 (2004).
36. Bonaccorsi, S. & Lohe, A., Fine mapping of satellite DNA sequences along the Y chromosome of Drosophila melanogaster: relationships between satellite sequences and fertility factors. Genetics 129, 177-189 (1991).
37. Matsunaga, S. et al., Duplicative transfer of a MADS box gene to a plant Y chromosome. Mol Biol Evol 20, 1062-1069 (2003).
38. Fisher, R. A., The evolution of dominance. Biological Reviews 6, 1 (1931).
39. Lynch, M. & Katju, V., The altered evolutionary trajectories of gene duplicates. Trends in Genetics 20, 544-549 (2004).
40. Usakin, L. A., Kogan, G. L., Kalmykova, A. I. & Gvozdev, V. A., An alien promoter capture as a primary step of the evolution of testes-expressed repeats in the Drosophila melanogaster genome. Mol Biol Evol 22, 1555-1560 (2005).
41. Aravin, A. A. et al., Double-stranded RNA-mediated silencing of genomic tandem repeats and transposable elements in the D. melanogaster germline. Curr Biol 11, 1017-1027 (2001).
42. Belloni, M., Tritto, P., Bozzetti, M. P., Palumbo, G. & Robbins, L. G., Does Stellate cause meiotic drive in Drosophila melanogaster? Genetics 161, 1551-1559 (2002).
43. Hamilton, W. D., Extraordinary sex-ratios Science 156, 477-488 (1967).
44. McCullagh, P. & Nelder, J. A., Generalized linear models, 2nd ed. (Chapman and Hall, London ; New York, 1989).
45. RDC Team, R: A language and environment for statistical computing (R Foundation for Statistical Computing, Vienna, 2007).
46. Tavare, S., Balding, D. J., Griffiths, R. C. & Donnelly, P., Inferring coalescence times from DNA sequence data. Genetics 145, 505-518 (1997).
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
55
47. Beaumont, M. A., Zhang, W. & Balding, D. J., Approximate Bayesian computation in population genetics. Genetics 162, 2025-2035 (2002).
48. Przeworski, M., Estimating the time since the fixation of a beneficial allele. Genetics 164, 1667-1676 (2003).
49. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R., A tool for analyzing and annotating genomic sequences. Genomics 46, 37-45 (1997).
50. Birney, E. & Durbin, R., Using GeneWise in the Drosophila annotation experiment. Genome Res 10, 547-548 (2000).
51. Lewis, S. E. et al., Apollo: a sequence annotation editor. Genome Biol. 3, R82 (2002).
52. Parra, G. et al., Comparative gene prediction in human and mouse. Genome Res 13, 108-117 (2003).
53. Thompson, J. D., Higgins, D. G. & Gibson, T. J., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680 (1994).
54. Tamura, K., Dudley, J., Nei, M. & Kumar, S., MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol 24, 1596-1599 (2007).
55. Sudhir Kumar, personal communication.
56. Rose, T. M., Henikoff, J. G. & Henikoff, S., CODEHOP (COnsensus-DEgenerate Hybrid Oligonucleotide Primer) PCR primer design. Nucleic Acids Res 31, 3763-3766 (2003).
57. Liberles, D. A., Evaluation of methods for determination of a reconstructed history of gene sequence evolution. Mol Biol Evol 18, 2040-2047 (2001).
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature
56
Supplementary Data
analytical.xls This MS-EXCEL file implements the analytical treatment of the
ascertainment bias, and estimates the unbiased ratio of the gain rate to loss rate, as
described in the Supplementary Methods (section 1).
indelsim_free.R This is a program written in R language that implements the computer
simulations of gain and loss of genes in the Y chromosome (across the 12 species), as
described in the Supplementary Methods (section 2). It produces approximate Bayesian
estimates of the posterior densities of the rates of gene gain and loss. The run time is
approximately two days in a 2 GHz Dual Core computer.
gains_losses_script.R This is a program written in R language that implements the
Poisson regression that tests the statistical significance of the gene gain / gene loss ratio.
gains_losses_data.txt This is the data file used by the gains_losses_script.R program.
doi: 10.1038/nature07463 SUPPLEMENTARY INFORMATION
www.nature.com/nature