the genomics of speciation in drosophila athabascathe genomics of speciation in drosophila athabasca...

66
The Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Integrative Biology in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Doris Bachtrog, Co-chair Professor Michael B. Eisen, Co-chair Professor Rasmus Nielsen Professor Rosemary Gillespie Fall 2013

Upload: others

Post on 21-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

!

The Genomics of Speciation in Drosophila athabasca

by

Karen Masae Wong Miller

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Doctor of Philosophy

in

Integrative Biology

in the

Graduate Division

of the

University of California, Berkeley

Committee in charge:

Professor Doris Bachtrog, Co-chair Professor Michael B. Eisen, Co-chair

Professor Rasmus Nielsen Professor Rosemary Gillespie

Fall 2013

Page 2: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

!

Page 3: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 1

Abstract

The Genomics of Speciation in Drosophila athabasca

by

Karen Masae Wong Miller

Doctor of Philosophy in Integrative Biology

University of California, Berkeley

Professor Doris Bachtrog, Co-Chair

Professor Michael B. Eisen, Co-Chair

Understanding the genetic basis underlying the process of speciation is one of the primary goals in the field of evolutionary biology. However, despite recent and exciting progress in the field of speciation genetics, particularly in the area of postzygotic isolating mechanisms, surprisingly little is still known about the genetic basis and evolutionary forces that are important early on in speciation. Notably, molecular mechanisms relating to the evolution of prezygotic isolating barriers are particularly poorly understood. While studies on the genetics of postzygotic isolating barriers are critical to our understanding of how species boundaries are maintained post-speciation, such factors potentially may not have been involved in driving the actual speciation event, and instead may have evolved secondarily. By studying recently diverged populations, we increase the chances that the differences in the genome that we detect are actually directly responsible for driving reproductive isolation and thus speciation. The widespread availability of whole genome sequencing techniques opens up the opportunity to examine speciation at the genomic level in non-model species that may be more applicable to the study of early speciation. This research takes advantage of whole genome sequencing techniques and a very young semispecies system, Drosophila athabasca. The D. athabasca species complex, which is composed of three overlapping semispecies – Western-Northern, Eastern-A, and Eastern-B, provides a unique system in which to study incipient speciation using population genomics. The three semispecies of D. athabasca are estimated to have diverged less than 25,000 years ago and are morphologically indistinguishable. Individuals will hybridize in the laboratory, but their geographic ranges and distinct courtship songs, which result in prezygotic isolation, differentiate the populations sufficiently for them to be designated as semispecies. This very young divergence time and unique population structure within D. athabasca makes it an ideal system to study the genetics of prezygotic isolation and incipient speciation.

Page 4: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 2

I first generated a de novo reference genome assembly for D. athabasca by sequencing the genome at 30X coverage using Illumina next-generation sequencing technologies and annotated this reference genome using a combination of de novo, comparative, and mRNAseq gene finding methods. In order to examine the genome of D. athabasca at a population genomic level, I established 404 iso-female lines of D. athabasca collected from across the species range, including the previously identified ranges of all three semispecies. I characterized courtship songs from a subset of these lines and sequenced the genomes of 28 individuals, roughly equally distributed geographically and between semispecies, each at 10X coverage. Using this genome-wide population data, I quantified levels of genome-wide diversity and differentiation within and between semispecies. Despite relatively low levels of divergence within the complex, principal component and phylogenetic analyses using the genomic data clearly separates individuals into distinct genetic groups corresponding to the three behaviorally defined semispecies. Furthermore, phylogenetic analysis places Eastern-A and Eastern-B as sister taxa, confirming previous research indicating that Eastern-A and Eastern-B semispecies are the more closely related semispecies, with Western-Northern being the most anciently diverged of the three semispecies. To infer the speciation history of the D. athabasca complex, I fit the data to demographic models and estimate divergence under a model of isolation with low levels of migration. This model estimates a divergence time of only 6,000 years ago for the Eastern-A/Eastern-B split and 16,000 years for the Western/Eastern split, consistent with a previous hypothesis of population expansion and colonization of North America following the last glacial maximum. Overall divergence within the semispecies is low, with approximately 2 million sites variable within D. athabasca, and only 1% of these variable sites being private and fixed within semispecies. Furthermore, I find divergence is not evenly distributed across the genome, with the X-chromosome exhibiting increased levels of divergence compared with autosomes. Most interestingly, despite the low levels of overall divergence, genome-wide scans identify a single large spike of differentiation between the two youngest semispecies, which have an estimated divergence time of only 6,000 years. Scans for selection also show strong signatures of a selective sweep within semispecies at this same locus, indicating that divergence in this particular region has likely been driven by selection. Further analysis of this region reveals that it harbors a gene previously identified to be involved in courtship song in other species within the Drosophila genus, suggesting that it may play an important role in the evolution of prezygotic reproductive barriers between the D. athabasca semispecies. This study provides one of the first genome-wide population genetic investigations of the molecular changes and population parameters important during incipient speciation, contributing important new information to our view of the genetics of speciation. With some of the youngest divergence times used to study speciation genetics on a genome-wide level thus far, examining the patterns of divergence within and between the semispecies of D. athabasca has allowed us to identify a candidate gene that may play a role in the evolution of prezygotic isolation, and thus, reproductive isolation at the earliest stages of speciation.

Page 5: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! i

DEDICATION

To Daniel.

Page 6: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! ii

TABLE OF CONTENTS CHAPTER 1: INTRODUCTION .....................................................................................................1

Background ..................................................................................................................................1 Drosophila athabasca as a new model system for the study of speciation ................................5 References ....................................................................................................................................7 Figures........................................................................................................................................14!

!

CHAPTER 2: ESTABLISHING GENOMIC RESOURCES FOR DROSOPHILA ATHABASCA ......15 Reference genome assembly & annotation ...............................................................................15 Collection of population samples ...............................................................................................15 Courtship song assays ...............................................................................................................16 Survey of mitotic karyotype variation ......................................................................................16 Whole genome re-sequencing & variant calling ........................................................................16 Acknowledgements ....................................................................................................................17 References ..................................................................................................................................18 Tables .........................................................................................................................................20

!

CHAPTER 3: PATTERNS OF GENOME-WIDE DIVERSITY & POPULATION STRUCTURE IN DROSOPHILA ATHABASCA ..........................................................................................................23

Abstract .....................................................................................................................................23 Introduction ................................................................................................................................23 Methods .....................................................................................................................................24 Results .......................................................................................................................................26 Discussion ..................................................................................................................................28 References ..................................................................................................................................31 Figures........................................................................................................................................34 Tables .........................................................................................................................................37

!

CHAPTER 4: THE GENOMIC LANDSCAPE OF INCIPIENT SPECIATION REVEALS A CANDIDATE SPECIATION GENE IN DROSOPHILA ATHABASCA .............................................38

Abstract .....................................................................................................................................38 Letter ..........................................................................................................................................38 References .................................................................................................................................42 Figures .......................................................................................................................................44 Supplementary Information ......................................................................................................46

!APPENDIX A: COURTSHIP SONG WAVEFORMS OF SEQUENCED LINES ..............................54 APPENDIX B: MITOTIC KARYOTYPES ....................................................................................57

Page 7: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 1

CHAPTER 1: INTRODUCTION Background Although Darwin laid out the conceptual foundation for understanding speciation in “On the Origin of Species” over 150 years ago, exactly how and why speciation occurs remain some of the most debated questions in evolutionary biology. The nature and definition of a species itself is a broad multifaceted topic. However, the widely accepted Biological Species Concept (BSC) which defines species as “groups of actually or potentially interbreeding natural populations, which are reproductively isolated from other such groups” (Mayr 1942), allows speciation research to essentially reduce down to understanding the evolution and genetics of reproductive isolating mechanisms between populations. On a genomic level, this means individuals within a species freely exchange genetic material until one or more mutations arise in the genome that act to severely reduce or eliminate gene flow between populations, resulting in a speciation event where a single species splits into two reproductively isolated lineages.

Although the BSC has created a more focused route for exploring the genetics of speciation, we still know surprisingly little about the changes that occur in the genome during a speciation event. Much of our lack of understanding can be attributed to the fact that once reproductive isolation evolves, other incompatibilities between species are expected to accumulate rapidly (the snowballing effect, Matute et al. 2010). Due to the snowballing nature of species divergence, a major problem that has continuously plagued our understanding of speciation genetics is deciphering which of the differences that we observe in the present are responsible for initiating the actual speciation event, from those that have evolved secondarily. In addition to the problem of secondary accumulation of mutations, which complicates all other aspects of speciation research, below is a brief overview of a few of the most discussed areas of speciation genetics research and the progress that has been made in each (for more comprehensive reviews see Wu 2001, Coyne & Orr 2004, Kulathinal & Singh 2008, Presgraves 2010, Feder et al. 2012). Importance of prezygotic vs. postzygotic isolating mechanisms?

Reproductive isolating barriers can come in many forms. Premating prezygotic isolation involves mechanisms which prevent two populations from mating, for example morphological differences that physically prevent mating (Sota & Kubota 1998, Gittenberger 1988), pollinator or host specificity (Bradshaw et al. 1995, Schemske & Bradshaw 1999, Bush 1992), temporal differences in mating seasons (Lloyd & Dybas 1966, Knowlton et al. 1997), and differences in courtship rituals and mate choice (Seehausen & van Alphen 1998, Grillet et al. 2006). Postmating prezygotic mechanisms allow for copulation, however they prevent zygote formation (Ludlow & Magurran 2006, Sweigart 2010). Finally, postzygotic mechanisms, which can be extrinsic or intrinsic, allow for copulation and zygote formation, however offspring are inviable or sterile (Barbash et al. 2000, Presgraves et al. 2003), prohibiting further gene exchange between species. With so many types of barriers, one of the commonly asked questions in speciation research is whether prezygotic or postzygotic barriers are more important during the speciation process.

In nature, we observe a large number of species that show prezygotic isolation in the absence of postzygotic isolating barriers, thus it has been thought that prezygotic isolation may

Page 8: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 2

commonly evolve first (Butlin et al. 2009), and thus be more important to the actual speciation process with postzygotic isolating barriers potentially accumulating after the actual speciation event. To support this, a study in Drosophila examined strength of prezygotic and postzygotic isolation in pairs of species at varying genetic distances, and it was found that for sympatric species, prezygotic isolation evolves earlier (Coyne & Orr 1989, 1997). However, although prezygotic isolation may be particularly important early on in the speciation process, we know surprisingly little about the genetics of these prezygotic mechanisms, perhaps due to the fact that phenotypes associated with postzygotic isolation are easier to quantify in genetic mapping studies (Kulathinal & Singh 2008, Butlin et al. 2009).

Given the small number of species used to investigate the genetics of speciation, overall patterns across both sympatric and allopatric taxa are still very much unknown. Investigating many more species at varying evolutionary distances, but specifically during the earliest stages of divergence, are likely to give us a better indication of the timing of evolution of both prezygotic and postzygotic isolating barriers during speciation. Can speciation occur in the face of gene flow? Since gene flow acts as a homogenizing force opposing speciation, exactly how reproductive isolating barriers and genetic differentiation can evolve in sympatry between populations has been a consistent problem for speciation research. Overcoming gene flow is an issue both for speciation in sympatry, as well as situations where populations with incomplete reproductive isolating barriers experience secondary contact after an initial period of allopatry. Because of the complication of gene flow, a very popular past view was that speciation in sympatry must not occur, and that populations with incomplete reproductive barriers that meet during secondary contact are transient and will eventually fuse into one species (reviewed in Coyne & Orr 2004).

Although transiency of species can be difficult to rule out, recent work investigating the role of chromosomal rearrangements during speciation has shown that species can remain distinct despite ongoing gene flow (Noor et al. 2001, Nosil 2008, Michel et al. 2010, Jones et al. 2012, Martin et al. 2013), reviving the plausibility of speciation in sympatry. Theory predicts that rearrangements, such as inversions and translocations, may promote speciation with gene flow by suppressing recombination and allowing divergence to accumulate and be maintained in rearranged regions despite ongoing gene flow throughout the rest of the genome (Noor et al. 2001, Rieseberg 2001, Navarro & Barton 2003, Ayala & Coluzzi 2005, Kirkpatrick & Barton 2006, Noor et al. 2007, Kulathinal et al. 2009, Kirkpatrick 2010, McGaugh & Noor 2012). Comparisons have also shown that related sympatric species exhibit more inversions than closely related allopatric species (Noor et al. 2001, Ayala & Coluzzi 2005), supporting a role for chromosomal rearrangements in speciation with gene flow. Most compellingly, studies have mapped loci known to be involved in reproductive isolation and local adaptation to inverted regions in multiple species groups that have experienced recent introgression, including Drosophila (Noor et al. 2001, Williams et al. 2001, Khadem & Camacho 2011), monkeyflowers (Lowry & Willis 2010, Fishman et al. 2013), sunflowers (Kim & Rieseberg 1999), sticklebacks (Jones et al. 2012), and butterflies (Joron et al. 2011).

Research on the role that genome rearrangements play in speciation and maintaining species boundaries despite ongoing gene flow is still in its infancy. However, our ability to investigate this topic has benefitted substantially from the recent development of whole-genome

Page 9: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 3

sequencing technologies, allowing us to expand this line of research to include many more species comparisons. Again, focusing particularly on those species in the very earliest stages of divergence can provide valuable insight into the role that gene flow and genome rearrangements play during speciation. Role of sex chromosomes in speciation? Work investigating the genomic distribution of genes involved in causing reproductive isolation has produced an interesting pattern with regard to the importance of sex chromosomes in speciation. In multiple taxa, divergence between species accumulates disproportionately on the sex chromosomes compared to autosomes (Stump et al. 2005, Garrigan et al. 2012, Ellegren et al. 2012). Similarly, mapping of hybrid sterility and inviability factors has shown that these factors also accumulate disproportionately on the X-chromosome (Dobzhansky 1936, Wu & Beckenbach 1983, Coyne & Charlesworth 1989, Khadem & Krimbas 1991, True et al. 1996, Tao et al. 2003a, Masly & Presgraves 2007, Good et al. 2008, Janousek et al. 2012). The two major reasons cited for the sex chromosomes playing a special role during speciation are Haldane’s rule and the large-X effect. Haldane’s rule, the observation that when hybrids are sterile or inviable, the affected sex will be the heterogametic sex, has been observed across many taxa, including Drosophila, butterflies, birds, mammals, and some plants with sex chromosomes (Coyne 1992, Naisbit et al. 2002, Brothers & Delph 2010). There have been many proposed mechanisms explaining Haldane’s rule, including dominance, faster-male evolution, faster-X, meiotic drive elements, and endosymbiont-driven hypotheses. The large-X effect is based on evidence that the X-chromosome appears to have a larger effect on hybrid incompatibility than predicted based on size or number of genes. Evidence for the large-X effect has been shown in Drosophila, where hybrid male sterility factors appear 2-4 times more dense on the X-chromosome compared to the autosomes (Tao et al. 2003a, Masly & Presgraves 2007), and butterflies (large-Z; Naisbit et al. 2002). Although it is clear that sex chromosomes appear to be important locations for the evolution of postzygotic reproductive isolation mechanisms, exactly why this is the case is still very much debated, as evidenced by the number of competing hypotheses.

Since research of species in the very earliest stages of divergence is still lacking, it will be interesting to see whether these patterns hold up, or whether sex chromosomes may instead be hotspots for accumulating secondary differences between species. Also, although much less work has been carried out investigating prezygotic isolating mechanisms, some studies have suggested a potentially larger autosomal component for sexual isolation behaviors compared to the X-chromosome (Coyne 1989, Hollocher et al 1997, Shaw et al. 2007). However, all of the reproductive isolation genes that have been clearly mapped at the gene level relate exclusively to hybrid incompatibility, so as we begin to locate genes responsible for prezygotic isolation, it will be interesting to see if they are also preferentially located on sex chromosomes, or whether they have a stronger autosomal component. How many genes are involved in speciation? The question of how many genes are involved in the evolution of reproductive isolation is still not well understood. The classic Dobzhansky-Muller incompatibility model proposes involvement of at least two loci, however some studies have shown that changes in single genes

Page 10: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 4

can cause speciation. An example of single gene speciation was shown in the coiling reversal of land snails, where a mutation reversing the coiling direction of the shell also resulted in reproductive isolation by causing genital mismatch between the left and right coiled morphs (Ueshima & Asami 2003). Although changes of a few genes of large effect may be the case in some systems, many QTL analyses and studies investigating the genetics behind hybrid incompatibility genes reveal effects of large numbers of loci (Presgraves 2003, Tao et al. 2003a, 2003b, Shaw et al. 2007).

Some of the most exciting recent work in the past few decades of speciation research has been the genetic mapping of postzygotic isolating mechanisms, giving us valuable information about number of loci involved and even pinpointing specific “speciation genes” contributing to hybrid incompatibility (Sawamura & Yamamoto 1997, Ting et al. 1998, Presgraves et al. 2003, Barbash et al. 2003, Brideau et al. 2006, Masly et al. 2006, Bomblies et al. 2007, Tang & Presgraves 2009, Phadnis & Orr 2009, Lee et al. 2008, Bikard et al. 2009, Mihola et al. 2009). Although this research is crucial to adding to our understanding of species and species differences, particularly of how the genetic boundaries of species are maintained, since these studies have been conducted using species pairs with long divergence times, they may not necessarily be giving us any information about the genetics of the actual speciation event. Moving forward

Over the past few decades, with the development of genetic tools and techniques employing “model species,” we have made exciting progress in the field of speciation genetics, even pinpointing a handful of genes responsible for reproductive isolation. However, the picture is far from complete. Notably, as discussed above, all the “speciation genes” that have been identified so far have focused on hybrid incompatibility of anciently diverged species pairs. And there is a good reason for this – the genetic tools and genomic resources widely available to researchers have exclusively been those of the classic model organisms, leaving these organisms as the only feasible systems to tackle these questions. However, this focus on model organisms with developed genetic resources has left us with a very incomplete picture of speciation genetics. Research relying on classic model organisms has centered around anciently diverged species pairs (Saccharomyces cerevisaie/bayanus: 14mya; D. melanogaster/simulans: 2-3mya; Human/chimp: 6mya; Mouse/rat: 25mya; Mus subspecies: 350kya), leading to a possibly artificial focus on deciphering the genetics of postzygotic reproductive isolating mechanisms (hybrid incompatibility) in speciation research. And, as discussed above, although postzygotic reproductive isolating mechanisms are undoubtedly important for understanding the maintenance of species boundaries, it remains to be seen whether postzygotic isolation mechanisms directly play a role during the initial formation of new species.

Fortunately, the recent development of cost-effective whole-genome sequencing technologies has led to a new era of speciation genomics, which has opened up the opportunity to develop genomic resources for other species more applicable to the study of speciation. Additionally, these technologies give us important genome-wide information, allowing us to investigate patterns of differentiation across the entire genome. Thus, questions that we may have reached our limits to investigating using previous technologies and models can now be re-visited using new approaches and, importantly, more species. Already, a number of recent studies have demonstrated the usefulness of whole-genome sequencing technologies to examine speciation on a genomic level (Kulathinal et al. 2009, Lawniczak et al. 2010, Jones et al. 2012,

Page 11: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 5

Garrigan et al. 2012, Janousek et al. 2012, Ellegren et al. 2012, Martin et al. 2013, Nadeau et al. 2013, Zhang et al. 2013, Andrew & Rieseberg 2013). These recent studies show the promise that whole-genome technologies have for improving our understanding of the genetics of speciation. Since speciation most certainly does not proceed in the same fashion for any given lineage or taxa, cost-effective whole-genome sequencing is crucial to giving us a fuller and more accurate view of the genetics of speciation by allowing us to develop genomic tools for non-model species. Drosophila athabasca as a new model system for the study of speciation

Studies in Drosophila have greatly increased our understanding of the genetics of speciation, particularly relating to postzygotic reproductive isolation (Wu & Palopoli 1994, Ting et al. 1998, Barbash et al. 2000, Presgraves et al. 2003, Masly et al. 2006, Chang & Noor 2007, Phadnis & Orr 2009, Tang & Presgraves 2009, Lu et al. 2010). However, the difficulty in teasing apart factors that are important to the actual process of speciation from those that have accumulated secondarily has been widely noted. Thus, although understanding the genetics of postzygotic reproductive isolating barriers is important for understanding the maintenance of species boundaries, these barriers may not have been important early on during species divergence. Instead, as discussed above, prezygotic isolating barriers may predominate during the early stages of divergence. To better understand the genetics of the early stages of speciation we must examine very recently diverged systems, increasing the likelihood that the differences that we detect are those responsible for the actual speciation event.

Drosophila athabasca, a non-model Drosophila species, is a North American species complex within the obscura group and affinis subgroup of Drosophila. The affinis subgroup is a young radiation, with members of the subgroup having an average age of only 3.5 million years, and its oldest member, D. azteca, originating only 6 mya (Beckenbach et al. 1993). The D. athabasca complex itself is made up of three semispecies: Western-Northern, Eastern-A, and Eastern-B, named for their respective ranges (Figure 1). Previous research on this group estimated that the complex diverged very recently, with Western-Northern having split from the Eastern semispecies 23,000 years ago, and Eastern-A splitting from Eastern-B only 5,000 years ago (Ford & Aquadro 1996). These very young divergence times make D. athabasca among the youngest species to be used to study speciation genomics and an invaluable addition to helping us understand what changes occur along the genome during the earliest stages of speciation. Along with having a very recent divergence time, a number of additional features characteristic of the D. athabasca semispecies, discussed below, make them particularly ideal to use as a new model species for speciation genomics.

The D. athabasca species range is spread across the northern half of North America, with semispecies ranges being largely non-overlapping, however regions of sympatry exist between Western-Northern and Eastern-A, and Eastern-A and Eastern-B (Miller & Westphal 1967, Johnson 1985, Ford et al. 1994, Figure 1). Regions of sympatry offer the opportunity to examine the interplay between gene flow and speciation in D. athabasca. Identifying regions of the genome resistant to introgression should help us to understand how the genetic boundaries of incipient species are maintained despite ongoing gene flow.

Phenotypically, D. athabasca is also an interesting group for studying the genetics of speciation. All three members of the complex appear morphologically identical, however they exhibit semispecies-specific male courtship song (Miller 1958; Miller et al. 1975). Additionally,

Page 12: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 6

mate choice experiments have shown strong corresponding female preference to the semispecies-specific courtship songs (Yoon 1991), consistent with a behavioral premating prezygotic reproductive isolating barrier. The semispecies of D. athabasca lack postzygotic isolating barriers, as hybrid crosses and no choice experiments in a laboratory setting have shown that all semispecies crosses produce fertile offspring (Miller & Westphal 1967; Johnson 1978, Yoon 1991). Thus, D. athabasca is a promising system for investigating the genetic mechanism of a rapidly evolving prezygotic isolating barrier – in this case, the genetic basis of semispecies-specific courtship song.

Large numbers of chromosomal rearrangements known within the D. athabasca complex (Novitski 1946, Johnson 1985) also lend it to be an ideal candidate for investigating the potential role structural rearrangements have played in the divergence of the semispecies, and ultimately in the evolution of prezygotic reproductive isolation. Over 70 inversions unique to a semispecies in D. athabasca have been previously estimated (Johnson 1985). Specifically, the X-chromosome is known to harbor inversions fixed between semispecies, with seven fixed inversions separating Western-Northern from the Eastern semispecies, and three fixed inversions separating the two Eastern semispecies (Yoon & Aquadro 1994). Examining patterns of divergence in inverted vs. non-inverted regions of the genome, as well as identifying what genes lie within inverted regions, should give us information on how chromosomal inversions may have played a role in the divergence of the D. athabasca semispecies.

In addition to fixed inversions, D. athabasca exhibits a unique polymorphic Y-autosome fusion, where individuals within the Western-Northern semispecies lack the fusion, while the Y-autosome fusion is segregating within the Eastern-A semispecies and is completely fixed in Eastern-B individuals (Miller & Roy 1964, Miller & Westphal 1967). The unique polymorphic sex chromosome system within D. athabasca could potentially provide interesting insights into the role that sex chromosomes have during speciation. Since it is hypothesized that sex chromosomes play a special role in speciation, we can use D. athabasca to directly compare whether we see different patterns or forces acting on a single chromosome in a sex chromosome vs. autosomal context.

The early, mostly descriptive, work on D. athabasca has revealed the massive potential for this young species complex to address some of the major persistent questions of speciation genetics. However despite the promise of D. athabasca for investigating the genetics of incipient speciation, the species complex has not been widely studied. Beyond a handful of genes, little is known about the genome-wide patterns of molecular variation and gene flow within the species, limiting the use of D. athabasca in genetic and evolutionary analyses. However, with the recent development of whole-genome sequencing technologies, we can now revisit this species complex to study the genetic basis of prezygotic isolation.

Page 13: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 7

References Andrew R. L., Rieseberg L. H., 2013. Divergence is focused on few genomic regions early in speciation: incipient speciation of sunflower ecotypes. Evolution 67: 2468–2482. Ayala F. J., Coluzzi M., 2005. Chromosome speciation: humans, Drosophila, and mosquitoes. Proc. Natl. Acad. Sci. U.S.A. 102 Suppl 1: 6535–6542. Barbash D. A., Roote J., Ashburner M., 2000. The Drosophila melanogaster hybrid male rescue gene causes inviability in male and female species hybrids. Genetics 154: 1747–1771. Barbash D. A., Siino D. F., Tarone A. M., Roote J., 2003. A rapidly evolving MYB-related protein causes species isolation in Drosophila. Proc. Natl. Acad. Sci. U.S.A. 100: 5302–5307. Beckenbach A. T., Wei Y. W., Liu H., 1993. Relationships in the Drosophila obscura species group, inferred from mitochondrial cytochrome oxidase II sequences. Mol. Biol. Evol. 10: 619–634. Bikard D., Patel D., Le Metté C., Giorgi V., Camilleri C., Bennett M. J., Loudet O., 2009. Divergent evolution of duplicate genes leads to genetic incompatibilities within A. thaliana. Science 323: 623–626. Bomblies K., Lempe J., Epple P., Warthmann N., Lanz C., Dangl J. L., Weigel D., 2007. Autoimmune response as a mechanism for a Dobzhansky-Muller-type incompatibility syndrome in plants. PLoS Biol 5: e236. Bradshaw H. D. Jr, WIlbert M., Otto K. G., Schemske D. W., 1995. Genetic mapping of floral traits associated with reproductive isolation in monkeyflowers (Mimulus). Nature 376: 762–765. Brideau N. J., Flores H. A., Wang J., Maheshwari S., Wang X., Barbash D. A., 2006. Two Dobzhansky-Muller genes interact to cause hybrid lethality in Drosophila. Science 314: 1292–1295. Brothers A. N., Delph L. F., 2010. Haldane's rule is extended to plants with sex chromosomes. Evolution 64: 3643–3648. Bush G. L., 1992. Host Race Formation and Sympatric Speciation in Rhagoletis Fruit Flies (Diptera: Tephritidae). Psyche: A Journal of Entomology 99: 335–357. Butlin R., Debelle A., Kerth C., Snook R. R., Beukeboom L. W., Castillo Cajas R. F., Diao W., Maan M. E., Paolucci S., Weissing F. J., van de Zande L., Hoikkala A., Geuverink E., Jennings J., Kankare M., Knott K. E., Tyukmaeva V. I., Zoumadakis C., Ritchie M. G., Barker D., Immonen E., Kirkpatrick M., Noor M., Macias Garcia C., Schmitt T., Schilthuizen M., 2012. What do we need to know about speciation? Trends Ecol. Evol. (Amst.) 27: 27–39.

Page 14: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 8

Chang A. S., Noor M. A. F., 2007. The genetics of hybrid male sterility between the allopatric species pair Drosophila persimilis and D. pseudoobscura bogotana: dominant sterility alleles in collinear autosomal regions. Genetics 176: 343–349. Coyne J. A., 1989. Genetics of sexual isolation between two sibling species, Drosophila simulans and Drosophila mauritiana. Proceedings of the National Academy of Sciences 86: 5464–5468. Coyne J., 1992. Genetics and speciation. Nature 355: 511–515. Coyne J., Charlesworth B., 1989. Genetic analysis of X-linked sterility in hybrids between three sibling species of Drosophila. Heredity 62 ( Pt 1): 97–106. Coyne J., Orr H., 1989. Patterns of speciation in Drosophila. Evolution 43: 362–381. Coyne J., Orr H., 1997. “Patterns of speciation in Drosophila” revisited. Evolution 51: 5–303. Coyne J. A., Orr H. A., 2004. Speciation. Sinauer Associates, Inc., Sunderland, Massachusetts. Dobzhansky T., 1936. Studies on Hybrid Sterility. II. Localization of Sterility Factors in Drosophila Pseudoobscura Hybrids. Genetics 21: 113–135. Ellegren H., Smeds L., Burri R., Olason P. I., Backström N., Kawakami T., Künstner A., Mäkinen H., Nadachowska-Brzyska K., Qvarnström A., Uebbing S., Wolf J. B. W., 2012. The genomic landscape of species divergence in Ficedula flycatchers. Nature 491: 756–760. Feder J. L., Egan S. P., NOSIL P., 2012. The genomics of speciation-with-gene-flow. Trends in Genetics 28: 342–350. Fishman L., Stathos A., Beardsley P. M., Williams C. F., Hill J. P., 2013. Chromosomal rearrangements and the genetics of reproductive barriers in mimulus (monkeyflowers). Evolution: doi:10.1111–evo.12154. Ford M. J., Aquadro C. F., 1996. Selection on X-linked genes during speciation in the Drosophila athabasca complex. Genetics 144: 689–703. Ford M. J., Yoon C. K., Aquadro C. F., 1994. Molecular evolution of the period gene in Drosophila athabasca. Mol. Biol. Evol. 11: 169–182. Garrigan D., Kingan S. B., Geneva A. J., Andolfatto P., Clark A. G., Thornton K. R., Presgraves D. C., 2012. Genome sequencing reveals complex speciation in the Drosophila simulans clade. Genome Res 22: 1499–1511. Gittenberger E., 1988. Sympatric Speciation in Snails; a Largely Neglected Model. Evolution 42: 826–828.

Page 15: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 9

Good J. M., Dean M. D., Nachman M. W., 2008. A complex genetic basis to X-linked hybrid male sterility between two species of house mice. Genetics 179: 2213–2228. Grillet M., Dartevelle L., Ferveur J.-F., 2006. A Drosophila male pheromone affects female sexual receptivity. Proc. Biol. Sci. 273: 315–323. Hollocher H., Ting C.-T., Wu M.-L., Wu C.-I., 1997. Incipient Speciation by Sexual Isolation in Drosophila melanogaster: Extensive Genetic Divergence Without Reinforcement. Genetics. Janoušek V., Wang L., Luzynski K., Dufková P., Vyskočilová M. M., Nachman M. W., Munclinger P., Macholán M., Piálek J., Tucker P. K., 2012. Genome-wide architecture of reproductive isolation in a naturally occurring hybrid zone between Mus musculus musculus and M. m. domesticus. Mol. Ecol. 21: 3032–3047. Johnson D., 1985. Genetic differentiation in the Drosophila athabasca complex. Evolution 39: 467–472. Johnson D., 1978. Genetic differentiation in two members of the Drosophila athabasca complex. Evolution 32: 798–811. Jones F. C., Grabherr M. G., Chan Y. F., Russell P., Mauceli E., Johnson J., Swofford R., Pirun M., Zody M. C., White S., Birney E., Searle S., Schmutz J., Grimwood J., Dickson M. C., Myers R. M., Miller C. T., Summers B. R., Knecht A. K., Brady S. D., Zhang H., Pollen A. A., Howes T., Amemiya C., Broad Institute Genome Sequencing Platform & Whole Genome Assembly Team, Baldwin J., Bloom T., Jaffe D. B., Nicol R., Wilkinson J., Lander E. S., Di Palma F., Lindblad-Toh K., Kingsley D. M., 2012. The genomic basis of adaptive evolution in threespine sticklebacks. Nature 484: 55–61. Joron M., Frezal L., Jones R. T., Chamberlain N. L., Lee S. F., Haag C. R., Whibley A., Becuwe M., Baxter S. W., Ferguson L., Wilkinson P. A., Salazar C., Davidson C., Clark R., Quail M. A., Beasley H., Glithero R., Lloyd C., Sims S., Jones M. C., Rogers J., Jiggins C. D., ffrench-Constant R. H., 2011. Chromosomal rearrangements maintain a polymorphic supergene controlling butterfly mimicry. Nature 477: 203–206. Khadem M., Camacho R., 2011. Studies of the species barrier between Drosophila subobscura and D. madeirensis V: the importance of sex-linked inversion in preserving species identity. J Evol Biol 24: 1263–1273. Khadem M., Krimbas C. B., 1991. Studies of the species barrier between Drosophila subobscura and D. madeirensis I. The genetics of male hybrid sterility. Heredity 67: 157–165. Kim S.-C., Rieseberg L. H., 1999. Genetic Architecture of Species Differences in Annual Sunflowers: Implications for Adaptive Trait Introgression. Genetics 153: 965–977. Kirkpatrick M., 2010. How and Why Chromosome Inversions Evolve. PLoS Biol 8: e1000501.

Page 16: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 10

Kirkpatrick M., Barton N., 2006. Chromosome Inversions, Local Adaptation and Speciation. Genetics 173: 419–434. Knowlton N., Mate J. L., Guzman H. M., Rowan R., Jara J., 1997. Direct evidence for reproductive isolation among the three species of the Monfasfraea annu/aris complex in Central America (Panama and Honduras). Marine Biology 127: 705–711. Kulathinal R. J., Singh R. S., 2008. The molecular basis of speciation: from patterns to processes, rules to mechanisms. J. Genet. 87: 327–338. Kulathinal R. J., Stevison L. S., Noor M. A. F., 2009. The Genomics of Speciation in Drosophila: Diversity, Divergence, and Introgression Estimated Using Low-Coverage Genome Sequencing. Plos Genet 5: e1000550. Lawniczak M. K. N., Emrich S. J., Holloway A. K., Regier A. P., Olson M., White B., Redmond S., Fulton L., Appelbaum E., Godfrey J., Farmer C., Chinwalla A., Yang S.-P., Minx P., Nelson J., Kyung K., Walenz B. P., Garcia-Hernandez E., Aguiar M., Viswanathan L. D., Rogers Y.-H., Strausberg R. L., Saski C. A., Lawson D., Collins F. H., Kafatos F. C., Christophides G. K., Clifton S. W., Kirkness E. F., Besansky N. J., 2010. Widespread divergence between incipient Anopheles gambiae species revealed by whole genome sequences. Science 330: 512–514. Lee H.-Y., Chou J.-Y., Cheong L., Chang N.-H., Yang S.-Y., Leu J.-Y., 2008. Incompatibility of nuclear and mitochondrial genomes causes hybrid sterility between two yeast species. Cell 135: 1065–1073. Lloyd M., Dybas H. S., 1966. The Periodical Cicada Problem. II. Evolution. Evolution 20: 466–505. Lowry D. B., Willis J. H., 2010. A widespread chromosomal inversion polymorphism contributes to a major life-history transition, local adaptation, and reproductive isolation. PLoS Biol 8: e1000500. Lu X., Shapiro J. A., Ting C.-T., Li Y., Li C., Xu J., Huang H., Cheng Y.-J., Greenberg A. J., Li S.-H., Wu M.-L., Shen Y., Wu C.-I., 2010. Genome-wide misexpression of X-linked versus autosomal genes associated with hybrid male sterility. Genome Res 20: 1097–1102. Ludlow A. M., Magurran A. E., 2006. Gametic isolation in guppies (Poecilia reticulata). Proc. Biol. Sci. 273: 2477–2482. Martin S. H., Dasmahapatra K. K., Nadeau N. J., Salazar C., Walters J. R., Simpson F., Blaxter M., Manica A., Mallet J., Jiggins C. D., 2013. Genome-wide evidence for speciation with gene flow in Heliconius butterflies. Genome Res: doi:10.1101/gr.159426.113. Masly J. P., Jones C. D., Noor M. A. F., Locke J., Orr H. A., 2006. Gene transposition as a cause of hybrid sterility in Drosophila. Science 313: 1448–1450.

Page 17: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 11

Masly J. P., Presgraves D. C., 2007. High-resolution genome-wide dissection of the two rules of speciation in Drosophila. PLoS Biol 5: e243. Matute D. R., Butler I. A., Turissini D. A., Coyne J. A., 2010. A Test of the Snowball Theory for the Rate of Evolution of Hybrid Incompatibilities. Science 329: 1518–1521. Mayr, E., 1942. Systematics and the Origin of Species. Columbia Univserity Press, New York. McGaugh S. E., Noor M. A. F., 2012. Genomic impacts of chromosomal inversions in parapatric Drosophila species. Philos. Trans. R. Soc. Lond., B, Biol. Sci. 367: 422–429. Michel A. P., Sim S., Powell T. H. Q., Taylor M. S., Nosil P., Feder J. L., 2010. Widespread genomic divergence during sympatric speciation. Proceedings of the National Academy of Sciences 107: 9724–9729. Mihola O., Trachtulec Z., Vlcek C., Schimenti J. C., Forejt J., 2009. A mouse speciation gene encodes a meiotic histone H3 methyltransferase. Science 323: 373–375. Miller D., 1958. Sexual Isolation and Variation in Mating-Behavior Within Drosophila-Athabasca. Evolution 12: 72–81. Miller D., Goldstein R., Patty R., 1975. Semispecies of Drosophila athabasca distinguishable by male courtship sounds. Evolution 29: 531–544. Miller D. D., Roy R., 1964. Further data on Y chromosome types in Drosophila athabasca. Can. J. Genet. Cytol. 259: 334–348. Miller D., Westphal N., 1967. Further evidence on sexual isolation within Drosophila athabasca. Evolution 21: 479–492. Nadeau N. J., Martin S. H., Kozak K. M., Salazar C., Dasmahapatra K. K., Davey J. W., Baxter S. W., Blaxter M. L., Mallet J., 2013. Genome-wide patterns of divergence and gene flow across a butterfly radiation. Mol. Ecol. 22: 814–826. Naisbit R. E., Jiggins C. D., Linares M., Salazar C., Mallet J., 2002. Hybrid Sterility, Haldane's Rule and Speciation in Heliconius cydno and H. melpomene. Genetics 161: 1517–1526. Navarro A., Barton N. H., 2003. Accumulating postzygotic isolation genes in parapatry: A new twist on chromosomal speciation. Evolution 57: 447–459. Noor M. A. F., Garfield D. A., Schaeffer S. W., Machado C. A., 2007. Divergence between the Drosophila pseudoobscura and D. persimilis genome sequences in relation to chromosomal inversions. Genetics 177: 1417–1428. Noor M. A., Grams K. L., Bertucci L. A., Reiland J., 2001. Chromosomal inversions and the reproductive isolation of species. Proc. Natl. Acad. Sci. U.S.A. 98: 12084–12088.

Page 18: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 12

Nosil P., 2008. Speciation with gene flow could be common. Mol. Ecol. 17: 2103–2106. Novitski E., 1946. Chromosome variation in Drosophila athabasca. Genetics 31: 508–524. Phadnis N., Orr H. A., 2009. A Single Gene Causes Both Male Sterility and Segregation Distortion in Drosophila Hybrids. Science 323: 376–379. Presgraves D. C., 2003. A fine-scale genetic analysis of hybrid incompatibilities in Drosophila. Genetics 163: 955–972. Presgraves D. C., 2010. The molecular evolutionary basis of species formation. Nat Rev Genet 11: 175–180. Presgraves D. C., Balagopalan L., Abmayr S. M., Orr H. A., 2003 Adaptive evolution drives divergence of a hybrid inviability gene between two species of Drosophila. Nature 423: 715–719. Rieseberg L. H., 2001. Chromosomal rearrangements and speciation. Trends Ecol. Evol. (Amst.) 16: 351–358. Sawamura K., Yamamoto M. T., 1997. Characterization of a reproductive isolation gene, zygotic hybrid rescue, of Drosophila melanogaster by using minichromosomes. Heredity 79: 97–103. Schemske D. W., Bradshaw H. D., 1999. Pollinator preference and the evolution of floral traits in monkeyflowers (Mimulus). Proc. Natl. Acad. Sci. U.S.A. 96: 11910–11915. Seehausen O., van Alphen J. J. M., 1998. The effect of male coloration on female mate choice in closely related Lake Victoria cichlids (Haplochromis nyererei complex). Behav Ecol Sociobiol 42: 1–8. Shaw K. L., Parsons Y. M., Lesnick S. C., 2007. QTL analysis of a rapidly evolving speciation phenotype in the Hawaiian cricket Laupala. Mol. Ecol. 16: 2879–2892. Sota T., Kubota K., 1998. Genital lock-and-key as a selective agent against hybridization. Evolution 52: 1507–1513. Stump A. D., Shoener J. A., Costantini C., Sagnon N., Besansky N. J., 2005. Sex-Linked Differentiation Between Incipient Species of Anopheles gambiae. Genetics 169: 1509–1519. Sweigart A. L., 2010. The genetics of postmating, prezygotic reproductive isolation between Drosophila virilis and D. americana. Genetics 184: 401–410. Tang S., Presgraves D. C., 2009. Evolution of the Drosophila nuclear pore complex results in multiple hybrid incompatibilities. Science 323: 779–782.

Page 19: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 13

Tao Y., Chen S., Hartl D. L., Laurie C. C., 2003a. Genetic Dissection of Hybrid Incompatibilities Between Drosophila simulans and D. mauritiana. I. Differential Accumulation of Hybrid Male Sterility Effects on the X and Autosomes. Genetics 164: 1383–1398. Tao Y., Zeng Z.-B., Li J., Hartl D. L., Laurie C. C., 2003b. Genetic dissection of hybrid incompatibilities between Drosophila simulans and D. mauritiana. II. Mapping hybrid male sterility loci on the third chromosome. Genetics 164: 1399–1418. Ting C.-T., Tsaur S.-C., Wu M.-L., Wu C.-I., 1998. A Rapidly Evolving Homeobox at the Site of a Hybrid Sterility Gene. Science 282: 1501–1504. True J. R., Weir B. S., Laurie C. C., 1996. A genome-wide survey of hybrid incompatibility factors by the introgression of marked segments of Drosophila mauritiana chromosomes into Drosophila simulans. Genetics 142: 819–837. Ueshima R., Asami T., 2003. Evolution: single-gene speciation by left-right reversal. Nature 425: 679. Williams M. A., Blouin A. G., Noor M. A., 2001. Courtship songs of Drosophila pseudoobscura and D. persimilis. II. Genetics of species differences. Heredity 86: 68–77. Wu C., 2001. The genic view of the process of speciation. J. Evol. Biol. 14: 851–865. Wu C. I., Beckenbach A. T., 1983. Evidence for Extensive Genetic Differentiation between the Sex-Ratio and the Standard Arrangement of Drosophila pseudoobscura and D. persimilis and Identification of Hybrid Sterility Factors. Genetics 105: 71–86. Wu C. I., Palopoli M. F., 1994. Genetics of postmating reproductive isolation in animals. Annu. Rev. Genet. 28: 283–308. Yoon C. K., 1991. Molecular and behavioral evolution in the semi-species of Drosophila athabasca. Ph.D. Dissertation, Cornell University. Yoon C. K., Aquadro C. F., 1994. Mitochondrial DNA variation among the Drosophila athabasca semispecies and Drosophila affinis. J. Hered. 85: 421–426. Zhang W., Kunte K., Kronforst M. R., 2013. Genome-wide characterization of adaptation and speciation in tiger swallowtail butterflies using de novo transcriptome assemblies. Genome Biol Evol 5: 1233–1245.

Page 20: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 14

Figures

Figure 1. The D. athabasca species range. Semispecies ranges are indicated in red, blue, and green. Abbreviations used WN = Western-Northern, EA = Eastern-A, and EB = Eastern-B.

WN EA EB

Page 21: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 15

CHAPTER 2: ESTABLISHING GENOMIC RESOURCES FOR DROSOPHILA ATHABASCA Reference genome assembly & annotation

To create a reference genome assembly we extracted genomic DNA from a single strain of D. athabasca (iso-female strain ID-10, Western-Northern) using the Puregene DNA Extraction Kit (Qiagen). We prepared a total of four genomic libraries using standard Illumina protocols, two short insert libraries with mean insert sizes of 28bp and 249bp from a genomic DNA extraction of 10 pooled females, and an additional two mate-pair libraries with mean insert sizes of 1,854bp and 5,000bp from a genomic DNA extraction of 20 pooled females. The genomic libraries were sequenced for 101bp from both ends, each on a lane of an Illumina Genome Analyzer II (GAII), resulting in a total of 54.0 million paired reads. The two long-insert mate-pair libraries were cropped to 36bp to reduce the chances of reading over library construction breakpoints, as suggested by the manufacturer. Reads were screened and cropped for adapter and bacterial contamination, leaving a total of 53.0 million paired reads amounting to 4.7 Gb of sequence used in the assembly, or approximately 30X coverage of the genome. We assembled the reads using SOAPdenovo (Li et al. 2010) with a Kmer size of 31, using mate-pair libraries for scaffolding. The GapCloser program within SOAPdenovo was used to close gaps. To assign scaffolds to Muller Elements and further screen out bacterial contaminants, scaffolds were BLAST (Altschul et al. 1990; -e 10e-20) to the D. pseudoobscura genome (version 2.25), throwing out any scaffold without a hit. The resulting final genome was 157.2 Mb in size and consisted of 6,651 scaffolds with an N50 = 83.5 kb (Table 1). To aid in genome annotation, we made three mRNAseq libraries using the D. athabasca reference strain, one with a pool of ten 5-10 day old female flies, another with a pool of ten 5-10 day old male flies, and a final with a pool of 10 mixed sex third-instar larvae. We extracted mRNA using the TRIzol extraction method (Life Technologies) followed by poly-A selection using Dynabeads (Life Technologies). Illumina mRNAseq libraries were prepared using standard protocols. We sequenced each library from both ends for 76bp on a lane of a GAII, resulting in 4.8 million paired female reads, 2.6 million paired male reads, and 3.9 million paired mixed-larvae reads. The genome was annotated using the MAKER pipeline (Holt & Yandell 2011), which combined SNAP (Korf 2004) and AUGUSTUS (Stanke & Waack 2003) de novo gene prediction tools with BLAST homology searches using D. pseudoobscura proteins and our mRNAseq experimental evidence preprocessed with Tophat (Trapnell et al. 2009) and Cufflinks (Trapnell et al. 2010). Our final genome annotation contained 13,378 genes (Table 1). To assess the genome for completeness, we used CEGMA (Parra et al. 2009) and found that 98.0% of core eukaryotic genes were present in our reference genome, with 94.8% of them being complete. Collection of population samples To collect population genomic data for D. athabasca, flies were collected over banana bait during the summers of 2009-2011. To avoid creating artificial population structure as a result of sampling artifacts, we collected flies at 19 different locations widely spread across the D. athabasca species range. Over 800 iso-female lines were established from these collection sites, and we used Sanger sequencing of a mitochondrial DNA fragment to confirm which lines belonged to the D. athabasca species complex (Cytochrome Oxidase II gene; Fwd primer: GTTTAAGAGACCAGTACTTG; Rev primer: ATGGCAGATTAGTGCAATGG). A total of

Page 22: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 16

404 D. athabasca lines were established in the lab (see Table 2 for collection locations and number of lines). Courtship song assays

Courtship songs were recorded for a subset of 28 of the D. athabasca lines (Appendix A). Flies were reared at 20°C on a 12L:12D light cycle. Both male and female virgins were collected shortly following eclosion and aged in individual vials for 7-10 days under the same temperature and lighting conditions as during rearing. Recordings were captured by placing a single virgin male and virgin female in an Insectavox insect recording chamber (Gorczyca & Hall 1987). The Insectavox was connected to a RadioShack Mini Amplifier Speaker (Cat. No. 277-1008C) and MacBook Pro, and songs were recorded using the RAVEN software (Bioacoustics Research Program 2011). All recordings were carried out at 21±1°C. Three separate mating pairs were recorded for each line, and interpulse interval was calculated directly from song waveforms as an average of the three pairs (see Table 3 for averages by line). The average interpulse interval was 11.2±0.8ms for Western-Northern lines, 29.0±2.6ms for Eastern-A lines, and 13.4±1.0ms for Eastern-B lines. Survey of mitotic karyotype variation

Since D. athabasca is known to have polymorphic Y-autosome fusion (Miller & Roy 1964, Miller & Westphal 1967) and it is important to know which portions of the genome are sex-linked for genomic analyses, we surveyed a subset of 75 of our established D. athabasca lines for mitotic karyotype variation (Appendix B). Flies were reared at 20°C on a 12L:12D light cycle. Male third instar larval brains were dissected in 0.09% NaCl solution and treated with colchicine to arrest cell division in metaphase. The cells were fixed in a 3:1 methanol-acetic acid solution and the chromosomes were stained using Giemsa. Karyotypes were produced for three replicates per line and confirm previous research indicating that Western-Northern individuals do not exhibit the Y-autosome fusion, Eastern-A individuals are polymorphic for the fusion, and Eastern-B individuals are fixed for the fusion (Miller & Roy 1964, Miller & Westphal 1967). See Table 3 for specific karyotype information for the 28 sequenced lines.

Whole genome re-sequencing & variant calling

For polymorphism analyses, a total of 28 D. athabasca iso-female strains were used in this analysis: 9 Western-Northern, 12 Eastern-A, and seven Eastern-B. We classified strains into semispecies groups based on a combination of geographic location and courtship song inter-pulse interval (Table 3). Genomic DNA was extracted from a single female fly from each of the strains using the same method as above. Single fly Illumina libraries were created and sequenced at Beijing Genome Institute according to the manufacturer’s instructions. We sequenced 90bp paired-end reads, generating 2 Gb of sequence for each strain.

We aligned the reads from each strain to our reference assembly using Bowtie2 (Langmead & Salzberg 2012, --very-sensitive), with a high percentage of reads aligning per strain (Mean = 85.3%, SD 1.9%). Genomic coverage per strain was ~10X. Variants for each strain were called using the GATK pipeline (version 1.5, DePristo et al. 2011). In brief, PCR duplicates were removed from each strain using Picard (http://picard.sourceforge.net) and strains

Page 23: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 17

were merged into a single file. Local realignment was performed on the merged file around indel regions to prevent erroneous variant calls due to alignment error. Variants from all strains were called simultaneously. Due to the lack of validated SNPs in D. athabasca, recalibration steps were omitted from the pipeline. Instead, only those variants that passed a strict filter were retained (clusterWindowSize 10; MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1); DP < 5; QUAL < 30.0; QUAL > 30.0 && QUAL < 50.0; QD < 1.5; SB > -10.0). As a method of validation, we performed the variant calling pipeline as described above, including the short-insert reads from the reference strain. We then counted the number of sites in which the reference strain was called as a homozygous variant allele, allowing us to estimate a false-positive rate of 0.009%. Acknowledgements We thank J. Jaenike, P. Andolfatto, D. Miller, T. Gurbich, Q. Zhou, and A. Hardin for help with fly collection, C. Aquadro for lending us his Insectavox, and E. Chan for assistance with the survey of mitotic karyotypes.

Page 24: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 18

References Altschul S., Gish W., Miller W., Myers E., Lipman D., 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410. Bioacoustics Research Program, 2011. Raven Pro: Interactive Sound Analysis Software (Version 1.4) [Computer software]. Ithaca, NY: The Cornell Lab of Ornithology. Available from http://www.birds.cornell.edu/raven. DePristo M. A., Banks E., Poplin R., Garimella K. V., Maguire J. R., Hartl C., Philippakis A. A., del Angel G., Rivas M. A., Hanna M., McKenna A., Fennell T. J., Kernytsky A. M., Sivachenko A. Y., Cibulskis K., Gabriel S. B., Altshuler D., Daly M. J., 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43: 491–498. Gorczyca, M., Hall, J. C., 1987. The Insectavox, an integrated device for recording and amplifying courtship songs of Drosophila. Dros Inf Serv, 66: 157–160. Holt C., Yandell M., 2011. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12: 491. Korf I., 2004. Gene finding in novel genomes. BMC Bioinformatics 5: 59. Langmead B., Salzberg S. L., 2012. Fast gapped-read alignment with Bowtie 2. Nat Meth 9: 357–359. Li R., Zhu H., Ruan J., Qian W., Fang X., Shi Z., Li Y., Li S., Shan G., Kristiansen K., Li S., Yang H., Wang J., Wang J., 2010. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 22: 549–556. Miller D., Roy R., 1964. Further data on Y chromosome types in Drosophila athabasca. Can. J. Genet. Cytol. 259: 334–348. Miller D., Westphal N., 1967. Further evidence on sexual isolation within Drosophila athabasca. Evolution. Parra G., Bradnam K., Ning Z., Keane T., Korf I., 2009. Assessing the gene space in draft genomes. Nucleic Acids Research 37: 289–297. Picard. http://picard.sourceforge.net Stanke M., Waack S., 2003. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 Suppl 2: ii215–25. Trapnell C., Pachter L., Salzberg S. L., 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105–1111.

Page 25: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 19

Trapnell C., Williams B. A., Pertea G., Mortazavi A., Kwan G., van Baren M. J., Salzberg S. L., Wold B. J., Pachter L., 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28: 511–515. !

Page 26: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 20

Tables Table 1. Summary of the D. athabasca genome assembly and annotation by Muller Element.

Muller Element # Scaffolds

Average Scaffold Length

(Kb)

Average Genes/Scaffold # Genes Total Size

(Mb)

A 418 63.3 5.3 2,231 26.5 A/D 414 64.1 5.8 2,407 26.6

B 1,742 16.9 1.5 2,623 29.5 C 912 23.7 2.6 2,368 21.6 E 1,285 26.3 2.4 3,054 33.7 F 20 59.9 3.9 78 1.2

Unknown 1,860 9.7 0.3 616 18.1 Total 6,651 23.6 2.0 13,378 157.2

!

Page 27: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 21

Table&2.!Sum

mary!of!collections!and!num

ber!of!D.#athabasca!lines!collected!at!each!location.!

!! Col

lect

ion

Loc

atio

n L

ine

Abb

rev.

C

oord

inat

es

Yea

r C

olle

cted

By

# lin

es

Oric

k, C

A

CA

N

41

24.4

32 W

124

1.1

52

2009

K

. Mill

er &

Q. Z

hou

5 W

ashi

ngto

n D

C

DC

un

know

n 20

10

D. B

acht

rog

2 D

eary

, ID

ID

un

know

n 20

09

J. Ja

enik

e 3

Ros

coe,

IL

IL

N 4

0 06

.778

W 0

88 1

5.67

3 20

10

A. H

ardi

n 1

Sout

h H

arbo

r, M

E M

E N

44

14.4

73 W

068

18.

402

2010

K

. Mill

er &

D. M

iller

43

B

ar H

arbo

r, M

E M

E-B

W

N 4

4 18

.694

W 0

68 1

2.51

9 20

10

K. M

iller

& D

. Mill

er

44

Bla

ck C

reek

, MI

MI-

BC

N

43

42.5

65 W

084

23.

778

2010

K

. Mill

er &

D. M

iller

33

Is

land

Lak

e, M

I M

I-IL

N

44

30.6

01 W

084

08.

527

2010

K

. Mill

er &

D. M

iller

16

La

ke It

asca

, MN

M

N

N 4

7 14

.105

W 0

95 1

1.42

6 20

10

K. M

iller

& D

. Mill

er

42

Cas

s Lak

e, M

N

MN

-CL

N 4

7 22

.753

W 0

94 3

1.05

3 20

10

K. M

iller

& D

. Mill

er

33

Prin

ceto

n, N

J N

J N

40

21.2

33 W

074

43.

116

2010

K

. Mill

er, D

. Mill

er &

P. A

ndol

fatto

15

B

ulls

Isla

nd, N

J N

J-B

I N

40

24.8

00 W

075

02.

542

2010

K

. Mill

er &

D. M

iller

8

Sant

a Fe

, NM

N

M

N 3

5 43

.687

W 1

05 5

0.35

9 20

10

K. M

iller

& D

. Mill

er

2 C

rops

eyvi

lle, N

Y

NY

N

42

46.1

79 W

073

30.

961

2010

K

. Mill

er &

D. M

iller

12

N

orth

-Sou

th L

ake,

NY

N

Y-N

S N

42

12.1

29 W

074

02.

294

2010

K

. Mill

er &

D. M

iller

35

Lo

leta

, PA

PA

N

41

23.9

44 W

079

04.

931

2010

K

. Mill

er &

D. M

iller

38

B

lack

Mos

hann

on, P

A

PA-B

M

N 4

0 55

.033

W 0

78 0

3.95

6 20

10

K. M

iller

& D

. Mill

er

38

Prin

ce W

illia

m, V

A

VA

-PW

N

38

33.6

56 W

077

20.

864

2011

K

. Mill

er &

T. G

urbi

ch

6 G

reen

Mou

ntai

n, V

T V

T N

42

52.7

72 W

073

04.

477

2010

K

. Mill

er &

D. M

iller

28

Page 28: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 22

Tabl

e 3.

D. a

thab

asca

line

s use

d fo

r pop

ulat

ion

geno

mic

ana

lyse

s, al

ong

with

col

lect

ion

info

rmat

ion,

sem

ispe

cies

as

sign

men

t, in

terp

ulse

inte

rval

, and

kar

yoty

pe in

form

atio

n.

! * O

nly

a si

ngle

repl

icat

e w

as o

btai

ned

for t

hese

line

s !

!

Lin

e Se

mis

peci

es

Inte

rpul

se I

nter

val

(sec

) Y

-aut

osom

e fu

sion

? L

ocat

ion

Coo

rdin

ates

C

olle

ctio

n

CA

-2

Wes

tern

-Nor

ther

n 0.

0100

* no

O

rick,

CA

N

41

24.4

32 W

124

01.

152

K.M

iller

& Q

.Zho

u 20

09

CA

-3

Wes

tern

-Nor

ther

n 0.

0116

no

O

rick,

CA

N

41

24.4

32 W

124

01.

152

K.M

iller

& Q

.Zho

u 20

09

CA

-4

Wes

tern

-Nor

ther

n 0.

0117

no

O

rick,

CA

N

41

24.4

32 W

124

01.

152

K.M

iller

& Q

.Zho

u 20

09

ID-1

W

este

rn-N

orth

ern

0.01

27

no

Dea

ry, I

D

unav

aila

ble

J.Jae

nike

200

9 IL

-2

East

ern-

A

0.02

45*

no

Ros

coe,

IL

N 4

2 25

.584

W 8

9 00

.921

A

.Har

din

2010

M

E-16

W

este

rn-N

orth

ern

0.01

11

no

Sout

h H

arbo

r, M

E N

44

14.4

73 W

068

18.

402

K.M

iller

& D

.Mill

er 2

010

ME-

43

Wes

tern

-Nor

ther

n 0.

0105

* no

So

uth

Har

bor,

ME

N 4

4 14

.473

W 0

68 1

8.40

2 K

.Mill

er &

D.M

iller

201

0 M

EBW

-13

Wes

tern

-Nor

ther

n 0.

0119

no

B

ar H

arbo

r, M

E N

44

18.6

94 W

068

12.

519

K.M

iller

& D

.Mill

er 2

010

MIB

C-2

2 Ea

ster

n-A

0.

0288

ye

s B

lack

Cre

ek, M

I N

43

42.5

65 W

084

23.

778

K.M

iller

& D

.Mill

er 2

010

MIB

C-6

0 Ea

ster

n-A

0.

0285

no

B

lack

Cre

ek, M

I N

43

42.5

65 W

084

23.

778

K.M

iller

& D

.Mill

er 2

010

MN

-47

Wes

tern

-Nor

ther

n 0.

0105

no

La

ke It

asca

, MN

N

47

14.1

05 W

095

11.

426

K.M

iller

& D

.Mill

er 2

010

MN

CL-

39

East

ern-

A

0.03

15*

unkn

own

Cas

s Lak

e, M

N

N 4

7 22

.753

W 0

94 3

1.05

3 K

.Mill

er &

D.M

iller

201

0 M

NC

L-50

Ea

ster

n-A

0.

0296

un

know

n C

ass L

ake,

MN

N

47

22.7

53 W

094

31.

053

K.M

iller

& D

.Mill

er 2

010

NJ-

126

East

ern-

B

0.01

36

yes

Prin

ceto

n, N

J N

40

21.2

33 W

074

43.

116

K.M

iller

, D.M

iller

& P

.And

olfa

tto 2

010

NJ-

34

East

ern-

B

0.01

48

yes

Prin

ceto

n, N

J N

40

21.2

33 W

074

43.

116

K.M

iller

, D.M

iller

& P

.And

olfa

tto 2

010

NJB

I-12

Ea

ster

n-B

0.

0138

ye

s B

ulls

Isla

nd, N

J N

40

24.8

00 W

075

02.

542

K.M

iller

& D

.Mill

er 2

010

NJB

I-9

East

ern-

B

0.01

24

yes

Bul

ls Is

land

, NJ

N 4

0 24

.800

W 0

75 0

2.54

2 K

.Mill

er &

D.M

iller

201

0 N

M-2

8 W

este

rn-N

orth

ern

0.01

18

no

Sant

a Fe

, NM

N

35

43.6

87 W

105

50.

359

K.M

iller

& D

.Mill

er 2

010

NY

NS-

11

East

ern-

A

0.02

61

no

Nor

th-S

outh

Lak

e, N

Y

N 4

2 12

.129

W 0

74 0

2.29

4 K

.Mill

er &

D.M

iller

201

0 N

YN

S-15

Ea

ster

n-A

0.

0290

no

N

orth

-Sou

th L

ake,

NY

N

42

12.1

29 W

074

02.

294

K.M

iller

& D

.Mill

er 2

010

PA-6

0 Ea

ster

n-A

0.

0271

ye

s Lo

leta

, PA

N

41

23.9

44 W

079

04.

931

K.M

iller

& D

.Mill

er 2

010

PA-6

7 Ea

ster

n-A

0.

0310

ye

s Lo

leta

, PA

N

41

23.9

44 W

079

04.

931

K.M

iller

& D

.Mill

er 2

010

PAB

M-1

8 Ea

ster

n-A

0.

0341

no

B

lack

Mos

hann

on, P

A

N 4

0 55

.033

W 0

78 0

3.95

6 K

.Mill

er &

D.M

iller

201

0 PA

BM

-28

East

ern-

A

0.02

80*

yes

Bla

ck M

osha

nnon

, PA

N

40

55.0

33 W

078

03.

956

K.M

iller

& D

.Mill

er 2

010

VA

PW-5

4 Ea

ster

n-B

0.

0142

ye

s Pr

ince

Will

iam

, VA

N

38

33.6

56 W

077

20.

864

K.M

iller

& T

.Gur

bich

201

1 V

APW

-56

East

ern-

B

0.01

18

yes

Prin

ce W

illia

m, V

A

N 3

8 33

.656

W 0

77 2

0.86

4 K

.Mill

er &

T.G

urbi

ch 2

011

VA

PW-9

9 Ea

ster

n-B

0.

0135

ye

s Pr

ince

Will

iam

, VA

N

38

33.6

56 W

077

20.

864

K.M

iller

& T

.Gur

bich

201

1 V

T-16

Ea

ster

n-A

0.

0303

no

G

reen

Mou

ntai

n, V

T N

42

52.7

72 W

073

04.

477

K.M

iller

& D

.Mill

er 2

010

Page 29: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 23

CHAPTER 3: PATTERNS OF GENOME-WIDE DIVERSITY AND POPULATION STRUCTURE IN DROSOPHILA ATHABASCA Abstract

The Drosophila athabasca species complex contains three recently diverged, prezygotically isolated semispecies (Western-Northern, Eastern-A, and Eastern-B) that are distributed across North America and share zones of sympatry. Inferences based on a handful of loci suggest that this complex might provide a valuable system for studying the genetics of incipient speciation and the evolution of prezygotic isolating mechanisms, but patterns of differentiation have not been characterized systematically. Here, we analyze whole-genome re-sequencing data for 28 D. athabasca individuals from across the species range to characterize genome-wide patterns of divergence among semispecies. Despite low levels of overall divergence, individuals exhibit distinct genetic clustering by semispecies. Levels of nucleotide variability within this species are relatively low for Drosophila, with the Eastern-B semispecies exhibiting an even further genome-wide reduction in diversity, consistent with a founder effect. The data support a simple island model of isolation with migration within D. athabasca, with divergence times <20kya. Notably, despite strong prezygotic isolation, we estimate low levels of migration between semispecies. The semispecies of D. athabasca demonstrate some of the youngest incipient species within Drosophila, making them a very useful system for the study of speciation. Introduction

Understanding the evolutionary forces and genetic patterns underlying the process of speciation is one of the primary aims in the field of evolutionary genetics. Studies utilizing Drosophila have contributed greatly to our understanding of speciation (Coyne & Orr 1989, 1997; Wu & Palopoli 1994), especially the mechanisms contributing to postzygotic reproductive incompatibility (Ting et al. 1998; Barbash et al. 2000; Presgraves et al. 2003; Masly et al. 2006; Phadnis & Orr 2009; Lu et al. 2010; Chang & Noor 2007; Wu & Palopoli 1994; Ortiz-Barrientos & Noor 2005; Tang & Presgraves 2009). However, we still know surprisingly little about the forces that act during the initial stages of speciation. While the evolution of hybrid incompatibility factors is critical to understanding how species boundaries are maintained post-speciation, such factors may not have been important early on during species divergence and may have evolved secondarily (Orr 1995; Noor & Feder 2006; Sobel et al. 2009). By studying recently diverged populations, we increase the chances that the differences that we detect are actually directly responsible for reproductive isolation, thus studies involving incipient species are essential.

Of particular interest to speciation studies are young incipient species that lack postzygotic isolating mechanisms that also share regions of sympatry (Via 2009; Nosil & Feder 2012). Since it has been suggested that prezygotic – not postzygotic – isolation factors may play a larger role early on in speciation (Ting et al. 2001; reviewed in Coyne & Orr 2004), we must study groups that have yet to develop postzygotic isolation. Furthermore, studying regions of sympatry in incipient species without postzygotic isolation offers the opportunity to examine gene flow between populations. Gene flow between recently diverged populations can result in natural “temporary genetic mosaics”, in which some regions of the genome are freely able to

Page 30: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 24

introgress, while regions of the genome important for species divergence remain distinct and diverge between populations (Via & West 2008; Nosil et al. 2009; Via 2009; Buerkle & Lexer 2008; Payseur 2010). Taking advantage of naturally occurring “mosaics” between incipient species should allow us to examine one of the most critical questions of speciation: how genetic boundaries of nascent species are maintained in the face of gene flow. Although this question is far from being considered novel, the lack of genetic resources for relevant model systems has limited our progress in this fundamental area of molecular evolution.

In this paper, we examine patterns of genome-wide diversity and divergence within Drosophila athabasca (obscura group, affinis subgroup), a promising species in which to study incipient speciation (Ford & Aquadro 1996). D. athabasca is composed of 3 morphologically identical, but prezygotically isolated semispecies: Western-Northern, Eastern-A, and Eastern-B. The semispecies are capable of interbreeding and producing fertile offspring in the lab (Miller & Westphal 1967; Yoon 1991), but exhibit divergent male courtship song (Miller 1958; Miller et al. 1975) and strong corresponding female preference (Yoon 1991). Individuals can be discretely categorized into one of the three semispecies based on their courtship song and geographic range (Miller et al. 1975). The D. athabasca species range is spread across the northern half of North America, with semispecies ranges being largely distinct, however regions of sympatry exist between Western-Northern and Eastern-A, and Eastern-A and Eastern-B (Figure 1). A previous study using six genes estimated divergence times between 5,000 and 23,000 years (Ford & Aquadro 1996), making D. athabasca one of the most recently diverged groups used to examine speciation in Drosophila.

Despite the promise of D. athabasca for investigating incipient speciation, the species has not been widely studied. Beyond a few genes, little is known about the genome-wide patterns of molecular variation and gene flow within the species, limiting the use of D. athabasca in evolutionary analyses. Here, we conduct a whole-genome analysis of D. athabasca using polymorphism data from 28 individuals sampled from across the species range. We describe patterns of genome-wide molecular variation, differentiation, and population structure within D. athabasca. Population genetic patterns within the species unambiguously support the behavioral and geographic stratification of individuals into three semispecies, and show semispecies specific, as well as X-chromosome specific, differences with regard to nucleotide diversity. Using this data, we model the demographic history within the species and reject a model of strictly allopatric divergence in favor of an isolation with migration model. Finally, we discuss the potential of D. athabasca as a powerful model system for studying the early stages of speciation, with this study developing key genomic resources for future analyses. Methods Polymorphism data

Using our previously collected population genomic data and SNP calls (Chapter 2), we took all sites that were biallelic and had data for all 28 lines, and assigned SNPs to site classes according to whether they were private or shared alleles between semispecies, and further into polymorphic or fixed sites within and between semispecies. To polarize the SNPs, ancestral states for each variant site were assigned by aligning the D. athabasca reference genome to the genomes of two closely related species, D. algonquin (D. athabasca-D. algonquin Dxy = 3.9%) and D. affinis (D. athabasca-D. affinis Dxy = 4.3%). Only those variant sites in which both D.

Page 31: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 25

algonquin and D. affinis were aligned and shared the same allele were polarized (68.1%). The D. algonquin genome was sequenced from a single Illumina ~500bp short insert library. Genomic DNA was extracted from a pool of 10 female flies from a single strain obtained from New Hampshire (NH-2) during the Summer of 2010. DNA extraction, library preparation, and Illumina sequencing protocols are identical to those for the D. athabasca reference genome. SOAPdenovo (Li et al. 2010, Kmer=29) was again used to assemble the reads (28.1 million paired-end, 101bp reads) into scaffolds, resulting in an assembly with 254,588 scaffolds and total genome size of 165.0 Mb. The scaffold N50 for the D. algonquin assembly was 1.8 kb. The D. affinis genome was kindly provided by Nicola Palmieri. Outgroup genomes were aligned to the D. athabasca reference genome using the LASTZ pipeline (Harris 2007). Measurements of genomic diversity

Mean pairwise nucleotide diversity (!) and population differentiation (FST) were estimated as an average across homozygous sites in all genes in the genome utilizing the C++ library, libsequence (Thornton 2003). Due to a polymorphic Muller C-Y fusion in D. athabasca (Miller & Roy 1964), Muller C was omitted from analyses involving X-chromosome vs. autosome comparisons. Additionally, because of differences in library construction (pooled extractions for the reference strain vs. single fly extractions for all other strains) the reference strain was excluded from polymorphism analyses. Sites with missing data were also excluded from the analysis. Population structure analyses

We inferred the phylogenetic tree that best represents the relationships of the sampled strains to one another using alignments of all autosomal genes (Muller B, E, F) in the genome, concatenated together. All 28 samples, plus the reference strain and three outgroups: D. pseudoobscura, D. affinis, and D. algonquin, were used to build the phylogeny. Heterozygous SNPs were excluded from alignments and the phylogeny was estimated using the neighbor-joining algorithm employed in MEGA (version 5, Tamura et al. 2011) and rooted using D. pseudoobscura. To assess the reliability of the final inferred tree, we examined 1000 bootstrap replicates. We also examined clustering patterns of individuals within D. athabasca by using principal component analysis (PCA) to examine the allele frequency distributions of the samples. The PCA was implemented on all autosomal SNPs of all 28 samples using the program SMARTPCA (Patterson et al. 2006), correcting for the effects of linkage disequilibrium (nsnpldregress=2; killr2=YES; r2physlim=10000).

Isolation-by-distance analysis

To test for isolation-by-distance (IBD), we first calculated average pairwise divergence (Dxy) using homozygous SNPs for all autosomal genes and all pairwise sample comparisons. Geographic distances were measured point-to-point. We then tested for a correlation between geographic and genetic distance using Mantel tests with 1000 random permutations using the IBDWS (Jensen et al. 2005).

Demographic model fitting

Page 32: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 26

To infer demographic parameters in D. athabasca, we used the software package ∂a∂i (Gutenkunst et al. 2009) to analyze the joint site-frequency spectra (SFS) of the sequences, grouped by semispecies. ∂a∂i uses a Wright-Fisher diffusion approximation method to generate an expected joint SFS under a specified demographic model and compares it to the SFS from the experimental data using a composite likelihood function. Importantly, ∂a∂i allows for simultaneous demographic inference for up to three populations.

We used all autosomal biallelic four-fold synonymous sites as putative neutral sites for this analysis (70.5Kb). Similar to the genomic diversity analyses, sites with missing data were omitted and ancestral states were assigned by polarizing SNPs using alignments to D. affinis and D. algonquin. Because we were most interested in determining whether or not the semispecies of D. athabasca diverged with or without gene flow, we tested the fit of our data to an isolation with no migration (allopatric divergence) and an isolation with symmetric migration model, both under a three-population divergence scenario with splitting orders based on the results from clustering analyses. A likelihood ratio test was used to compare the fit of the models to the data. Additionally, we used the point estimates from ∂a∂i to generate 1000 simulated datasets with the coalescent simulator ms (Hudson 2002) and analyzed them with ∂a∂i to obtain standard deviation measurements for demographic parameter estimates of the best fitting model. We scaled the maximum likelihood parameter estimates assuming 10 generations per year with the neutral mutation rate estimated from Drosophila melanogaster mutation accumulation lines, ! = 5.8x10-9 (Haag-Liautard et al. 2007). Results Patterns of genome-wide variation

Our whole genome analysis of D. athabasca resulted in a total of 2.1 Mbp of sites that were variable within D. athabasca with no missing genotypes in any of our 28 strains. After screening out sites that were not biallelic or lacked ancestral state information, we were left with a total of 1.7 Mbp of variable sites. We assigned SNPs to site classes according to whether they were private or shared alleles between semispecies, and further into polymorphic or fixed sites within and between semispecies (Table 1). The vast majority of SNPs in D. athabasca are private and polymorphic within a semispecies (75.8%), followed by those that are shared and polymorphic between two or more semispecies (22.1%), those that are shared and fixed between two semispecies (1.1%), and finally those that are private and fixed within a semispecies (1.0%). When considering all SNPs shared between semispecies, those that are shared between Eastern-A and Eastern-B are the most common (46.5%), while those shared between Western-Northern and Eastern-B are the least common (5.7%). The Western-Northern semispecies of D. athabasca harbor the most private SNPs fixed within a semispecies (0.9%), while Eastern-A has the least (0.03%). For private SNPs that are still segregating within a semispecies, Eastern-A has the largest number (33.2%), which was also the most common site class when considering all variant sites together.

Species-wide and genome-wide mean pairwise diversity for D. athabasca is low (! = 0.0009). Table 2 shows estimates of ! and FST for comparisons within D. athabasca, as well as for the species as a whole. When individuals were grouped into their respective semispecies categories, mean pairwise diversity varied between semispecies, with Eastern-B showing a

Page 33: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 27

genome-wide reduction in diversity compared to both Western-Northern and Eastern-A. Between semispecies patterns of nucleotide variation show increased relative population differentiation between Western-Northern and both Eastern semispecies and less differentiation between the two Eastern semispecies, with mean genome-wide FST being lowest in comparisons between Eastern-A and Eastern-B. In all groups except Eastern-B, X-linked nucleotide diversity exceeds autosomal nucleotide diversity. Interestingly, Eastern-B shows the converse pattern, with X-linked diversity being reduced compared to that of the autosomes. Patterns of population differentiation between semispecies also show differences between the X-chromosome and autosomes in all pairwise comparisons, with all comparisons showing the same pattern of an excess of differentiation between semispecies on the X-chromosome when compared to autosomes. Population structure A neighbor-joining tree for D. athabasca was constructed using the whole-genome SNP data to examine the distance relationships between individuals within the species. Examining the topology of the inferred tree allows us to identify population structure within the species, independent of predefined classifications. A phylogeny concordant with behavioral semispecies classifications would provide genetic support for grouping individuals into semispecies. We found that all individuals corresponding to a single semispecies always clustered together in the tree, each forming a monophyletic group with 100% bootstrap support (Figure 4a). Principal component analysis (PCA) also revealed three distinct clusters corresponding to the three semispecies of D. athabasca (Figure 4b). PCA, phylogenetic analysis, and estimates of FST all show less differentiation between Eastern-A and Eastern-B groups compared to Western-Northern. Next, we were interested in resolving whether the semispecies exhibited patterns of isolation-by-distance or whether they formed groups with sharply defined boundaries. When the entire dataset was considered, there was a significant relationship between geographic and genetic distance (Figure 2a; Mantel test, r2 = 0.29, p < 0.001); however, controlling for population structure, patterns of IBD virtually disappeared (Figure 2b-d), suggesting that the species-wide patterns of IBD that we see in D. athabasca can be mostly attributed to between population comparisons of the three semispecies. The only significant pattern that remained was weak, with geographic distance explaining only 6% of the variation within Western-Northern (Mantel test, r2 = 0.06, p<0.03), the semispecies with the largest geographic range. Fitting a speciation model To fit a demographic model to our data, we used a diffusion-based approach. Because of the strong support for the branching order within D. athabasca, we restricted our analysis to a three-population divergence model in which Eastern-A and Eastern-B are the more recently diverged populations, with Western-Northern splitting from the common ancestor first. The model that best described our data was the model that included gene flow (isolation with migration model; Figure 4). This model fit significantly better than the strictly allopatric model (Supplementary Figure 1; Likelihood-ratio-test, X2 = 1.2E+5, p <0.001). Maximum likelihood estimates of inferred demographic parameters, along with their confidence intervals, are shown in table 3.

Page 34: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 28

Discussion Population structure & diversity within D. athabasca

Individuals within D. athabasca have been historically classified into one of three semispecies based purely on courtship song and geographic range. Previous studies using allozyme (Johnson 1978, 1985), mtDNA (Yoon & Aquadro 1994), and a study of six nuclear genes have shown some genetic support for such groupings (Ford & Aquadro 1996). However, until now, it remained unclear how representative these previously examined regions were of the entire genome. Despite low levels of overall species diversity, individuals within each semispecies always clustered together in phylogenetic and principal component analyses, demonstrating strong support for the presence of three distinct genetic populations within the species.

We then looked to describe the spatial distribution of variation within the species. In contrast with a spatial population structure where genetic distance between individuals increases with geographic distance (individuals vary continuously across the species range), individuals within D. athabasca showed little to no isolation-by-distance when controlling for population structure, indicating three populations separated by sharp boundaries. This observation suggests the majority of the D. athabasca genome is diverging under an island model, as opposed to an isolation-by-distance model.

We also show that levels of population differentiation (FST) between semispecies are elevated on the X-chromosome, confirming previous work (Ford & Aquadro 1996) and suggesting the X-chromosome may play an important role in population divergence within this species complex. Examination of nucleotide variability within semispecies also reveals interesting semispecies-specific patterns of evolution. In particular, Eastern-B shows an overall reduction of diversity across the genome compared to the other semispecies. The small effective population size of Eastern-B suggests that the formation of the Eastern-B semispecies may have been the result of a founder event, a possibility that was also discussed briefly by Ford & Aquadro (1996). Additionally, Eastern-B shows a further reduction of diversity on its X-chromosome compared to autosomes, suggestive of either several selective sweeps on the X-chromosome in this semispecies, or a strong bottleneck reducing diversity disproportionally on the X (Pool & Nielsen 2007). Differences in patterns of variation along the X-chromosome are of particular interest since it has been previously shown that sex chromosomes play an important role in speciation of other Drosophila (see Presgraves 2008 for review) and since genes involved in courtship song differences between semispecies have been shown to have a large X-linked component (Yoon 1991). History of divergence & gene flow

Previous studies have suggested a splitting order for the semispecies of D. athabasca where Eastern-A and Eastern-B are more recently diverged sister groups, with Western-Northern having diverged earlier in the genealogical history of the species (Yoon & Aquadro 1994; Ford et al. 1994, 1996). Our dataset also supports this relationship, with both phylogenetic and principal component analyses consistently clustering the Eastern groups together. Additionally, low levels of relative population differentiation between Eastern semispecies (measured by FST),

Page 35: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 29

combined with the observation that Eastern-A and Eastern-B have the highest number of shared SNPs among all semispecies comparisons, is indicative of recent shared ancestry. Although gene flow is also a factor likely contributing to these patterns, given the relatively short divergence time estimated for these groups and lack of support for an isolation by distance model, patterns of low divergence and relatively high amounts of shared SNPs between Eastern groups are consistent with a (Western-Northern, (Eastern-A, Eastern-B)) splitting model within D. athabasca.

Historical demography leaves characteristic signatures in the genome, and we use our population genomic data to conduct a detailed analysis of the population history of D. athabasca. In particular, we employ a likelihood method that allows us to infer specifics regarding current and ancestral population sizes, levels of gene flow, and the timing of population splits. Since the order of divergence between semispecies was well supported, we focused on determining whether a model of strict allopatric divergence (isolation with no migration) or a model allowing gene flow (isolation with migration) was more consistent with the patterns of genomic variation that we observe in our data. Although we expect neither of these simple models to capture the full history of the D. athabasca group, examining the goodness-of-fit of our data to these models will increase our understanding of demographic processes within the species group and provide an important framework for future evolutionary analyses in this species.

The fit of our data rejects a model of divergence under strict allopatry in favor of an isolation with migration model. Although we find little evidence for gene flow in our isolation-by-distance analyses, and previous studies exploring mtDNA haplotype sharing in D. athabasca have suggested no evidence for gene flow between semispecies (Yoon & Aquadro 1994), these observations do not rule out low levels of migration between semispecies. Estimates of current effective population sizes are consistent with expectations based on current observed ranges, in which Western-Northern has the largest effective population size, while estimates for the effective population size of Eastern-B were the smallest. Based on these results, we infer a divergence time of ~16,000 years for the Western-Eastern split and ~6,000 years for the Eastern-A-Eastern-B split, similar to previous estimates by Ford & Aquadro (1996). Conclusions & future prospects for D. athabasca

From an evolutionary perspective, D. athabasca is a compelling group in which to study incipient speciation. Semispecies share regions of sympatry, exhibit prezygotic isolation, and have very recent divergence times. Despite the existence of prezygotic isolation between semispecies, we detect low levels of gene flow between them, making the D. athabasca system an exciting arena in which to examine the interplay between the evolution of prezygotic isolation barriers and gene flow, a topic which deserves future in-depth analyses in this species. Since theory predicts that speciation in the presence of gene flow should result in a characteristic pattern of introgression in which regions important for speciation remain distinct between populations, leaving other regions of the genome free to introgress (Via & West 2008; Nosil et al. 2009), an in-depth genome-wide scan for individual loci in the D. athabasca genome that violate genome-wide patterns we show here, would provide valuable insight regarding the evolution of prezygotic isolating mechanisms and the permeability of genetic boundaries during speciation.

Previously, speciation studies were mostly limited to classic model organisms for which genomic resources have been well developed, and their closely related species. However, next-

Page 36: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 30

generation sequencing technologies have opened up the possibility of expanding and developing additional, more pertinent model systems for the study of speciation. The distinctive intraspecies groups have long made D. athabasca an attractive species for descriptive studies involving prezygotic isolation (Miller 1958, Miller & Westphal 1967; Yoon 1991) and population differentiation (Johnson 1978, 1985), however until now, lack of genomic resources have limited evolutionary investigations. This broad genomic survey of the patterns of variation and population structure within D. athabasca provide important genomic resources and a historical framework necessary for future evolutionary analyses in D. athabasca.

Page 37: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 31

References Barbash D. A., Roote J., Ashburner M., 2000. The Drosophila melanogaster hybrid male rescue gene causes inviability in male and female species hybrids. Genetics 154: 1747–1771. Buerkle C., Lexer C., 2008. Admixture as the basis for genetic mapping. Trends Ecol Evol 23: 686–694. Chang A. S., Noor M. A. F., 2007. The genetics of hybrid male sterility between the allopatric species pair Drosophila persimilis and D. pseudoobscura bogotana: dominant sterility alleles in collinear autosomal regions. Genetics 176: 343–349. Coyne J., Orr H., 1989. Patterns of speciation in Drosophila. Evolution 43: 362–381. Coyne J., Orr H., 1997. “ Patterns of speciation in Drosophila” revisited. Evolution 51: 5–303. Coyne J. A., Orr H. A., 2004. Speciation. Sinauer Associates, Inc. Ford M. J., Aquadro C. F., 1996. Selection on X-linked genes during speciation in the Drosophila athabasca complex. Genetics 144: 689–703. Ford M. J., Yoon C. K., Aquadro C. F., 1994. Molecular evolution of the period gene in Drosophila athabasca. Mol. Biol. Evol. 11: 169–182. Gutenkunst R. N., Hernandez R. D., Williamson S. H., Bustamante C. D., 2009. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. Plos Genet 5: e1000695. Haag-Liautard C., Dorris M., Maside X., Macaskill S., Halligan D. L., Houle D., Charlesworth B., Keightley P. D., 2007. Direct estimation of per nucleotide and genomic deleterious mutation rates in Drosophila. Nature 445: 82–85. Hudson R. R., 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18: 337–338. Jensen J. L., Bohonak A. J., Kelley S. T., 2005. Isolation by distance, web service. BMC Genet. 6: 13. Johnson D., 1978. Genetic differentiation in two members of the Drosophila athabasca complex. Evolution 32: 798–811. Johnson D., 1985. Genetic differentiation in the Drosophila athabasca complex. Evolution 39: 467–472.

Page 38: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 32

Li R., Zhu H., Ruan J., Qian W., Fang X., Shi Z., Li Y., Li S., Shan G., Kristiansen K., Li S., Yang H., Wang J., Wang J., 2010. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 22: 549–556. Lu X., Shapiro J. A., Ting C.-T., Li Y., Li C., Xu J., Huang H., Cheng Y.-J., Greenberg A. J., Li S.-H., Wu M.-L., Shen Y., Wu C.-I., 2010. Genome-wide misexpression of X-linked versus autosomal genes associated with hybrid male sterility. Genome Res 20: 1097–1102. Masly J. P., Jones C. D., Noor M. A. F., Locke J., Orr H. A., 2006. Gene transposition as a cause of hybrid sterility in Drosophila. Science 313: 1448–1450. Miller D., 1958. Sexual Isolation and Variation in Mating-Behavior Within Drosophila-Athabasca. Evolution 12: 72–81. Miller D., Goldstein R., Patty R., 1975. Semispecies of Drosophila athabasca distinguishable by male courtship sounds. Evolution 29: 531–544. Miller D., Roy R., 1964. Further data on Y chromosome types in Drosophila athabasca. Can. J. Genet. Cytol. 259: 334–348. Miller D., Westphal N., 1967. Further evidence on sexual isolation within Drosophila athabasca. Evolution. Noor M. A. F., Feder J. L., 2006. Speciation genetics: evolving approaches. Nat Rev Genet 7: 851–861. Nosil P., Feder J. L., 2012. Genomic divergence during speciation: causes and consequences. Philos. Trans. R. Soc. Lond., B, Biol. Sci. 367: 332–342. Nosil P., Funk D. J., Ortíz-Barrientos D., 2009. Divergent selection and heterogeneous genomic divergence. Mol. Ecol. 18: 375–402. Orr H. A., 1995. The population genetics of speciation: the evolution of hybrid incompatibilities. Genetics 139: 1805–1813. Ortíz-Barrientos D., Noor M. A. F., 2005. Evidence for a One-Allele Assortative Mating Locus. Science, New Series 310. Patterson N., Price A. L., Reich D., 2006. Population Structure and Eigenanalysis. Plos Genet 2: e190. Payseur B. A., 2010. Using differential introgression in hybrid zones to identify genomic regions involved in speciation. Molecular Ecology Resources 10: 806–820. Phadnis N., Orr H. A., 2009. A Single Gene Causes Both Male Sterility and Segregation Distortion in Drosophila Hybrids. Science 323: 376–379.

Page 39: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 33

Pool J. E., Nielsen R., 2007. Population size changes reshape genomic patterns of diversity. Evolution 61: 3001–3006. Presgraves D. C., 2008. Sex chromosomes and speciation in Drosophila. Trends Genet. 24: 336–343. Presgraves D. C., Balagopalan L., Abmayr S. M., Orr H. A., 2003. Adaptive evolution drives divergence of a hybrid inviability gene between two species of Drosophila. Nature 423: 715–719. Sobel J., Chen G., Watt L., Schemske D., 2009. The biology of speciation. Evolution 64-2: 295-315 Tang S., Presgraves D. C., 2009. Evolution of the Drosophila nuclear pore complex results in multiple hybrid incompatibilities. Science 323: 779–782. Thornton K., 2003. libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics 19: 2325–2327. Ting C. T., Takahashi A., Wu C. I., 2001. Incipient speciation by sexual isolation in Drosophila: concurrent evolution at multiple loci. Proc. Natl. Acad. Sci. U.S.A. 98: 6709–6713. Ting C.-T., Tsaur S.-C., Wu M.-L., Wu C.-I., 1998. A Rapidly Evolving Homeobox at the Site of a Hybrid Sterility Gene. Science 282: 1501–1504. Via S., 2009. Natural selection in action during speciation. Proceedings of the National Academy of Sciences 106 Suppl 1: 9939–9946. Via S., West J., 2008. The genetic mosaic suggests a new role for hitchhiking in ecological speciation. Mol. Ecol. 17: 4334–4345. Wu C. I., Palopoli M. F., 1994. Genetics of postmating reproductive isolation in animals. Annu. Rev. Genet. 28: 283–308. Yoon C. K., 1991. Molecular and behavioral evolution in the semi-species of Drosophila athabasca. Thesis. Yoon C. K., Aquadro C. F., 1994. Mitochondrial DNA variation among the Drosophila athabasca semispecies and Drosophila affinis. J. Hered. 85: 421–426.

Page 40: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 34

Figures

Figure 1. D. athabasca species range based on personal collection and previous studies (Yoon 1991; Ford & Aquadro 1996). Semispecies ranges are indicated by different colors. Abbreviations used WN = Western-Northern, EA = Eastern-A, and EB = Eastern-B. Note that Western-Northern and Eastern-A semispecies share a region of range sympatry near the Eastern United States-Canadian border, while Eastern-A and Eastern-B semispecies share a region of sympatry in New Jersey.

WN EA EB

Page 41: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 35

a b

c d Figure 2. Patterns of isolation-by-distance in D. athabasca. Species-wide, D. athabasca shows a significant pattern of IBD (a); however, controlling for population structure (b-d) shows weak IBD in Western-Northern and no IBD within Eastern-A and Eastern-B.

0 1000 2000 3000 4000

2e-04

4e-04

6e-04

8e-04

geographic distance (km)

gene

tic d

ista

nce

(Dxy

)

0 1000 2000 3000 4000

0.00035

0.00040

0.00045

0.00050

0.00055

0.00060

geographic distance (km)

gene

tic d

ista

nce

(Dxy

)

0 500 1000 1500

2e-04

3e-04

4e-04

5e-04

geographic distance (km)

gene

tic d

ista

nce

(Dxy

)

0 50 100 150 200 250 300

1e-04

2e-04

3e-04

4e-04

geographic distance (km)

gene

tic d

ista

nce

(Dxy

)

ALL;!r2=0.29,!p<0.001!

WN;!r2=0.06,!p<0.03!

!!!EA;!r2=0.0005,!p=0.47! EB;!r2=0.019,!p=0.94!

Page 42: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 36

a b Figure 3. Population structure in D. athabasca. (a) Neighbor-joining tree using all 28 polymorphism strains + reference strain, along with outgroups: D. pseudoobscura (dpse), D. affinis (daff), and D. algonquin (dalg). Tree was rooted using D. pseudoobscura and node labels are bootstrap supports from 1000 replicates. (b) Principal component analysis using autosomal SNP frequencies. Plot is of the first two principal components. The first principal component shows a clear separation of Western-Northern (red) from both Eastern semispecies but no differentiation between Eastern-A (blue) and Eastern-B (green). The second principal component shows separation of Eastern-A and Eastern-B. Figure 4. Divergence model for D. athabasca demographic history.

CA-4 F

CA-2 F

CA-3 F

REF ID-10 F

ID-1 F

MN-47 F

ME-43 F

NM-28 F

MEBW-13 F

ME-16 F

IL-2 F

PA-60 F

NYNS-11 F

MNCL-39 F

MNCL-50 F

MIBC-60 F

PA-67 F

MIBC-22 F

NYNS-15 F

PABM-18 F

PABM-28 F

VT-16 F

VAPW-54 F

VAPW-56 F

NJBI-9 F

NJ-34 F

NJ-126 F

NJBI-12 F

VAPW-99 F

dalg

daff

dpse

100

100

90

80

5635

29

20

48

34

100

72100

53

100

10063

52

75

62

100

100

100

66

83

59

100

100

99

-0.1 0.0 0.1 0.2 0.3

-0.4

-0.2

0.0

0.2

Autosomes

eigenvector 1

eige

nvec

tor 2

WN! !!!!!!!!!!!!!!!!EB!!!!!!!!!!!!!!!!!!!!!!!!EA!

ANC!

present!

Time!

TWesternKEastern!=!~16,000!

TEAKEB!=!~6,000!MWNKEB!

MWNKEA!

MEAKEB!

Page 43: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 37

Tables Table 1. Summary of variant sites within and between semispecies. Numbers in parentheses are percentages of SNPs in that category. Abbreviations are as in figure 2.

SNP type Private Derived Shared Derived Total WN EA EB WN-EA WN-EB EA-EB ALL

Fixed 15,664 (0.914)

496 (0.029)

1,319 (0.077)

34 (0.002)

26 (0.002)

18,178 (1.061)

--- 34,717 (2.084)

Polymorphic 436,607 (25.474)

569,523 (33.229)

292,875 (17.088)

64,851 (3.784)

22,498 (1.313)

166,828 (9.734)

125,034 (7.295)

1,678,216 (97.916)

Total 1,316,484 (76.811)

397,449 (23.189)

1,713,933 (100)

Table 2. Patterns of sequence diversity within and between semispecies for all gene regions on the X-chromosome, autosomes, and whole-genome. Note that Muller C was omitted due to a polymorphic Muller C-Y fusion within D. athabasca.

Table 3. ∂a∂i maximum likelihood estimates of population demographic parameters under an isolation with symmetric migration model. Parameter Estimate Ancestral Ne 458,075 WN Ne 781,886 EA Ne 342,373 EB Ne 85,520 TWestern-Eastern 16,006 TEasternA-EasternB 6,116 MWN<->EA 2.05e-8 MWN<->EB 2.10e-8 MEA<->EB 1.90e-8

Group X-chromosome (Muller A & A/D)

Autosomes (Muller B, E, F) Genome-wide

!

Species-wide 0.00144 0.00059 0.00092 WN 0.00126 0.00098 0.00101 EA 0.00110 0.00071 0.00084 EB 0.00033 0.00063 0.00045

FST

WN-EA 0.360 0.107 0.224 WN-EB 0.451 0.122 0.269 EA-EB 0.088 0.024 0.049

Page 44: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 38

CHAPTER 4: THE GENOMIC LANDSCAPE OF INCIPIENT SPECIATION REVEALS A CANDIDATE SPECIATION GENE IN DROSOPHILA ATHABASCA Abstract

Disentangling the causal mutations that contribute to reproductive isolation from those that have accumulated secondarily has been a major challenge in the field of speciation genetics (Coyne & Orr 2004, Butlin et al. 2009). Studying species in the earliest stages of divergence should provide us with key information about which genes are responsible for initiating a speciation event, since there has been much less time for secondary mutations to occur (Orr 1995). The Drosophila athabasca species complex consists of three very young semispecies that diverged from each other only 6,000 and 16,000 years ago, but exhibit highly distinct courtship songs resulting in behavioral premating isolation. Here we identify a candidate speciation gene in D. athabasca potentially involved in the evolution of prezygotic isolation within the species. We investigate the genomic landscape of divergence within the species complex by re-sequencing the genomes of 28 individuals from across the three semispecies. Divergence is generally elevated along the X chromosome, and approximately 10-30-fold higher in the oldest semispecies compared to the younger semispecies pair. We identify a single highly divergent region in the genome of the younger semispecies pair that also shows strong signatures of having undergone a selective sweep. This region contains a single gene, nonA, which has been previously shown to be involved in courtship song in other Drosophila species. Our data demonstrate that divergence increases non-uniformly between nascent species, both within and between chromosomes, as well as temporally during the speciation process, illustrating the role natural selection, and sexual selection in particular, plays in shaping genome evolution during the earliest stages of speciation. Letter

Understanding the genetic basis underlying the process of speciation is one of the primary goals in the field of evolutionary biology. However, despite recent progress identifying genes contributing to postzygotic isolation (Presgraves 2010), little is known about the genetic basis and evolutionary forces that are important early on in speciation (Coyne & Orr 2004, Wolf et al. 2010). Notably, molecular mechanisms relating to the evolution of prezygotic isolating barriers are particularly poorly understood (Ortíz-Barrientos 2009, Butlin et al. 2009, Safran et al. 2013). While studies on the genetics of postzygotic isolating barriers are critical to our understanding of how species boundaries are maintained post-speciation, such factors potentially may not have been involved in driving the actual speciation event, and instead may have evolved secondarily (Wolf et al. 2010). By studying recently diverged populations, we increase the chances that the differences in the genome that we detect are actually directly responsible for driving reproductive isolation and thus speciation (Via 2009, Wolf et al. 2010, Safran et al. 2013).

Here we utilize whole genome sequencing techniques to study the genomics of speciation

in a very young semispecies system, the Drosophila athabasca species complex, which is composed of three partially overlapping semispecies – Western-Northern, Eastern-A, and Eastern-B. The three semispecies of D. athabasca are estimated to have diverged less than

Page 45: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 39

25,000 years ago (Chapter 3, Ford & Aquadro 1996) and are morphologically indistinguishable (Figure 1a). Drosophila males emit sounds by vibrating their wings during courtship, and differences in this courtship song between semispecies results in strong behavioral prezygotic isolation within D. athabasca (Miller 1958, Miller et al. 1975). Specifically, the interpulse interval (IPI) component of song, the time from the end of a pulse to the start of the next, varies among semispecies (Figure 1a), and laboratory crosses have revealed a high degree of sexual isolation between them (Miller 1958, Yoon 1991, Ford 1995). Crosses between semispecies produce fertile offspring (Miller & Westphal 1967, Johnson 1978, Yoon 1991), but their geographic ranges and strong behavioral prezygotic isolation differentiate the populations sufficiently for them to be designated as semispecies.

To study the molecular and population genetic forces involved in the divergence within the D. athabasca complex, a single iso-female line from Deary, Idaho was sequenced with Illumina paired-end technology and used to assemble a draft of the D. athabasca genome, followed by whole genome re-sequencing of 28 individuals. The final assembly was 157.2 Mb in size with an N50 of 83.5 kb (Chapter 2). Using a related species with a sequenced genome, D. pseudoobscura, we assigned scaffolds to chromosomes based on conserved gene content and annotated the genome using a combination of comparative, de novo, and mRNAseq gene finding methods, resulting in a total of 13,378 predicted protein-coding genes (Chapter 2). We collected population samples from across the D. athabasca species range and measured the IPI from courtship song recordings. Combining IPI data with geographic range data, we were able to unambiguously assign individual iso-female lines to specific semispecies groups, and we re-sequenced 28 lines (9 from Western-Northern, 12 from Eastern-A, and 7 from Eastern-B), each at roughly 10X coverage (Chapter 2).

There are 1.7 million single nucleotide polymorphisms (SNPs) segregating within the D.

athabasca complex, after applying strict coverage and quality filters (Chapter 3). The majority of SNPs are polymorphic within a single semispecies (75.8%), or shared across multiple semispecies (22.1%). Only 18,238 (1.1%) sites are fixed and shared between two semispecies, with the vast majority of these (99.6%) being shared between the Eastern-A and Eastern-B semispecies, reflecting their more recent common ancestry. Sites fixed and private within a semispecies account for only 1.0% of the total SNPs within D. athabasca (17,479 sites). Nucleotide diversity is low but varies by semispecies (πWN=0.00085, πEA=0.00071, πEB=0.00032), with Western-Northern exhibiting the highest diversity, consistent with it having the largest geographic range (Figure 1b), while Eastern-B shows severely reduced levels of nucleotide diversity compared with the other semispecies, suggesting that it may have undergone a recent population bottleneck. Although species-wide nucleotide diversity within D. athabasca (π=0.00099) is only slightly higher than within semispecies estimates, reflecting very recent divergence of the entire complex, population structure analyses show three distinct genetic populations within the species, corresponding to the three behaviorally defined semispecies (Chapter 3).

We find 10-30x more private fixed SNPs in the oldest semispecies, Western-Northern

(15,664 private fixed sites), which branched off approximately 16,000 years ago (Chapter 3), compared with the two more recently diverged semispecies that split only 6,000 years ago (496 private fixed sites in Eastern-A and 1,319 private fixed sites in Eastern-B). Investigating the

Page 46: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 40

distribution of divergence across the genome within the semispecies of D. athabasca using a sliding window analysis, we find divergence has accumulated broadly across the genome within the Western-Northern semispecies (WN mean =1.2x10-4 fixed differences/bp, sd=0.0002). In contrast, the younger Eastern-A and Eastern-B semispecies show a more restricted distribution of fixed differences (EA mean=3.5x10-6 fixed differences/bp, sd=3.0x10-5, EB mean=1.0x10-5 fixed differences/bp, sd=4.5x10-5; Supplementary Figure 1). This demonstrates how differentiation can accumulate quickly even within very short timescales, consistent with the snowballing nature of species divergence, whereby additional differences between species are expected to rapidly accumulate after an initial reproductive isolating barrier evolves (Orr 1995).

Patterns of divergence between the semispecies are not uniformly distributed across the

genome, with the X chromosome exhibiting elevated levels of population differentiation (Fst) compared to autosomes across all semispecies comparisons (Chapter 3). Furthermore, the X chromosome shows an excess of fixed SNPs (between 73.2-91.7% of all fixed SNPs; Figure 2, Supplementary Table 1), far more than would be expected based on chromosome size. Elevated divergence at X-linked loci has been found in a number of other recent species comparisons (Stump et al. 2005, Garrigan et al. 2012, Ellegren el al. 2012), and is consistent with the view that sex chromosomes may play an important role during speciation and species divergence (Presgraves 2008, Kulathinal & Singh 2008), even at this very early stage.

Sliding window analyses of fixed divergence reveals a notable region in the genomes of

the Eastern-A and Eastern-B semispecies that exhibits the highest level of fixed divergence in both semispecies (Figure 2). Pairwise divergence (Dxy) between Eastern-A and Eastern-B also peaks at this locus, and our annotation reveals the presence of a single gene, no-on-or-off-transient A (nonA), located beneath this peak (Figure 3a). Remarkably, the nonA gene has been previously identified to be involved in courtship song differences within other Drosophila species (Rendahl et al. 1992), and specifically affects the IPI phenotype between D. melanogaster and D. virilis (Campesan et al. 2001). This finding suggests that nonA may play a role in the evolution of prezygotic isolation and thus semispecies divergence within D. athabasca, consistent with theory that regions exhibiting elevated levels of divergence between recently diverged populations may contain loci relevant to functional divergence (Wu 2001, Nosil et al. 2009, 2012). The peak of divergence between Eastern-A and Eastern-B at the nonA locus lies close (<3kb) to the end of a scaffold, perhaps suggesting a role for complex repeats being involved in species divergence (Supplementary Figure 4), similar to a recent study in flycatchers (Ellegren el al. 2012). However, to rule out that the patterns of divergence that we see are merely due to assembly artifacts, we confirmed the presence of private fixed SNPs at this locus using Sanger sequencing.

To assess whether selection is responsible for driving the divergence between

semispecies at the nonA locus, we examine the distribution of genomic windows comparing levels of semispecies divergence to within semispecies polymorphism. The window containing nonA is an extreme outlier compared to the rest of the genome, exhibiting elevated divergence between Eastern-A and Eastern-B along with suppressed levels of within semispecies polymorphism (Figure 3b), consistent with signatures of a recent selective sweep. Application of the Fst-based Population Branch Statistic to all annotated transcripts in D. athabasca shows evidence for accelerated evolution specifically along the Eastern-A branch at a nonA transcript,

Page 47: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 41

compared to genome-wide averages (Figure 3c). Remarkably, if the nonA locus encodes semispecies specific IPI information, similar to its function in D. melanogaster and D. virilis, the accelerated evolution that we observe along the Eastern-A branch at nonA could potentially account for the observed phenotypic difference in IPI seen in individuals of the Eastern-A semispecies (Figure 1a).

To conclude, our study provides one of the first genome-wide population genetic

investigations of the molecular changes and population parameters important during incipient speciation, contributing important new information to our view of the genetics of speciation. Our data suggest that divergence accumulates rapidly, very early on during the divergence process, with signals of divergence at specific loci already obscured at 16,000 years divergence. However, with some of the youngest divergence times used to study speciation genetics on a genome-wide level thus far, examining the patterns of divergence within and between the youngest semispecies of D. athabasca has allowed us to identify a candidate gene that may play a role in the evolution of prezygotic isolation, and thus, reproductive isolation during the earliest stages of speciation.

Page 48: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 42

References Butlin R., Debelle A., Kerth C., Snook R. R., Beukeboom L. W., Castillo Cajas R. F., Diao W., Maan M. E., Paolucci S., Weissing F. J., van de Zande L., Hoikkala A., Geuverink E., Jennings J., Kankare M., Knott K. E., Tyukmaeva V. I., Zoumadakis C., Ritchie M. G., Barker D., Immonen E., Kirkpatrick M., Noor M., Macias Garcia C., Schmitt T., Schilthuizen M., 2012. What do we need to know about speciation? Trends Ecol. Evol. (Amst.) 27: 27–39. Campesan S., Dubrova Y., Hall J. C., Kyriacou C. P., 2001. The nonA gene in Drosophila conveys species-specific behavioral characteristics. Genetics 158: 1535–1543. Coyne J. A., Orr H. A., 2004. Speciation. Sinauer Associates, Inc., Sunderland, Massachusetts. Ellegren H., Smeds L., Burri R., Olason P. I., Backström N., Kawakami T., Künstner A., Mäkinen H., Nadachowska-Brzyska K., Qvarnström A., Uebbing S., Wolf J. B. W., 2012. The genomic landscape of species divergence in Ficedula flycatchers. Nature 491: 756–760. Ford M J., 1995. Selective sweeps during speciation: Theory and practice in Drosophila athabasca. Ph.D. Dissertation, Cornell University. Ford M. J., Aquadro C. F., 1996. Selection on X-linked genes during speciation in the Drosophila athabasca complex. Genetics 144: 689–703. Garrigan D., Kingan S. B., Geneva A. J., Andolfatto P., Clark A. G., Thornton K. R., Presgraves D. C., 2012 Genome sequencing reveals complex speciation in the Drosophila simulans clade. Genome Res 22: 1499–1511. Johnson D., 1978. Genetic differentiation in two members of the Drosophila athabasca complex. Evolution 32: 798–811. Kulathinal R. J., Singh R. S., 2008. The molecular basis of speciation: from patterns to processes, rules to mechanisms. J. Genet. 87: 327–338. Miller D., 1958. Sexual Isolation and Variation in Mating-Behavior Within Drosophila-Athabasca. Evolution 12: 72–81. Miller D., Goldstein R., Patty R., 1975. Semispecies of Drosophila athabasca distinguishable by male courtship sounds. Evolution 29: 531–544. Miller D., Westphal N., 1967. Further evidence on sexual isolation within Drosophila athabasca. Evolution 21: 479–492. Nosil P., Funk D. J., Ortíz-Barrientos D., 2009. Divergent selection and heterogeneous genomic divergence. Mol. Ecol. 18: 375–402.

Page 49: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 43

Nosil P., Parchman T. L., Feder J. L., Gompert Z., 2012 Do highly divergent loci reside in genomic regions affecting reproductive isolation? A test using next-generation sequence data in Timema stick insects. BMC Evol. Biol. 12: 164. Orr H. A., 1995. The population genetics of speciation: the evolution of hybrid incompatibilities. Genetics 139: 1805–1813. Ortíz-Barrientos D., Grealy A., NOSIL P., 2009. The genetics and ecology of reinforcement: implications for the evolution of prezygotic isolation in sympatry and beyond. Ann. N. Y. Acad. Sci. 1168: 156–182. Presgraves D. C., 2008. Sex chromosomes and speciation in Drosophila. Trends Genet. 24: 336–343. Presgraves D. C., 2010. The molecular evolutionary basis of species formation. Nat Rev Genet 11: 175–180. Rendahl K. G., Jones K. R., Kulkarni S. J., Bagully S. H., Hall J. C., 1992. The dissonance mutation at the no-on-transient-A locus of D. melanogaster: genetic control of courtship song and visual behaviors by a protein with putative RNA-binding motifs. J. Neurosci. 12: 390–407. Safran R. J., Scordato E. S. C., Symes L. B., Rodríguez R. L., Mendelson T. C., 2013. Contributions of natural and sexual selection to the evolution of premating reproductive isolation: a research agenda. Trends Ecol. Evol. (Amst.) 28: 643–650. Stump A. D., Shoener J. A., Costantini C., Sagnon N., Besansky N. J., 2005 Sex-Linked Differentiation Between Incipient Species of Anopheles gambiae. Genetics 169: 1509–1519. Via S., 2009. Natural selection in action during speciation. Proceedings of the National Academy of Sciences 106 Suppl 1: 9939–9946. Wolf J. B. W., Lindell J., Backström N., 2010. Speciation genetics: current status and evolving approaches. Philos. Trans. R. Soc. Lond., B, Biol. Sci. 365: 1717–1733. Wu C., 2001. The genic view of the process of speciation. Journal of Evolutionary Biology 14: 851–865. Yoon C. K., 1991. Molecular and behavioral evolution in the semi-species of Drosophila athabasca. Ph.D. Dissertation, Cornell University.

Page 50: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 44

Figures

Figure 1. Overview of the D. athabasca semispecies complex. (a) Semispecies are morphologically identical, but exhibit semispecies-specific courtship songs most easily quantified by differences in interpulse interval (IPI), or the time from the end of a pulse to the start of the next. The average IPI for each semispecies is indicated underneath a typical waveform. Western-Northern and Eastern-B exhibit similar IPIs, however their ranges do not overlap in nature (b). Semispecies ranges are depicted by different colors, WN = red, EA = blue, EB = green.

Figure 2. Genomic landscape of divergence. Distribution of private and fixed SNPs in each of the semispecies using 5kb sliding windows. The arrow indicates the region with highest divergence in both Eastern-A and Eastern-B, which occurs at the same position of the genome in both semispecies.

a& b&

11.2±0.8ms 29.0±2.6ms 13.4±1.0ms

16 kya

6 kya

Page 51: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 45

Figure 3. Genome-wide scans identify nonA as a candidate speciation gene within D. athabasca. (a) Patterns of pairwise divergence using a 5kb sliding window between the two youngest semispecies, Eastern-A and Eastern-B, show elevated Dxy between semispecies (black line) and low within semispecies polymorphism (πwithin; blue=EA, green=EB) coinciding with the region of the genome containing the nonA gene. (b) Across the majority of the genome (10kb windows), levels of nucleotide diversity between Eastern-A and Eastern-B (πbetween) positively correlates with nucleotide diversity within semispecies (πwithin), however the window containing the nonA gene shows signatures of a recent selective sweep, with high levels of πbetween, but greatly suppressed diversity within semispecies (darker point = more windows; using πwithin from Eastern-B shows the same pattern; Supplementary Figure 2). (c) Analysis of branch specific evolution at nonA shows accelerated evolution specifically in the Eastern-A lineage compared to the genome-wide average.

a&

b& c&

Genome-wide average

nonA

Page 52: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 46

Supplementary Information Sliding/tiled window analyses

Using the D. athabasca reference genome, population variant calls (Chapter 2), and SNP classifications (Chapter 3), we examined patterns of within semispecies diversity and between semispecies divergence across the genome by dividing the genome up into either 5 or 10kb windows. To compute genome-wide averages for nucleotide diversity within (πwithin) and between (πbetween) semispecies, we estimated population genetic parameters for each window using the C++ library, libsequence and the compute program (Thornton 2003). Genome-wide averages were obtained by averaging across all windows. To examine the genome-wide distribution of divergence between semispecies, we counted the numbers of private fixed SNPs for each semispecies in 5kb sliding windows across the genome, starting at the first base of a scaffold and sliding across in 1kb steps. For visualization purposes, scaffolds were grouped together randomly by Muller element, but no cross-scaffold windows were considered. For all window analyses, we omitted sites that lacked data from all 28 individuals. Scaffolds with total lengths smaller than the window size were excluded from this analysis and only full windows were examined, which for tiled (non-overlapping) windows resulted in the exclusion of the very end of some scaffolds. Experimental verification of fixed SNPs in the nonA region

In order to validate that the pattern of divergence we observe between semispecies at the nonA locus is not due to assembly artifacts, we used Sanger sequencing to verify semispecies specific SNPs within the nonA locus and nearby flanking regions. However, since the scaffold containing nonA ends close to the start of the gene, with the 5’ region directly upstream from of the gene being highly repetitive (Supplementary Figure 4a) limiting our ability to accurately amplify this region, we used paired-end information from our Illumina reads to identify the scaffold that lies directly upstream. To do this, we mapped all the Illumina reads from our population re-sequencing analysis (90bp paired-end reads, 500bp insert size) to our reference assembly using Bowtie2 (Langmead & Salzberg 2012, --very-sensitive). Examining the paired-end mapping flags output from Bowtie2 (Supplementary Figure 4a), we retained all reads that mapped within one insert size (500bp) from the 5’ end of the scaffold that had mates which mapped within one insert size from the end of a different scaffold (618 pairs). The majority of the mates mapped to the same scaffold (465 pairs; Supplementary Figure 4b), allowing us to identify the scaffold that lies directly 5’ of the scaffold containing nonA. We then used this information to design primers to amplify nonA, along with its flanking regions, with the forward primer being located in the upstream scaffold (Supplementary Table 2).

DNA was extracted from a single male individual from each semispecies using the Puregene DNA Extraction Kit (Qiagen) and PCR was carried out using LA Taq DNA polymerase (Takara) under the 2-step PCR cycle recommended by the manufacturer. PCR products were cleaned using Exonuclease I and Shrimp Alkaline Phosphatase, and sequenced with Big-Dye (Version 3.1; Applied Biosystems) using PCR primers and additional internal primers (Supplementary Table 2). Sequencing reactions were cleaned using Performa dye terminator removal plates (EdgeBio) and run on an Applied Biosystems 3730 capillary sequencer.

Page 53: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 47

Detecting selection

In order to examine the genome for signatures of selection, we used the estimates of nucleotide diversity (π) both between and within semispecies in 10kb non-overlapping windows as calculated above. To look for outlier windows in the genome, we plotted the average πbetween

against πwithin for each window using density scatterplots (hexbin package in R; Supplementary Figure 2), with windows showing signatures of a selective sweep expected to show high levels of between semispecies divergence (πbetween) and low levels of within semispecies polymorphism (πwithin).

Although the window-based approach described above gives us information about which regions in the genome may have undergone a recent selective sweep, it gives us limited information about lineage specific selection. To examine branch specific evolution in the D. athabasca complex, we use the Fst-based Population Branch Statistic developed by Yi et al. (2010). Using all transcripts from our genome annotation, including alternatively spliced transcripts, we calculated branch lengths using estimates of Fst (libsequence) for each pairwise semispecies comparison, correcting for negative branch lengths. We plotted the distribution of branch lengths for each semispecies lineage to look for transcripts exhibiting abnormal branch lengths (Supplementary Figure 3). To calculate the genome-wide average tree, we averaged the branch lengths for all transcripts in the genome. It is of interest to note that although there are two annotated transcripts for nonA, only the longer transcript (894 vs. 271 amino acids) shows evidence for branch specific selection. References

Langmead B., Salzberg S. L., 2012. Fast gapped-read alignment with Bowtie 2. Nat Meth 9: 357–359. Thornton K., 2003. libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics 19: 2325–2327. Yi X., Liang Y., Huerta-Sanchez E., Jin X., Cuo Z. X. P., Pool J. E., Xu X., Jiang H., Vinckenbosch N., Korneliussen T. S., Zheng H., Liu T., He W., Li K., Luo R., Nie X., Wu H., Zhao M., Cao H., Zou J., Shan Y., Li S., Yang Q., Asan, Ni P., Tian G., Xu J., Liu X., Jiang T., Wu R., Zhou G., Tang M., Qin J., Wang T., Feng S., Li G., Huasang, Luosang J., Wang W., Chen F., Wang Y., Zheng X., Li Z., Bianba Z., Yang G., Wang X., Tang S., Gao G., Chen Y., Luo Z., Gusang L., Cao Z., Zhang Q., Ouyang W., Ren X., Liang H., Zheng H., Huang Y., Li J., Bolund L., Kristiansen K., Li Y., Zhang Y., Zhang X., Li R., Li S., Yang H., Nielsen R., Wang J., Wang J., 2010. Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude. Science 329: 75.

Page 54: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 48

Supplementary Figures

Private Fixed SNPs/bp

Figure S1. Density of private fixed differences per base pair in non-overlapping 10kb windows for the three semispecies of D. athabasca. WN=Red, EA=Blue, EB=Green.

Histogram of table$fixedWN.bp

table$fixedWN.bp

Frequency

0.0000 0.0005 0.0010 0.0015 0.0020

0200

400

600

800

1000

Histogram of table$fixedEA.bp

table$fixedEA.bp

Frequency

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

8000

9000

10000

11000

12000

Page 55: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 49

πWNEA πWNEA

πWNEB πWNEB

πEAEB πEAEB Figure S2. Detecting windows showing signatures of selective sweeps for all comparisons. All density scatterplots involving Western-Northern show a much broader distribution (darker points = more windows). The outlier showing evidence of a selective sweep in the window containing nonA remains an outlier across all Eastern-A/Eastern-B comparisons, regardless of whether the πwithin used was from Eastern-A or Eastern-B.

0 0.001 0.002 0.003 0.004 0.005 0.006 0.007

0

5e-04

0.001

0.0015

0.002

0.0025

0.003

0.0035

16121722283339444955606671768287

Counts

0 0.001 0.002 0.003 0.004 0.005 0.006

0

5e-04

0.001

0.0015

0.002

0.0025

0.003

0.0035

1814212834414754616774808794100107

Counts

0 5e-04 0.001 0.0015 0.002 0.0025 0.003 0.0035

0

5e-04

0.001

0.0015

0.002

0.0025

1815222936435057647178859299106113

Counts

πWN! πEA!

0 0.001 0.002 0.003 0.004 0.005 0.006 0.007

0

5e-04

0.001

0.0015

0.002

0.0025

15101418232731364044485357616670

Counts

πWN!

πEA! πEB!

0 5e-04 0.001 0.0015 0.002 0.0025 0.003 0.0035

0

5e-04

0.001

0.0015

0.002

16111520253035404449545964687378

Counts

0 0.001 0.002 0.003 0.004 0.005 0.006

0

5e-04

0.001

0.0015

0.002

15101418232731364044485357616670

Counts

πEB!

Page 56: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 50

Figure S3. Population Branch Statistic distribution for all annotated transcripts in the genome for each semispecies. The magenta dot indicates the largest nonA transcript.

Page 57: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 51

a

b

Figure S4. Identifying the scaffold upstream from the scaffold that contains nonA. (a) The distribution of paired-end mapping flags from Bowtie2 along the scaffold that contains nonA (gene region at 2170-11020bp), along with the corresponding coverage plot for the scaffold showing extremely high coverage at the 5’ end, indicative of repetitive sequence in this region. (b) Paired-end mapping in all three semispecies shows evidence for the same upstream scaffold.

Page 58: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 52

Supplementary Tables Table S1. Number of sites with private and fixed SNPs per semispecies per Muller Element. Omitting Muller C (due to a polymorphic Y-Muller C fusion within D. athabasca), 86.8% of Western-Northern, 91.7% of Eastern-A, and 73.2% of Eastern-B private fixed SNPs are located on the X-chromosome. !

Element' WN' EA' EB'A' 6931% 276% 571%

A/D' 5116% 178% 378%B' 563% 12% 334%C' 1788% 1% 23%E' 714% 4% 5%F' 111% 0% 0%

Unknown' 441% 25% 8%Total' 15664% 496% 1319%

!

Page 59: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 53

Table S2. Primers used for validation of fixed SNPs in the nonA region by PCR and Sanger sequencing. Number' Type' Sequence'

1% PCR_Fwd% TCCAAGAGAAGCATCGACCT%2% PCR_Rev% AGTCAATGGTCGCATCATGT%3% internal% ACCATTCCAACGCTATGCTC%4% internal% GGATAGCCCAGACACTTCCA%5% internal% CCAGCAGCATGTTTGAGAAA%6% internal% CATGATAGGCACCGTTTGTG%7% internal% CCACCGACTGTTTCGAGAGT%8% internal% GACCTTGGGTTGCATTAGGA%9% internal% CGGATGAATCGTTTCCAGAT%10% internal% CCACTACGATTGAACCACCA%11% internal% CAAACACACTGCACCCAAAC%12% internal% TTTTATCCCGCATTTCCTTG%13% internal% TTTGGTGGCTCAAACAATCA%14% internal% AGAGTTTTCTCCTGCCCACA%15% internal% CGCTTTTGAATTTGGGGTTA%16% internal% ACGTTGTGGAGGAAATCCAA%17% internal% GGCGTTGGTTGTCAAAGAAT%18% internal% GCCTTGTTCCAAGTCTCTCG%19% internal% CATAGCAATGATCGCTCGAA%20% internal% GCGGTGGTGATGATTTCTTT%21% internal% AATTTCGGAGGCAATAGCAA%22% internal% ATATCCGGAGGGGTGGTCTA%23% internal% TCGGAACAAAGACACGATCA%24% internal% GCTCATTGTCGGTAATGTCG%25% internal% TTTTCACGCAACGAAGTGTT%26% internal% ATTTAGGGGTGTCTGGCAAA%27% internal% GCTTCGCTTCCTTGGTTATC%28% internal% CGCAGAATGATGAGCGTTTA%29% internal% GTGAATGACGACAACGATGG%30% internal% TTCCCATCCCATTTTAACCA%31% internal% GGCGAGTGGGAAAAGGTAAT%32% internal% ATTCCATTTTCTTGCGTTCG%33% internal% AGCCGCTGAAGTTGACTTGT%

Page 60: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 54

APPENDIX A: COURTSHIP SONG WAVEFORMS OF SEQUENCED LINES

Page 61: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 55

Page 62: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 56

Page 63: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 57

APPENDIX B: MITOTIC KARYOTYPES

Appendix A Representative Mitotic Karyotype for Each Line ID-1 ME-6 ME-15 ME-16

ME-20 ME-43 ME-47 ME-60

ME_BW-42 ME-BW-13 ME-BW-14 ME-BW-26

ME-BW-28 ME-BW-41 ME-BW-58 ME-BW-61

MI-BC-3 MI-BC-13 MI-BC-22 MI-BC-60

Page 64: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 58

MI-BC-67 MI-IL-11 MI-IL-12 MI-IL-36

MN-11 MN-13 MN-17 MN-20

MN-40 MN-46 MN-47 NH-2

NJ-28 NJ-34 NJ-38 NJ-68

NJ-126 NJ-BI-12 NJ-BI-9 NM-28

''*'

Page 65: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 59

NY-NS-5 NY-NS-11 NY-NS-15 NY-NS-26

NY-NS-27 NY-NS-28 NY-NS-34 PA-20

PA-21 PA-22 PA-26 PA-30

PA-35 PA-45 PA-49 PA-56

PA-60 PA-63 PA_67 PA-72

Page 66: The Genomics of Speciation in Drosophila athabascaThe Genomics of Speciation in Drosophila athabasca by Karen Masae Wong Miller Doctor of Philosophy in Integrative Biology University

! 60

* Karyotype for D. algonquin sequenced strain (outgroup)

PA-76 PA-BM-18 PA-BM-25 PA-BM-40

PA-BM-42 R-2(CA-2) R-3(CA-3) R-4(CA-4)

R-5(CA-5) RIL-2 VA-PW-54 VA-PW-56

VA-PW-72 VA-PW-99 VT-1 VT-16

''CA?2' ''''CA?3' ''CA?4'

'''CA?5' 'IL?2'