seminar2015

42
Thinking about rare alleles in flies and humans Kevin Thornton Ecology & Evolutionary Biology UC Irvine

Upload: kevin-thornton

Post on 17-Jul-2015

205 views

Category:

Science


0 download

TRANSCRIPT

  • Thinking about rare alleles in flies and

    humansKevin Thornton

    Ecology & Evolutionary Biology UC Irvine

  • Schmidt et al. doi:10.1371/journal.pgen.1000998

  • +http://www.illumina.comClark, A. G., et al. doi:10.1038/nature06341

  • http://commons.wikimedia.org/wiki/Maps_of_Africa#/media/File:AfricaCIA-HiRes.jpg

    DS

    DS,DYDY

    DS = D. simulans, DY = D. yakuba

  • A B C D E F G H IReference

    A B C D E F D E F G H ISample

    Cridland, J. M., & Thornton, K. R. doi:10.1093/gbe/evq001

  • Rogers, R. L. et al. doi:10.1093/molbev/msu124

  • differentially to adaptive changes. In D. melanogaster, the Xchromosome contains greater repetitive content (Mackayet al. 2012), displays different gene density (Adams et al.2000), has potentially smaller population sizes (Wright 1931;Andolfatto 2001), lower levels of background selection(Charlesworth 2012), and an excess of genes involved infemale-specific expression (Ranz et al. 2003) in comparisonto the autosomes. Furthermore, the X chromosome is hemi-zygous in males, exposing recessive mutations to the fulleffects of selection more often than comparable loci on theautosomes (Charlesworth et al. 1987). Hence, the incidence ofduplications on the X and the types of genes affected maydiffer from the autosomes, and thereby produce differentimpacts on phenotypic evolution.

    Many copy number variants are thought to be nonneutral(Hu and Worton 1992; Emerson et al. 2008; Cardoso-Moreiraet al. 2011), especially when they capture partial gene se-quences or create chimeric gene structures (Rogers andHartl 2012) or result in recruitment of noncoding sequences(Lee and Reinhardt 2012). Such modifications are likely tochange gene regulatory profiles (Rogers and Hartl 2012), in-creasing the likelihood of nonneutral phenotypes. Surveys inD. melanogaster have identified large numbers of such vari-ants (Emerson et al. 2008; Rogers et al. 2009; Cardoso-Moreiraet al. 2011, 2012; Lee and Reinhardt 2012). Establishing profilesof partial gene duplication, whole gene duplication, chimeraformation, and recruitment of noncoding sequence areessential to a complete understanding of the roles tandemduplicates play in beneficial and detrimental phenotypicchanges across species.

    Here, we describe the number, types, and genomic loca-tions of tandem duplications segregating in 20 strains ofD. yakuba and in 20 strains of D. simulans and discuss differ-ences across species and across chromosomes, as well as theirpotential to create novel gene constructs.

    ResultsWe have sequenced the complete genomes of 20 isofemalelines of D. yakuba and 20 isofemale lines D. simulans eachinbred in the lab for 912 generations to produce effectivelyhaploid samples, as well as the reference genome stocks of

    each species (as a control for genome quality and false pos-itives) (Drosophila Twelve Genomes Consortium 2007; Huet al. 2013). Genomes are sequenced to high coverage of50150! for a total of 42 complete genomes (supplementarytables S1S5, Supplementary Material online, see Materialsand Methods). We have used mapping orientation ofpaired-end reads to identify recently derived, segregating du-plications in these samples

  • divergent read pairs and confirmed by split read mapping inPacBio long molecule sequencing, whereas the deletion issupported by 20 long-spanning read pairs in line NY66-2and gapped alignment in PacBio reads (fig. 6). The same du-plication and deletion are independently confirmed withPacBio sequences in line CY21B3. The duplication spanningjgw is found at a frequency of 520 strains, whereas the deletionshown is observed only in CY21B3 and NY66-2, suggestingthat the deletions is a secondary modification. A secondindependent duplication spans jgw in 420 strains and is con-firmed in PacBio data, indicating that the region has beenmodified multiple times in different strains.

    Deletions are exceptionally common in Drosophila (Petrovet al. 1996), and several geneticmechanismsmightoffermeansof excision in a short time frame after duplication. The largeloop mismatch repair system can facilitate deletions of dupli-cated sequence to modify duplicated sequence as long asvariants are polymorphic. The presence of unpaired dupli-cated DNA during meiosis or mitosis would commonlyinvoke the action of the large loop mismatch repair system,which if resolved imprecisely, could result in the constructobserved (fig. 7). Deletions lying within a duplication have amedian size of 3.6 kb in D. yakuba and 1.8 kb in D. simulans.Such large deletions are well outside the norm for genomewide large deletions in mutation accumulation lines ofD. melanogaster, which show an average 409 bp and maxi-mum of 2.6 kb (Schrider et al. 2013). Deletions of this sizehowever are consistent with the size of excised fragments inlarge loop mismatch repair of several kilobases (Kearney et al.2001). Deletion during nonhomologous end joining or homol-ogy-mediated replication slippagemight produce deletions aswell though it is unclear whether mutation rates are naturallyhigh enough to operate in short time frames. Thus, we wouldexpect modification of duplicated alleles to be extremelycommon, especially in deletion-biased Drosophila.

    Differences in Gene Duplications across SpeciesDuplicated coding sequences can diverge to produce novelpeptides, novel regulatory profiles, or specialized subfunctions(Conant andWolfe 2008). In order to determine the extent towhich genes are likely to be duplicated andwhether particularcategories of gene duplications are more likely to be favored,we identified coding sequences captured by tandem duplica-tions. We find large numbers of segregating gene duplicationsin both D. yakuba and D. simulans including hundreds of

    FIG. 6. Read mapping patterns indicative of a modified duplication surrounding jingwei in Drosophila yakuba line NY66-2. Duplications are indicatedwith divergently oriented paired-end reads (blue) as well as with split read mapping of long molecule sequencing (purple). Deletions in one copy aresuggested by gapped read mapping of long molecule reads (red) as well as multiple long-spanning read pairs at the tail of mapping distances in paired-end read sequencing (green) just upstream from jgw. Up to 20% of duplicates observed have long-spanning read pairs indicative of putative deletions inone or more alleles in the population.

    FIG. 7. Secondary deletion via large loop mismatch repair. A tandemduplication forms via ectopic recombination or replication slippage. Atsome point prior to fixation in the population the duplication pairs withan unduplicated chromatid in meiosis or mitosis, invoking the action ofthe large loop mismatch repair system. Imprecise excision results in amodified duplicate with partially deleted sequence. Large loop mis-match repair requires that duplications are polymorphic, and wouldtherefore produce secondary modification over short timescales, result-ing in rapid modification of tandem duplicates.

    1754

    Rogers et al. . doi:10.1093/molbev/msu124 MBE

    at University of California, Irvine on October 15, 2014http://mbe.oxfordjournals.org/

    Downloaded from Rogers, R. L. et al. doi:10.1093/molbev/msu124

  • yakuba and 76 in D. simulans where both breakpoints fallwithin nonoverlapping coding sequences. Some 11 of the 130duplications in D. yakuba and 30 of 76 in D. simulans haveboth breakpoints in gene sequences face one another and assuch are not expected to create new open reading frames, asthe constructs will lack promoters. Another 40 of the 130duplications in D. yakuba and 8 of 76 in D. simulans haveboth breakpoints in gene sequences, and will have promotersthat can potentially transcribe sequences from both strandsof DNA (fig. 9, supplementary tables S20 and S21,Supplementary Material online). Only 78 chimeric codingsequences in D. yakuba (supplementary tables S22 and S23,Supplementary Material online) and 38 chimeric genes in D.simulans (supplementary table S24, Supplementary Materialonline) have parental genes in parallel orientation. Among theparental genes of these chimeras, cytochromes and genesinvolved in drug metabolism are overrepresented in D.yakuba. Other functional categories which are present butnot overrepresented include endopeptidases, signaling glyco-peptides, and sensory signal transduction peptides. Amongparental genes in D. simulans, cytochromes and insecticidemetabolism genes, sensory perception genes, and endopepti-dase genes are overrepresented. Additional categories presentinclude signal peptides, endocytosis genes, and oogenesisgenes. Several such constructs are found at moderate fre-quencies above 10/20, suggesting that they are at least notdetrimental. However, two chimeras in D. yakuba are foundat high frequency. One formed from a combination ofGE12441 and GE12442 is at a frequency of 16/20, and oneformed from GE12353 and GE12354 is at a frequency of 19/20.In D. simulans one chimera, formed from CG11598 andCG11608, is at a frequency of 20/20. All of these genes arelipases or endopeptidases. These high-frequency variants arestrong candidates for selective sweeps.

    Compared with the number of tandem duplications thatcapture coding sequences, the number of duplications whichform chimeric genes indicates that chimeric constructs

    derived from parental genes in parallel orientation form asa result of 10.4% of tandem duplications that capture genes inD. yakuba and 9.5% of tandem duplications that capturecoding sequences in D. simulans. These numbers are in gen-eral agreement with rates of chimeric genes formation esti-mated from a within-genome study of D. melanogaster of16.0% compared with the rate of formation of duplicategenes (Rogers et al. 2009).

    Association with Transposable Elements and DirectRepeatsRepetitive sequences are known to facilitate ectopic recom-bination events that commonly yield tandem duplications(Lim and Simmons 1994). In D. yakuba, 179 (12.7%)tandem duplications fall within 1 kb of a transposable ele-ment (TE) in at least one sample strain that has a duplicationand 52 (3.7%) fall within 100 bp of a TE (supplementary tableS25, Supplementary Material online). In D. simulans, 122(12.5%) lie within 1 kb of a TE and 53 (5.4%) fall within100 bp of a TE (supplementary table S25, SupplementaryMaterial online). Additionally, 125 (8.8%) of duplications inD. yakuba have 100 bp ormore of direct repeated sequence inthe 500 bp up and downstream of duplication boundariesand 237 (16.7%) have 30 bp ormore in the reference sequenceas identified in a BLASTn comparison of regions flankingdivergently oriented read spans at an E value ! 10"5 (sup-plementary table S25, Supplementary Material online). InD. simulans, 56 (5.7%) have 100 bp or more of direct repeatedsequence in the 500 bp up and downstream of duplicationboundaries in the reference and 150 (14.4%) have 30 bp ormore of repeated sequence (supplementary table S25,Supplementary Material online). In total 371 duplications inD. yakuba and 243 duplications inD. simulans either lie within1 kb of a TE in at least one strain or are flanked by 30 bp ormore of direct repeated sequence. Hence, a maximum of26.2% of duplications identified in D. yakuba and 24.9% ofduplications identified in D. simulans may have been facili-tated by ectopic recombination between large repeats, con-sistent with previous estimates from single genome studies of30% in D. melanogaster but somewhat higher than those inD. yakuba of 12% (Zhou et al. 2008).

    In D. yakuba, 14.4% of duplications with 100 bp or more ofrepetitive sequence and 21.1% of duplications with 30 bp ormore are located on the X. In contrast, 46.4% of duplications

    FIG. 8. Abnormal gene structures. Duplicated sequence is highlightedwith bold colors and is framed by the dashed box. (A) The partial du-plication of a coding sequence (blue) results in the recruitment of pre-viously upstream noncoding sequence (dashed lines) to create a novelopen reading frame (blue and turquoise). (B) Tandem duplication whereboth boundaries fall within coding sequences results in a chimeric gene.

    FIG. 9. Dual promoter genes. Duplicated sequence is highlighted withbold colors and is framed by the dashed box. Tandem duplication whereboth boundaries fall within coding sequences results in a chimeric genewhich contains two promoters, one which facilitates transcription inone direction, the other facilitating transcription from the oppositestrand. The chimera is capable of making partial antisense transcripts.

    1756

    Rogers et al. . doi:10.1093/molbev/msu124 MBE

    at University of California, Irvine on October 15, 2014http://mbe.oxfordjournals.org/

    Downloaded from

    Rogers, R. L. et al. doi:10.1093/molbev/msu124

    D. yakuba D. simulans

    Chimeric gene

    structures78 38

    Recruited ncDNA 143 96

  • 0.0

    0.2

    0.4

    0.6

    0.8

    SFS for Duplications in D. simulansSFS for Duplications in D. yakubaA B SFS for Xlinked mutations in D. simulans

    0.0

    0.2

    0.4

    0.6

    0.8

    C

    0.0

    0.2

    0.4

    0.6

    0.8

    Figure 1: SFS for tandem duplications in D. yakuba and D. simulans, corrected forascertainment bias. A. Site frequency spectra on the autosomes (black) and on the X (grey)in D. yakuba. B. SFS on the autosomes (black) and on the X (grey) in D. simulans. C.SFS for X-linked intronic SNPs (black) and duplicates (white). The excess of high frequencyvariants on the X in D. simulans suggests widespread selection for tandem duplicates on theD. simulans X.

    53

    A B

    Figure 5: A) Gene ontology classes overrepresented by species among singly duplicatedgenes or among multiply duplicated genes. B) Number of genes duplicated by species. Mostvariants are species specific, with small numbers of parallel duplication of orthologs acrossspecies.

    57

    Rogers, R. L. et al. Submitted (1)

  • D. yakuba D. simulans D. melanogaster

    12 MY

    wholegene1.17 109 6.03 1010

    chimrecruit3.46 10

    10

    3.70 10102.42 1010

    8.52 1011

    Ne1.21 106 5.93 105

    Figure 6: Genomewide population mutation rates for all duplications (), populationsizes (Ne), and per gene mutation rates () for gene structures produced by whole geneduplication, recruitment of non-coding sequence, and chimeric genes by species. Lowmutation rates and mutation limited evolution leads to low levels of parallel recruitmentof tandem duplications.

    58

    Rogers, R. L. et al. Submitted (1)Schrider, D. R. et al. doi:10.1534/genetics.115.174912

  • Table 1: Activated genesChimeras Tissue Upregulated Total

    Female Carcass 5 76Female Ovary 11 76Male Carcass 10 76Male Testes 7 76All 24 76

    Whole Gene Tissue Upregulated TotalFemale Carcass 3 66Female Ovary 2 66Male Carcass 1 66Male Testes 0 66All 5 66

    Whole Gene and 100 bp Intergenic Tissue Upregulated TotalFemale Carcass 3 58Female Ovary 2 58Male Carcass 1 58Male Testes 0 58All 5 58

    27

    Rogers, R. L. et al. doi:10.1534/g3.114.013532Rogers, R. L. et al. Submitted (2)

  • GE18451 GE18452 GE18453GE18452Chimera

    Figure 2: Chimeric gene structures result in novel expression patterns. A tandem duplicationthat does not respect gene boundaries unites the 50 end of GE18453 with the 30 end ofGE18451 to produce a chimeric gene on chromosome 2L. Plot shows quantile normalizedcoverage in RNA seq data for sample (red) and reference (grey) with HMM output (blue)on chromosome 2L for female carcass. The chimera displays a change in transcript levels,while transcript levels for parental gene sequence are not altered. Sites with upregulated ordownregulated sequence as defined by HMM output is shown in blue. The region spanned bythe tandem duplication is shaded in grey. The region spanned by the chimeric gene showshigh-level upregulation, while the whole gene duplication of GE18452 does not display asignificant change in mRNA levels.

    30

    Rogers, R. L. et al. Submitted (2)

  • Hu, X., & Worton, R. G., doi:10.1002/humu.1380010103

    GENE DUPLICATION IN HUMAN DISEASE 5 TABLE 1. A Summary on Reported Cases of Partial Gene Duplication Associated With Human Diseases

    Number of independent Exons(s) Translational

    Genes duplications duplicated" reading frameb Disorders' References HPRT 1 LDL receptor 3

    Dystrophin 10

    1 13 1 2

    a-Galactosidase A 1 Factor VIII 1 LPL 1

    2.3 2-8 9-12

    13-15 8.9 3-11

    38-43 50-52 3, 4

    45-51 20-41 3,4 2-7

    22-27 ND ND 13-42 5-11

    17

    13 2-6

    6 hartial)

    In-frame In-frame Shift Shift Shift Shift Shift Shift In-frame In-frame Shift In-frame In-frame ND ND ND In-frame shift shift ND

    Lesch-Hyhan syndrome Familial hyper

    cholesterolemia

    DMD DMD DMD DMD DMD DMD Intermediate Intermediate BMD BMD ND D M D / B M D BMD DMD DMD Fabry disease

    Yang et al., 1984, 1988 Lehrman et al.. 1987a Top et al., 1990 Lelli et al.. 1991 Hu et al.. 1988,1990

    Greenberg et al., 1988 Den Dunnen et al., 1989 Angelini et al., 1990 Roberts et al., 1991

    Bernstein et al.. 1989 In-h-ame Hemophilia A Casula et al., 1990

    ~~ ~ , ND Lipoprotein lipase deficiency Devlin et al., 1990 Type 11 collagen 1 b bp In-frame Spondyloepiphyseal dysplasia Tiller et al., 1990 C1 inhibitor 1 4 In-frame Hereditory angioedema Stoppa-Lyonnet et al., 1990 p-Galactosidase 1 165 bp In-frame G,,-gangliosidosis Yoshida et al., 1991

    "ND, not defined. the majority of cases, the reading kame status of the mRNA was not actually determined but was predicted based on the assumption

    that the exons contained in the duplicated segment were spliced correctly to the exons flanking the duplicated segment. ND, not determined. 'DMD-Duchenne muscular dystrophy. BMD-Becker muscular dystrophy. Intermediateintermediate phenotype of the muscular

    (within exon 48)

    dystrophy.

    the original copy in a head-to-tail direct orienta- tion. This is different from the process of transpo- sition in which a copy of a gene is inserted at a new site in the chromosome and this sometimes in- volves retrotransposition of reverse-transcribed RNA (Finnegan, 1989; Dombroski et al., 1991). Unequal crossing over, either between homolo- gous chromosomes or between sister chromatids of the same chromosome, has generally been ac- cepted as the mechanism by which a tandem gene duplication arises, as this mechanism can most eas- ily explain the tandem arrangement of a duplicated gene or genes (Fig. 1). This mechanism also pre- dicts the formation of a deletion as a product of reciprocal exchange. Molecular and genetic anal- yses of duplications and deletions in Drosoghila and in fungi provided the first evidence for these chro- matid exchange events to occur (Bridges, 1936; Tartof, 1974; Petes, 1980; Szostak and Wu, 1980). The formation of the hemoglobin fusion genes of- fered an example of unequal crossing over in the human genome (Embury et al., 1980; Goossens et al., 1980; Vogel and Motulsky, 1986). Another mechanism that has been postulated to explain the

    tandem duplication of a limited number of nucle- otides (up to a few hundred base pairs) is the in- trastrand slipped-mispairing model (Roth et al., 1985). This model suggests that mispairing be- tween slipped short homologies (usually a few nu- cleotides) upon breakage of single-stranded DNA followed by repair synthesis and an additional round of replication will generate a tandem dupli- cation. A similar model had been proposed to ex- plain the formation of deletions of the same size range (Efstratiadis et al., 1980).

    One approach that has been used extensively in the study of mechanisms of gene duplication and other genomic rearrangements in mammalian cells has been the detailed analysis of the nucleotide sequences surrounding the recombination joints (or junctions). The assumption has been that identification of sequences involved in these re- combination events might allow one to deduce an enzymatic mechanism. Thus far, two general mech- anisms leading to duplications and deletions have been proposed: homologous and nonhomologous recombination. In the former, sequence homology between the two parental DNA strands in the re-

  • Schmidt et al. doi:10.1371/journal.pgen.1000998

  • uncommon both in this data set and in previous studies(Kaminker et al. 2002; Lipatov et al. 2005).

    Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions. The paralogs derailed (fig. 4F) and derailed-2 (fig. 4E)showed very different patterns of TE insertions, though thesame pattern was seen in both resources. derailed-2, locatedon chromosome 2R, is in a region ofmoderate recombination,1.99 cM/Mb, whereas derailed is located towards the distalend of 2L in a region of low recombination, 0.44 cM/Mb.

    Given the context of the recombination rates, the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995), but theopposite pattern was observed.

    While both genes play a role in Wnt5 signaling, mutationsin derailed can cause major phenotypic changes inDrosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al. 2003). derailed-2, when mutated,produces only minor differences in neuron positioning

    1/15 1/151/15 2/15 1/151/15

    1/14 1/122

    1/1181/120 1/1181/124

    2/117

    2/1182/119

    1/122

    1/119 1/118

    2/118

    1/121 3/114

    1/115

    1/114

    1/119

    1/113

    1/119

    1/120

    1/120

    1/119

    1/120

    3/119

    1/119

    A klarsichtDGRPDSPR

    Insertion in an Exon

    3/14 11/95

    4/121D Cyp6a20

    3/15

    1/123

    14/15

    56/125

    1/151/15

    4/15

    1/156/15

    10/124

    1/124 1/124 1/123

    2/124E derailed-2

    B Notch1/15 1/1211/122

    C Delta1/114 1/120

    1/91F derailed

    * Same insertion in both resources

    ***

    *

    ***

    *

    FIG. 4. Transposable element insertions in genes. The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent. Gene images are from the UCSC genome browser (http://genome.ucsc.edu/, last accessed June 31, 2012).

    2318

    Cridland et al. . doi:10.1093/molbev/mst129 MBE at University of California, Irvine on September 17, 2013

    http://mbe.oxfordjournals.org/Downloaded from

    Cridland, J. M., et al. doi:10.1093/molbev/mst129

    typically unique to a single line though there were 3 insertionsin more than one line in the DSPR and 14 insertions in morethan one line in the DGRP.

    Some genes accumulated many TE insertions, such asklarsicht (fig. 4A, gene images from the UCSC genome brow-ser, Meyer et al. 2013) which had 29 total insertions. This genealso had a hotspot of insertions with seven independent in-sertions, each at low frequency, located within 3.6 kb of each

    other. Only one of these insertions was present in any givenline and these could be functionally equivalent though inde-pendently arising mutations. Other genes had only a few in-sertions such as Notch (fig. 4B) and Delta (fig. 4C). Theinsertions in these genes were also at low frequency, andthe few present were located in intronic regions.

    While most insertions exist in only one panel some werefound in both. Cyp6a20 (fig. 4D) contains a high frequencynon-reference TE insertions present in both data sets, in 4lines out of 121 lines where we were able to make a presenceor absence call in the DGRP (hereafter shown as 4/121) and3/14 lines in the DSPR. Both Cyp6a20 and klarsicht are exam-ples of genes with TE insertions in exons, which was

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1 2 3 4 5 6 7+

    Observed TEsObserved SNPsExpected

    DSPR

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1 2 3 4 5 6 7+

    Observed TEsObserved SNPsExpected

    DGRP

    FIG. 3. Derived allele count spectra for the DSRP lines and the DGRP25lines where a positive presence or absence call was made for eachinsertion in each line, 6,613 insertions in the DSPR and 3,274 in theDGRP25. Count spectra for SNPs is from SNPs in introns !86bp. !2tests between observed and expected distributions result in P& 0 forcomparisons between TEs and the neutral model as well as between TEsand SNPs for both data sets.

    Table 5. TE Density for 15 Individual Families of Elements.

    Mean Density (TE/Mb)

    Element Resource X High X Low Auto High Auto Low

    roo DGRP25 0.39 0.37 0.32 0.33DSPR 0.80 0.88 0.65 0.68

    297 DGRP25 0.05 0.00 0.02 0.01DSPR 0.05 0.00 0.04 0.00

    412 DGRP25 0.03 0.01 0.03 0.07DSPR 0.06 0.08 0.09 0.12

    F DGRP25 0.02 0.06 0.04 0.06DSPR 0.07 0.08 0.12 0.13

    17.6 DGRP25 0.01 0.00 0.01 0.02DSPR 0.02 0.00 0.02 0.06

    Bari1 DGRP25 0.01 0.00 0.04 0.01DSPR 0.02 0.02 0.06 0.02

    copia DGRP25 0.01 0.00 0.02 0.03DSPR 0.09 0.02 0.14 0.19

    H DGRP25 0.01 0.00 0.01 0.00DSPR 0.00 0.00 0.01 0.00

    hopper DGRP25 0.10 0.01 0.00 0.00DSPR 0.10 0.04 0.02 0.02

    INE-1 DGRP25 0.42 1.88 0.03 0.48DSPR 0.45 2.03 0.04 0.55

    jockey DGRP25 0.15 0.12 0.19 0.13DSPR 0.35 0.34 0.40 0.33

    mdg1 DGRP25 0.03 0.02 0.03 0.10DSPR 0.08 0.04 0.12 0.18

    pogo DGRP25 0.06 0.04 0.07 0.05DSPR 0.13 0.21 0.15 0.13

    springer DGRP25 0.00 0.01 0.01 0.03DSPR 0.02 0.00 0.01 0.05

    Table 4. TE Density in the X and Autosomes.

    TE/Mb

    DGRP25 DSPR Both

    X, all 3.82 6.38 4.83

    Autosomes, all 3.51 5.93 4.47

    X, high recombination 3.71 6.00 4.61

    X, low recombination 3.93 6.76 5.05

    2L, high recombination 2.27 4.73 3.24

    2L, low recombination 3.23 5.89 4.28

    2R, high recombination 2.82 4.89 3.64

    2R, low recombination 6.17 8.63 7.14

    3L, high recombination 2.96 5.23 3.86

    3L, low recombination 4.74 7.26 5.73

    3R, high recombination 2.81 4.68 3.55

    3R, low recombination 3.72 6.70 4.90

    2317

    Transposable Elements in Two Drosophila QTL Mapping Resources . doi:10.1093/molbev/mst129 MBE

    at University of California, Irvine on September 17, 2013http://mbe.oxfordjournals.org/

    Downloaded from

    typically unique to a single line though there were 3 insertionsin more than one line in the DSPR and 14 insertions in morethan one line in the DGRP.

    Some genes accumulated many TE insertions, such asklarsicht (fig. 4A, gene images from the UCSC genome brow-ser, Meyer et al. 2013) which had 29 total insertions. This genealso had a hotspot of insertions with seven independent in-sertions, each at low frequency, located within 3.6 kb of each

    other. Only one of these insertions was present in any givenline and these could be functionally equivalent though inde-pendently arising mutations. Other genes had only a few in-sertions such as Notch (fig. 4B) and Delta (fig. 4C). Theinsertions in these genes were also at low frequency, andthe few present were located in intronic regions.

    While most insertions exist in only one panel some werefound in both. Cyp6a20 (fig. 4D) contains a high frequencynon-reference TE insertions present in both data sets, in 4lines out of 121 lines where we were able to make a presenceor absence call in the DGRP (hereafter shown as 4/121) and3/14 lines in the DSPR. Both Cyp6a20 and klarsicht are exam-ples of genes with TE insertions in exons, which was

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1 2 3 4 5 6 7+

    Observed TEsObserved SNPsExpected

    DSPR

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1 2 3 4 5 6 7+

    Observed TEsObserved SNPsExpected

    DGRP

    FIG. 3. Derived allele count spectra for the DSRP lines and the DGRP25lines where a positive presence or absence call was made for eachinsertion in each line, 6,613 insertions in the DSPR and 3,274 in theDGRP25. Count spectra for SNPs is from SNPs in introns !86bp. !2tests between observed and expected distributions result in P& 0 forcomparisons between TEs and the neutral model as well as between TEsand SNPs for both data sets.

    Table 5. TE Density for 15 Individual Families of Elements.

    Mean Density (TE/Mb)

    Element Resource X High X Low Auto High Auto Low

    roo DGRP25 0.39 0.37 0.32 0.33DSPR 0.80 0.88 0.65 0.68

    297 DGRP25 0.05 0.00 0.02 0.01DSPR 0.05 0.00 0.04 0.00

    412 DGRP25 0.03 0.01 0.03 0.07DSPR 0.06 0.08 0.09 0.12

    F DGRP25 0.02 0.06 0.04 0.06DSPR 0.07 0.08 0.12 0.13

    17.6 DGRP25 0.01 0.00 0.01 0.02DSPR 0.02 0.00 0.02 0.06

    Bari1 DGRP25 0.01 0.00 0.04 0.01DSPR 0.02 0.02 0.06 0.02

    copia DGRP25 0.01 0.00 0.02 0.03DSPR 0.09 0.02 0.14 0.19

    H DGRP25 0.01 0.00 0.01 0.00DSPR 0.00 0.00 0.01 0.00

    hopper DGRP25 0.10 0.01 0.00 0.00DSPR 0.10 0.04 0.02 0.02

    INE-1 DGRP25 0.42 1.88 0.03 0.48DSPR 0.45 2.03 0.04 0.55

    jockey DGRP25 0.15 0.12 0.19 0.13DSPR 0.35 0.34 0.40 0.33

    mdg1 DGRP25 0.03 0.02 0.03 0.10DSPR 0.08 0.04 0.12 0.18

    pogo DGRP25 0.06 0.04 0.07 0.05DSPR 0.13 0.21 0.15 0.13

    springer DGRP25 0.00 0.01 0.01 0.03DSPR 0.02 0.00 0.01 0.05

    Table 4. TE Density in the X and Autosomes.

    TE/Mb

    DGRP25 DSPR Both

    X, all 3.82 6.38 4.83

    Autosomes, all 3.51 5.93 4.47

    X, high recombination 3.71 6.00 4.61

    X, low recombination 3.93 6.76 5.05

    2L, high recombination 2.27 4.73 3.24

    2L, low recombination 3.23 5.89 4.28

    2R, high recombination 2.82 4.89 3.64

    2R, low recombination 6.17 8.63 7.14

    3L, high recombination 2.96 5.23 3.86

    3L, low recombination 4.74 7.26 5.73

    3R, high recombination 2.81 4.68 3.55

    3R, low recombination 3.72 6.70 4.90

    2317

    Transposable Elements in Two Drosophila QTL Mapping Resources . doi:10.1093/molbev/mst129 MBE

    at University of California, Irvine on September 17, 2013http://mbe.oxfordjournals.org/

    Downloaded from

  • Figure 1 Normalized rank expression of transposable elements. Observed numbers of TE-containing lines per rank bin vs. 10,000 permutations. Reddots indicate the observed number of TE-containing lines; box plots show permutations. Box plot tails indicate the 2.5% and the 97.5% condenceintervals. Open circles above and below the box plots indicate the 0.5% and the 99.5% condence interval. (A) TE is in an exon, (B) TE is in a 1st intron,(C) TE is in an intron#400bp in length, (D) TE is not in 1st Intron, (E) TE# 500bp from TSS, (F) TE# 500bp from TES, (G) TE# 200bp from a donor site,and (H) TE # 200bp from an acceptor site.

    Transposons and Expression Variation 89

    Cridland, J. M., et al. doi:10.1534/genetics.114.170837

  • most extreme example of this was the gene nessy, where theaverage expression level of the line with the TE was 40standard deviations lower than TE free lines. This is likelyan effectively null mutation at this gene segregating in na-ture. Comparing the average standard deviation in meanexpression per DGRP line for TE-associated transcripts totranscripts with no TEs in or within 10 kb indicates thatTE-associated transcripts have a larger mean standard de-viation, 0.33 vs. 0.28, (t-test; P = 1.0e-22). Since the DGRPline with the TE insertion is not used in calculating thestandard deviation this suggests that transcripts with TEsmay be those that are more tolerant of variation in expres-sion level.

    Transposable elements vs. SNPs and othercomplex variants

    Massouras et al. (2012) have reanalyzed the same expres-sion data used here (Ayroles et al. 2009) and attempted toassociate complex variants (primarily insertion/deletionevents and small duplications rather than TEs) with varia-tion in gene expression levels. They identied 17,501 cis-eQTL associated with 2033 genes (at a 10% false discoveryrate), with 499 of these genes having eQTL with similareffect in each sex. Though due to the way statistical testingwas carried out in this earlier work it is difcult to estimatehow many independent cis-eQTL were discovered for anygiven gene. We compared our set of genes where the TE-containing transcript was in the lowest 5% of expressionmeasures to the Massouras et al. (2012) set of genes iden-tied as having a cis-eQTL in both sexes to identify genescommon to both sets.

    We nd that 145 of the 499 genes with previouslyreported cis-eQTL common to both sexes have TEs within10 kb of the gene and of these 22 genes, a TE in the DGRPline with the lowest level of expression. These numberssuggest that 4.4% of genes they identify as having a non-TEcis-eQTL also have a TE that is categorized as a causativemutation. While it is unclear how the TE and non-TE muta-tions are contributing to expression variation, and how theymay interact, this observation highlights the importance of

    incorporating TE information into work on expression andphenotypic analyses. Given that our analysis has focused oninsertions of large effect, this is a conservative estimate ofthe number of TEs that may contribute to expression differ-ences. There are also important differences between thesestudies. First, we only looked at sex-averaged expression,which means that we are likely excluding a number of inser-tions that affect expression in a sex-specic manner. Second,this study specically focused on rare variants, while theprevious analysis of this dataset was restricted to variantsfound in at least three lines.

    Effect of TEs as a class of variation

    Several previous studies that found associations betweenTEs and variation have done so by examining TEs as a classof variation, where all TEs are expected to contribute tophenotypic variation in the same way (Mackay and Langley1990; Long et al. 2000; Macdonald et al. 2005; Gruber et al.2007). We examined transcripts where four or more DGRPlines had a TE insertion in a particular location category,though each TE insertion was only present in a single DGRPline. We nd 96 such transcripts having four or more DGRPlines with a TE in a given location category. For each suchtranscript we performed a t-test for a difference in geneexpression between TE-harboring DGRP lines vs. TE-freeDGRP lines. We expect that lines containing TEs will havelower expression levels than lines that do not contain TEs.We then plotted the cumulative distribution of these P-valuesagainst their expectation, based on a single permutation, inFigure 2 (on a log-log scale). If TEs impact gene expression,we would expect to see more t-tests with signicant P-valuesin the observed data when compared to the permuted data.

    Table 3 Mean z-scores for transcripts with TEs

    Category Mean z-score N

    Within exon 23.44 249Introns #400 bp 21.03 72Within 200 bp of acceptor site 20.90 64Within 200 bp of donor site 20.67 64Within rst intron 20.37 545Not within rst intron 20.11 852#500 bp of TSS 20.43 186501 bp to 2 kb of TSS 20.01 418.2 kb of TSS 20.05 2121#500 bp of TES 20.52 213501 bp to 2 kb of TES 20.04 347.2 kb of TES 20.02 1976

    Mean z-scores are calculated from the transcript/TE pairs for all transcripts with aninsertion in each location category.

    Figure 2 Transposable elements as a class of variation. Probabilityprobabilityplot of observed and expected P-values from t-tests of all cases where four ormore lines show an independent TE insertion in the same location categoryfor the same transcript.

    Transposons and Expression Variation 91

    most extreme example of this was the gene nessy, where theaverage expression level of the line with the TE was 40standard deviations lower than TE free lines. This is likelyan effectively null mutation at this gene segregating in na-ture. Comparing the average standard deviation in meanexpression per DGRP line for TE-associated transcripts totranscripts with no TEs in or within 10 kb indicates thatTE-associated transcripts have a larger mean standard de-viation, 0.33 vs. 0.28, (t-test; P = 1.0e-22). Since the DGRPline with the TE insertion is not used in calculating thestandard deviation this suggests that transcripts with TEsmay be those that are more tolerant of variation in expres-sion level.

    Transposable elements vs. SNPs and othercomplex variants

    Massouras et al. (2012) have reanalyzed the same expres-sion data used here (Ayroles et al. 2009) and attempted toassociate complex variants (primarily insertion/deletionevents and small duplications rather than TEs) with varia-tion in gene expression levels. They identied 17,501 cis-eQTL associated with 2033 genes (at a 10% false discoveryrate), with 499 of these genes having eQTL with similareffect in each sex. Though due to the way statistical testingwas carried out in this earlier work it is difcult to estimatehow many independent cis-eQTL were discovered for anygiven gene. We compared our set of genes where the TE-containing transcript was in the lowest 5% of expressionmeasures to the Massouras et al. (2012) set of genes iden-tied as having a cis-eQTL in both sexes to identify genescommon to both sets.

    We nd that 145 of the 499 genes with previouslyreported cis-eQTL common to both sexes have TEs within10 kb of the gene and of these 22 genes, a TE in the DGRPline with the lowest level of expression. These numberssuggest that 4.4% of genes they identify as having a non-TEcis-eQTL also have a TE that is categorized as a causativemutation. While it is unclear how the TE and non-TE muta-tions are contributing to expression variation, and how theymay interact, this observation highlights the importance of

    incorporating TE information into work on expression andphenotypic analyses. Given that our analysis has focused oninsertions of large effect, this is a conservative estimate ofthe number of TEs that may contribute to expression differ-ences. There are also important differences between thesestudies. First, we only looked at sex-averaged expression,which means that we are likely excluding a number of inser-tions that affect expression in a sex-specic manner. Second,this study specically focused on rare variants, while theprevious analysis of this dataset was restricted to variantsfound in at least three lines.

    Effect of TEs as a class of variation

    Several previous studies that found associations betweenTEs and variation have done so by examining TEs as a classof variation, where all TEs are expected to contribute tophenotypic variation in the same way (Mackay and Langley1990; Long et al. 2000; Macdonald et al. 2005; Gruber et al.2007). We examined transcripts where four or more DGRPlines had a TE insertion in a particular location category,though each TE insertion was only present in a single DGRPline. We nd 96 such transcripts having four or more DGRPlines with a TE in a given location category. For each suchtranscript we performed a t-test for a difference in geneexpression between TE-harboring DGRP lines vs. TE-freeDGRP lines. We expect that lines containing TEs will havelower expression levels than lines that do not contain TEs.We then plotted the cumulative distribution of these P-valuesagainst their expectation, based on a single permutation, inFigure 2 (on a log-log scale). If TEs impact gene expression,we would expect to see more t-tests with signicant P-valuesin the observed data when compared to the permuted data.

    Table 3 Mean z-scores for transcripts with TEs

    Category Mean z-score N

    Within exon 23.44 249Introns #400 bp 21.03 72Within 200 bp of acceptor site 20.90 64Within 200 bp of donor site 20.67 64Within rst intron 20.37 545Not within rst intron 20.11 852#500 bp of TSS 20.43 186501 bp to 2 kb of TSS 20.01 418.2 kb of TSS 20.05 2121#500 bp of TES 20.52 213501 bp to 2 kb of TES 20.04 347.2 kb of TES 20.02 1976

    Mean z-scores are calculated from the transcript/TE pairs for all transcripts with aninsertion in each location category.

    Figure 2 Transposable elements as a class of variation. Probabilityprobabilityplot of observed and expected P-values from t-tests of all cases where four ormore lines show an independent TE insertion in the same location categoryfor the same transcript.

    Transposons and Expression Variation 91

    Cridland, J. M., et al. doi:10.1534/genetics.114.170837

  • Summary, part 1 Structural variants are typically rare

    Duplications have non-additive effects on gene expression

    Low change of precise convergence at molecular level

    TEs are strong candidates for RALE in flies

    All of these variants are poorly-tagged in current-generation association studies

  • from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging fromextreme elationormania to severedepres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 710 andheritability 8090%

    27,28. Thedefinitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

    Several genomic regions have been implicated in linkage studies30

    and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

    both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

    The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P5 6.33 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene disrupted inschizophrenia 1 (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

    Of the four regions showing association at P, 53 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P5 2.231027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

    log

    10(P

    )

    05

    1015

    05

    1015

    05

    1015

    05

    1015

    05

    1015

    05

    1015

    05

    1015

    Chromosome

    Type 2 diabetes

    22 XX212019181716151413121110987654321

    22 XX212019181716151413121110987654321

    22 XX212019181716151413121110987654321

    22 XX212019181716151413121110987654321

    22 XX212019181716151413121110987654321

    22 XX212019181716151413121110987654321

    22 XX212019181716151413121110987654321

    Coronary artery disease

    Crohns disease

    Hypertension

    Rheumatoid arthritis

    Type 1 diabetes

    Bipolar disorder

    Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

    Chromosomes are shown in alternating colours for clarity, withP values,13 1025 highlighted in green. All panels are truncated at2log10(P value)5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

    ARTICLES NATURE |Vol 447 |7 June 2007

    666Nature 2007 Publishing Group

    Burton, P. R., et al. doi:doi:10.1038/nature05911

  • family studies, and can be expected to vary across environments.Narrow-sense heritability estimates in humans can be inflated iffamily resemblance is influenced by non-additive genetic effects(dominance and epistasis, or genegene interaction), shared familialenvironments, and by correlations or interactions among genotypesand environment36,37. However, heritabilities estimated from pedi-gree studies in animals agree well with heritability estimated fromresponse to artificial selection, suggesting that estimates from familystudies are not necessarily inflated.

    Teasing apart the contributions to heritability of environmentalfactors shared among relatives will soon be possible because theavailability of genome-wide markers now provides empirical esti-mates of identity-by-descent (IBD) allele sharing between pairs of rela-tives. For example, full sibs share on average half their genetic com-plement, but this proportion can varyin one large study it rangedfrom 0.37 to 0.62 (ref. 38). By relating phenotypic differences to theobserved IBD sharing fraction among sib pairs, marker data were usedto generate a heritability estimate of 0.8 for height38. This is remarkablyconsistent with estimates using traditional methods but free of theirassumptions, suggesting that for height at least, heritability is not over-estimated. Applying such estimation to distantly related or unrelatedindividuals is now feasible using dense genomic scans39; given the num-berof peoplewithdensegenotypingdata,heritability estimates couldbegenerated for a wide variety of traits free of potential confounding byunmeasured shared environment.

    Improving estimates of all contributors to heritability will facilitatedetermination of the proportion of genetic variance that has beenexplained. Despite imprecision in current estimates, it may still bepossible to know that all the heritability has been explained by pre-dicting phenotypes in a new set of individuals from trait-associatedmarkers, and correlating the predicted phenotypes with the actualvalues. If the markers truly explain all the additive genetic variance,the squared correlation between predicted and actual phenotype willbe equal to the heritability40. Population-based heritability estimatesthus provide a valuable metric for completeness of available geneticrisk information, but individualized disease prevention and treatmentwill ultimately require identifying the variants accounting for risk in agiven individual rather than on a population basis.

    Rare variants and unexplained heritabilityMuch of the speculation about missing heritability from GWAS hasfocused on the possible contribution of variants of low minor allelefrequency (MAF), defined here as roughly 0.5%,MAF, 5%, or ofrare variants (MAF, 0.5%). Such variants are not sufficiently fre-quent to be captured by current GWA genotyping arrays14,41, nor dothey carry sufficiently large effect sizes to be detected by classicallinkage analysis in family studies (Fig. 1). Once MAF falls below0.5%, detection of associations becomes unlikely unless effect sizesare very large, as in monogenic conditions. For modest effect sizes,association testingmay require composite tests of overall mutationalload, comparing frequencies of mutations of potentially similarfunctional effect in cases and controls.

    Low frequency variants could have substantial effect sizes (increas-ing disease risk two- to threefold) without demonstrating clearMendelian segregation, and could contribute substantially to missingheritability42. For example, 20 variants with risk allele frequency of 1%and allelic odds ratio (or probability of an event occurring divided bythe probability of it not occurring, compared in people with versuswithout the risk allele) of three would account for most familialaggregation of type 2 diabetes. There are relatively few examples ofsuch variants contributing to complex traits, possibly owing to insuf-ficiently large sample sizes or insufficiently comprehensive arrays.

    The primary technology for the detection of rare SNPs is sequen-cing, which may target regions of interest, or may examine the wholegenome. Next-generation sequencing technologies, which processmillions of sequence reads in parallel, provide monumental increasesin speed and volume of generated data free of the cloning biases and

    arduous sample preparation characteristic of capillary sequencing43.Detection of associations with low frequency and rare variants will befacilitated by the comprehensive catalogue of variants withMAF$ 1% being generated by the 1,000 Genomes Project (http://www.1000genomes.org/page.php), which will also identify manyvariants at lower allele frequencies. The pilot effort of that programhas already identifiedmore than 11million new SNPs in initially low-depth coverage of 172 individuals44.

    Current mechanisms for using sequencing to identify rare variantsunderlying or co-located with GWA-defined associations includesequencing in genomic regions defined by strong and repeatedly repli-cated associations with common variants, and sequencing a larger frac-tion of the genome in people with extreme phenotypes. In the absenceof GWA-defined signals, sequencing candidate genes in subjects at theextremes of a quantitative trait (such as lipid levels or the age at onset),can identify other associated variants, both common and rare45,46. Animportant finding from these studies is thatmuch of the information isprovidedbypeople at the extremesof trait distributions,who seemtobemore likely to carry loss-of-function alleles47.

    Sample sizes used for the initial identification of DNA sequencevariants have generally been modest, and sample size requirementsincrease essentially linearly with 1/MAF. Much larger samples areneeded for the identification of associations with variants than thoseneeded for the detection of the variants themselves. They also scaleroughly linearly with 1/MAF given a fixed odds ratio and fixed degreeof linkage disequilibrium with genotyped markers. Sample size forassociation detection also scales approximately quadratically with1/j(OR2 1)j, and thus increases sharply as the odds ratio (OR)declines. Sample size is even more strongly affected by small oddsratios than by small MAF, so low frequency and rare variants willneed to have higher odds ratios to be detected.

    Complicating matters further, numerous rare variants may bedetected in a gene or region but they may have disparate effects onphenotype. Common variants have typically been analysed individu-ally23,48, but with one or two carriers of each rare variant, poolingthem using specific criteria becomes attractive47,49,50. Pooling variantsof similar class increases the effectiveMAFof the class and reduces thenumber of tests performed, but raises several other questions (Box 1).

    Determining which of the multitude of variants carried by anindividual are responsible for a given phenotype represents a massivetask, especially if the causal alleles are relatively anonymous in termsof known functional consequences. Because only a small proportionwill have obvious functional consequences for the resultant protein,lesser evidence of association may suffice to implicate variants of thissort. The best approaches for combining functional credibility andstatistical support in the evaluation of such variants remain to be

    Allele frequency

    Effect size

    Very rare Common

    Low

    High

    Rare Low frequency0.001 0.005 0.05

    Intermediate

    Modest

    Rare allelescausing

    Mendelian disease

    Few examples ofhigh-effect

    common variantsinfluencing

    common disease

    Commonvariants

    implicated incommon disease

    by GWA

    Rare variants ofsmall effect

    very hard to identifyby genetic means

    Low-frequencyvariants with

    intermediate effect

    3.0

    1.5

    1.1

    50.0

    Figure 1 | Feasibility of identifying genetic variants by risk allele frequencyand strength of genetic effect (odds ratio). Most emphasis and interest liesin identifying associations with characteristics shownwithin diagonal dottedlines. Adapted from ref. 42.

    NATUREjVol 461j8 October 2009 REVIEWS

    749 Macmillan Publishers Limited. All rights reserved2009

    Manolio, T. A., et al. doi:10.1038/nature08494

  • Li & Leal

    Madsen & Browning

    c-Alpha

    SKAT

  • 0 2 4 6 8 100

    24

    68

    10Causative mutations on paternal allele

    Caus

    ative

    mut

    ation

    s on

    mat

    erna

    l alle

    le

    0.05 0.1

    0.15

    0.2 0.25

    0.3 0.35

    0.4

    gene-based

    Thornton, K. R., et al. doi:10.1371/journal.pgen.1003258

    vs.

    0 2 4 6 8 10

    02

    46

    810

    Causative mutations on paternal allele

    Caus

    ative

    mut

    ation

    s on

    mat

    erna

    l alle

    le

    0.2 0.4

    0.6

    0.8

    1

    1.2

    1.4

    Risch-like

    Risch, N. (1990). AJHG, 46(2), 222228.

  • Locus non-risk risk

    1 A a

    2 B b

    3 C c

    A

    {aLocus 1

    a is an allelic series with variable effect sizes

  • 4 2 0 2 4

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Value

    Relat

    ive d

    ensit

    y

    Fitness, S2

    Gaussian noise, e2

    Effect sizes, mean =

    VG=4dS2, H2 = VGVG + e2

    Thornton, K. R., et al. doi:10.1371/journal.pgen.1003258Turelli, M. (1984). Theoretical Population Biology, 25(2), 138193.

  • H1 =X

    i

    ei H2 =Xj

    ej

    G =pH1 H2

    P = G+N(0,e)

    Thornton, K. R., et al. doi:10.1371/journal.pgen.1003258

    VG 22s

    Slatkin, M. (1987). Genetical Research, 50(1), 5362.

  • Kaul, R., et al. (1994). Journal of Inherited Metabolic Disease, 17(3), 356358.

  • N = 20,0008N gens

    N = 106

    500 gens

  • Thornton, K. R. doi:10.1534/genetics.114.165019

    http://github.com/molpopgen/fwdpp http://github.com/molpopgen/foRward

    pos = 0.623, n = 1,s = 0

    pos = 0.113, n = 2, s = 0.001

    pos = 0.004, n = 1, s = -0.2

    I1 I2, n=1 I3, n = 1n=1 I2, n=1

    Ancestral chromosome

    Mutations

    2N Gametes

    N Diploids Diploid 1 Diploid 2

  • V(E)=0.075^2 V(E)=0.053^2

    0.025

    0.050

    0.075

    0.025

    0.050

    0.075

    Rapid growthNo growth

    0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5Mean effect size ()M

    ean

    broa

    dse

    nse

    herit

    abilit

    y due

    to lo

    cus Model Additive Multiplicative Mult. recessive Genebased

  • 0 20 40 60 80 100

    05

    1015

    Position (kbp)

    log

    10(p

    )

    CommonCommon, causativeRareRare, causative

    Thornton, K. R., et al. doi:10.1371/journal.pgen.1003258

  • or most of the loci we find at this level are either already known orhave now been confirmed by subsequent replication. Such replica-tion studies are also the substrate for efforts to determine the range ofassociated phenotypes and to identify and characterize pathologicallyrelevant variation.

    Second, failure to detect a prominent association signal in the pre-sent study cannot provide conclusive exclusion of any given gene. Thisis the consequence of several factors including: less-than-completecoverage of common variation genome-wide on the Affymetrix chip;poor coverage (by design) of rare variants, including many structuralvariants (thereby reducing power to detect rare, penetrant, alleles)25;difficultieswithdefining the full genomic extent of the gene of interest;and, despite the sample size, relatively low power to detect, at levels of

    significance appropriate for genome-wide analysis, variants withmodest effect sizes (odds ratio (OR), 1.2).

    Third, whereas the association signals detected can help to defineregions of interest, they cannot provide unambiguous identificationof the causal genes. Nevertheless, assessments on the basis of posi-tional candidacy carry considerable weight, and, as we show, thesealready allow us, for selected diseases, to highlight pathways andmechanisms of particular interest. Naturally, extensive resequencingand fine-mapping work, followed by functional studies will berequired before such inferences can be translated into robust state-ments about the molecular and physiological mechanisms involved.

    We turn now to a discussion of the main findings for each disease,focusing here only on the most significant and interesting results

    Table 2 | Evidence for signal of association at previously robustly replicated lociCollection Gene Chromosome Reported SNP WTCCC SNP HapMap r2 Trend

    P valueGenotypic P value

    CAD APOE 19q13 * rs4420638 - 1.7 3 10201 1.7 3 10201

    CD NOD2 16q12 rs2066844 rs17221417 0.23 9.4 3 10212 4.0 3 10211

    CD IL23R 1p31 rs11209026 rs11805303 0.01 6.5 3 10213 5.9 3 10212

    RA HLA-DRB1 6p21 * rs615672 - 2.6 3 10227 7.5 3 10227

    RA PTPN22 1p13 rs2476601 rs6679677 0.75 4.9 3 10226 5.6 3 10225

    T1D HLA-DRB1 6p21 * rs9270986 - 4.0 3 102116 2.3 3 102122

    T1D INS 11p15 rs689 { - - -T1D CTLA4 2q33 rs3087243 rs3087243 1 2.5 3 10205 1.8 3 10205

    T1D PTPN22 1p13 rs2476601 rs6679677 0.75 1.2 3 10226 5.4 3 10226

    T1D IL2RA 10p15 rs706778 rs2104286 0.25 8.0 3 10206 4.3 3 10205

    T1D IFIH1 2q24 rs1990760 rs3788964 0.26 1.9 3 10203 7.6 3 10203

    T2D PPARG 3p25 rs1801282 rs1801282 1 1.3 3 10203 5.4 3 10203

    T2D KCNJ11 11p15 rs5219 rs5215 0.9 1.3 3 10203 5.6 3 10203

    T2D TCF7L2 10q25 rs7903146 rs4506565 0.92 5.7 3 10213 5.1 3 10212

    Where information on the strength of association at a particular SNP had been previously published and replicated we tabulated the P value of both the trend and genotype test at the same SNP (if inour study), or the best tag SNP (defined to be the SNP with highest r2 with the reported SNP, calculated in the CEU sample of the HapMap project). Positions are in NCBI build-35 coordinates.*Previous reports relate to haplotypes rather than single SNPs. {Not well tagged by SNPs that pass the quality control, see main text.

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    BD

    Obs

    erve

    d te

    st s

    tatis

    tic

    Expected chi-squared value

    CAD CD

    HT RA

    T2D

    T1D

    Figure 3 | Quantile-quantile plots for seven genome-wide scans. For eachof the seven disease collections, a quantile-quantile plot of the results of thetrend test is shown in black for all SNPs that pass the standard project filters,have a minor allele frequency.1% and missing data rate,1%. SNPs thatwere visually inspected and revealed genotype calling problems wereexcluded. These filters were chosen to minimize the influence of genotype-calling artefacts. Each quantile-quantile plot shown in black involves around

    360,000 SNPs. SNPs at which the test statistic exceeds 30 are represented bytriangles. Additional quantile-quantile plots, which also exclude all SNPslocated in the regions of association listed in Table 3, are superimposed inblue (for BD, the exclusion of these SNPs has no visible effect on the plot, andfor HT there are no such SNPs). The blue quantile-quantile plots show thatdepartures in the extreme tail of the distribution of test statistics are due toregions with a strong signal for association.

    NATURE |Vol 447 |7 June 2007 ARTICLES

    665Nature 2007 Publishing Group

    Burton, P. R., et al. doi:doi:10.1038/nature05911

  • Additive Multiplicative Mult. recessive Genebased

    0.00

    0.25

    0.50

    0.75

    1.00

    0.00

    0.25

    0.50

    0.75

    1.00

    Rapid growthNo growth

    0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5

    Mean effect size ()Pow

    er a

    t gen

    ome

    wide

    sign

    ifican

    ce th

    resh

    oldTest Logit ESM MBr SKATO [Beta(1,25)] SKATO [linear]

  • MAF of most asscociated SNP

    Freq

    uenc

    y

    0.0 0.1 0.2 0.3 0.4 0.5

    0.00

    0.02

    0.04

    0.06

    0.08

    0.10

    Wray, N. R., et al. PLoS Biology, 9(1), e1000579.Gibson, G. doi:10.1038/nrg3118

    See also:

  • Wray, N. R., et al. PLoS Biology, 9(1), e1000579.Gibson, G. doi:10.1038/nrg3118

    See also:

  • or most of the loci we find at this level are either already known orhave now been confirmed by subsequent replication. Such replica-tion studies are also the substrate for efforts to determine the range ofassociated phenotypes and to identify and characterize pathologicallyrelevant variation.

    Second, failure to detect a prominent association signal in the pre-sent study cannot provide conclusive exclusion of any given gene. Thisis the consequence of several factors including: less-than-completecoverage of common variation genome-wide on the Affymetrix chip;poor coverage (by design) of rare variants, including many structuralvariants (thereby reducing power to detect rare, penetrant, alleles)25;difficultieswithdefining the full genomic extent of the gene of interest;and, despite the sample size, relatively low power to detect, at levels of

    significance appropriate for genome-wide analysis, variants withmodest effect sizes (odds ratio (OR), 1.2).

    Third, whereas the association signals detected can help to defineregions of interest, they cannot provide unambiguous identificationof the causal genes. Nevertheless, assessments on the basis of posi-tional candidacy carry considerable weight, and, as we show, thesealready allow us, for selected diseases, to highlight pathways andmechanisms of particular interest. Naturally, extensive resequencingand fine-mapping work, followed by functional studies will berequired before such inferences can be translated into robust state-ments about the molecular and physiological mechanisms involved.

    We turn now to a discussion of the main findings for each disease,focusing here only on the most significant and interesting results

    Table 2 | Evidence for signal of association at previously robustly replicated lociCollection Gene Chromosome Reported SNP WTCCC SNP HapMap r2 Trend

    P valueGenotypic P value

    CAD APOE 19q13 * rs4420638 - 1.7 3 10201 1.7 3 10201

    CD NOD2 16q12 rs2066844 rs17221417 0.23 9.4 3 10212 4.0 3 10211

    CD IL23R 1p31 rs11209026 rs11805303 0.01 6.5 3 10213 5.9 3 10212

    RA HLA-DRB1 6p21 * rs615672 - 2.6 3 10227 7.5 3 10227

    RA PTPN22 1p13 rs2476601 rs6679677 0.75 4.9 3 10226 5.6 3 10225

    T1D HLA-DRB1 6p21 * rs9270986 - 4.0 3 102116 2.3 3 102122

    T1D INS 11p15 rs689 { - - -T1D CTLA4 2q33 rs3087243 rs3087243 1 2.5 3 10205 1.8 3 10205

    T1D PTPN22 1p13 rs2476601 rs6679677 0.75 1.2 3 10226 5.4 3 10226

    T1D IL2RA 10p15 rs706778 rs2104286 0.25 8.0 3 10206 4.3 3 10205

    T1D IFIH1 2q24 rs1990760 rs3788964 0.26 1.9 3 10203 7.6 3 10203

    T2D PPARG 3p25 rs1801282 rs1801282 1 1.3 3 10203 5.4 3 10203

    T2D KCNJ11 11p15 rs5219 rs5215 0.9 1.3 3 10203 5.6 3 10203

    T2D TCF7L2 10q25 rs7903146 rs4506565 0.92 5.7 3 10213 5.1 3 10212

    Where information on the strength of association at a particular SNP had been previously published and replicated we tabulated the P value of both the trend and genotype test at the same SNP (if inour study), or the best tag SNP (defined to be the SNP with highest r2 with the reported SNP, calculated in the CEU sample of the HapMap project). Positions are in NCBI build-35 coordinates.*Previous reports relate to haplotypes rather than single SNPs. {Not well tagged by SNPs that pass the quality control, see main text.

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    2520

    20

    15

    15

    10

    10

    5

    5

    30

    00

    BD

    Obs

    erve

    d te

    st s

    tatis

    tic

    Expected chi-squared value

    CAD CD

    HT RA

    T2D

    T1D

    Figure 3 | Quantile-quantile plots for seven genome-wide scans. For eachof the seven disease collections, a quantile-quantile plot of the results of thetrend test is shown in black for all SNPs that pass the standard project filters,have a minor allele frequency.1% and missing data rate,1%. SNPs thatwere visually inspected and revealed genotype calling problems wereexcluded. These filters were chosen to minimize the influence of genotype-calling artefacts. Each quantile-quantile plot shown in black involves around

    360,000 SNPs. SNPs at which the test statistic exceeds 30 are represented bytriangles. Additional quantile-quantile plots, which also exclude all SNPslocated in the regions of association listed in Table 3, are superimposed inblue (for BD, the exclusion of these SNPs has no visible effect on the plot, andfor HT there are no such SNPs). The blue quantile-quantile plots show thatdepartures in the extreme tail of the distribution of test statistics are due toregions with a strong signal for association.

    NATURE |Vol 447 |7 June 2007 ARTICLES

    665Nature 2007 Publishing Group

    QQ plot from Burton, P. R., et al. doi:doi:10.1038/nature05911Thornton, K. R., et al. doi:10.1371/journal.pgen.1003258

    ESMK =KXi=i

    log10Pi + log10 i

    M

  • Summary, part 2 GWAS observations consistent with a simple model

    of loss of function (recessive) mutations in genes

    The (genetic) model matters much more than the demographic assumptions!

    Standard models are rejected by the data.

    We need to get our hands on better GWAS data sets!

  • uncommon both in this data set and in previous studies(Kaminker et al. 2002; Lipatov et al. 2005).

    Differences in the strength of selection against closelyrelated genes can also be illustrated by patterns of TE inser-tions. The paralogs derailed (fig. 4F) and derailed-2 (fig. 4E)showed very different patterns of TE insertions, though thesame pattern was seen in both resources. derailed-2, locatedon chromosome 2R, is in a region ofmoderate recombination,1.99 cM/Mb, whereas derailed is located towards the distalend of 2L in a region of low recombination, 0.44 cM/Mb.

    Given the context of the recombination rates, the expectationwould be that derailed would have a higher TE load sincedeleterious alleles are removed more efficiently in regions ofhigh recombination (Hudson and Kaplan 1995), but theopposite pattern was observed.

    While both genes play a role in Wnt5 signaling, mutationsin derailed can cause major phenotypic changes inDrosophilanervous system development resulting in the loss of normalfunction (Yoshikawa et al. 2003). derailed-2, when mutated,produces only minor differences in neuron positioning

    1/15 1/151/15 2/15 1/151/15

    1/14 1/122

    1/1181/120 1/1181/124

    2/117

    2/1182/119

    1/122

    1/119 1/118

    2/118

    1/121 3/114

    1/115

    1/114

    1/119

    1/113

    1/119

    1/120

    1/120

    1/119

    1/120

    3/119

    1/119

    A klarsichtDGRPDSPR

    Insertion in an Exon

    3/14 11/95

    4/121D Cyp6a20

    3/15

    1/123

    14/15

    56/125

    1/151/15

    4/15

    1/156/15

    10/124

    1/124 1/124 1/123

    2/124E derailed-2

    B Notch1/15 1/1211/122

    C Delta1/114 1/120

    1/91F derailed

    * Same insertion in both resources

    ***

    *

    ***

    *

    FIG. 4. Transposable element insertions in genes. The frequency above each insertion is the number of lines in which the element is present over thenumber of lines in which the element is validated as either present or absent. Gene images are from the UCSC genome browser (http://genome.ucsc.edu/, last accessed June 31, 2012).

    2318

    Cridland et al. . doi:10.1093/molbev/mst129 MBE at University of California, Irvine on September 17, 2013

    http://mbe.oxfordjournals.org/Downloaded from

    Cridland, J. M., et al. doi:10.1093/molbev/mst129

    King, E. G., et a. doi:10.1371/journal.pgen.1004322 McClellan, J., & King, M.-C. doi:10.1016/j.cell.2010.03.032

    214 Cell 141, April 16, 2010 2010 Elsevier Inc.

    ary factors, including the impact of the illness on selection (Pritchard and Cox, 2002). In order to be maintained at poly-morphic frequencies worldwide, com-mon variants with even modest influence on disease must withstand selective pressure in every generation. Not sur-prisingly, therefore, the common alleles with the best documented relationship to disease are associated with disorders that arise later in life, i.e., Alzheimer dis-eases or age-related macular degenera-tion. For illnesses that impact reproduc-tive fitness, balancing positive selection is often demonstrable. Illness in these cases may arise from interaction between genetic and environmental factors, such that an otherwise adaptive mechanism or trait is deleterious in certain individu-als. For example, adaptive inflammatory responses can cause autoimmune dis-orders when turned against the host, or efficient storage of calories can lead to type II diabetes or to obesity in food-rich cultures.

    Both common and rare alleles may lead to the same disease. For example, multiple rare mutations in any one of several genes (e.g., APP, PS1, PS2, and UBQLN1) lead to early-onset Alzheimers disease (Bird, 2005). Rare mutations in genes involved in immune response (e.g., DNASEI, TREX1) confer a very high risk for lupus (Moser et al., 2009). Each of these conditions is characterized by allele and locus heterogeneity. How-ever, it should not be anticipated that all complex diseases will have substantial contributions from common risk vari-ants. This is especially true for illnesses that reduce fertility, either biologically or through social selection against mar-riage with individuals with severe dis-eases. Alleles of significant effect must be enabled by evolutionary forces to persist at polymorphic frequencies.

    Genome-wide Association StudiesIt has become commonplace in the genetics community to aver that genome-wide association studies have led to the identification of hundreds of SNPs as risk variants for common diseases, while acknowledging that most of the heritability for these traits remains to be explained (Manolio et al., 2009). We agree with this conclu-sion: that most of the inherited basis of

    common traits remains to be explained (Goldstein, 2009). We further suggest that many GWAS findings stem from factors other than a true association with disease risk. The bases of our concern are both statistical and experi-mental.Cryptic Population StratificationA major limitation complicating genome-wide association studies is the poten-tial for cryptic population stratification. Subtle differences in ancestry between cases and controls can produce spuri-ous association solely due to sampling. GWAS study designs typically control for population structure in two ways: by taking into account differences among populations in average allele frequencies and by excluding individual subjects with extreme outlier genotypes (e.g., Price et al., 2006; Purcell et al., 2007). These approaches are appropriate and neces-sary but not sufficient. Neither adjust-ment addresses the problem posed by individual SNPs that are outliers with respect to variation in allele frequencies among healthy populations. A hypervari-able SNP may be falsely identified as a risk variant if cases and controls are not perfectly matched. Stratifying cases and controls by self-reported ethnicity is not nearly sufficient to control for this prob-lem (Serre et al., 2008).

    A recently published genome-wide association study of autism (Wang et al., 2009) provides a particularly dramatic example of the perils of cryptic popu-lation stratification. The SNP most sig-nificantly associated with the illness was rs4307059 on chromosome 5p14.1. The ancestral T allele of rs4307059 was iden-tified as the risk variant, with allele fre-quencies in the discovery series of 0.65 among cases and 0.61 among controls (odds ratio = 1.19, p = 3.4 108). All cases and controls were of European ancestry. The p value is compelling. However, the frequency of the proposed risk variant varies from 0.21 to 0.77 across Euro-pean populations (Coop et al., 2009), a range 14-fold greater than the difference in allele frequencies between cases and controls. Even a subtle difference in the ancestries of cases and controls within Europe could explain the difference in allele frequencies that was attributed to the autism phenotype. An odds ratio of 3.0, or even of 2.0 depending on population allele frequencies, would be robust to such population stratification. However, odds ratios of the magnitude generally detected by GWAS (

  • Resources

    http://www.molpopgen.org/Data

    http://github.com/molpopgen

    http://github.com/ThorntonLab

  • Acknowledgements

    Julie Cridland, Andrew Foran, Rebekah Rogers, Jaleal Sanjak, Ling Shao

    Peter Andolfatto, Tony Long, Stuart MacDonald

    Joseph Farran & Harry Mangalam

    NIH, UCI Center for Complex Biological Systems