a 4-gigabase physical map unlocks the structure and evolution of the complex genome of aegilops...

20
A 4-gigabase physical map unlocks the structure and evolution of the complex genome of Aegilops tauschii, the wheat D-genome progenitor Ming-Cheng Luo a,1 , Yong Q. Gu b,1 , Frank M. You a,2 , Karin R. Deal a , Yaqin Ma a,3 , Yuqin Hu a , Naxin Huo a,b , Yi Wang a,b , Jirui Wang a,4 , Shiyong Chen a , Chad M. Jorgensen a , Yong Zhang a , Patrick E. McGuire a , Shiran Pasternak c , Joshua C. Stein c , Doreen Ware c,5 , Melissa Kramer c , W. Richard McCombie c , Shahryar F. Kianian d , Mihaela M. Martis e , Klaus F. X. Mayer e , Sunish K. Sehgal f , Wanlong Li f,6 , Bikram S. Gill f , Michael W. Bevan g , Hana Simková h , Jaroslav Dole zel h , Song Weining i , Gerard R. Lazo b , Olin D. Anderson b , and Jan Dvorak a,7 a Department of Plant Sciences, University of California, Davis, CA 95616; b Genomics and Gene Discovery Research Unit, Western Regional Research Center, US Department of Agriculture/Agricultural Research Service, Albany, CA 94710; c Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724; d Department of Plant Sciences, North Dakota State University, Fargo, ND 58108; e Institute of Bioinformatics and Systems Biology/Munich Information Center for Protein Sequences, Helmholtz Center Munich, 85764 Neuherberg, Germany; f Department of Plant Pathology, Kansas State University, Manhattan, KS 66506; g John Innes Centre, Norwich NR4 7UJ, United Kingdom; h Centre of the Region Haná for Biotechnological and Agricultural Research, Institute of Experimental Botany, CZ-78371 Olomouc, Czech Republic; and i Northwest Agriculture and Forestry University, Yangling, Shaanxi 712100, China Edited* by Jeffrey L. Bennetzen, University of Georgia, Athens, GA, and approved March 25, 2013 (received for review November 1, 2012) The current limitations in genome sequencing technology require the construction of physical maps for high-quality draft sequences of large plant genomes, such as that of Aegilops tauschii, the wheat D-genome progenitor. To construct a physical map of the Ae. tau- schii genome, we ngerprinted 461,706 bacterial articial chro- mosome clones, assembled contigs, designed a 10K Ae. tauschii Innium SNP array, constructed a 7,185-marker genetic map, and anchored on the map contigs totaling 4.03 Gb. Using whole ge- nome shotgun reads, we extended the SNP marker sequences and found 17,093 genes and gene fragments. We showed that collinearity of the Ae. tauschii genes with Brachypodium dis- tachyon, rice, and sorghum decreased with phylogenetic distance and that structural genome evolution rates have been high across all investigated lineages in subfamily Pooideae, including that of Brachypodieae. We obtained additional information about the evo- lution of the seven Triticeae chromosomes from 12 ancestral chro- mosomes and uncovered a pattern of centromere inactivation accompanying nested chromosome insertions in grasses. We showed that the density of noncollinear genes along the Ae. tauschii chro- mosomes positively correlates with recombination rates, suggested a cause, and showed that new genes, exemplied by disease re- sistance genes, are preferentially located in high-recombination chromosome regions. single nucleotide polymorphism | synteny | gene density | Oryza | BAC contig coassembly M any plants have large genomes with vast amounts of re- peated DNA. An example is Aegilops tauschii, the diploid progenitor of the D genome of hexaploid wheat (Triticum aesti- vum). The estimates of its genome size range from 4.02 (1) to 4.98 Gb (2), and 90% of its genome was estimated to be repetitive DNA (3). The Ae. tauschii genome and the D genome of hexa- ploid wheat are closely related due to the recent origin of hexa- ploid wheat (4). Ae. tauschii is therefore an important resource for wheat breeding, and its genome is an invaluable reference for wheat genomics, as illustrated by the utility of its sequences in the analysis of the wheat gene space (5). The utility of Ae. tauschii for wheat genetics and genomics would be further enhanced by a high-quality draft sequence of its genome. With current tech- nology, the only approach to produce a high-quality de novo draft sequence for a genome of this size and complexity is the ordered- clone sequencing approach, which requires a physical map. Physical map construction necessitates ngerprinting multiple genome equivalents of bacterial articial chromosome (BAC) clones, assembling them into contigs, and anchoring the contigs on a genetic map (68). Great strides have been made in BAC ngerprinting techniques (7, 912) and software for ngerprint editing and contig assembly (1316). It is now possible with these technological advances to ngerprint and assemble contigs from hundreds of thousands of BAC clones (7, 8, 1719). In contrast, contig anchoring remains a weakness in physical mapping of large plant genomes because of their low gene density, extensive gene duplication, and abundance of repetitive DNA. BAC end sequences (BESs) are an effective means of contig anchoring in small genomes (11). In large genomes, however, hundreds of thousands of BESs are needed. DNA hybridization and PCR- based anchoring (6, 7, 20, 21) is laborious and often produces equivocal results. Contig anchoring with highly multiplexed Illu- mina GoldenGate SNP assays overcomes some of these limitations Author contributions: M.-C.L., Y.Q.G., P.E.M., D.W., B.S.G., J. Dole zel, O.D.A., and J. Dvorak designed research; Y.Q.G., F.M.Y., K.R.D., Y.M., Y.H., N.H., Y.W., J.W., S.C., C.M.J., Y.Z., S.P., M.K., W.R.M., S.F.K., K.F.X.M., S.K.S., M.W.B., H. S., S.W., G.R.L., O.D.A., and J. Dvorak performed research; S.K.S., W.L., H. S., and S.W. contributed new re- agents/analytic tools; M.-C.L., Y.Q.G., F.M.Y., K.R.D., N.H., Y.W., J.C.S., D.W., M.M.M., K.F.X.M., O.D.A., and J. Dvorak analyzed data; and M.-C.L., Y.Q.G., F.M.Y., K.R.D., Y.W., P.E.M., J.C.S., D.W., W.R.M., M.M.M., K.F.X.M., O.D.A., and J. Dvorak wrote the paper. Conict of interest statement: W.R.M. has participated in Illumina-sponsored meetings (past 4 years) and received travel reimbursement and honorarium for presenting at these events (Illumina had no role in decisions relating to this study and the decision to publish), has participated in Pacic Biosciences-sponsored meetings (past 3 years) and received travel reimbursement for presenting at these events, and is a founder and shared holder of Orion Genomics, which focuses on plant genomics and cancer genetics. *This Direct Submission article had a prearranged editor. Freely available online through the PNAS open access option. Data deposition: The sequences reported in this paper have been deposited in the National Center for Biotechnology Information database (www.ncbi.nlm.nih.gov/) (acces- sion nos. SRP012566, SRA052214.1, SRX129979, SRX125241, SRX125233, SRX124436, SRX116546, SRX037891, and SRX037088. 1 M.-C.L. and Y.Q.G. contributed equally to this work. 2 Present address: Cereal Research Centre, Agriculture and Agri-Food Canada, Winnipeg, Canada MB R3T 2M9. 3 Present address: Department of Botany and Plant Sciences, University of California, Riv- erside, CA 92521. 4 Home address: Triticeae Research Institute, Sichuan Agricultural University, Chengdu, Sichuan 611130, China. 5 Alternative address: US Department of Agriculture/Agricultural Research Service North Atlantic Area, Robert W. Holley Center for Agriculture and Health, Tower Road, Ithaca, NY 14853. 6 Present address: Department of Biology and Microbiology, South Dakota State Univer- sity, Brookings, SD 57007-2142. 7 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1219082110/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1219082110 PNAS Early Edition | 1 of 6 PLANT BIOLOGY

Upload: independent

Post on 06-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

A 4-gigabase physical map unlocks the structure andevolution of the complex genome of Aegilops tauschii,the wheat D-genome progenitorMing-Cheng Luoa,1, Yong Q. Gub,1, Frank M. Youa,2, Karin R. Deala, Yaqin Maa,3, Yuqin Hua, Naxin Huoa,b, Yi Wanga,b,Jirui Wanga,4, Shiyong Chena, Chad M. Jorgensena, Yong Zhanga, Patrick E. McGuirea, Shiran Pasternakc,Joshua C. Steinc, Doreen Warec,5, Melissa Kramerc, W. Richard McCombiec, Shahryar F. Kianiand, Mihaela M. Martise,Klaus F. X. Mayere, Sunish K. Sehgalf, Wanlong Lif,6, Bikram S. Gillf, Michael W. Bevang, Hana �Simkováh,Jaroslav Dole�zelh, Song Weiningi, Gerard R. Lazob, Olin D. Andersonb, and Jan Dvoraka,7

aDepartment of Plant Sciences, University of California, Davis, CA 95616; bGenomics and Gene Discovery Research Unit, Western Regional Research Center,US Department of Agriculture/Agricultural Research Service, Albany, CA 94710; cCold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724; dDepartmentof Plant Sciences, North Dakota State University, Fargo, ND 58108; eInstitute of Bioinformatics and Systems Biology/Munich Information Center for ProteinSequences, Helmholtz Center Munich, 85764 Neuherberg, Germany; fDepartment of Plant Pathology, Kansas State University, Manhattan, KS 66506; gJohnInnes Centre, Norwich NR4 7UJ, United Kingdom; hCentre of the Region Haná for Biotechnological and Agricultural Research, Institute of ExperimentalBotany, CZ-78371 Olomouc, Czech Republic; and iNorthwest Agriculture and Forestry University, Yangling, Shaanxi 712100, China

Edited* by Jeffrey L. Bennetzen, University of Georgia, Athens, GA, and approved March 25, 2013 (received for review November 1, 2012)

The current limitations in genome sequencing technology requirethe construction of physical maps for high-quality draft sequencesof large plant genomes, such as that ofAegilops tauschii, thewheatD-genome progenitor. To construct a physical map of the Ae. tau-schii genome, we fingerprinted 461,706 bacterial artificial chro-mosome clones, assembled contigs, designed a 10K Ae. tauschiiInfinium SNP array, constructed a 7,185-marker genetic map, andanchored on the map contigs totaling 4.03 Gb. Using whole ge-nome shotgun reads, we extended the SNP marker sequencesand found 17,093 genes and gene fragments. We showed thatcollinearity of the Ae. tauschii genes with Brachypodium dis-tachyon, rice, and sorghum decreased with phylogenetic distanceand that structural genome evolution rates have been high acrossall investigated lineages in subfamily Pooideae, including that ofBrachypodieae. We obtained additional information about the evo-lution of the seven Triticeae chromosomes from 12 ancestral chro-mosomes and uncovered a pattern of centromere inactivationaccompanyingnested chromosome insertions in grasses.We showedthat the density of noncollinear genes along the Ae. tauschii chro-mosomes positively correlates with recombination rates, suggesteda cause, and showed that new genes, exemplified by disease re-sistance genes, are preferentially located in high-recombinationchromosome regions.

single nucleotide polymorphism | synteny | gene density | Oryza |BAC contig coassembly

Many plants have large genomes with vast amounts of re-peated DNA. An example is Aegilops tauschii, the diploid

progenitor of the D genome of hexaploid wheat (Triticum aesti-vum). The estimates of its genome size range from 4.02 (1) to 4.98Gb (2), and 90% of its genome was estimated to be repetitiveDNA (3). The Ae. tauschii genome and the D genome of hexa-ploid wheat are closely related due to the recent origin of hexa-ploid wheat (4). Ae. tauschii is therefore an important resourcefor wheat breeding, and its genome is an invaluable referencefor wheat genomics, as illustrated by the utility of its sequences inthe analysis of the wheat gene space (5). The utility of Ae. tauschiifor wheat genetics and genomics would be further enhanced bya high-quality draft sequence of its genome. With current tech-nology, the only approach to produce a high-quality de novo draftsequence for a genome of this size and complexity is the ordered-clone sequencing approach, which requires a physical map.Physical map construction necessitates fingerprinting multiple

genome equivalents of bacterial artificial chromosome (BAC)clones, assembling them into contigs, and anchoring the contigson a genetic map (6–8). Great strides have been made in BAC

fingerprinting techniques (7, 9–12) and software for fingerprintediting and contig assembly (13–16). It is now possible with thesetechnological advances to fingerprint and assemble contigs fromhundreds of thousands of BAC clones (7, 8, 17–19). In contrast,contig anchoring remains a weakness in physical mapping oflarge plant genomes because of their low gene density, extensivegene duplication, and abundance of repetitive DNA. BAC endsequences (BESs) are an effective means of contig anchoring insmall genomes (11). In large genomes, however, hundreds ofthousands of BESs are needed. DNA hybridization and PCR-based anchoring (6, 7, 20, 21) is laborious and often producesequivocal results. Contig anchoring with highly multiplexed Illu-mina GoldenGate SNP assays overcomes some of these limitations

Author contributions: M.-C.L., Y.Q.G., P.E.M., D.W., B.S.G., J. Dole�zel, O.D.A., andJ. Dvorak designed research; Y.Q.G., F.M.Y., K.R.D., Y.M., Y.H., N.H., Y.W., J.W., S.C.,C.M.J., Y.Z., S.P., M.K., W.R.M., S.F.K., K.F.X.M., S.K.S., M.W.B., H.�S., S.W., G.R.L., O.D.A.,and J. Dvorak performed research; S.K.S., W.L., H.�S., and S.W. contributed new re-agents/analytic tools; M.-C.L., Y.Q.G., F.M.Y., K.R.D., N.H., Y.W., J.C.S., D.W., M.M.M.,K.F.X.M., O.D.A., and J. Dvorak analyzed data; and M.-C.L., Y.Q.G., F.M.Y., K.R.D., Y.W.,P.E.M., J.C.S., D.W., W.R.M., M.M.M., K.F.X.M., O.D.A., and J. Dvorak wrote the paper.

Conflict of interest statement: W.R.M. has participated in Illumina-sponsored meetings(past 4 years) and received travel reimbursement and honorarium for presenting at theseevents (Illumina had no role in decisions relating to this study and the decision to publish),has participated in Pacific Biosciences-sponsored meetings (past 3 years) and receivedtravel reimbursement for presenting at these events, and is a founder and shared holderof Orion Genomics, which focuses on plant genomics and cancer genetics.

*This Direct Submission article had a prearranged editor.

Freely available online through the PNAS open access option.

Data deposition: The sequences reported in this paper have been deposited in theNational Center for Biotechnology Information database (www.ncbi.nlm.nih.gov/) (acces-sion nos. SRP012566, SRA052214.1, SRX129979, SRX125241, SRX125233, SRX124436,SRX116546, SRX037891, and SRX037088.1M.-C.L. and Y.Q.G. contributed equally to this work.2Present address: Cereal Research Centre, Agriculture and Agri-Food Canada, Winnipeg,Canada MB R3T 2M9.

3Present address: Department of Botany and Plant Sciences, University of California, Riv-erside, CA 92521.

4Home address: Triticeae Research Institute, Sichuan Agricultural University, Chengdu,Sichuan 611130, China.

5Alternative address: US Department of Agriculture/Agricultural Research Service NorthAtlantic Area, Robert W. Holley Center for Agriculture and Health, Tower Road, Ithaca,NY 14853.

6Present address: Department of Biology and Microbiology, South Dakota State Univer-sity, Brookings, SD 57007-2142.

7To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1219082110/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1219082110 PNAS Early Edition | 1 of 6

PLANTBIOLO

GY

(22), but its relatively high cost limits the number of markers thatcan be used.Here we report the construction of a physical map of the Ae.

tauschii genome using the SNaPshot BAC fingerprinting tech-nology (9, 11) and the Illumina Infinium SNP array technology incontig anchoring. To make the physical map comparative, we useSNPs in sequences greatly enriched for genes (23) in designing thearray and then extend the mapped SNP markers with whole ge-nome shotgun (WGS) reads. We use the Ae. tauschii comparativemap to reassess collinearity between the Ae. tauschii genome andthe Brachypodium distachyon, rice (Oryza sativa), and sorghum(Sorghum bicolor) genomes and report insights into several aspectsof grass genome evolution.

ResultsPhysical Map Construction. We fingerprinted 461,706 BAC clones(Table S1) of Ae. tauschii accession AL8/78, edited them, andassembled 399,448 clones into 3,153 BAC contigs with Finger-printed Contig (FPC) software (14) (SI Text; Table 1); 15,683clones remained as singletons. To construct a genetic map forcontig anchoring, we built a 10K Infinium iSelect SNP oligonu-cleotide array (SI Text) using 10,000 SNPs selected from 195,631SNPs between Ae. tauschii accessions AL8/78 and AS75, theparents of our F2 mapping population, mostly located in genes(23). We genotyped 1,102 F2 plants of a single biparental mappingpopulation (Fig. S1) and constructed a 7,185-marker geneticmap (Table 2; SI Text; http://probes.pw.usda.gov/WheatDMarker/downloads/ComparativeMapData.xls).To anchor BAC contigs, we constructed five-dimensional (5-D)

BAC pools (22) (SI Text). To avoid nonspecific DNA amplifica-tion in negative pools, we added AS75 genomic DNA to all pools.The Infinium assays failed to produce clear-cut clustering ofpositive and negative genotypes in the GenomeStudio graphs(Fig. S1), which we overcame as described in SI Text. Genotypingdata for 356 (4.96%) low-deconvolution confidence markers weredeconvoluted manually and were either confirmed or eliminatedfrom the physical map. Ultimately, we assigned 6,992 (97.3%) ofthe 7,185 SNP markers to contigs.We then examined contigs for false joins by using (i) contig

anchoring on the genetic map (http://probes.pw.usda.gov/WheatDMarker/downloads/ComparativeMapData.xls), (ii) ana-lyses of FPC consensus band (CB) maps (Fig. S2), and (iii)coassembly of contigs using Ae. tauschii BAC clones and clones ofwheat D-genome subgenomic BAC libraries (24) (Fig. S3; SI Text).The contig coassembly was made possible by the close phyloge-netic proximity of the Ae. tauschii genome to the wheat D genome.We disjoined 425 (13.5%) chimeric contigs and obtained

a 4,030-Mb physical map (Fig. 1) consisting of 2,263 anchoredcontigs (Tables 1 and 2). Contig disjoining increased the numberof contigs and decreased their average and N50 lengths (Table 1).The 4,030-Mb physical map (http://probes.pw.usda.gov/

WheatDMarker/) represented 84.2% of 4,792 Mb, the totallength of the 3,578 BAC contigs (Table 1). The total contig lengthwas within the range of published Ae. tauschii genome size esti-mates. However, 4,792 Mb must be an overestimate becauseundetected overlaps between contigs were necessarily countedtwice. The remaining 15.8% of the total length were unanchored,

mostly short contigs of average length = 578 kb (Table 1). Theminimal tiling path (MTP) across all anchored and unanchoredcontigs consisted of 42,822 clones.

Marker Sequence Extension. We extended the sequences of the7,185 mapped SNP markers with 3.1× and 50× genome equiva-lent of Roche GS FLX Titanium and Illumina HiSeq WGSreads, respectively (SI Text). The sequence associated with eachmarker was on average extended to 7,869 bp with an N50 of10,830 bp (Table 3).To assess the accuracy of sequence extension, we aligned

extended sequences to 37 homologous AL8/78 BAC clonesequences in the National Center for Biotechnology Information(NCBI) database. Of the 37 extended sequences, the alignment of22 and 27 exceeded 99% and 95% of the sequence length, re-spectively, and the alignment of all sequences exceeded 81% se-quence length. A reduced alignment length of two sequences wascaused by gaps in the BAC sequence scaffolds. Nucleotide identity,except for one BAC clone, exceeded 98% of aligned nucleotides.The 7,185 extended sequences (http://probes.pw.usda.gov/

WheatDMarker/) contained 17,093 genes and gene fragments(http://probes.pw.usda.gov/WheatDMarker/downloads/GeneList.xls). Of these, 9,716 (56.8%) were complete genes (defined in SIText), and 7,377 (43.2%) were gene fragments based on alignmentwith wheat and barley expressed sequence tags (ESTs) (4,866) or,in the absence of EST evidence, with proteins (2,511).

Chromosome Structure. We previously hypothesized that the sevenAe. tauschii chromosomes originated from 12 ancestral chromo-somes by five nested chromosome insertions (NCIs) (25) (Fig. 1).Our data showed that during the NCIs that produced Ae. tauschiichromosomes 1D, 2D, 5D, and 7D, a telomere of the insertedchromosomewas inserted near the centromere, in a gene-containingregion (Fig. 1). That centromere was lost, and the centromere ofthe inserted chromosome became the active centromere in eachcompound chromosome (Fig. 1).Chromosome 5D originated by NCI of a chromosome corre-

sponding to Os12 near the centromere of Os9, but we did not findevidence of the expected homology between the 5D short arm andthe 1-Mb tip of the short arm of Os9. However, the small size ofthe region may have precluded its detection. In addition to thisputative NCI, chromosome 5D acquired a segment correspondingto a distal portion of the short arm of Os3, which was attached toa segment corresponding to the long arm of Os9, making up thearm 5DL (Fig. 1). We detected the reciprocal product of thistranslocation. The tip of the long arm of Os9 forms the tip of the4D short arm; hence, the translocation between chromosomescorresponding to Os3 and Os9 was reciprocal.

Rate of Genome Evolution. We assigned inversions, translocations,and NCIs (http://probes.pw.usda.gov/WheatDMarker/downloads/ComparativeMapData.xls) to each branch of a grass phylogenetictree (Fig. 2A) as described earlier (25). Using divergence time

Table 1. Statistics of FPC contig assembly and editing (disjoiningof chimeric contigs)

ContigsNo. ofcontigs

Total length(Mb)

Average length(kb)

N50(kb)

After FPCassembly

3,153 4,756 1,509 2,665

After editing 3,578 4,792 1,339 2,092Anchored contigs 2,263 4,030 1,778 2,385Unanchored

contigs1,315 757 578 945

Table 2. Maps of the Ae. tauschii chromosomes

Chromosome

Genetic map Physical map

No. ofmarkers

Length(cM)

No. ofmarkers

No. ofcontigs

anchoredLength(Mb)

1D 973 175.6 943 295 5202D 1,326 235.1 1,282 385 6723D 1,101 204.1 1,062 356 6334D 821 143.9 799 267 5205D 1,034 215.4 1,001 319 5776D 771 172.2 746 267 4667D 1,159 228.1 1,159 374 642Total 7,185 1,374.4 6,992 2,263 4,030

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1219082110 Luo et al.

estimates (26), we estimated the rates (k) of structural genomeevolution (Fig. 2A). We confirmed the highest rate of genomeevolution in the Ae. tauschii branch (67 inversions, translocations,and NCIs) and the slowest rates in the rice and sorghum branches(respectively 16 and 25 inversions, translocations, and NCIs). Therate was also high in the B. distachyon branch (48 inversions,translocations, and NCIs) and internal branch 2 (18 inversions andtranslocations), showing that all investigated branches of subfamilyPooideae had elevated rates of genome evolution compared withthose of Panicoideae andEhrhartoideae (P= 0.01). The fast rate ofgenome evolution in Ae. tauschii is illustrated by dot plots (see SIText for methodology). The Ae. tauschii–rice dot-plot shows morechromosome rearrangements than the sorghum–rice dot plot (Fig.2B andC). TheAe. tauschii–B. distachyon dot plot (Fig. S4A) showsthe largest number of rearrangements illustrating the fast rates ofgenome evolution in both lineages.The orthologous and paralogous relationships of the Ae. tau-

schii chromosomes relative to rice pseudomolecules are shownby a dot plot in Fig. S4B. The paralogous relationships reflect thepan-grass whole-genome duplication (27).

Recombination Rates and Gene Density. The average recombinationrate was 0.32 cM/Mb and ranged from nearly zero in the proximalchromosome regions to about 1.5–2.0 cM/Mb in the distal regions(Fig. S5). Recombination rates dropped more precipitously in theshort arms than in the long arms (Fig. 1; Fig. S5).Genes were distributed across the entire spans of the physical

maps of the Ae. tauschii chromosomes including the low-recombination pericentromeric regions (Fig. 1; Fig. S5). Genedensity computed for nonoverlapping 30-Mb intervals along Ae.tauschii chromosomes based on the distribution of 17,093 genes

and gene fragments fluctuated two- to threefold (Fig. S5). Genedensity was correlated with recombination rate (r = 0.53, P <0.0001). An insertion of a telomeric region into the centromericregion during NCI juxtaposes a high-gene-density terminal re-gion and a low-gene-density centromeric region in the nascentcompound chromosome. These gene-density juxtapositions havebeen observed in the B. distachyon genome (26). They appear tobe absent from the Ae. tauschii genome (Fig. 1; Fig. S5).

Noncollinear Genes. We selected among the extended sequencesof 7,185 SNP markers a nonredundant set of 5,901 Ae. tauschiigenes or gene fragments with one or more homologs in at leastone of the B. distachyon, rice, and sorghum genomes. If therewere more than one homolog in a compared genome, we alwaysconsidered only the homolog with the lowest E value. We thencounted the numbers of homologs in collinear locations in theB. distachyon, rice, and sorghum genomes and used their per-centage as a measure of collinearity of genes between the Ae.tauschii genome and the three grass genomes (Table 4). Collin-earity of Ae. tauschii genes with B. distachyon, rice, and sorghumgenes was proportional to the phylogenetic distance of comparedgenomes (Table 4).Of the 5,901 Ae. tauschii genes, 1,540 (26.1%) were complete

genes having homologs in collinear locations in none of the threegrass genomes (Table S2) and were therefore most likely genestransposed or translocated to new locations in the Ae. tauschiilineage after its divergence from the B. distachyon lineage. Thefollowing indicated that these genes were preferentially locatedin the distal, high-recombination regions of the Ae. tauschiichromosomes. The elevated gene density in the distal regions ofthe Ae. tauschii chromosomes (circle #3 in Fig. 1) largely dis-appeared when we excluded the noncollinear genes from thephysical map (circle #4 in Fig. 1). Dot plots of individual Ae.tauschii chromosomes compared with the 12 rice chromosomesshowed dense clouds of rice homologs in noncollinear locationsin the distal, high-recombination regions of each of the Ae. tauschiichromosomes (Fig. S6). Finally, the density of Ae. tauschii non-collinear genes per megabase correlated with recombination rate(r = 0.464, P < 0.0001; Fig. 3).

Recombination and Location of Actively Evolving Genes.We selectedthe average recombination rate as an arbitrary boundary todivide each Ae. tauschii chromosome arm into high- and low-recombination regions. We then selected 4,134 Ae. tauschii com-plete genes in 23 gene ontology (GO) categories and allocatedthem into four groups: collinear high recombination, collinear lowrecombination, noncollinear high recombination, and noncollinearlow recombination (Table S3). The distribution of collinear genesin each of the 23 GO categories with respect to high- and low-recombination regions did not significantly differ (2 × 2

Table 3. Gene prediction in extended sequences anchored onthe physical map

Category Measure

Extended sequence contigs (number) 7,185Mean extended length (bp) 7,869N50 (bp) 10,830Genes/gene fragments (number) 17,093Complete genes (number) 9,716Genes aligned to ESTs (percentage) 84.0Genes aligned to proteins (percentage) 95.4Averaged gene length (bp) 2,772Average exon number per gene 5.5Median exon length (bp) 245Average exon length (bp) 266Median intron length (bp) 151Average intron length (bp) 156Gene fragments (number) 7,377

Fig. 1. Ae. tauschii genome circle maps. The inner circle (#1) contains thephysical maps of the Ae. tauschii chromosomes each with its short arm tip at0 Mb. Circle #2 contains heat maps of recombination rates, circle #3 containsgene density heat maps, circle #4 contains heat maps of only genes collinearwith the B. distachyon, rice, or sorghum pseudomolecules, and circle #5shows global synteny with the rice chromosomes symbolizing 12 ancestralchromosomes. In circle #5, the active Ae. tauschii centromeres are white,inferred extinct centromeres are black, and the locations of current andinferred ancient telomeres are diagrammed by thick bars. Thirty-megabasewindows sliding by 1 Mb were used in the heat map construction. The heatmap units for circle #2 are cM/Mb and for circles #3 and #4 are numbers ofgenes and gene fragments discovered in the extended sequences per meg-abase of the physical maps.

Luo et al. PNAS Early Edition | 3 of 6

PLANTBIOLO

GY

contingency tables). However, the distribution of noncollineargenes in GO category “receptor activity” (GO: 0004872) was sig-nificantly (P = 0.002) overrepresented in the high-recombinationgroup (Table S3). This overrepresentation pattern was also ob-served for Ae. tauschii disease resistance genes, which were iden-tified among the 4,134 genes by homology search against the plantdisease resistance (R) gene database (28) (Table 5) (noncollinear Rgenes and their GOs are listed in http://probes.pw.usda.gov/WheatDMarker/downloads/RGeneGOClass.xls). This overrepre-sentation is typified by the distribution of NB-LRR disease re-sistance protein genes (NB-ARC domain-containing proteins),which accounted for 4.7% of the noncollinear genes in the high-recombination regions but only for 0.7% of noncollinear genesin the low-recombination region.

DiscussionPhysical Mapping. The most significant technical advance we reporthere is contig anchoring with an Infinium SNP array. Prior physical

mapping of other medium size or large grass genomes used mul-tiple marker systems, some of them requiring intensive laboratorywork, to anchor contigs (7, 8, 21). Contig anchoring with anInfinium SNP array opened the door to the use of essentially un-limited numbers of SNP assays in contig anchoring with minimallabor beyond that needed for the construction of 5-D BAC pools.Another important technical advance, which greatly reduced theambiguity of contig anchoring, was a computer script that used thelocations of positive BAC clones in contigs to discriminate betweentrue-positive and false-positive BAC clones during the deconvo-lution of the 5-D BAC pools (29). These two advances made itpossible to include unequivocally 6,992 of the 7,185 markers(97.4%) on the genetic map into BAC contigs for contig anchoring.The construction of the physical map also benefitted from the useof an Ae. tauschii-wheat D-genome contig coassembly as one of thecriteria for detecting Ae. tauschii chimeric contigs.The results of our contig assembly and contig anchoring com-

pare favorably with other Triticeae physical mapping endeavors.We assembled 399,448 edited clones into 3,578 edited contigswith an N50 contig length of 2,092 kb. During the physicalmapping of wheat chromosome 3B, 56,952 fingerprinted cloneswere assembled into 1,036 edited contigs of an average size of783 kb (21). The likely reasons for the relatively greater numberof contigs and shorter average contig length in physical mappingof chromosome 3B were the use of fewer BAC libraries, a shorteraverage insert length, and a smaller number of restriction frag-ments in fingerprints caused by the use of shorter size standardduring capillary electrophoresis (LIZ500 rather than LIZ1200).During the physical mapping of the barley genome, 517,202 editedBAC clones were assembled into 9,265 contigs with a N50 contiglength of 904 kb (8). Because of the greater number of clones, thebarley assembly should have generated a smaller number of contigswith greater N50 contig length than our assembly. However, thereverse was obtained. Because the two assemblies used similarstepwise strategies and similar stringencies, the reasons for thedifferent outcomes likely lie in some undetermined technical factor.

Gene Sequences.We encountered 17,093 gene and gene fragmentsequences in the extended sequences of the 7,185 SNP markers,which is seemingly incongruous with genes representing as littleas 2.5% of the Ae. tauschii genome (3). We attribute this ap-parent contradiction to two factors: most of the 7,185 SNPmarkers were already located in genes (23) and clustering of Ae.tauschii genes (30).The structural characteristics of the 9,716 complete genes we

found in the extended sequences, such as the average number ofexons per gene (5.5), the average exon length (245 bp), and theaverage gene length (2.8 kb), were similar to those reported forB. distachyon genes (5.5, 268 bp, and 2.6 kb, respectively) andrice genes (4.8, 364 bp, and 2.5 kb, respectively) (26). For un-known reasons, our data agree less closely with those reportedfor barley (7.6, 454 bp, and 3.0 kb, respectively) (8).

Centromere Fate During NCI. In each Ae. tauschii chromosome thatoriginated by NCI whereby a telomere of the inserted chromo-some was inserted into the vicinity of the centromere of the re-cipient chromosome, the centromere of the recipient chromosomebecame extinct, whereas the centromere of the inserted chromo-some remained active. We observed the same pattern in the sevenNCIs in the B. distachyon genome (26) and two NCIs in the sor-ghum genome (25). Because the centromere of the recipientchromosome has a telomeric region in its neighborhood, wespeculate that such NCIs impair centromere functioning andgenerate functionally monocentric nascent chromosomes. A cor-ollary is that NCIs taking place far away from the centromere maygenerate functionally dicentric chromosomes that are lost, whichcould explain why all NCIs observed to date in grasses took placein the vicinity of the centromere.

Genome Evolution Rate and Genome Size.A previous study (25) thatdid not include B. distachyon suggested a possible relationship

Fig. 2. Rates of structural genome evolution in the grass family. (A) Rates(k; numbers of inversions, translocations and NCIs per million years) duringthe evolution of grass subfamilies Panicoideae, Ehrhartoideae, and Pooi-deae. (B and C) Pairwise dot-plot comparisons of gene order along Ae.tauschii physical maps (B) and along sorghum pseudomolecules (C) relativeto that along the rice pseudomolecules. Chromosomes and pseudomoleculeson the y axis have the tips of the short arms at 0 Mb, respectively, at the top.Blue dots indicate parallel order and red dots indicate antiparallel order ofchromosomes and pseudomolecules on the x and y axes. Only orthologousloci are shown for the sake of clarity.

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1219082110 Luo et al.

between a large genome size and a fast rate of genome evolutionamong grasses. Here we show that genomes in all investigatedbranches of the subfamily Pooideae have been evolving fast, in-cluding the Brachypodieae branch. Because B. distachyon hasa small genome (26), the fast rate of evolution of its genomecontradicts the previous study. Genera Melica, Glyceria, Nardus,and Lygeum that diverged from the Brachypodieae lineage beforethe divergence of the Brachypodieae and Triticeae lineages (31)have 1C genome sizes >1.5 Gb (The C-value database, Kew RoyalBotanical Garden, http://data.kew.org/cvalues/). We thereforespeculate that the B. distachyon genome evolved from a largerancestral genome by size contraction to account for the relativelyfast rate of Brachypodieae genome evolution.

Noncollinear Genes and Evolution of New Genes. Transposition ortranslocation of a gene into a new location in a genome createspolymorphism for a paralogous gene pair. If both paralogues areretained and are not in tandem they appear as dispersed dupli-cated genes (32). The ancestral gene in the paralogous pair is ina collinear location relative to related genomes, whereas the de-rived gene is in a noncollinear location. Ae. tauschii collinear geneswere distributed more-or-less uniformly across Ae. tauschii chro-mosomes, but noncollinear genes were concentrated in the distalregions, and their density correlated with recombination ratesalong chromosomes.We propose the following hypothesis that may at least partially

account for the concentration of noncollinear genes in distal, high-recombination regions of Ae. tauschii chromosomes. Neutral SNPand restriction fragment length polymorphism (RFLP) wereshown to have a greater rate of loss in low-recombinationregions of Ae. tauschii chromosomes than in high-recombinationregions (33, 34), presumably due to the effects of selection sweepsand background selection. Polymorphic noncollinear genes mustbe affected by the same processes as other neutral polymorphismand hence have a greater rate of loss in low-recombinationregions compared with high-recombination regions. Noncollinear

genes therefore have a greater chance of survival in distal high-recombination regions of Ae. tauschii chromosomes than inproximal low-recombination regions.The population of noncollinear genes is an arena for evolution

of new gene functions, and the high-recombination regions of Ae.tauschii chromosomes should therefore be enriched for new ac-tively evolving genes, including R genes, which are rapidly evolvinggenes in plants (35). Ae. tauschii noncollinear genes in distal, high-recombination chromosome regions were indeed enriched for Rgenes, as predicted by the hypothesis. In agreement with enrich-ment for R genes, high-recombination regions were also enrichedfor signal transduction genes (GO class “receptor activity” GO:0004872), which play important roles in the innate immunity inplants (36). This observation supports the hypothesis that theevolution of new genes preferentially takes place in the high-recombination regions of Triticeae chromosomes, as previouslyhypothesized on the basis of other evidence (37, 38).

Materials and MethodsBAC Fingerprinting and Contig Assembly. We used nine large-insert libraries(Table S1, SI Text) and fingerprinted 461,706 clones of average length 120.5kb (Table S1), as described earlier (9, 11). We edited fingerprints withFPMiner software (11) (SI Text) and assembled 399,448 edited clones intocontigs with FPC (version 9.3; www.agcol.arizona.edu/software/fpc/) at aninitial Sulston cutoff of 1 × 10−70, stepwise reduction of stringency accom-panied by contig joining allowing a maximum of 15% Q clones, and ter-minated at the Sulston cutoff of 1 × 10−22 (SI Text). We converted FPC lengthmetrics CB units into kilobases by estimating the insert lengths in 100 BACclones per library by pulse field electrophoresis and dividing the total lengthby the number of restriction fragments in the fingerprints. The conversionfactor was 1.5 kb/CB unit.

Ae. tauschii 10K Infinium Array.We used an algorithm similar to that describedpreviously (39) in selection of SNPs to maximize the likelihood of havinga single SNP marker per gene. Based on the evaluation of 10,000 sequencescontaining SNPs with Illumina’s Assay Design Tool, we obtained 9,485functional assays in the 10K Infinium array. They included 515 SNPs lo-cated in wheat ESTs (labeled by GenBank accession numbers) that havebeen previously mapped on an AL8/78 × AS75 map (25). The 10K Infiniumdatabase can be found at http://probes.pw.usda.gov/WheatDMarker/al878_gene_10000_snps_order_070410.csv.

Genetic and Physical Map Construction. F2 mapping population, DNA iso-lation, 10K Infinium genotyping of the F2 plants, and genetic map con-struction are described in SI Text. The genetic map database is available athttp://probes.pw.usda.gov/WheatDMarker/downloads/ComparativeMapData.xls.The construction of the Ae. tauschii physical map consisted of the followingsteps: (i) 5D pooling of BAC clones, (ii) genotyping of the pools with the Ae.tauschii 10K SNP Infinium assay, (iii) deconvolution of the 5D BAC poolgenotyping data to identify BAC clones bearing marker genes, (iv) assigningeach positive BAC clone to a marker locus on the genetic map, and (v)manual editing BAC contigs and disjoining chimeric contigs. Steps i–v aredescribed in SI Text.

Marker Sequence Extension. We constructed contigs from the 3.1× Roche 454reads, extended them with 50× Illumina contigs, and generated a set of

Table 4. Nonredundant Ae. tauschii complete genes and genefragments in collinear positions relative to homologous genes inB. distachyon, rice, and sorghum

Genome Total (no.) Collinear (no.) Percentage

B. distachyon 5,901* 3,624‡ 61.45,272† 3,523§ 66.6

Rice 5,901* 3,209‡ 54.45,272† 3,136§ 59.5

Sorghum 5,901* 3,008‡ 51.05,272† 2,942§ 55.4

*Complete genes plus gene fragments.†Complete genes only.‡,§Estimates followed by the same footnote symbol differ significantly fromeach other (χ2 test with Yates correction, P < 0.001).

Fig. 3. Relationship between recombination rate and density of non-collinear genes in the Aegilops tauschii genome.

Table 5. Numbers of Ae. tauschii genes homologous to plantdisease resistance genes in four indicated groups

Class

Recom-bination

rate Genes

Geneshomologousto R genes

Percent ofgenes

in the class

Collinear High 1,202 56 3.7Collinear Low 1,930 94 3.8Noncollinear High 620 43 4.7Noncollinear Low 382 8 1.4Total 4,134 201

P < 0.0001 between noncollinear classes (Fisher exact test). P = 0.049between collinear and noncollinear classes in high recombination regions(Fisher exact test).

Luo et al. PNAS Early Edition | 5 of 6

PLANTBIOLO

GY

gene predictions in the 7,185 mapped extended sequences as described inSI Text. We assumed that 9,716 of the genes found in the extended sequencesthat were without any gap in the coding sequence and aligned fully withESTs or annotated proteins were complete genes. We also obtained sequencesthat were incomplete but showed partial alignment to wheat and barleyESTs or proteins if no EST evidence was obtained. The output of MAKER(http://gmod.org/wiki/MAKER) was used to create a .gff file for our Gbrowseweb interface (available at http://probes.pw.usda.gov/WheatDMarker/). Amatrix of 17,093 genes and gene fragments including name, location on thegenetic map, locations of homologous genes in B. distachyon, rice, andsorghum, and GO is available at http://probes.pw.usda.gov/WheatDMarker/downloads/GeneList.xls.

Recombination Rate and Gene Density.We computed the cumulative lengths ofcontigs along the physical maps of the Ae.tauschii chromosomes. For Fig. 1, weused a 30-Mb window sliding by 1 Mb at a time to compute a rate. For Fig. S5and statistics, we used a 30-Mb nonoverlapping window to compute a rate.

Collinearity Between Ae. tauschii and B. distachyon, Rice, and Sorghum. Wesearched for homology between the nucleotide sequences of Ae. tauschiigenes and gene fragments in the B. distachyon, rice, and sorghum pseu-domolecules (builds Brachypodium 1.2 from www. brachypodium.org, Osa-tiva_120 from www.phytozome.net/, and Sorghum 1.0 from http://genome.jgi-psf.org/Sorbi1/Sorbi1.info.html). If we detected more than one homolo-gous gene in the B. distachyon, rice, and sorghum genome, we selected thegene with the lowest E value. We used the progressive increase or decreaseof gene starts along the pseudomolecule as evidence of collinearity and

changes in this progression as evidence of inversions or translocations. Wecolor-coded each change in the progression along a pseudomolecule relativeto the order of genes along the Ae. tauschii genetic map (http://probes.pw.usda.gov/WheatDMarker/downloads/ComparativeMapData.xls). Using max-imum parsimony, we assigned each inversion or translocation to a lineage.Because we did not have an outgroup, we could not decide if a change ingene order took place in the sorghum lineage or in branch 1 of the phylo-genetic tree (Fig. 2A). We arbitrarily assigned all changes in these twobranches to the sorghum branch.

GO. We used gene sequences in a BLASTX search against the UniProt knowl-edgebase (UniProtKB) database (www.uniprot.org) to assign each gene a func-tional GO term (BLASTX cutoff E value < 1E−7). GOs were assigned on the basisof biological, functional, and molecular annotation available from GO (www.geneontology.org). We also used gene sequences in a BLAST search against theplant resistance gene database (PRGdb; http://prgdb.crg.eu/wiki/Main_Page).

ACKNOWLEDGMENTS. We thank C. Soderlund, W. Nelson, J. Messing, andP. Langridge for their service as advisors to the DBI-0701916 project andE. Ghiban at Cold Spring Harbor Laboratory for assistance with Illuminasequencing for the IOS-1032105 project. This work was supported by NationalScience Foundation Plant Genome Research Program Grants DBI-0701916(Principal Investigator, J. Dvorak), DBI-0822100 (Principal Investigator, S.F.K.),and IOS-1032105 (Principal Investigator, W.R.M.), US Agricultural ResearchService Projects 5325-21000-014 and 1907-21000-030, and Czech ScienceFoundation Grant P501/12/2554 (Principal Investigator, H.�S.). This paper isa contribution to the International Wheat Genome Sequencing Consortium.

1. Arumuganathan K, Earle ED (1991) Nuclear DNA content of some important plantspecies. Plant Mol Biol Rep 9(3):208–218.

2. Rees H, Walters MR (1965) Nuclear DNA and evolution of wheat. Heredity 20(1):73–82.

3. Li W, Zhang P, Fellers JP, Friebe B, Gill BS (2004) Sequence composition, organization,and evolution of the core Triticeae genome. Plant J 40(4):500–511.

4. Nesbitt M, Samuel D (1996) From staple crop to extinction? The archaeology andhistory of hulled wheats. Hulled Wheats. Promoting the Conservation and Use ofUnderutilized and Neglected Crops. Proceedings of the First International Workshopon Hulled Wheats, eds Padulosi S, Hammer K, Keller J (International Plant GeneticResources Institute, Rome, Italy), pp 41–100.

5. Brenchley R, et al. (2012) Analysis of the bread wheat genome using whole-genomeshotgun sequencing. Nature 491(7426):705–710.

6. Klein PE, et al. (2000) A high-throughput AFLP-based method for constructing integratedgenetic and physical maps: Progress toward a sorghum genome map. Genome Res 10(6):789–807.

7. Wei F, et al. (2007) Physical and genetic structure of the maize genome reflects itscomplex evolutionary history. PLoS Genet 3(7):e123.

8. International Barley Genome Sequencing Consortium (2012) A physical, genetic andfunctional sequence assembly of the barley genome. Nature 491(7426):711–716.

9. Luo MC, et al. (2003) High-throughput fingerprinting of bacterial artificial chromo-somes using the snapshot labeling kit and sizing of restriction fragments by capillaryelectrophoresis. Genomics 82(3):378–389.

10. Nelson WM, et al. (2005) Whole-genome validation of high-information-contentfingerprinting. Plant Physiol 139(1):27–38.

11. Gu YQ, et al. (2009) A BAC-based physical map of Brachypodium distachyon and itscomparative analysis with rice and wheat. BMC Genomics 10:496.

12. van Oeveren J, et al. (2011) Sequence-based physical mapping of complex genomes bywhole genome profiling. Genome Res 21(4):618–625.

13. Soderlund C, Longden I, Mott R (1997) FPC: A system for building contigs from re-striction fingerprinted clones. Comput Appl Biosci 13(5):523–535.

14. Soderlund C, Humphray S, Dunham A, French L (2000) Contigs built with fingerprints,markers, and FPC V4.7. Genome Res 10(11):1772–1787.

15. You FM, et al. (2007) GenoProfiler: batch processing of high-throughput capillaryfingerprinting data. Bioinformatics 23(2):240–242.

16. Frenkel Z, Paux E,Mester D, Feuillet C, Korol A (2010) LTC: A novel algorithm to improve theefficiencyof contig assembly for physicalmapping in complexgenomes.BMCBioinformatics11:584.

17. Luo MC, et al. (2003) Construction of contigs of Ae. tauschii genomic DNA fragmentscloned in BAC and BiBAC vectors. Proceedings of the Tenth International WheatGenetics Symposium, eds Pogna NE, Romano M, Pogna EA, Galterio G (S.IM.I., RomeItaly), pp 293–296.

18. Messing J, et al. (2004) Sequence composition and genome organization of maize.Proc Natl Acad Sci USA 101(40):14349–14354.

19. Ng SHS, et al. (2005) A physical map of the genome of Atlantic salmon, Salmo salar.Genomics 86(4):396–404.

20. Cai WW, Reneker J, Chow CW, Vaishnav M, Bradley A (1998) An anchored frameworkBAC map of mouse chromosome 11 assembled using multiplex oligonucleotide hy-bridization. Genomics 54(3):387–397.

21. Paux E, et al. (2008) A physical map of the 1-gigabase bread wheat chromosome 3B.Science 322(5898):101–104.

22. Luo MC, et al. (2009) A high-throughput strategy for screening of bacterial artificialchromosome libraries and anchoring of clones on a genetic map constructed withsingle nucleotide polymorphisms. BMC Genomics 10:28.

23. You FM, et al. (2011) Annotation-based genome-wide SNP discovery in the large andcomplex Aegilops tauschii genome using next-generation sequencing without a ref-erence genome sequence. BMC Genomics 12:59.

24. Luo MC, et al. (2010) Feasibility of physical map construction from fingerprintedbacterial artificial chromosome libraries of polyploid plant species. BMC Genomics11:122.

25. Luo MC, et al. (2009) Genome comparisons reveal a dominant mechanism of chro-mosome number reduction in grasses and accelerated genome evolution in Triticeae.Proc Natl Acad Sci USA 106(37):15780–15785.

26. International Brachypodium Initiative (2010) Genome sequencing and analysis of themodel grass Brachypodium distachyon. Nature 463(7282):763–768.

27. Paterson AH, Bowers JE, Chapman BA (2004) Ancient polyploidization predating di-vergence of the cereals, and its consequences for comparative genomics. Proc NatlAcad Sci USA 101(26):9903–9908.

28. Sanseverino W, et al. (2013) PRGdb 2.0: Towards a community-based database modelfor the analysis of R-genes in plants. Nucleic Acids Res 41(Database issue, D1):(D1):D1167–D1171.

29. You FM, et al. (2010) A new implementation of high-throughput five-dimensionalclone pooling strategy for BAC library screening. BMC Genomics 11:692.

30. Gottlieb A, et al. (2013) Insular organization of gene space in grass genomes. PLoSONE 8(1):e54101.

31. Aliscioni S, et al.; Grass Phylogeny Working Group II (2012) New grass phylogenyresolves deep evolutionary relationships and discovers C4 origins. New Phytol 193(2):304–312.

32. Akhunov ED, Akhunova AR, Dvorak J (2007) Mechanisms and rates of birth and deathof dispersed duplicated genes during the evolution of a multigene family in diploidand tetraploid wheats. Mol Biol Evol 24(2):539–550.

33. Dvorák J, Luo M-C, Yang Z-L (1998) Restriction fragment length polymorphismand divergence in the genomic regions of high and low recombination in self-fertilizing and cross-fertilizing Aegilops species. Genetics 148(1):423–434.

34. Wang JR, et al. (2013) Aegilops tauschii single nucleotide polymorphisms shed lighton the origins of wheat D-genome genetic diversity and pinpoint the geographicorigin of hexaploid wheat. New Phytol 198(3):925–937.

35. Michelmore RW, Meyers BC (1998) Clusters of resistance genes in plants evolve bydivergent selection and a birth-and-death process. Genome Res 8(11):1113–1130.

36. Vakhrusheva OA, Nedospasov SA (2011) System of innate immunity in plants.Mol Biol45(1):16–23.

37. Dvorak J, Akhunov ED (2005) Tempos of deletions and duplications of gene loci inrelation to recombination rate during diploid and polyploid evolution in the Aegi-lops-Triticum alliance. Genetics 171:323–332.

38. See DR, et al. (2006) Gene evolution at the ends of wheat chromosomes. Proc NatlAcad Sci USA 103(11):4162–4167.

39. Mammadov JA, et al. (2010) Development of highly polymorphic SNP markers fromthe complexity reduced portion of maize [Zea mays L.] genome for use in marker-assisted breeding. Theor Appl Genet 121(3):577–588.

6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1219082110 Luo et al.

Supporting InformationLuo et al. 10.1073/pnas.1219082110SI TextGenetic Map Construction. Aegilops tauschii ssp. strangulata acces-sion AL8/78 was collected by V. Jaaska (Department of Botany,Institute of Zoology and Botany, Tartu, Estonia) in Yerevan,Armenia, near the Hrazdan River. Accession AS75 (Ae. tauschiissp. typica) was collected in Xi’an, Shaanxi Province, China. Theformer accession was used for the construction of large-insertlibraries, and both were used as the parents of a biparentalmapping population for the construction of a genetic map.Earlier, we discovered 195,631 genic SNPs between accessions

AL8/78 and AS75 with the genomewide SNP discovery pipelineAGSNP (1). About 84% of the SNPs were real; the rest weresequencing errors (1). To construct a 10K Infinium iSelect SNParray, we selected the best SNPs present in sequence contigsreported by You at al (1). The Infinium II type SNPs were thenselected from this pool to maximize the number of SNP assays inthe 10K Infinium and SNP genotyping performance.We submitted the SNPs to Illumina for evaluation using

Illumina’s Assay Design Tool (ADT). On the basis of the ADTdesign scores, we submitted 10,000 high-score SNPs for manu-facturing and ended up with 9,485 functional assays in the 10KInfinium SNP array. This population of SNP assays included 515SNPs located in wheat expressed sequence tags (ESTs) (labeledby GenBank accession numbers) (2) that have previously beenused for an Illumina GoldenGate SNP assay array construction(3). We used these SNPs to align the Infinium Ae. tauschii mapwith the preceding Ae. tauschii GoldenGate map (3).We grew 1,102 AL8/78 × AS75 F2 plants in the greenhouse and

extracted DNA from isolated nuclei as described earlier (4). TheUC Davis Genome Center used 3 μg of genomic DNA (300 ng/μL) per plant for performing Infinium SNP genotyping assaysusing protocols provided by the Infinium manufacturer (Illumina).We processed the 10K Infinium genotyping data with the Ge-nomeStudio software (Illumina) and manually examined thegraphs of genotyped DNAs of the F2 plants and the AL8/78 andAS75 parental controls for clustering. If an SNP assay performedwell, we expected three well-separated clusters of genotype scoresin the 1:2:1 codominant monohybrid ratio, as illustrated in Fig. S1.The clustering of 1,214 SNP markers (14.5%) was inadequate, andthey were excluded. The remaining 7,185 SNP assays generatedgenotype clustering similar to that shown in the upper panel ofFig. S1 and were used for the construction of the genetic map.We generated a genotype matrix for the 1,102 Ae. tauschii F2

plants from the GenomeStudio SNP genotyping score outputand used it as an input in the MultiPoint mapping software (5)using the following settings: cluster threshold (recombinationrate) of 0.1, Jackknife value 90, number of iteration 10, andKosambi function. We obtained seven linkage groups, one foreach of the seven chromosomes of the Ae. tauschii genome. Wemanually examined the marker order and rechecked the matrixdata and GenomeStudio clustering for markers showing lowconfidence order on the maps. We executed three iterations ofmap construction and checked the matrix and GenomeStudiodata each time. Some groups of markers showed no recombi-nation within the groups, and we based the order of thosemarkers on synteny comparisons with Brachypodium distachyon,rice (Oryza sativa), and sorghum (Sorghum bicolor) and theirlocation in bacterial artificial chromosome (BAC) contigs.

BAC Libraries, BAC Clone Fingerprinting, Fingerprint Editing, and BACContig Assembly.We confirmed the identity of greenhouse-grownAL8/78 plants targeted for the construction of BAC libraries by

sequencing the PCR amplicons of ESTs AJ603554, BE403345,BE488719, BE518031, and BG263347 that had been sequencedpreviously in a number of Ae. tauschii accessions including AL8/78 (3, 6).We collected leaves from the plants, immediately frozethem in liquid nitrogen, placed them into plastic bags, andmailed them to Amplicon Express on dry ice for BAC libraryconstruction. Amplicon Express constructed the BamHI, EcoRI,HindIII, and MboI BAC libraries using the pCC1 BAC vector(EPICENTRE) and DH10B host cells. The total number ofclones per library, average insert sizes, number of clones used forfingerprinting, and other characteristics of the libraries aresummarized in Table S1 (first four rows). Inserts in 100 clonesper library were sized with pulse-field electrophoresis.A total of 406,944 BAC clones from these four libraries (Table

S1) were fingerprinted with a SNaPshot high-information contentfingerprinting (HICF) method described by Luo et al. (7) andmodified by Gu et al. (8). We also randomly selected 22,810clones from a HindIII BAC library of AL8/78 previously con-structed and fingerprinted with a technique described by Luoet al. (7). Clones from that earlier library construction and as-sembly (9) will be called phase I clones to distinguish them fromthe BAC clones produced here (phase II clones; Table S1). Werefingerprinted these 22,810 phase I clones with the modifiedfingerprinting method used here and included them into thepresent contig assembly for future alignment of phase I contigswith the phase II contigs. We also refingerprinted 31,805 BACand binary BAC (BiBAC, vector pCLD04541) clones locatednear the ends of phase I contigs for the same purpose as theHindIII BAC clones (Table S1). Unfortunately, the identity ofthese clones was incorrect, and they could not be used for as-sociating phase I contigs with phase II contigs. However, becausethey were already fingerprinted, we included them into the phaseII contig assembly.In total, we fingerprinted 461,706 clones of an average insert

size length of 120.5 kb (Table S1). The fingerprints were editedwith the FPMiner software (8) using the default settings. Duringfingerprint editing, we retained restriction fragments only in the70- to 1,000-bp size range, excluded vector fragments, clonesfailing fingerprinting or lacking inserts, and clones with less than30 or more than 220 fragments, and removed cross-contami-nated samples using a module in the FPMiner. Cross-con-tamination was detected as clones residing in neighboring wellsand sharing 30% or more of the mean number of fragments intheir profiles. After editing, we were left with 399,448 finger-prints for contig assembly (Table S1). The average insert lengthof the edited clones is unknown but must be >120.5 kb becauseclones with short inserts were removed from the pool of fin-gerprinted clones during the fingerprint editing phase.We assembled the fingerprinted clones into contigs with FPC

software (version 9.3, www.agcol.arizona.edu/software/fpc/) usingthe following strategy. We set the tolerance at 5 (=0.5 bp)throughout the assembly. We performed the initial assembly atSulston cutoff of 1 × 10−70, which was followed by several roundsof DQering, until all contigs contined <15% questionable (Q)clones. We then reduced Sulston cutoff stringency and per-formed end-to-end and singleton-to-end contig merges, re-quiring two or more clones per merge. Sulston cutoff stringencyreduction and contig merging were repeated until a Sulston scoreof 1 × 10−22 was reached, at which point the assembly was ter-minated. The assembly resulted in a total of 3,153 contigs and15,683 singletons.

Luo et al. www.pnas.org/cgi/content/short/1219082110 1 of 14

BAC Pool Construction, Genotyping, and Deconvolution.We used a 5-D pooling strategy for BAC contig anchoring (10, 11). We iso-lated BAC DNA with the R.E.A.L. kit (Qiagen) and used 5 μLeach (20 μL total) for a column and row pools in a stack of 100plates. We produced eight separate sets of row/column pools, 1×genome coverage each, to minimize the number of false-positivepool intersections. To generate plate pools, we inoculated 100 mLof LB medium in a 250-mL flask with all 384 clones in a singleplate, grew them overnight, and isolated DNA with a standardalkaline-lysis protocol. We arranged DNAs of plate-pools into an18 × 18 grid, and combined aliquots into column and row su-perpools. We generated four sets of superpools, each was ∼3×genome equivalent. We added 300 ng of AS75 genomic DNA toall BAC pools to preempt nonspecific PCR amplification in theabsence of a target DNA. A total of 10 μg of DNA per pool wassubmitted to the UC Davis Genome Center for genotyping withthe 10K Ae. tauschii Infinium array.The term deconvolution means the identification of the positive

BAC clone(s) among the clones forming the 5-D BAC pools.Because the deconvolution program and its algorithms have beenpublished (11), we will describe here only the basic idea andfocus on details by which we implemented BAC pool deconvo-lution. The deconvolution program identified intersections be-tween positive BAC row-pools, column-pools, and plate-poolsfrom the 10K Infinium assay resulting in some false-positive in-tersections. To eliminate the need for PCR discrimination be-tween true-positive and false-positive intersections, we used thedistribution of BAC clones in contigs to discriminate betweentrue-positive and false-positive intersections (11). BAC clonesthat are true positives must be neighbors in a contig and overlap,whereas clones that are false positive are distributed randomlyamong contigs. Only contigs that were anchored at a locus on thegenetic map by two or more overlapping BAC clones weretherefore accepted as true positive clones.Because of the variable amounts of BAC DNAs in the pools,

10K Infinium BAC pool genotyping data failed to produce theclear-cut clustering seen in the upper panel in Fig. S1. Instead, weobtained diffuse plots only vaguely resembling the expectedclusters (lower panel in Fig. S1). The fact that at the same time thegenomic DNAs of AL8/78 and AS75 produced tight clusters nearthe x and y axes convinced us that the Infinium assay performedwell. To determine where in the plots the negative and positiveBAC pools clustered, we manually deconvoluted the 10K In-finium BAC pool genotyping data for 34 markers on the 1Dgenetic map. Negative BAC pools (sharing the genotype withAS75) were located in a single tight cluster near one of the or-dinates together with the AS75 genomic DNAs (green dots inthe green oval in the lower panel of Fig. S1. The positive BACpools were located in the diffused cloud of dots along the ordi-nate containing the genomic DNAs showing the AL8/78 nucle-otide (red dots in the red oval). We empirically determined thatall dots with fluorescence 1.5 times the background fluorescenceof AS75 genomic DNAs (fluorescence A in the lower panel ofFig. S1) is a robust boundary separating the positive BAC poolsfrom negative BAC pools. Thus, all pools within the blue rect-angle in the lower panel of Fig. S1 are positive BAC pools andsuperpools.

Manual Contig Editing and Physical Map Construction. The purposeof manual contig editing was to detect chimeric contigs anddissociate them. We examined the genetic map location ofmarkers integrated into each contig. If the markers were in twoseparate regions on the genetic map, the contig was deemedchimeric and was manually disjoined using FPC tools. As illus-trated in Fig. S2, a false join caused by a chimeric BAC clone canbe easily detected on the FPC’s CB map of a contig. In additionto the diagnostic pattern, the clone causing a false join is almostalways a Q clone (Fig. S2). We examined CB maps of all an-

chored contigs for false joins and of unanchored contigs >1 Mb,and clones causing false joins were removed, which separatedeach chimeric contig into two. We also coassembled Ae. tauschiicontigs with BAC clones from subgenomic BAC libraries (Fig.S3). Luo et al. (12) showed that contig coassembly using sub-genomic BAC libraries fingerprinted with the SNaPshot HICFtechnique is an effective strategy for detecting BAC clone rela-tionships. Here we used this technique for detecting Ae. tauschiichimeric contigs consisting of clones from different Ae. tauschiichromosomes or chromosome arms. We coassembled Ae. tau-schii contigs with fingerprinted clones from subgenomic BAClibraries constructed from DNA isolated from the following flow-sorted wheat cv Chinese Spring chromosomes or chromosomearms: 30,067 fingerprinted and edited clones from a 1D-4D-6DBAC library (13), 30,157 fingerprinted and edited clones froma 3DS BAC library (12), 39,852 fingerprinted and edited clonesfrom a 7DS BAC library (14), and 43,492 fingerprinted clonesfrom a 7DL BAC library (14). BAC libraries of wheat chromo-somes 2D and 5D and chromosome arm 3DL fingerprinted witha technique identical to that used here were not available to us.Combined information provided by the contig marker an-

choring, contig CB map, and contig coassembly was sufficientlyredundant to detect and disjoin most of the chimeric contigs.Disjoining of chimeric contigs increased the total number ofcontigs from 3,153 to 3,578 and decreased their average lengthfrom 1,509 to 1,339 kb.

Extension of Marker Sequences. We extended the sequencescontaining SNP markers on the genetic and physical maps with3.1× genome equivalent of Roche 454 WGS sequences andassembled sequence contigs. The Roche 454 sequence contigswere then stepwise extended with 50× genome equivalent ofshort Illumina contigs.Roche 454 genomic library construction and sequencing. We preparedand sequenced the 454 sequencing library according to themanufacturer’s instructions (GS FLX Titanium General Librarypreparation kit/emPCRkit sequencing kit; Roche Diagnostics).Briefly, we sheared 10 μg of Ae. tauschii accession AL78/78 ge-nomic DNA by nebulization and fractionated it with agarose gelelectrophoresis to isolate 400- to 750-bp fragments and used thesized fragments to construct a single-stranded shotgun library. Wequantified the library by fluorometry using the Quant-iT Ribo-Green reagent and processed it by emulsion PCR amplification.We sequenced the library with GS FLX Titanium following themanufacturer’s recommendations (Roche Diagnostics).Illumina library barcoding and sequencing. We quantified Ae. tauschiigenomic DNA using the Qubit flourometer and used ∼2 μg ofDNA for the construction of the standard 300-bp and over-lapping 180-bp Illumina libraries. We sheared the DNA byadaptive focused acoustics (using the Covaris instrument) andend-repaired it using T4 DNA polymerase, Klenow fragment,and T4 polynucleotide kinase. To add a single 3′ deoxyA over-hang, we treated fragments with Klenow fragment (3′–5′ exo-nuclease) and ligated them to standard paired-end Illuminaadapters. Qiagen columns were used for purification betweensteps. For the 300-bp library, we size-selected the fragments inthe range of 350–450 bp using agarose gel electrophoresis, andfor the 180-bp overlapping library, we size-selected the frag-ments for an insert size in the range of 190–210 bp using theCaliper Labchip XT instrument. We then PCR amplified eachlibrary using Phusion DNA polymerase in HF buffer for 12 cyclesand quantified using the Agilent BioAnalyzer.To construct mate-pair libraries (2 and 5 Kb), we sheared 10 μg

of genomic DNA with Covaris, end-repaired it using T4 DNApolymerase, Klenow fragment, and T4 polynucleotide kinase,and added Biotinylated bases using T4 DNA polymerase, Kle-now fragment, and T4 polynucleotide kinase. Qiagen beads wereused for purification between steps. We size-selected DNA

Luo et al. www.pnas.org/cgi/content/short/1219082110 2 of 14

fragments in the 2- or 5-Kb range using agarose gel electro-phoresis and quantitated them with the Agilent BioAnalyzer. Wecircularized DNA fragments with ligase, digested linear DNAwith exonuclease, fragmented the circular DNA with the Covaristo ∼400 bp, and selected biotinylated fragments by binding tostreptavidin magnetic beads. We repaired biotinylated fragmentsas above, A-tailed them using Klenow, and ligated them tostandard Illumina adapters. Each library was then PCR amplifiedusing Phusion DNApolymerase in HF buffer for 18 cycles. Weperformed final size selection for 350- to 650-bp fragments byagarose gel electrophoresis and quantified using the AgilentBioAnalyzer. All libraries were normalized to 10 nM beforeloading on the Illumina sequencers.Illumina sequencing. We sequenced the 300- and 180-bp standardAe. tauschii libraries with 100-bp paired end read lengths and the2- and 5-Kb mate pairs using paired-end 50-bp read lengths,using Illumina GAIIx or HiSeq2000 instruments with paired-endmodules. A 1% phiX control library was spiked into each samplelane to aid in quality monitoring as the runs progressed. We usedthe most current versions of the Illumina instrument controlsoftware and the Illumina flowcell and reagent kits available atthe time each library was sequenced.Initial Illumina sequence analysis.We processed images generated onIllumina GAIIx or HiSeq2000 sequencers and performed base-calling on the fly using the Illumina Real Time Analysis (RTA)software. The files were then transferred to a secondary Linuxserver for further processing. The .bcl files produced by the RTAsoftware on the instrument contained base-call and quality scoreinformation in binary format. The .bcl files were converted toFASTQ format by the CASAVA pipeline (v1.7/v1.8), which alsoprovided run summary and quality information. Illumina FASTQfiles were then uploaded to the NCBI short read archive (SRA).See www.cshl.edu/genome/wheat for SRA accession information.Roche and Illumina contig assembly. We used 3.1× Ae. tauschii ge-nome equivalents of Roche 454 reads for de novo assembly ofcontigs with the Roche gsAssembler using default settings. Theassembly generated 1,070,122 contigs of a total length of584,671,146 bp and an N50 of 835 bp. To filter out repetitiveDNA, we searched homology between the 454 contigs and theTREP database (a curated database of repeat elements in thetribe Triticeae; http://wheat.pw.usda.gov/ITMI/Repeats/).We performed a similar manipulation with the 50× Illumina

reads. They too were filtered by homology search against theTREP database. By filtering out repeated sequences, we reducedthe Illumina sequences from the original 2,566,522,820 to1,518,407,964 bp (a 59.2% reduction). We assembled the filteredIllumina reads with Velvet, but because of limited computermemory, we were able to use fragment reads and 300-bp paired-end data only up to a 3.5× Ae. tauschii genome equivalent. Weperformed 17 such independent assemblies, which on averageassembled 4.8 million sequences with an average N50 value of221 bp. A total of 714 Mb of genomic DNA was assembled intothese short contigs.Contig extension.The 454 contigs constructed from the 3.1× Roche454 reads were extended with 50× Illumina contigs. First weperformed a blastN search against the Velvet Illumina read as-sembly dataset to identify the Illumina contigs corresponding tothe 454 reads. We then stepwise extended the 454 contigs byusing 100 bp from the end of each contig in a blast search againstthe Illumina sequences to extract reads or contigs at an E-valueof −30. We repeated this step until an attempt to extend a contigfailed due to the absence of reads matching the end sequence ordue to the end sequence matching a repetitive sequence. Theaverage length of the 7,185 contigs containing SNP markers was

extended to 7,869 bp, with a total cumulative length of 61 Mband an N50 of 10,830 bp. The extended marker sequence lengthranged from 348 to 54,605 bp.We used a genome annotation pipeline MAKER (http://gmod.

org/wiki/MAKER) for sequence annotation to generate a set ofab initio gene predictions in the 7,185 extended contigs. MAKERidentified repeats using the TREP database, aligned ESTs andprotein sequences with contig sequences, produced ab initio genepredictions, and automatically synthesized these data into geneannotation classes with evidence-based quality indices. Wealigned wheat and barley full-length ESTs and assembled wheatEST contigs (http://plantta.jcvi.org/cgi-bin/plantta_release.pl)with the extended sequence contigs in the MAKER pipeline.We also used plant protein datasets for homology comparison ingene prediction. In total we predicted 17,093 protein-encodinggenes or gene fragments in the 7,185 extended sequence contigs.We assumed that 9,716 of these genes that were without any gapin the coding sequence and the predicted gene sequences werefully aligned with ESTs or annotated proteins were completegenes. MAKER also provided a list of sequences that showedpartial alignment to ESTs or proteins. In this case, sequencespartially matching ESTs were named as gene fragments_EST andthose that did not match any wheat or barley EST but partiallymatched sequences in the nonredundant protein database werenamed gene fragments_protein. They could be pseudogenes or novelgenes not present in databases. We also calculated the averagegene, exon, and intron lengths for the 9,716 annotated genes.The output of MAKER was used to create a gff file and used in

our Gbrowse web interface build (http://probes.pw.usda.gov/cgi-bin/gb2/gbrowse/wheat_D_marker/). A spreadsheet of the 17,093genes and gene fragments including name, location on the ge-netic map, locations of homologous genes in B. distachyon, rice,and sorghum and gene ontology (GO) is at http://probes.pw.usda.gov/WheatDMarker/downloads/GeneList.xls.

Dot Plots. To make dot plots as shown in Fig. 2 B and C, and Fig.S4, we aligned Ae. tauschii marker fasta sequences to annotatedproteins of reference grass species using NCBI BLASTX. Toincrease sensitivity, we predicted translated sequences of Ae.tauschii markers on the basis of FGENESH (PMID:10779491)and aligned them to annotated reference proteins by BLASTP.We determined the collinear relationships between best signifi-cantly aligned marker and reference genes (E-value ≤ 1E−10)using DAGchainer (PMID:15247098), and if necessary, filteredparalogous chromosomal relationships to leave just orthologouscollinear gene pairs. We collapsed duplicate loci so that eachmarker or gene was represented exactly once. We graphedmarker and reference gene loci on the basis of their rank posi-tion along chromosomes. For plots between rice, sorghum, andB. distachyon, collinear genes were detected among orthologousgenes as classified in Gramene Release 32 (November 2010) onthe basis of Compara phylogenetic trees (PMID: 19029536;PMID: 21076153). The following reference genome annotationswere used: Oryza sativa, MSU6.1; Brachypodium distachyon, JGIBrachy1.2; and Sorghum bicolor, JGI Sbi1.4 (PMID: 17145706,PMID: 20148030, and PMID: 19189423, respectively).Tomakedot plots shown inFig. S6, wefirst performedBLASTX

analysis of the 17,093 genes and gene fragments against the an-notated rice genome. The top rice hit of each Ae. tauschii gene orgene fragments was recorded with its coordinate in the rice ge-nome. The dot plots of individual Ae. tauschii chromosomes weregraphed by plotting each marker locus along the physical mapagainst the corresponding top hit in the rice genome.

1. You FM, et al. (2011) Annotation-based genome-wide SNP discovery in the large andcomplex Aegilops tauschii genome using next-generation sequencing withouta reference genome sequence. BMC Genomics 12:59.

2. Qi LL, et al. (2004) A chromosome bin map of 16,000 expressed sequence tag loci anddistribution of genes among the three genomes of polyploid wheat. Genetics 168(2):701–712.

Luo et al. www.pnas.org/cgi/content/short/1219082110 3 of 14

3. Luo MC, et al. (2009) Genome comparisons reveal a dominant mechanism ofchromosome number reduction in grasses and accelerated genome evolution inTriticeae. Proc Natl Acad Sci USA 106(37):15780–15785.

4. Dvorak J, McGuire PE, Cassidy B (1988) Apparent sources of the A genomes of wheatsinferred from the polymorphism in abundance and restriction fragment length ofrepeated nucleotide sequences. Genome 30(5):680–689.

5. Mester DI, et al. (2006) Multilocus consensus genetic maps (MCGM): formulation,algorithms, and results. Comput Biol Chem 30(1):12–20.

6. Akhunov ED, et al. (2010) Nucleotide diversity maps reveal variation in diversityamong wheat genomes and chromosomes. BMC Genomics 11:702.

7. Luo MC, et al. (2003) High-throughput fingerprinting of bacterial artificialchromosomes using the SNaPshot labeling kit and sizing of restriction fragments bycapillary electrophoresis. Genomics 82(3):378–389.

8. Gu YQ, et al. (2009) A BAC-based physical map of Brachypodium distachyon and itscomparative analysis with rice and wheat. BMC Genomics 10:496.

9. Luo MC, et al. (2003) Construction of contigs of Ae. tauschii genomic DNA fragmentscloned in BAC and BiBAC vectors. Proceedings of the Tenth International Wheat

Genetics Symposium, eds Pogna NE, Romano M, Pogna EA, Galterio G (S.IM.I., Rome,Italy), pp 293–296.

10. Luo MC, et al. (2009) A high-throughput strategy for screening of bacterial artificialchromosome libraries and anchoring of clones on a genetic map constructed withsingle nucleotide polymorphisms. BMC Genomics 10:28.

11. You FM, et al. (2010) A new implementation of high-throughput five-dimensionalclone pooling strategy for BAC library screening. BMC Genomics 11:692.

12. Luo MC, et al. (2010) Feasibility of physical map construction from fingerprintedbacterial artificial chromosome libraries of polyploid plant species. BMC Genomics11:122.

13. Janda J, et al. (2004) Construction of a subgenomic BAC library specific forchromosomes 1D, 4D and 6D of hexaploid wheat. Theor Appl Genet 109(7):1337–1345.

14. �Simková H, et al. (2011) BAC libraries from wheat chromosome 7D: Efficient tool forpositional cloning of aphid resistance genes. J Biomed Biotechnol 2011:302543.

Luo et al. www.pnas.org/cgi/content/short/1219082110 4 of 14

Fig. S1. Comparison of genotype clustering in Cartesian graphs in GenomeStudio outputs for F2 plants of the mapping population (Upper) and 5-D BAC pools(Lower). Upper shows 1,102 Ae. tauschii F2 plants from the cross AS75 × AL8/78 genotyped with an Infinium SNP assay for an SNP in EST locus BE406943. The redand blue clusters contain homozygous genotypes and the purple cluster contains heterozygous genotypes. Lower is a GenomeStudio Cartesian graph from thegenotyping of the five-dimensional BAC pools of Ae. tauschii accession AL8/78 with the Infinium SNP assay BE398417Contig1ATwsnp1. We added a constantamount of Ae. tauschii AS75 genomic DNA to each BAC pool to preempt nonspecific PCR amplification in the assay. Each dot in the graph represents a DNAsample, either a BAC pool or genomic DNA of accession AS75 (green dots in the green oval) and accession AL8/78 (red dots in the red oval). The horizontal andvertical coordinates of a dot are quantified fluorescence A and B, which in this specific case measures the amount of AL8/78 nucleotide and AS75 nucleotide ina DNA sample, respectively. Note that only the AL8/78 and AS75 generated clusters reminiscent the clusters in the upper panel. We empirically determined thatpools with <1.5 times the average fluorescence of the AS75 genomic DNAs in the AL8/78 nucleotide fluorescence spectrum (fluorescence A in this case) werelikely negative. Those showing >1.5 times the average fluorescence of the AS75 genomic DNA (BAC pools and genomic DNAs in the blue box) were likelypositive.

Luo et al. www.pnas.org/cgi/content/short/1219082110 5 of 14

Fig. S2. A detail of the CB map for the chimeric contig ctg756 consisting of a 3D part (upper part of the contig) and 6D part (lower part of the contig), asindicated by integrated markers. BAC clones are depicted by vertical lines and restriction fragments (numbered at the left) making up BAC fingerprints aredepicted by short horizontal lines across the vertical lines symbolizing BAC clones. The absence of a specific restriction fragment in a BAC fingerprint is shownby a small empty oval box on the vertical line depicting a clone. The total number of fragments not fitting clone overlaps is specified above each BAC clone. Thejoin between the 3D and 6D portions of the contig is due to a single, questionable (Q) BAC clone (blue), which has 43 aberrant restriction fragments andmissing virtually all restriction fragments near the join. FPC labeled the clone as a Q clone. The clone is likely a chimera, as evidenced by the truncation of theoverlaps of BAC clone fingerprints immediately to the right and immediately to the left of the Q clone. Manual removal of the BAC clone disjoined the contiginto two contigs: one located on chromosome 3D and the other located on chromosome 6D.

Luo et al. www.pnas.org/cgi/content/short/1219082110 6 of 14

Fig. S3. Ae. tauschii contig ctg421 illustrating the technique of detecting chimeric contigs by coassembly of Ae. tauschii BAC clones with wheat D-genome BACclones from subgenomic BAC libraries. The left side of the contig was anchored on 3DS by marker AT3D2563_76. In agreement, Ae. tauschii BAC clones (gray) inthat area coassembled with a large number of 3DS clones (dark blue). The portion of ctg421 to the right of chimeric BAC clone RI551G20 (light blue) was devoidof 3DS BAC clones indicating that that portion of the contig came from another Ae. tauschii chromosome. Markers AT2D1576_102, AT2D1563_102, andAT2D1553_102 anchored that portion of the contig on Ae. tauschii chromosome 2D.

Luo et al. www.pnas.org/cgi/content/short/1219082110 7 of 14

Fig. S4. Ae. tauschii synteny dot plots. (A) Pairwise comparative maps between Ae. tauschii and B. distachyon. To highlight large-scale structural re-arrangements, we show only loci exhibiting collinear relationships. These loci are plotted by rank order with parallel order shown in blue and antiparallel ordershown in red. Distances on the Ae. tauschii genetic map above the dot plots are such that each tick-mark corresponds to 10 cM. Physical distances above the dotplots refer to Ae. tauschii, and each tick-mark corresponds to 10 Mb. Physical distances on the right side of the dot plots refer to B. distachyon and each tick-mark corresponds to 1 Mb. (B) Synteny map between Ae. tauschii and rice showing both orthologous relationships (blue) and paralogous relationships (red).Syntenic paralogs are chiefly the result of the pan-grass whole genome duplication that occurred before the divergence of the rice and Ae. tauschii lineages,although small segmental duplications may be also present. To better highlight structural rearrangements, we show only loci exhibiting collinear relationships,which are plotted by their numeric order along chromosomes, with short arm termini at the bottom left corner. We plot distances on the Ae. tauschii geneticmap above the dot-plots such that each tick-mark corresponds to 10 cM. We plot physical distances above and on the right side of dot-plots. For Ae. tauschii,each tick-mark corresponds to 10 Mb and for rice each tick-mark corresponds to 1 Mb. To recover collinear paralogs, we selected up to five significantly alignedrice genes per marker query, and omitted known orthologous relationships before mapping with DAGchainer (PMID:15247098).

Luo et al. www.pnas.org/cgi/content/short/1219082110 8 of 14

Fig. S5. Gene density expressed as the number of genes per Mb (red line) and recombination rate in centiMorgans per megabase (blue line) along the sevenAe. tauschii chromosomes. The short arm terminus is to the left in each graph. Black circles depict centromeres. Intervals along the x axis are 30 Mb long. Thearrows show the sites of insertions of the ancient telomeres (chromosomes 1D, 2D, 5D, and 7D) and centromeres (4D) due to NCIs. Gene density was computedfrom the locations of 17,093 genes and gene fragments.

Luo et al. www.pnas.org/cgi/content/short/1219082110 9 of 14

Fig. S6. (Continued)

Luo et al. www.pnas.org/cgi/content/short/1219082110 10 of 14

Fig. S6. (Continued)

Luo et al. www.pnas.org/cgi/content/short/1219082110 11 of 14

Fig. S6. (Continued)

Luo et al. www.pnas.org/cgi/content/short/1219082110 12 of 14

Fig. S6. Dot plots of the best matches between genes on the physical maps of the 7 Ae. tauschii chromosomes and 12 rice pseudomolecules. Note theconcentration of noncollinear genes in the distal regions of the Ae. tauschii chromosomes.

Luo et al. www.pnas.org/cgi/content/short/1219082110 13 of 14

Table S1. BAC and BiBAC libraries and their SNaPshot fingerprinting

Library Name Select. marker Total clones Insert size (kb) Fingerprinted clones Clones in contig assemblies

BamHI (phase II) HI Chl. 92,160 115 57,792 47,957EcoRI (phase II) RI Chl. 172,800 120 155,904 132,715HindIII (phase II) HD Chl. 172,800 125 158,880 143,256MboI (phase II) MI Chl. 92,160 115 34,368 30,083HindIII (phase I) HD Chl. 123 22,810 18,428BAC (phase I)* TCM Chl. 116 20,233 17,927BiBAC (phase I)* TET Tet. 114 11,572 8,935EcoRI (phase I) RI Chl. 118 50 50BamHI (phase I) HI Chl. 109 51 51BamHI BiBAC (phase I) BB Tet. 103 12 12HindIII BiBAC (phase I) HB Tet. 125 34 34Total 529,929 461,706 399,448Weighted insert size mean (kb) 120.5

Chl., chloramphenicol; Tet., tetracycline.*The library origin of the clones is unknown.

Table S2. Characterization of Ae. tauschii complete genes andgene fragments with respect to the presence or absence of anortholog in at least one of the B. distachyon, rice, or sorghumgenomes (collinear Ae. tauschii genes) or none (noncollinear Ae.tauschii genes)

Class of Ae. tauschii genes Number Percent

Total number of mapped genes and gene fragments 5,901 100.0Collinear genes and gene fragments 3,848 65.2Noncollinear genes 1,540 26.1Noncollinear gene fragments 513 8.7Total noncollinear genes and gene fragments 2,053 34.8

Table S3. Gene ontology of 4,134 mapped genes allocated to the following four groups: Collinear genes in high-recombination regions (CH), collinear genes in low-recombination regions (CL), noncollinear genes in high-recombination regions (NH), and noncollinear genes in low-recombination regions (NL)

Class Gene ontology CH CL NH NL CH % CL % NH % NL %

GO:0000166 Nucleotide binding 124 221 66 52 10.3 11.5 10.6 13.6GO:0003676 Nucleic acid binding 22 29 5 8 1.8 1.5 0.8 2.1GO:0003677 DNA binding 28 50 10 14 2.3 2.6 1.6 3.7GO:0003682 Chromatin binding 3 4 0 0 0.2 0.2 0.0 0.0GO:0003700 Sequence-specific DNA binding

transcription factor103 151 38 27 8.6 7.8 6.1 7.1

GO:0003723 RNA binding 37 55 12 14 3.1 2.8 1.9 3.7GO:0003774 Motor activity 11 32 2 3 0.9 1.7 0.3 0.8GO:0003824 Catalytic activity 100 173 76 35 8.3 9.0 12.3 9.2GO:0004518 Nuclease activity 5 16 6 5 0.4 0.8 1.0 1.3GO:0004871 Signal transducer activity 21 45 6 4 1.7 2.3 1.0 1.0GO:0004872 Receptor activity 27 27 19* 1 2.2 1.4 3.1 0.3GO:0005198 Structural molecule activity 30 29 13 4 2.5 1.5 2.1 1.0GO:0005215 Transporter activity 66 130 38 22 5.5 6.7 6.1 5.8GO:0005488 Binding 32 53 18 9 2.7 2.7 2.9 2.4GO:0005515 Protein binding 146 223 78 38 12.1 11.6 12.6 9.9GO:0008135 Translation factor activity,

nucleic acid binding4 14 2 4 0.3 0.7 0.3 1.0

GO:0008289 Lipid binding 19 23 2 3 1.6 1.2 0.3 0.8GO:0016301 Kinase activity 125 175 59 31 10.4 9.1 9.5 8.1GO:0016740 Transferase activity 75 117 41 35 6.2 6.1 6.6 9.2GO:0016787 Hydrolase activity 163 270 71 57 13.6 14.0 11.5 14.9GO:0019825 Oxygen binding 16 25 19 6 1.3 1.3 3.1 1.6GO:0030234 Enzyme regulator activity 8 9 10 1 0.7 0.5 1.6 0.3GO:0030246 Carbohydrate binding 37 59 29 9 3.1 3.1 4.7 2.4Total 1,202 1,930 620 382 100 100 100 100

*GO category in which the proportion of genes in the high recombination region differed at P = 0.01 from that in the lowrecombination region (Fisher exact test).

Luo et al. www.pnas.org/cgi/content/short/1219082110 14 of 14