a high-definition view of functional genetic variation ... · article a high-definition view of...

17
Article A High-Definition View of Functional Genetic Variation from Natural Yeast Genomes Anders Bergstro ¨m, 1 Jared T. Simpson, 2 Francisco Salinas, 1 Benjamin Barre ´, 1 Leopold Parts, 2,3 Amin Zia, 4,5 Alex N. Nguyen Ba, 4 Alan M. Moses, 4 Edward J. Louis, 6 Ville Mustonen, 2 Jonas Warringer, 7 Richard Durbin, 2 and Gianni Liti* ,1 1 Institute for Research on Cancer and Ageing, Nice (IRCAN), University of Nice, Nice, France 2 The Wellcome Trust Sanger Institute, Cambridge, United Kingdom 3 Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada 4 Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada 5 Stanford Center for Genomics and Personalized Medicine, Stanford University School of Medicine 6 Centre of Genetic Architecture of Complex Traits, University of Leicester, Leicester, United Kingdom 7 Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden *Corresponding author: E-mail: [email protected]. Associate editor: Joshua Akey The sequence data from this study have been submitted to NCBI BioProject under accession no. PRJEB2299 and the SRA/ENA data- bases under study accession no. ERP000361. Abstract The question of how genetic variation in a population influences phenotypic variation and evolution is of major impor- tance in modern biology. Yet much is still unknown about the relative functional importance of different forms of genome variation and how they are shaped by evolutionary processes. Here we address these questions by population level sequencing of 42 strains from the budding yeast Saccharomyces cerevisiae and its closest relative S. paradoxus. We find that genome content variation, in the form of presence or absence as well as copy number of genetic material, is higher within S. cerevisiae than within S. paradoxus, despite genetic distances as measured in single-nucleotide polymor- phisms being vastly smaller within the former species. This genome content variation, as well as loss-of-function variation in the form of premature stop codons and frameshifting indels, is heavily enriched in the subtelomeres, strongly reinforcing the relevance of these regions to functional evolution. Genes affected by these likely functional forms of variation are enriched for functions mediating interaction with the external environment (sugar transport and metab- olism, flocculation, metal transport, and metabolism). Our results and analyses provide a comprehensive view of genomic diversity in budding yeast and expose surprising and pronounced differences between the variation within S. cerevisiae and that within S. paradoxus. We also believe that the sequence data and de novo assemblies will constitute a useful resource for further evolutionary and population genomics studies. Key words: population genomics, functional variation, genome evolution, yeast, subtelomeres, loss-of-function variants. Introduction A central issue in biology is how genetic variation influences variation in organismal phenotypes, as well as how this variation is shaped by evolutionary processes. So far, the emphasis in population genomics studies has been on single-nucleotide polymorphisms (SNPs), which are the most abundant form of sequence variation and therefore the most informative about population history. However, in- creasing attention is being devoted to other forms of variation including structural, genome content, and copy number var- iation (CNV), which might be more likely to have large phe- notypic effects (Conrad et al. 2010; Sudmant et al. 2010). Little is known, however, about the relative biological importance of these various forms of variation in different species. The Baker’s yeast Saccharomyces cerevisiae has emerged as an attractive system in which to address these questions, having long served as an important model organism for mole- cular biology and genetics but also for comparative genomics and the study of genome evolution (Cliften et al. 2003; Kellis et al. 2003; Dujon et al. 2004; Dujon 2010; Hittinger 2013). Recent years have seen a growing interest in using S. cerevisiae and its closest relatives to study natural variation, ecology, and population level genetics and genomics (Replansky et al. 2008; Liti and Schacherer 2011; Hittinger 2013) as well to exploit natural variation in the study of the genetic architecture of traits (Liti and Louis 2012). In one of the first population genomics studies, we revealed the strong population struc- tures of S. cerevisiae and its closest relative S. paradoxus, with most variants being private to specific phylogenetic lineages, and demonstrated the influence of human activity and ß The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons. org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Open Access 872 Mol. Biol. Evol. 31(4):872–888 doi:10.1093/molbev/msu037 Advance Access publication January 14, 2014

Upload: buikiet

Post on 27-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Article

A High-Definition View of Functional Genetic Variation fromNatural Yeast GenomesAnders Bergstrom1 Jared T Simpson2 Francisco Salinas1 Benjamin Barre1 Leopold Parts23 Amin Zia45

Alex N Nguyen Ba4 Alan M Moses4 Edward J Louis6 Ville Mustonen2 Jonas Warringer7

Richard Durbin2 and Gianni Liti1

1Institute for Research on Cancer and Ageing Nice (IRCAN) University of Nice Nice France2The Wellcome Trust Sanger Institute Cambridge United Kingdom3Donnelly Centre for Cellular and Biomolecular Research University of Toronto Toronto ON Canada4Department of Cell amp Systems Biology University of Toronto Toronto ON Canada5Stanford Center for Genomics and Personalized Medicine Stanford University School of Medicine6Centre of Genetic Architecture of Complex Traits University of Leicester Leicester United Kingdom7Department of Chemistry and Molecular Biology University of Gothenburg Gothenburg Sweden

Corresponding author E-mail giannilitiunicefr

Associate editor Joshua Akey

The sequence data from this study have been submitted to NCBI BioProject under accession no PRJEB2299 and the SRAENA data-

bases under study accession no ERP000361

Abstract

The question of how genetic variation in a population influences phenotypic variation and evolution is of major impor-tance in modern biology Yet much is still unknown about the relative functional importance of different forms ofgenome variation and how they are shaped by evolutionary processes Here we address these questions by populationlevel sequencing of 42 strains from the budding yeast Saccharomyces cerevisiae and its closest relative S paradoxus Wefind that genome content variation in the form of presence or absence as well as copy number of genetic material ishigher within S cerevisiae than within S paradoxus despite genetic distances as measured in single-nucleotide polymor-phisms being vastly smaller within the former species This genome content variation as well as loss-of-function variationin the form of premature stop codons and frameshifting indels is heavily enriched in the subtelomeres stronglyreinforcing the relevance of these regions to functional evolution Genes affected by these likely functional forms ofvariation are enriched for functions mediating interaction with the external environment (sugar transport and metab-olism flocculation metal transport and metabolism) Our results and analyses provide a comprehensive view of genomicdiversity in budding yeast and expose surprising and pronounced differences between the variation within S cerevisiaeand that within S paradoxus We also believe that the sequence data and de novo assemblies will constitute a usefulresource for further evolutionary and population genomics studies

Key words population genomics functional variation genome evolution yeast subtelomeres loss-of-function variants

IntroductionA central issue in biology is how genetic variation influencesvariation in organismal phenotypes as well as how thisvariation is shaped by evolutionary processes So far theemphasis in population genomics studies has been onsingle-nucleotide polymorphisms (SNPs) which are themost abundant form of sequence variation and thereforethe most informative about population history However in-creasing attention is being devoted to other forms of variationincluding structural genome content and copy number var-iation (CNV) which might be more likely to have large phe-notypic effects (Conrad et al 2010 Sudmant et al 2010) Littleis known however about the relative biological importanceof these various forms of variation in different species TheBakerrsquos yeast Saccharomyces cerevisiae has emerged as an

attractive system in which to address these questionshaving long served as an important model organism for mole-cular biology and genetics but also for comparative genomicsand the study of genome evolution (Cliften et al 2003 Kelliset al 2003 Dujon et al 2004 Dujon 2010 Hittinger 2013)Recent years have seen a growing interest in using S cerevisiaeand its closest relatives to study natural variation ecology andpopulation level genetics and genomics (Replansky et al 2008Liti and Schacherer 2011 Hittinger 2013) as well to exploitnatural variation in the study of the genetic architecture oftraits (Liti and Louis 2012) In one of the first populationgenomics studies we revealed the strong population struc-tures of S cerevisiae and its closest relative S paradoxus withmost variants being private to specific phylogenetic lineagesand demonstrated the influence of human activity and

The Author 2014 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby30) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited Open Access872 Mol Biol Evol 31(4)872ndash888 doi101093molbevmsu037 Advance Access publication January 14 2014

domestication on the evolution of the former but not thelatter species (Liti Carter et al 2009) A number of otherstudies have corroborated and extended these findings invarious directions (Doniger et al 2008 Schacherer et al2009 Zhang et al 2010 Hyma et al 2011 Warringer et al2011 Dunn et al 2012 Hyma and Fay 2013) such that acoherent framework for understanding the population struc-ture and evolutionary history of Bakerrsquos yeast is nowemerging

A question of primary interest in yeast population geno-mics has been the extent to which the highly stratified geneticstructure is driven by on the one hand geography and on theother hand by ecology and adaptation to different lifestylesand environmental niches So far the evidence points in favorof geography being the most important factor (Liti Carteret al 2009 Cromie et al 2013 Hittinger 2013) Saccharomycesparadoxus populations found in different parts of the worldare highly diverged at the sequence level do not seem to haverecently exchanged genetic material and in some cases dis-play partial reproductive isolation (Sniegowski et al 2002Koufopanou et al 2006 Liti Carter et al 2009) The geneticstructure of S cerevisiae is similarly characterized by stronggeographical stratification but is nuanced by a larger degree ofadmixture between populations and the presence of strainswith phylogenetically mosaic genomes (Liti Carter et al 2009Cromie et al 2013) This admixture has likely been facilitatedby the association of S cerevisiae with human fermentationpractices and deliberate or accidental dispersal of domesti-cated strains to different parts of the world (Hyma and Fay2013) Overall genetic divergence is much higher in S para-doxus than in S cerevisiae the most highly diverged strains inthe former species being separated by ~35 at the sequencelevel as compared to 05ndash08 in the latter (Liti Carter et al2009) Surprisingly however phenotypic diversity is substan-tially higher in S cerevisiae than in S paradoxus (Liti Carteret al 2009 Warringer et al 2011) This observation is highlyunexpected under a model where phenotypic variation isprimarily the product of gradual accumulation of SNPsthroughout the genome Therefore it raises the hypothesisthat other evolutionary processes potentially involving otherforms of genetic variation are responsible

Several advances have been made in the search for thegenetic variants that underlie phenotypic variation in yeastprimarily in S cerevisiae Due to the high degree of populationstratification genome-wide association studies are problem-atic to carry out in natural yeast populations (Connelly andAkey 2012) but studies utilizing recombinant populationsobtained by crossing diverged parental strains have provenfruitful (Liti and Louis 2012) Quantitative trait loci (QTLs)have been identified for a range of traits including the abilityto grow at high temperature (Steinmetz et al 2002 Sinha et al2008 Parts et al 2011) resistance to numerous chemicalcompounds (Ehrenreich et al 2010 2012) and enologicaltraits (Marullo et al 2007 Ambroset et al 2011 Salinaset al 2012) Although most genotypendashphenotype mappinghas been performed using the same small set of parent strainswith a focus on laboratory strains an increasing number ofstudies are utilizing additional strains that cover more of the

genetic diversity within the species (Wenger et al 2010Cubillos et al 2011 Parts et al 2011 Ehrenreich et al 2012)Less attention has been devoted to the genetic architecture oftraits in S paradoxus though QTLs for telomere length havebeen identified (Liti Haricharan et al 2009) Little is knownabout the evolutionary forces that shape the relationshipbetween genotype and phenotype in natural yeast popula-tions It has been proposed that the predominance of asexualgrowth and self-fertilization in the natural life history of yeast(Ruderfer et al 2006 Tsai et al 2008) makes genetic drift thedominant force driving phenotypic evolution such that del-eterious variants become fixed in specific lineages followingrepeated population bottlenecks (Warringer et al 2011 Zorgoet al 2012) However there are also studies reporting someevidence for a role of positive selection (Fraser et al 2010)

So far population genomics analyses in yeast have mostlyrelied on microarray technology (Schacherer et al 2007 2009)and low-coverage capillary sequencing (Liti Carter et al2009) A number of individual strain genomes have alsobeen sequenced for various purposes mostly using capillaryand pyrosequencing (Wei et al 2007 Borneman et al 2008Doniger et al 2008 Argueso et al 2009 Novo et al 2009Dowell et al 2010 Akao et al 2011 Borneman et al 2011Nijkamp et al 2012 Ralser et al 2012 Zheng et al 2012) Dueto rapid technological advances next-generation sequencingtechnologies generating short reads have become the mostcost-effective choice for population-level whole-genome se-quencing Here we apply high-coverage Illumina sequencingto 42 natural strains from S cerevisiae and S paradoxus Inboth species strains were selected from the set of strainspreviously surveyed using low-coverage capillary sequencing(Liti Carter et al 2009) and comprehensive phenotypic anal-yses (Warringer et al 2011) to be as informative as possibleabout overall genetic diversity We construct de novo assem-blies and perform analyses to advance our understanding ofgenomic diversity and evolution functional variation and thegenotypendashphenotype relationship In addition to SNPs thisdata enable analyses of other kinds of variants such that wecan begin to address directly the aforementioned questionsabout the nature and evolution of genomic variation and itspotential phenotypic relevance We furthermore hope thatthe sequence data and assemblies will constitute a usefuladdition to the growing set of resources available for popu-lation genomics and genotypendashphenotype studies in yeastData and results from the project are accessible at httpwwwmoseslabcsbutorontocasgrp (last accessed January14 2014)

Results and Discussion

De Novo Yeast Genome Assemblies from Short-ReadSequencing Data

We selected 42 haploid (or homozygous diploid) yeast strainsand sequenced their genomes to intermediate-to-high(10ndash60) coverage using paired-end Illumina technologyA subset of six strains were sequenced to very high (400ndash800) coverage (table 1) Genomes of 14 S cerevisiae and 13 Sparadoxus strains for which the coverage was higher than

873

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Tab

le1

Sequ

enci

ng

and

De

Nov

oA

ssem

bly

ofY

east

Stra

inG

enom

es

Stra

inSu

bpop

ula

tion

Sou

rce

Loca

tion

Cov

aN

um

ber

ofSc

affo

ldsb

Ass

emb

lySi

zeb

Con

tig

N50

bM

ax

Scaf

fold

Size

bSc

affo

ldN

50b

Sce

revi

siae

UW

OPS

87-2

421

Mos

aic

Opu

ntia

spp

Haw

aii

821

559

536

116

584

291

167

177

210

365

310

847

354

612

654

612

616

093

920

012

2

UW

OPS

83-7

873

Mos

aic

Opu

ntia

spp

Bah

amas

627

513

495

116

769

431

167

812

211

370

211

370

255

777

188

503

318

788

124

466

1

YPS

128

Nor

thA

mer

ican

Que

rcus

alba

USA

6485

579

111

741

720

11

768

379

948

959

934

145

129

451

896

010

955

518

762

3

SK1

Mos

aic

Soil

USA

561

315

111

911

704

669

11

753

398

491

495

407

232

107

256

961

067

234

267

272

L152

8W

ine

Euro

pean

Win

eC

hile

4995

180

811

565

290

11

594

025

361

083

957

424

618

146

398

655

077

101

460

W30

3M

osai

cLa

bora

tory

USA

481

391

121

911

630

510

11

666

829

345

373

798

620

922

328

454

752

228

122

824

DB

VPG

6765

Win

eEu

rope

anU

nkn

own

Un

know

n46

967

779

116

224

181

167

095

346

113

51

348

280

644

540

506

656

672

083

26

Y12

Sake

Sake

Japa

n44

172

11

582

116

100

641

164

974

725

846

26

998

113

242

229

213

369

446

014

8

DB

VPG

1106

Win

eEu

rope

anG

rap

esA

ust

ralia

431

280

116

311

539

734

11

580

105

266

132

724

617

296

418

308

036

291

47

371

Y55

Mos

aic

Gra

pes

Fran

ce38

151

61

219

116

354

231

170

113

924

282

25

808

263

008

418

695

421

571

418

98

DB

VPG

6044

Wes

tA

fric

anB

iliw

ine

Wes

tA

fric

a35

113

798

311

598

603

11

656

308

421

434

528

223

691

735

097

052

144

94

331

YJM

975

Win

eEu

rope

anC

linic

alIt

aly

333

069

242

911

406

731

11

688

902

570

15

849

334

645

940

37

414

150

38

UW

OPS

03-4

614

Mal

aysi

anB

ertr

amp

alm

Mal

aysi

a32

364

63

213

114

991

461

169

426

85

691

584

440

466

60

006

691

910

747

DB

VPG

1373

Win

eEu

rope

anSo

ilN

eth

erla

nds

321

232

970

115

594

541

162

650

421

249

22

464

185

406

368

851

355

429

852

6

YJM

978

Win

eEu

rope

anC

linic

alIt

aly

32mdash

mdashmdash

mdashmdash

DB

VPG

1788

Win

eEu

rope

anSo

ilFi

nla

nd

30mdash

mdashmdash

mdashmdash

L137

4W

ine

Euro

pean

Win

eC

hile

26mdash

mdashmdash

mdashmdash

BC

187

Win

eEu

rope

anW

ine

USA

23mdash

mdashmdash

mdashmdash

YJM

981

Win

eEu

rope

anC

linic

alIt

aly

16mdash

mdashmdash

mdashmdash

Sp

arad

oxus

Y8

5Eu

rop

ean

Que

rcus

spp

U

K50

243

911

623

026

11

133

254

762

020

411

1

Z1

1Eu

rop

ean

Que

rcus

spp

U

K42

342

641

211

616

631

11

624

232

113

350

115

435

555

399

555

399

238

384

289

391

Y9

6Eu

rop

ean

Que

rcus

spp

U

K40

874

311

615

713

67

180

27

189

283

447

Z1

Euro

pea

nQ

uerc

ussp

p

UK

367

685

116

865

90

102

641

357

366

128

755

Q59

1Eu

rop

ean

Que

rcus

spp

U

K54

931

803

117

144

391

175

763

767

680

73

926

362

301

496

642

710

431

509

95

N-4

4Fa

rEa

ster

nQ

uerc

ussp

p

Rus

sia

512

058

1488

115

761

821

173

337

611

054

12

242

662

541

561

2813

806

36

398

YPS

138

Am

eric

anQ

ve

lutin

aU

SA43

204

715

2911

613

204

11

740

006

110

251

180

169

807

117

759

147

973

328

5

S36

7Eu

rop

ean

Que

rcus

spp

U

K42

115

311

1211

658

851

11

672

986

465

804

688

017

463

818

728

750

727

56

021

Y6

5Eu

rop

ean

Que

rcus

spp

U

K41

113

498

111

655

419

11

704

717

470

305

210

719

704

435

741

450

244

78

826

Y7

Euro

pea

nQ

uerc

ussp

p

UK

391

164

976

116

539

441

172

135

543

703

46

679

163

882

230

460

458

728

876

2

Q95

3Eu

rop

ean

Que

rcus

spp

U

K35

122

91

002

116

604

421

173

403

642

716

46

351

190

059

307

994

482

611

113

11

UFR

J508

16A

mer

ican

Dro

soph

ilasp

p

Bra

zil

34mdash

mdashmdash

mdashmdash

T21

4Eu

rop

ean

Que

rcus

spp

U

K34

118

799

111

656

698

11

699

313

357

884

141

013

132

834

001

940

273

70

866

IFO

1804

Far

East

ern

Que

rcus

spp

R

ussi

a26

mdashmdash

mdashmdash

mdash

W7

Euro

pea

nQ

uerc

ussp

p

UK

232

070

116

476

94

129

27

946

15

145

11

Q74

4Eu

rop

ean

Que

rcus

spp

U

K23

mdashmdash

mdashmdash

mdash

Q89

8Eu

rop

ean

Que

rcus

spp

U

K22

mdashmdash

mdashmdash

mdash

Q62

5Eu

rop

ean

Que

rcus

spp

U

K20

mdashmdash

mdashmdash

mdash

Q69

8Eu

rop

ean

Que

rcus

spp

U

K18

mdashmdash

mdashmdash

mdash

Y8

1Eu

rop

ean

Que

rcus

spp

U

K16

mdashmdash

mdashmdash

mdash

KPN

3829

Euro

pea

nQ

uerc

ussp

p

Rus

sia

16mdash

mdashmdash

mdashmdash

Q32

3Eu

rop

ean

Que

rcus

spp

U

K12

mdashmdash

mdashmdash

mdash

NO

TEmdash

Stra

ins

wit

hout

valu

esw

ere

not

deno

voas

sem

bled

Add

itio

nal

info

rmat

ion

ofth

ese

quen

ced

stra

ins

was

repo

rted

inLi

tiC

arte

ret

al(

2009

)T

heS

cere

visi

aeY

12or

igin

ally

repo

rted

from

Ivor

yC

oast

Palm

win

esu

bseq

uent

lycl

arifi

ed(F

ayJ

pers

onal

com

mun

icat

ion)

that

the

stra

inse

nt

was

the

Sake

stra

inK

12fr

omJa

pan

a Ref

ers

tora

wco

vera

ge

bV

alue

sbe

fore

and

afte

rfo

rwar

dsl

ashe

sco

rres

pond

tobe

fore

and

afte

rad

diti

onal

scaf

fold

ing

wit

hlo

wco

vera

gep

aire

d-en

dSa

nge

rda

ta

All

valu

esin

unit

sof

base

-pai

rs

874

Bergstrom et al doi101093molbevmsu037 MBE

20 after quality filtering were de novo assembled The sixstrains with very high coverage were used to explore the effectof sequencing coverage on de novo assembly quality by as-sembling data subsets systematically downsampled to differ-ent coverage levels We find that assembly quality increaseswith increasing coverage up to but not above a level of about120 (supplementary fig S1 Supplementary Materialonline) As almost all strains were previously sequenced tolow coverage (1ndash4) using Sanger technology (Liti Carteret al 2009) and paired-end libraries with long inserts (meaninsert size 4480 bp) we could improve our assemblies byincluding this data Mapping Sanger reads onto the Illuminaassemblies and performing scaffolding based on paired-endinformation improved assembly connectivity substantiallywith an increase in average scaffold N50 from 648 to1181 kb (table 1)

The 27 de novo assemblies have total sizes between 1158and 1177 Mb suggesting little variability in yeast genome sizeThis is 32ndash48 smaller than the completed S cerevisiae ref-erence genome of 1216 Mb indicating a slight underestima-tion of true genome sizes presumably due to collapse in theassemblies of transposable elements and perhaps a few otherrepeat regions Underestimates are compatible with a re-ported Ty transposon content of 1ndash3 in these strains anda higher transposon content of 35 in the reference strainS288c (Liti Carter et al 2009) For the majority of strains thelargest scaffolds are longer than the smallest chromosomes ofthe reference genome (table 1) Some scaffolds in the highercoverage assemblies correspond to near full-length chromo-somes This illustrates that high-quality assembly of yeast ge-nomes is achievable using short-read sequencing data Thestructurally complex subtelomeric regions did not generallyassemble well however (with some exceptions see below) aknown problem in genome assembly

Assembly Scaffolding by Genetic Linkage RevealsExtensive Structural Conservation exceptin Subtelomeres

Four of the S cerevisiae strains sequenced the North Amer-ican YPS128 the West African DBVPG6044 the SakeJapa-nese Y12 and the WineEuropean DBVPG6765 have beenused as parents for advanced intercross lines from whichlarge numbers of highly recombined offspring genomeshave been sequenced (Cubillos et al 2013 Illingworth et al2013) The genetic linkage that manifests between nearby lociin these artificial populations was here put to use to furtherscaffold the de novo assemblies of these four strains Thisresulted in chromosome sized scaffolds for all 16 nuclearchromosomes in these strains containing 946ndash961 ofthe total assembly sequence and scaffold N50 values of852ndash890 kb

The linkage-assisted assemblies exposed the near-completecolinearity and structural conservation of the genomes offour of the major phylogenetic lineages in S cerevisiae(fig 1A) This is consistent with high rates of meiotic sporeviability observed in crosses between these strains (Cubilloset al 2011) as large-scale structural variation would impair

proper segregation of chromosomes as well as the highdegree of karyotype conservation within the S sensu strictospecies clade (Fischer et al 2000 Liti et al 2013) Interestinglythe high degree of structural conservation does not extendinto subtelomeric regions consistent with their rapid evolu-tion and high variability (Liti et al 2005 Brown et al 2010)Lower assembly quality in these regions makes analysis diffi-cult however the genetic linkage data allowed the identifica-tion of some cases of structural variation in the subtelomeresrelative to the S cerevisiae reference genome (fig 1B) We alsofind subtelomeric material that is present in some strains andabsent in others for instance a segment of approximately18 kb that localizes to the right subtelomere of chromosomeXIII and assembled well in the North American and WestAfrican S cerevisiae strains (fig 1C) This genomic region isabsent from the WineEuropean SakeJapanese as well as theS cerevisiae reference strain while being present in all S para-doxus strains constituting an example of a subtelomericregion that has recently been lost in certain S cerevisiaestrains These kind of structural differences have implicationsfor QTL mapping studies that rely on a reference genomebecause the causative sequence responsible for a phenotypicassociation might in fact be located in a different subtelomerein the mapping strains or simply be absent from the referencegenome thus risking that the search for candidates is di-rected to the incorrect genomic region (Cubillos et al 2011)

Genome and Gene Content Variation in S cerevisiaeExceeds That in S paradoxus

High quality de novo genome assemblies enable the system-atic identification of long sequence segments that are presentin only a subset of yeast strains such as the region exemplifiedin figure 1C We made pairwise comparisons between straingenomes and summed the total length of sequence regionslarger than 1 kb that are present in one strain but not theother and refer to this as genome content variation InS cerevisiae this sum is always larger than the number ofSNPs between a pair of strains for all possible strain compar-isons This is not the case for S paradoxus surprisingly as theamount of genome content variation within this species islower than within S cerevisiae despite genetic variation in theform of SNPs being almost an order of magnitude larger(fig 2A) In both species there is a positive correlation be-tween the SNP distance between strains and the amount ofgenome content difference but the correlation is muchweaker in S cerevisiae than in S paradoxus (r = 051 andr = 097 respectively) The highly variable relationship inS cerevisiae suggests that this kind of variation is the productof a different mode of evolution than the clocklike nature ofSNP accumulation Nonetheless clustering strains based ongenome content differences recapitulates the known popu-lation structures of both species (supplementary fig S2Supplementary Material online) To ensure that the detecteddifferences reflects true underlying genome variation we val-idated the presence and absence of 14 variable regions acrossstrains by polymerase chain reaction (PCR) confirming the denovo assembly predictions in 326327 cases as well as

875

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

compared one of our assemblies to an alternative assemblyfor the same strain and found no differences (see supplemen-tary fig S3 Supplementary Material online and Materials andMethods) The observation that genome content variation isrelatively larger within S cerevisiae than within S paradoxus ishighly unexpected under a neutral model of genome

evolution and implies pronounced differences in the evolu-tionary histories of these two species Although challenging toestablish at present we suggest that this unexpected excess ofgenome content variation in S cerevisiae is likely to be a majorcontributor to the equally surprising excess of phenotypicvariability within this species We furthermore note that

A

B C

890 892 894 896 898 900 902 904 906 908

Position on chromosomal scaffold XIII (kb)

FIG 1 Yeast genome structures revealed by de novo assemblies augmented by genetic linkage data (A) Scaffolding de novo assemblies using geneticlinkage information from advanced intercross lines dramatically improves assembly connectivity and reveals extensive structural conservation of thecore chromosomes in four of the major S cerevisiae lineages Displayed is a dot plot of sequence similarity between the assembly scaffolds of the strainYPS128 from the North American phylogenetic lineage and the 16 nuclear chromosomes of the S cerevisiae reference genome (strain S288c) before andafter the incorporation of the genetic linkage data into the scaffolding process After scaffolding by genetic linkage the majority of the assemblysequence is contained in 16 large scaffolds that are collinear with the chromosomes of the reference genome Results are highly similar for the otherthree strains for which genetic linkage data is available the West African strain DBVPG6044 the WineEuropean strain DBVPG6765 and the sakeJapanese strain Y12 (the recent sequencing of the sake strain Kyokai no 7 (Akao et al 2011) revealed two intrachromosomal inversions in chromosomesV and XIV in relation to the reference strain S288c however these are not shared by the sake strain Y12 sequenced here) Only scaffolds bigger than50 kb are displayed (B) Structural rearrangements relative to the chromosome organization of the S cerevisiae reference genome all localized to thesubtelomeric regions A directed arrow indicates that a sequence region is aligning to the part of the reference genome where the arrow starts but in thede novo assembly is located in the part of the genome corresponding to where the arrow ends (C) A subtelomeric 18-kb region that assembled well inseveral strains and could be localized by genetic linkage is displayed with coordinates corresponding to the YPS128 chromosome XIII scaffold Six geneswere found in this region by ab initio gene prediction (arrows indicate coding direction)

876

Bergstrom et al doi101093molbevmsu037 MBE

there are considerable amounts of genomic sequence that ispresent in one or more natural strains but absent in the Scerevisiae reference strain S288c (supplementary fig S4Supplementary Material online)

We annotated the de novo assemblies by homology andsynteny comparisons to the reference genome Out of 5774nondubious open reading frames (ORFs) in the S288cS cerevisiae reference genome we recovered a median of5417 ORFs (94) per strain with most failures to recover agene appearing to be caused by collapse in the assemblies ofvery close paralogs We also identified genes that are notpresent in the reference genome The number of nonrefer-ence genes present in a S cerevisiae strain genome rangesfrom four in the lab strain W303 which is very closely relatedto the reference strain S288c to 61 in the mosaic strainUWOPS87-2421 with a median of 31 genes per strain(fig 2B) Using the four S cerevisiae strains for which genetic

linkage data allowed assembly of chromosome-sized scaffoldswe estimate that at least 75 of these nonreference genes arelocated in the subtelomeric parts of the chromosomes Byexploiting distant sequence similarity to functionally anno-tated reference genome proteins we find that the set ofnonreference genes is enriched for gene ontology termsrelated to flocculation and sugarmdashparticularly maltosemdashtransport and metabolism These results are consistent withfindings on the evolutionary properties of subtelomeric genesacross larger evolutionary distances (Brown et al 2010)

CNV Is Extensive in Subtelomeres and Greater inS cerevisiae than in S paradoxus

Whereas the analysis described in the previous section is onlypowered to detect binary presence and absence due to thetendency of similar copies to collapse during de novo

A

B

FIG 2 Genome content variation within natural yeast populations (A) The relationship between genetic distance between strains as measured in SNPsand the amount of genomic material being presentabsent between strains All pairwise strain comparisons within each of the two species are included(B) The number of nonreference genes found in each strain genome Strain colors denote subpopulation origin (for S cerevisiae green = WineEuropeanred = West African cyan = Malaysian yellow = North American dark blue = SakeJapanese black = mosaic genome for S paradoxus orange =American brown = Far Eastern magenta = European) The strain trees are neighbor-joining trees based on genome-wide SNP distances and thescale bars indicate sequence distance in units of SNPs per basepair (distance scales differ between the species)

877

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

assembly by mapping reads to the reference genome we canalso assay differences between higher copy numbers Weidentified CNV across all strains with a mapped coverage of8 or higher (18 S cerevisiae strains and 19 S paradoxusstrains) The total size of genomic regions exhibiting CNVbetween strains is three times larger in S cerevisiae than inS paradoxus (423 kb vs 142 kb corresponding to 35 and12 of the total genome size respectively) despite lowerlevels of overall genetic divergence in the former speciesAlthough the true amount of CNV in S paradoxus couldbe slightly underestimated due to the quality of the referencegenome being lower than that of S cerevisiae (containingmultiple gaps where CNV detection is not possible) this isunlikely to explain the large difference observed (see Materialsand Methods) These results mirror the recent finding of dif-ferences in CNV rates between the great ape lineages(Sudmant et al 2013) CNV similarity between strains stronglycorrelates with SNP similarity within both S cerevisiae and Sparadoxus (r = 0843 and r = 0885 respectively) and cluster-ing strains based on CNV profiles recapitulates the broadphylogenetic structures of the two species (supplementaryfig S5 Supplementary Material online) In both S cerevisiaeand S paradoxus we find very limited CNV in nonsubtelo-meric regions and extensive variation in the subtelomericregions In S cerevisiae 320 of subtelomeric nucleotide po-sitions are affected by CNV compared with 07 in nonsub-telomeric regions (42-fold enrichment) and in S paradoxusthe corresponding numbers are 93 and 004 respectively(23-fold enrichment) In S cerevisiae genes contained in re-gions displaying CNV are enriched for gene ontology termsrelated to sugar transport and metabolism flocculation andion and metal transport and metabolism The same trends ofenrichment are observed in S paradoxus (although withfewer categories reaching statistical significance as a conse-quence of the overall lower number of genes affected) Othergenes with variable copy number between strains in bothspecies include the high copy number subtelomeric YRF1helicase PAU seripauperin and DUP gene families These re-sults imply that largely similar evolutionary forces are shapingthe landscapes of CNV in these two species By consideringaggregate sequencing depth across close paralogs in the ref-erence genome we could unveil further trends not necessarilyapparent when considering genes one by one This genefamily based analysis showed that one clinically isolatedstrain from the WineEuropean phylogenetic clusterYJM981 harbors an extremely large number of YRF1 genecopies on the order of 1000 copies compared with estimatesof 10ndash40 copies in other strains including the referencegenome This radical copy number increase is consistentwith an experimentally demonstrated phenomenon wherethe subtelomeric Y0-element which contains the YRF1genes is amplified to serve as an alternative mechanism fortelomere maintenance following failure of the conventionaltelomerase system (Lundblad and Blackburn 1993) This hasdramatic consequences on subtelomeric structure and overallgenome size and we confirmed by pulsed field gel electro-phoresis that the chromosomes of YJM981 are much largerthan normal (supplementary fig S6 Supplementary Material

online) Another closely related clinical isolate YJM978 alsoharbors an usually large number of YRF1 copies though muchfewer than YJM981 (~150 copies) While we have not identi-fied any obvious loss-of-function mutations affecting the keytelomere maintenance genes in these strains some of them(EST1 EST2 RIF2 TEL1 yKu80) appear to have very low ex-pression levels in an RNA-seq data set (Skelly et al 2013)These results raise the intriguing possibility that this alterna-tive mode of telomere elongation actually occurs in naturalstrains which to our knowledge has never beendemonstrated

Convergent Evolution of Subtelomeric CNV UnderliesNatural Arsenic Resistance in S cerevisiae andS paradoxus

One region of the genome exhibiting CNV within bothS cerevisiae and S paradoxus is the ARR gene clustercontaining three contiguous genes involved in the cellulardetoxification of arsenic and antimonyl compounds the tran-scription factor ARR1 the arsenate reductase ARR2 and theplasma membrane transporter ARR3 We hypothesized thatCNV in this gene cluster would impact growth phenotypes inmedia containing arsenic compounds and tested this usingpreviously collected phenotype data (Warringer et al 2011)In both S cerevisiae and S paradoxus we found a strongassociation between ARR cluster copy number and mitoticgrowth rate length of mitotic lag phase and mitotic growthefficiency in the presence of arsenite (fig 3A) In an additivemodel ARR copy number explains 50 (P = 15103) and92 (P = 12109) of the phenotypic variation in growthrate in S cerevisiae and S paradoxus respectively and 71(P = 25105) and 82 (P = 58107) respectively of thevariation in the length of lag time Interestingly for the growthefficiency phenotype additivity breaks down as there is nodifference between strains with one copy and strains with twocopies This result is biologically consistent with a modelwhere the rate at which the cell can expel arsenic compoundsincreases with the number of ARR copies but the energy costper arsenite molecule exported is constant and independentof copy number above one

In S paradoxus the distribution of the ARR gene clusterCNV tracks the population structure with all strains fromthe European population having two copies the NorthAmerican strain having one copy and the two Far Easternstrains missing the region The variant distribution within Scerevisiae shows a more complex pattern with evidence forboth convergent amplification and introgression betweenlineages (fig 3B) The region is found in one copy in moststrains in the species missing in the Malaysian strainUWOPS03-4614 and the predominantly West Africanmosaic strains SK1 and Y55 and in two copies in thesake strain Y12 and two strains from the WineEuropeancluster By computationally phasing the haplotypes of thetwo copies of the WineEuropean strain BC187 we foundthat one of these copies clusters phylogenetically with thealleles of the other WineEuropean strains as expectedwhereas the other clusters with the Y12 copies indicating

878

Bergstrom et al doi101093molbevmsu037 MBE

that this copy has been introgressed from the sake lineageinto parts of the WineEuropean population (fig 3C) Theother WineEuropean strain with a duplication DBVPG1373however harbors two copies with very low internal sequencedivergence and with no similarity to the sake copies implyingthat this is the result of an independent duplication eventwithin the WineEuropean population These findings dem-onstrate convergent evolution of ARR cluster duplication andloss both between different lineages within S cerevisiae andbetween S cerevisiae and S paradoxus It is tempting to spec-ulate that the ARR cluster CNV has been driven by differencesin environmental arsenic concentrations between the habi-tats of different yeast lineages

Derived Alleles in Yeast Tend to Be Deleterious andPrivate to a Single Population

To predict the effects of SNPs on protein function we em-ployed the Sorting Intolerant From Tolerant (SIFT) algorithmwhich estimates the probability of an amino acid substitutionbeing deleterious based on evolutionary conservation of thesite across a large number of species (Kumar et al 2009) on allnonsynonymous SNPs identified in the 18 S cerevisiae strainswith at least 8 coverage mapped to the reference genomeUsing the S paradoxus population as outgroup we also in-ferred the most likely ancestral state at each polymorphiclocus allowing the polarization of alleles into ancestral andderived Overall we found a strong tendency for derived

ARR cluster copy number

Rat

e (5

mM

NaA

s 2O3)

Lag

(5m

M N

aAs 2O

3)

Effi

cien

cy (

5mM

NaA

s 2O3)

ARR cluster copy number ARR cluster copy number

A

B C

FIG 3 Convergent evolution of ARR cluster copy number (A) Growth rate length of mitotic lag and mitotic growth efficiency in medium containing5 mM sodium arsenite oxide for strains with different ARR cluster copy number Units are on a log2 scale and relative to the S cerevisiae reference strainderivative BY4741 The strain data points are jittered along the horizontal dimension to increase visibility (B) Distribution of the ARR cluster copynumber variant within the populations of S cerevisiae and S paradoxus Strain colors denote subpopulation origin as in figure 2 The strain trees areneighbor-joining trees based on genome-wide SNP distances and the scale bars indicates sequence distance in units of SNPs per basepair (distancescales differ between the species) (C) The two copies of the ARR gene cluster in the WineEuropean strain BC187 were computationally phased and thesequences of the two copies were clustered with the corresponding sequences from the clean lineage strains of S cerevisiae using the neighbor-joiningalgorithm Although the JapaneseSake strain (Y12) carries two copies the haplotypes are very similar in sequence and are represented here by aconsensus version where the few positions that are polymorphic between the two haplotypes have been masked out The scale bar indicates sequencedistance in units of SNPs per basepair

879

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

alleles with higher functional potential to be shifted towardlow frequencies (fig 4A) Most derived alleles are present inonly one (37) or a few strains but among nonsynonymousalleles the bias is more pronounced with 46 occurring inonly a single strain Among nonsynonymous alleles predictedto be deleterious the bias toward lower frequencies is evenstronger with 64 occurring in a single strain These patternsreflect the effects of purifying selection and agree with obser-vations made in the human (Abecasis et al 2012) andArabidopsis thaliana (Cao et al 2011) populations Derivedand therefore recent alleles are four times more likely to beclassified as deleterious than ancestral alleles (fig 4B) consis-tent with elevated negative selection against them in thespecies as a whole We also specifically considered SNPswhere the derived allele is private to a single strain andfound that such alleles present in WineEuropean strainsare more frequently nonsynonymous and the nonsynony-mous SNPs are more frequently predicted deleterious thanSNPs in other strains (fig 4C) This suggests relaxed negativeselection in the WineEuropean population potentially asso-ciated with a recent population expansion into a new andbeneficial niche such as that introduced by humans with the

emergence and spread of wine production over the last 7000years (Borneman et al 2013) Running SIFT in the same fash-ion on variants identified in the S paradoxus population wefind that derived alleles in this species are on average slightlyless often deleterious than derived S cerevisiae alleles (178vs 215) also consistent with recently relaxed selection inS cerevisiae

Loss-of-Function Variants in the S cerevisiaePopulation

Certain classes of variants are expected to have dramaticconsequences on gene products and therefore constituteparticularly interesting candidates for contributing to pheno-typic variation Taking advantage of the well-annotated ref-erence genome of S cerevisiae we predicted highly probableloss-of-function variants in the form of prematurely intro-duced stop codons and frameshifting indels across all 19S cerevisiae strains Overall these variants are enriched to-ward the 30-end of ORFs (37-fold enrichment in the last 5 ofORFs Plt 1021) This reflects lower purifying selection pres-sures against mutations that only perturb translation of thevery C-terminal end of the protein and is in line with previous

A

B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of strains

Fra

ctio

n0

00

10

20

30

40

50

60

7

NonminuscodingCodingSynonymousNonminussynonymousDeleterious (SIFT)

Derived Ancestral

Fra

ctio

n de

lete

rious

(S

IFT

)

000

005

010

015

020

025

FIG 4 Distribution of SNPs within the S cerevisiae population (A) The derived allele frequency spectrum for SNPs with different coding effects Theancestral state of each SNP was inferred by using S paradoxus as an outgroup (B) SNP alleles inferred to be derived are much more frequently predictedto be deleterious by SIFT than alleles predicted to be ancestral (215 vs 54 respectively) (C) The effect on gene sequences of derived alleles that arefound in only a single strain Strain colors denote subpopulation origin as in figure 2 The strain tree is a neighbor-joining tree based on genome-wideSNP distances and the scale bar indicates sequence distance in units of SNPs per basepair

880

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

domestication on the evolution of the former but not thelatter species (Liti Carter et al 2009) A number of otherstudies have corroborated and extended these findings invarious directions (Doniger et al 2008 Schacherer et al2009 Zhang et al 2010 Hyma et al 2011 Warringer et al2011 Dunn et al 2012 Hyma and Fay 2013) such that acoherent framework for understanding the population struc-ture and evolutionary history of Bakerrsquos yeast is nowemerging

A question of primary interest in yeast population geno-mics has been the extent to which the highly stratified geneticstructure is driven by on the one hand geography and on theother hand by ecology and adaptation to different lifestylesand environmental niches So far the evidence points in favorof geography being the most important factor (Liti Carteret al 2009 Cromie et al 2013 Hittinger 2013) Saccharomycesparadoxus populations found in different parts of the worldare highly diverged at the sequence level do not seem to haverecently exchanged genetic material and in some cases dis-play partial reproductive isolation (Sniegowski et al 2002Koufopanou et al 2006 Liti Carter et al 2009) The geneticstructure of S cerevisiae is similarly characterized by stronggeographical stratification but is nuanced by a larger degree ofadmixture between populations and the presence of strainswith phylogenetically mosaic genomes (Liti Carter et al 2009Cromie et al 2013) This admixture has likely been facilitatedby the association of S cerevisiae with human fermentationpractices and deliberate or accidental dispersal of domesti-cated strains to different parts of the world (Hyma and Fay2013) Overall genetic divergence is much higher in S para-doxus than in S cerevisiae the most highly diverged strains inthe former species being separated by ~35 at the sequencelevel as compared to 05ndash08 in the latter (Liti Carter et al2009) Surprisingly however phenotypic diversity is substan-tially higher in S cerevisiae than in S paradoxus (Liti Carteret al 2009 Warringer et al 2011) This observation is highlyunexpected under a model where phenotypic variation isprimarily the product of gradual accumulation of SNPsthroughout the genome Therefore it raises the hypothesisthat other evolutionary processes potentially involving otherforms of genetic variation are responsible

Several advances have been made in the search for thegenetic variants that underlie phenotypic variation in yeastprimarily in S cerevisiae Due to the high degree of populationstratification genome-wide association studies are problem-atic to carry out in natural yeast populations (Connelly andAkey 2012) but studies utilizing recombinant populationsobtained by crossing diverged parental strains have provenfruitful (Liti and Louis 2012) Quantitative trait loci (QTLs)have been identified for a range of traits including the abilityto grow at high temperature (Steinmetz et al 2002 Sinha et al2008 Parts et al 2011) resistance to numerous chemicalcompounds (Ehrenreich et al 2010 2012) and enologicaltraits (Marullo et al 2007 Ambroset et al 2011 Salinaset al 2012) Although most genotypendashphenotype mappinghas been performed using the same small set of parent strainswith a focus on laboratory strains an increasing number ofstudies are utilizing additional strains that cover more of the

genetic diversity within the species (Wenger et al 2010Cubillos et al 2011 Parts et al 2011 Ehrenreich et al 2012)Less attention has been devoted to the genetic architecture oftraits in S paradoxus though QTLs for telomere length havebeen identified (Liti Haricharan et al 2009) Little is knownabout the evolutionary forces that shape the relationshipbetween genotype and phenotype in natural yeast popula-tions It has been proposed that the predominance of asexualgrowth and self-fertilization in the natural life history of yeast(Ruderfer et al 2006 Tsai et al 2008) makes genetic drift thedominant force driving phenotypic evolution such that del-eterious variants become fixed in specific lineages followingrepeated population bottlenecks (Warringer et al 2011 Zorgoet al 2012) However there are also studies reporting someevidence for a role of positive selection (Fraser et al 2010)

So far population genomics analyses in yeast have mostlyrelied on microarray technology (Schacherer et al 2007 2009)and low-coverage capillary sequencing (Liti Carter et al2009) A number of individual strain genomes have alsobeen sequenced for various purposes mostly using capillaryand pyrosequencing (Wei et al 2007 Borneman et al 2008Doniger et al 2008 Argueso et al 2009 Novo et al 2009Dowell et al 2010 Akao et al 2011 Borneman et al 2011Nijkamp et al 2012 Ralser et al 2012 Zheng et al 2012) Dueto rapid technological advances next-generation sequencingtechnologies generating short reads have become the mostcost-effective choice for population-level whole-genome se-quencing Here we apply high-coverage Illumina sequencingto 42 natural strains from S cerevisiae and S paradoxus Inboth species strains were selected from the set of strainspreviously surveyed using low-coverage capillary sequencing(Liti Carter et al 2009) and comprehensive phenotypic anal-yses (Warringer et al 2011) to be as informative as possibleabout overall genetic diversity We construct de novo assem-blies and perform analyses to advance our understanding ofgenomic diversity and evolution functional variation and thegenotypendashphenotype relationship In addition to SNPs thisdata enable analyses of other kinds of variants such that wecan begin to address directly the aforementioned questionsabout the nature and evolution of genomic variation and itspotential phenotypic relevance We furthermore hope thatthe sequence data and assemblies will constitute a usefuladdition to the growing set of resources available for popu-lation genomics and genotypendashphenotype studies in yeastData and results from the project are accessible at httpwwwmoseslabcsbutorontocasgrp (last accessed January14 2014)

Results and Discussion

De Novo Yeast Genome Assemblies from Short-ReadSequencing Data

We selected 42 haploid (or homozygous diploid) yeast strainsand sequenced their genomes to intermediate-to-high(10ndash60) coverage using paired-end Illumina technologyA subset of six strains were sequenced to very high (400ndash800) coverage (table 1) Genomes of 14 S cerevisiae and 13 Sparadoxus strains for which the coverage was higher than

873

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Tab

le1

Sequ

enci

ng

and

De

Nov

oA

ssem

bly

ofY

east

Stra

inG

enom

es

Stra

inSu

bpop

ula

tion

Sou

rce

Loca

tion

Cov

aN

um

ber

ofSc

affo

ldsb

Ass

emb

lySi

zeb

Con

tig

N50

bM

ax

Scaf

fold

Size

bSc

affo

ldN

50b

Sce

revi

siae

UW

OPS

87-2

421

Mos

aic

Opu

ntia

spp

Haw

aii

821

559

536

116

584

291

167

177

210

365

310

847

354

612

654

612

616

093

920

012

2

UW

OPS

83-7

873

Mos

aic

Opu

ntia

spp

Bah

amas

627

513

495

116

769

431

167

812

211

370

211

370

255

777

188

503

318

788

124

466

1

YPS

128

Nor

thA

mer

ican

Que

rcus

alba

USA

6485

579

111

741

720

11

768

379

948

959

934

145

129

451

896

010

955

518

762

3

SK1

Mos

aic

Soil

USA

561

315

111

911

704

669

11

753

398

491

495

407

232

107

256

961

067

234

267

272

L152

8W

ine

Euro

pean

Win

eC

hile

4995

180

811

565

290

11

594

025

361

083

957

424

618

146

398

655

077

101

460

W30

3M

osai

cLa

bora

tory

USA

481

391

121

911

630

510

11

666

829

345

373

798

620

922

328

454

752

228

122

824

DB

VPG

6765

Win

eEu

rope

anU

nkn

own

Un

know

n46

967

779

116

224

181

167

095

346

113

51

348

280

644

540

506

656

672

083

26

Y12

Sake

Sake

Japa

n44

172

11

582

116

100

641

164

974

725

846

26

998

113

242

229

213

369

446

014

8

DB

VPG

1106

Win

eEu

rope

anG

rap

esA

ust

ralia

431

280

116

311

539

734

11

580

105

266

132

724

617

296

418

308

036

291

47

371

Y55

Mos

aic

Gra

pes

Fran

ce38

151

61

219

116

354

231

170

113

924

282

25

808

263

008

418

695

421

571

418

98

DB

VPG

6044

Wes

tA

fric

anB

iliw

ine

Wes

tA

fric

a35

113

798

311

598

603

11

656

308

421

434

528

223

691

735

097

052

144

94

331

YJM

975

Win

eEu

rope

anC

linic

alIt

aly

333

069

242

911

406

731

11

688

902

570

15

849

334

645

940

37

414

150

38

UW

OPS

03-4

614

Mal

aysi

anB

ertr

amp

alm

Mal

aysi

a32

364

63

213

114

991

461

169

426

85

691

584

440

466

60

006

691

910

747

DB

VPG

1373

Win

eEu

rope

anSo

ilN

eth

erla

nds

321

232

970

115

594

541

162

650

421

249

22

464

185

406

368

851

355

429

852

6

YJM

978

Win

eEu

rope

anC

linic

alIt

aly

32mdash

mdashmdash

mdashmdash

DB

VPG

1788

Win

eEu

rope

anSo

ilFi

nla

nd

30mdash

mdashmdash

mdashmdash

L137

4W

ine

Euro

pean

Win

eC

hile

26mdash

mdashmdash

mdashmdash

BC

187

Win

eEu

rope

anW

ine

USA

23mdash

mdashmdash

mdashmdash

YJM

981

Win

eEu

rope

anC

linic

alIt

aly

16mdash

mdashmdash

mdashmdash

Sp

arad

oxus

Y8

5Eu

rop

ean

Que

rcus

spp

U

K50

243

911

623

026

11

133

254

762

020

411

1

Z1

1Eu

rop

ean

Que

rcus

spp

U

K42

342

641

211

616

631

11

624

232

113

350

115

435

555

399

555

399

238

384

289

391

Y9

6Eu

rop

ean

Que

rcus

spp

U

K40

874

311

615

713

67

180

27

189

283

447

Z1

Euro

pea

nQ

uerc

ussp

p

UK

367

685

116

865

90

102

641

357

366

128

755

Q59

1Eu

rop

ean

Que

rcus

spp

U

K54

931

803

117

144

391

175

763

767

680

73

926

362

301

496

642

710

431

509

95

N-4

4Fa

rEa

ster

nQ

uerc

ussp

p

Rus

sia

512

058

1488

115

761

821

173

337

611

054

12

242

662

541

561

2813

806

36

398

YPS

138

Am

eric

anQ

ve

lutin

aU

SA43

204

715

2911

613

204

11

740

006

110

251

180

169

807

117

759

147

973

328

5

S36

7Eu

rop

ean

Que

rcus

spp

U

K42

115

311

1211

658

851

11

672

986

465

804

688

017

463

818

728

750

727

56

021

Y6

5Eu

rop

ean

Que

rcus

spp

U

K41

113

498

111

655

419

11

704

717

470

305

210

719

704

435

741

450

244

78

826

Y7

Euro

pea

nQ

uerc

ussp

p

UK

391

164

976

116

539

441

172

135

543

703

46

679

163

882

230

460

458

728

876

2

Q95

3Eu

rop

ean

Que

rcus

spp

U

K35

122

91

002

116

604

421

173

403

642

716

46

351

190

059

307

994

482

611

113

11

UFR

J508

16A

mer

ican

Dro

soph

ilasp

p

Bra

zil

34mdash

mdashmdash

mdashmdash

T21

4Eu

rop

ean

Que

rcus

spp

U

K34

118

799

111

656

698

11

699

313

357

884

141

013

132

834

001

940

273

70

866

IFO

1804

Far

East

ern

Que

rcus

spp

R

ussi

a26

mdashmdash

mdashmdash

mdash

W7

Euro

pea

nQ

uerc

ussp

p

UK

232

070

116

476

94

129

27

946

15

145

11

Q74

4Eu

rop

ean

Que

rcus

spp

U

K23

mdashmdash

mdashmdash

mdash

Q89

8Eu

rop

ean

Que

rcus

spp

U

K22

mdashmdash

mdashmdash

mdash

Q62

5Eu

rop

ean

Que

rcus

spp

U

K20

mdashmdash

mdashmdash

mdash

Q69

8Eu

rop

ean

Que

rcus

spp

U

K18

mdashmdash

mdashmdash

mdash

Y8

1Eu

rop

ean

Que

rcus

spp

U

K16

mdashmdash

mdashmdash

mdash

KPN

3829

Euro

pea

nQ

uerc

ussp

p

Rus

sia

16mdash

mdashmdash

mdashmdash

Q32

3Eu

rop

ean

Que

rcus

spp

U

K12

mdashmdash

mdashmdash

mdash

NO

TEmdash

Stra

ins

wit

hout

valu

esw

ere

not

deno

voas

sem

bled

Add

itio

nal

info

rmat

ion

ofth

ese

quen

ced

stra

ins

was

repo

rted

inLi

tiC

arte

ret

al(

2009

)T

heS

cere

visi

aeY

12or

igin

ally

repo

rted

from

Ivor

yC

oast

Palm

win

esu

bseq

uent

lycl

arifi

ed(F

ayJ

pers

onal

com

mun

icat

ion)

that

the

stra

inse

nt

was

the

Sake

stra

inK

12fr

omJa

pan

a Ref

ers

tora

wco

vera

ge

bV

alue

sbe

fore

and

afte

rfo

rwar

dsl

ashe

sco

rres

pond

tobe

fore

and

afte

rad

diti

onal

scaf

fold

ing

wit

hlo

wco

vera

gep

aire

d-en

dSa

nge

rda

ta

All

valu

esin

unit

sof

base

-pai

rs

874

Bergstrom et al doi101093molbevmsu037 MBE

20 after quality filtering were de novo assembled The sixstrains with very high coverage were used to explore the effectof sequencing coverage on de novo assembly quality by as-sembling data subsets systematically downsampled to differ-ent coverage levels We find that assembly quality increaseswith increasing coverage up to but not above a level of about120 (supplementary fig S1 Supplementary Materialonline) As almost all strains were previously sequenced tolow coverage (1ndash4) using Sanger technology (Liti Carteret al 2009) and paired-end libraries with long inserts (meaninsert size 4480 bp) we could improve our assemblies byincluding this data Mapping Sanger reads onto the Illuminaassemblies and performing scaffolding based on paired-endinformation improved assembly connectivity substantiallywith an increase in average scaffold N50 from 648 to1181 kb (table 1)

The 27 de novo assemblies have total sizes between 1158and 1177 Mb suggesting little variability in yeast genome sizeThis is 32ndash48 smaller than the completed S cerevisiae ref-erence genome of 1216 Mb indicating a slight underestima-tion of true genome sizes presumably due to collapse in theassemblies of transposable elements and perhaps a few otherrepeat regions Underestimates are compatible with a re-ported Ty transposon content of 1ndash3 in these strains anda higher transposon content of 35 in the reference strainS288c (Liti Carter et al 2009) For the majority of strains thelargest scaffolds are longer than the smallest chromosomes ofthe reference genome (table 1) Some scaffolds in the highercoverage assemblies correspond to near full-length chromo-somes This illustrates that high-quality assembly of yeast ge-nomes is achievable using short-read sequencing data Thestructurally complex subtelomeric regions did not generallyassemble well however (with some exceptions see below) aknown problem in genome assembly

Assembly Scaffolding by Genetic Linkage RevealsExtensive Structural Conservation exceptin Subtelomeres

Four of the S cerevisiae strains sequenced the North Amer-ican YPS128 the West African DBVPG6044 the SakeJapa-nese Y12 and the WineEuropean DBVPG6765 have beenused as parents for advanced intercross lines from whichlarge numbers of highly recombined offspring genomeshave been sequenced (Cubillos et al 2013 Illingworth et al2013) The genetic linkage that manifests between nearby lociin these artificial populations was here put to use to furtherscaffold the de novo assemblies of these four strains Thisresulted in chromosome sized scaffolds for all 16 nuclearchromosomes in these strains containing 946ndash961 ofthe total assembly sequence and scaffold N50 values of852ndash890 kb

The linkage-assisted assemblies exposed the near-completecolinearity and structural conservation of the genomes offour of the major phylogenetic lineages in S cerevisiae(fig 1A) This is consistent with high rates of meiotic sporeviability observed in crosses between these strains (Cubilloset al 2011) as large-scale structural variation would impair

proper segregation of chromosomes as well as the highdegree of karyotype conservation within the S sensu strictospecies clade (Fischer et al 2000 Liti et al 2013) Interestinglythe high degree of structural conservation does not extendinto subtelomeric regions consistent with their rapid evolu-tion and high variability (Liti et al 2005 Brown et al 2010)Lower assembly quality in these regions makes analysis diffi-cult however the genetic linkage data allowed the identifica-tion of some cases of structural variation in the subtelomeresrelative to the S cerevisiae reference genome (fig 1B) We alsofind subtelomeric material that is present in some strains andabsent in others for instance a segment of approximately18 kb that localizes to the right subtelomere of chromosomeXIII and assembled well in the North American and WestAfrican S cerevisiae strains (fig 1C) This genomic region isabsent from the WineEuropean SakeJapanese as well as theS cerevisiae reference strain while being present in all S para-doxus strains constituting an example of a subtelomericregion that has recently been lost in certain S cerevisiaestrains These kind of structural differences have implicationsfor QTL mapping studies that rely on a reference genomebecause the causative sequence responsible for a phenotypicassociation might in fact be located in a different subtelomerein the mapping strains or simply be absent from the referencegenome thus risking that the search for candidates is di-rected to the incorrect genomic region (Cubillos et al 2011)

Genome and Gene Content Variation in S cerevisiaeExceeds That in S paradoxus

High quality de novo genome assemblies enable the system-atic identification of long sequence segments that are presentin only a subset of yeast strains such as the region exemplifiedin figure 1C We made pairwise comparisons between straingenomes and summed the total length of sequence regionslarger than 1 kb that are present in one strain but not theother and refer to this as genome content variation InS cerevisiae this sum is always larger than the number ofSNPs between a pair of strains for all possible strain compar-isons This is not the case for S paradoxus surprisingly as theamount of genome content variation within this species islower than within S cerevisiae despite genetic variation in theform of SNPs being almost an order of magnitude larger(fig 2A) In both species there is a positive correlation be-tween the SNP distance between strains and the amount ofgenome content difference but the correlation is muchweaker in S cerevisiae than in S paradoxus (r = 051 andr = 097 respectively) The highly variable relationship inS cerevisiae suggests that this kind of variation is the productof a different mode of evolution than the clocklike nature ofSNP accumulation Nonetheless clustering strains based ongenome content differences recapitulates the known popu-lation structures of both species (supplementary fig S2Supplementary Material online) To ensure that the detecteddifferences reflects true underlying genome variation we val-idated the presence and absence of 14 variable regions acrossstrains by polymerase chain reaction (PCR) confirming the denovo assembly predictions in 326327 cases as well as

875

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

compared one of our assemblies to an alternative assemblyfor the same strain and found no differences (see supplemen-tary fig S3 Supplementary Material online and Materials andMethods) The observation that genome content variation isrelatively larger within S cerevisiae than within S paradoxus ishighly unexpected under a neutral model of genome

evolution and implies pronounced differences in the evolu-tionary histories of these two species Although challenging toestablish at present we suggest that this unexpected excess ofgenome content variation in S cerevisiae is likely to be a majorcontributor to the equally surprising excess of phenotypicvariability within this species We furthermore note that

A

B C

890 892 894 896 898 900 902 904 906 908

Position on chromosomal scaffold XIII (kb)

FIG 1 Yeast genome structures revealed by de novo assemblies augmented by genetic linkage data (A) Scaffolding de novo assemblies using geneticlinkage information from advanced intercross lines dramatically improves assembly connectivity and reveals extensive structural conservation of thecore chromosomes in four of the major S cerevisiae lineages Displayed is a dot plot of sequence similarity between the assembly scaffolds of the strainYPS128 from the North American phylogenetic lineage and the 16 nuclear chromosomes of the S cerevisiae reference genome (strain S288c) before andafter the incorporation of the genetic linkage data into the scaffolding process After scaffolding by genetic linkage the majority of the assemblysequence is contained in 16 large scaffolds that are collinear with the chromosomes of the reference genome Results are highly similar for the otherthree strains for which genetic linkage data is available the West African strain DBVPG6044 the WineEuropean strain DBVPG6765 and the sakeJapanese strain Y12 (the recent sequencing of the sake strain Kyokai no 7 (Akao et al 2011) revealed two intrachromosomal inversions in chromosomesV and XIV in relation to the reference strain S288c however these are not shared by the sake strain Y12 sequenced here) Only scaffolds bigger than50 kb are displayed (B) Structural rearrangements relative to the chromosome organization of the S cerevisiae reference genome all localized to thesubtelomeric regions A directed arrow indicates that a sequence region is aligning to the part of the reference genome where the arrow starts but in thede novo assembly is located in the part of the genome corresponding to where the arrow ends (C) A subtelomeric 18-kb region that assembled well inseveral strains and could be localized by genetic linkage is displayed with coordinates corresponding to the YPS128 chromosome XIII scaffold Six geneswere found in this region by ab initio gene prediction (arrows indicate coding direction)

876

Bergstrom et al doi101093molbevmsu037 MBE

there are considerable amounts of genomic sequence that ispresent in one or more natural strains but absent in the Scerevisiae reference strain S288c (supplementary fig S4Supplementary Material online)

We annotated the de novo assemblies by homology andsynteny comparisons to the reference genome Out of 5774nondubious open reading frames (ORFs) in the S288cS cerevisiae reference genome we recovered a median of5417 ORFs (94) per strain with most failures to recover agene appearing to be caused by collapse in the assemblies ofvery close paralogs We also identified genes that are notpresent in the reference genome The number of nonrefer-ence genes present in a S cerevisiae strain genome rangesfrom four in the lab strain W303 which is very closely relatedto the reference strain S288c to 61 in the mosaic strainUWOPS87-2421 with a median of 31 genes per strain(fig 2B) Using the four S cerevisiae strains for which genetic

linkage data allowed assembly of chromosome-sized scaffoldswe estimate that at least 75 of these nonreference genes arelocated in the subtelomeric parts of the chromosomes Byexploiting distant sequence similarity to functionally anno-tated reference genome proteins we find that the set ofnonreference genes is enriched for gene ontology termsrelated to flocculation and sugarmdashparticularly maltosemdashtransport and metabolism These results are consistent withfindings on the evolutionary properties of subtelomeric genesacross larger evolutionary distances (Brown et al 2010)

CNV Is Extensive in Subtelomeres and Greater inS cerevisiae than in S paradoxus

Whereas the analysis described in the previous section is onlypowered to detect binary presence and absence due to thetendency of similar copies to collapse during de novo

A

B

FIG 2 Genome content variation within natural yeast populations (A) The relationship between genetic distance between strains as measured in SNPsand the amount of genomic material being presentabsent between strains All pairwise strain comparisons within each of the two species are included(B) The number of nonreference genes found in each strain genome Strain colors denote subpopulation origin (for S cerevisiae green = WineEuropeanred = West African cyan = Malaysian yellow = North American dark blue = SakeJapanese black = mosaic genome for S paradoxus orange =American brown = Far Eastern magenta = European) The strain trees are neighbor-joining trees based on genome-wide SNP distances and thescale bars indicate sequence distance in units of SNPs per basepair (distance scales differ between the species)

877

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

assembly by mapping reads to the reference genome we canalso assay differences between higher copy numbers Weidentified CNV across all strains with a mapped coverage of8 or higher (18 S cerevisiae strains and 19 S paradoxusstrains) The total size of genomic regions exhibiting CNVbetween strains is three times larger in S cerevisiae than inS paradoxus (423 kb vs 142 kb corresponding to 35 and12 of the total genome size respectively) despite lowerlevels of overall genetic divergence in the former speciesAlthough the true amount of CNV in S paradoxus couldbe slightly underestimated due to the quality of the referencegenome being lower than that of S cerevisiae (containingmultiple gaps where CNV detection is not possible) this isunlikely to explain the large difference observed (see Materialsand Methods) These results mirror the recent finding of dif-ferences in CNV rates between the great ape lineages(Sudmant et al 2013) CNV similarity between strains stronglycorrelates with SNP similarity within both S cerevisiae and Sparadoxus (r = 0843 and r = 0885 respectively) and cluster-ing strains based on CNV profiles recapitulates the broadphylogenetic structures of the two species (supplementaryfig S5 Supplementary Material online) In both S cerevisiaeand S paradoxus we find very limited CNV in nonsubtelo-meric regions and extensive variation in the subtelomericregions In S cerevisiae 320 of subtelomeric nucleotide po-sitions are affected by CNV compared with 07 in nonsub-telomeric regions (42-fold enrichment) and in S paradoxusthe corresponding numbers are 93 and 004 respectively(23-fold enrichment) In S cerevisiae genes contained in re-gions displaying CNV are enriched for gene ontology termsrelated to sugar transport and metabolism flocculation andion and metal transport and metabolism The same trends ofenrichment are observed in S paradoxus (although withfewer categories reaching statistical significance as a conse-quence of the overall lower number of genes affected) Othergenes with variable copy number between strains in bothspecies include the high copy number subtelomeric YRF1helicase PAU seripauperin and DUP gene families These re-sults imply that largely similar evolutionary forces are shapingthe landscapes of CNV in these two species By consideringaggregate sequencing depth across close paralogs in the ref-erence genome we could unveil further trends not necessarilyapparent when considering genes one by one This genefamily based analysis showed that one clinically isolatedstrain from the WineEuropean phylogenetic clusterYJM981 harbors an extremely large number of YRF1 genecopies on the order of 1000 copies compared with estimatesof 10ndash40 copies in other strains including the referencegenome This radical copy number increase is consistentwith an experimentally demonstrated phenomenon wherethe subtelomeric Y0-element which contains the YRF1genes is amplified to serve as an alternative mechanism fortelomere maintenance following failure of the conventionaltelomerase system (Lundblad and Blackburn 1993) This hasdramatic consequences on subtelomeric structure and overallgenome size and we confirmed by pulsed field gel electro-phoresis that the chromosomes of YJM981 are much largerthan normal (supplementary fig S6 Supplementary Material

online) Another closely related clinical isolate YJM978 alsoharbors an usually large number of YRF1 copies though muchfewer than YJM981 (~150 copies) While we have not identi-fied any obvious loss-of-function mutations affecting the keytelomere maintenance genes in these strains some of them(EST1 EST2 RIF2 TEL1 yKu80) appear to have very low ex-pression levels in an RNA-seq data set (Skelly et al 2013)These results raise the intriguing possibility that this alterna-tive mode of telomere elongation actually occurs in naturalstrains which to our knowledge has never beendemonstrated

Convergent Evolution of Subtelomeric CNV UnderliesNatural Arsenic Resistance in S cerevisiae andS paradoxus

One region of the genome exhibiting CNV within bothS cerevisiae and S paradoxus is the ARR gene clustercontaining three contiguous genes involved in the cellulardetoxification of arsenic and antimonyl compounds the tran-scription factor ARR1 the arsenate reductase ARR2 and theplasma membrane transporter ARR3 We hypothesized thatCNV in this gene cluster would impact growth phenotypes inmedia containing arsenic compounds and tested this usingpreviously collected phenotype data (Warringer et al 2011)In both S cerevisiae and S paradoxus we found a strongassociation between ARR cluster copy number and mitoticgrowth rate length of mitotic lag phase and mitotic growthefficiency in the presence of arsenite (fig 3A) In an additivemodel ARR copy number explains 50 (P = 15103) and92 (P = 12109) of the phenotypic variation in growthrate in S cerevisiae and S paradoxus respectively and 71(P = 25105) and 82 (P = 58107) respectively of thevariation in the length of lag time Interestingly for the growthefficiency phenotype additivity breaks down as there is nodifference between strains with one copy and strains with twocopies This result is biologically consistent with a modelwhere the rate at which the cell can expel arsenic compoundsincreases with the number of ARR copies but the energy costper arsenite molecule exported is constant and independentof copy number above one

In S paradoxus the distribution of the ARR gene clusterCNV tracks the population structure with all strains fromthe European population having two copies the NorthAmerican strain having one copy and the two Far Easternstrains missing the region The variant distribution within Scerevisiae shows a more complex pattern with evidence forboth convergent amplification and introgression betweenlineages (fig 3B) The region is found in one copy in moststrains in the species missing in the Malaysian strainUWOPS03-4614 and the predominantly West Africanmosaic strains SK1 and Y55 and in two copies in thesake strain Y12 and two strains from the WineEuropeancluster By computationally phasing the haplotypes of thetwo copies of the WineEuropean strain BC187 we foundthat one of these copies clusters phylogenetically with thealleles of the other WineEuropean strains as expectedwhereas the other clusters with the Y12 copies indicating

878

Bergstrom et al doi101093molbevmsu037 MBE

that this copy has been introgressed from the sake lineageinto parts of the WineEuropean population (fig 3C) Theother WineEuropean strain with a duplication DBVPG1373however harbors two copies with very low internal sequencedivergence and with no similarity to the sake copies implyingthat this is the result of an independent duplication eventwithin the WineEuropean population These findings dem-onstrate convergent evolution of ARR cluster duplication andloss both between different lineages within S cerevisiae andbetween S cerevisiae and S paradoxus It is tempting to spec-ulate that the ARR cluster CNV has been driven by differencesin environmental arsenic concentrations between the habi-tats of different yeast lineages

Derived Alleles in Yeast Tend to Be Deleterious andPrivate to a Single Population

To predict the effects of SNPs on protein function we em-ployed the Sorting Intolerant From Tolerant (SIFT) algorithmwhich estimates the probability of an amino acid substitutionbeing deleterious based on evolutionary conservation of thesite across a large number of species (Kumar et al 2009) on allnonsynonymous SNPs identified in the 18 S cerevisiae strainswith at least 8 coverage mapped to the reference genomeUsing the S paradoxus population as outgroup we also in-ferred the most likely ancestral state at each polymorphiclocus allowing the polarization of alleles into ancestral andderived Overall we found a strong tendency for derived

ARR cluster copy number

Rat

e (5

mM

NaA

s 2O3)

Lag

(5m

M N

aAs 2O

3)

Effi

cien

cy (

5mM

NaA

s 2O3)

ARR cluster copy number ARR cluster copy number

A

B C

FIG 3 Convergent evolution of ARR cluster copy number (A) Growth rate length of mitotic lag and mitotic growth efficiency in medium containing5 mM sodium arsenite oxide for strains with different ARR cluster copy number Units are on a log2 scale and relative to the S cerevisiae reference strainderivative BY4741 The strain data points are jittered along the horizontal dimension to increase visibility (B) Distribution of the ARR cluster copynumber variant within the populations of S cerevisiae and S paradoxus Strain colors denote subpopulation origin as in figure 2 The strain trees areneighbor-joining trees based on genome-wide SNP distances and the scale bars indicates sequence distance in units of SNPs per basepair (distancescales differ between the species) (C) The two copies of the ARR gene cluster in the WineEuropean strain BC187 were computationally phased and thesequences of the two copies were clustered with the corresponding sequences from the clean lineage strains of S cerevisiae using the neighbor-joiningalgorithm Although the JapaneseSake strain (Y12) carries two copies the haplotypes are very similar in sequence and are represented here by aconsensus version where the few positions that are polymorphic between the two haplotypes have been masked out The scale bar indicates sequencedistance in units of SNPs per basepair

879

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

alleles with higher functional potential to be shifted towardlow frequencies (fig 4A) Most derived alleles are present inonly one (37) or a few strains but among nonsynonymousalleles the bias is more pronounced with 46 occurring inonly a single strain Among nonsynonymous alleles predictedto be deleterious the bias toward lower frequencies is evenstronger with 64 occurring in a single strain These patternsreflect the effects of purifying selection and agree with obser-vations made in the human (Abecasis et al 2012) andArabidopsis thaliana (Cao et al 2011) populations Derivedand therefore recent alleles are four times more likely to beclassified as deleterious than ancestral alleles (fig 4B) consis-tent with elevated negative selection against them in thespecies as a whole We also specifically considered SNPswhere the derived allele is private to a single strain andfound that such alleles present in WineEuropean strainsare more frequently nonsynonymous and the nonsynony-mous SNPs are more frequently predicted deleterious thanSNPs in other strains (fig 4C) This suggests relaxed negativeselection in the WineEuropean population potentially asso-ciated with a recent population expansion into a new andbeneficial niche such as that introduced by humans with the

emergence and spread of wine production over the last 7000years (Borneman et al 2013) Running SIFT in the same fash-ion on variants identified in the S paradoxus population wefind that derived alleles in this species are on average slightlyless often deleterious than derived S cerevisiae alleles (178vs 215) also consistent with recently relaxed selection inS cerevisiae

Loss-of-Function Variants in the S cerevisiaePopulation

Certain classes of variants are expected to have dramaticconsequences on gene products and therefore constituteparticularly interesting candidates for contributing to pheno-typic variation Taking advantage of the well-annotated ref-erence genome of S cerevisiae we predicted highly probableloss-of-function variants in the form of prematurely intro-duced stop codons and frameshifting indels across all 19S cerevisiae strains Overall these variants are enriched to-ward the 30-end of ORFs (37-fold enrichment in the last 5 ofORFs Plt 1021) This reflects lower purifying selection pres-sures against mutations that only perturb translation of thevery C-terminal end of the protein and is in line with previous

A

B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of strains

Fra

ctio

n0

00

10

20

30

40

50

60

7

NonminuscodingCodingSynonymousNonminussynonymousDeleterious (SIFT)

Derived Ancestral

Fra

ctio

n de

lete

rious

(S

IFT

)

000

005

010

015

020

025

FIG 4 Distribution of SNPs within the S cerevisiae population (A) The derived allele frequency spectrum for SNPs with different coding effects Theancestral state of each SNP was inferred by using S paradoxus as an outgroup (B) SNP alleles inferred to be derived are much more frequently predictedto be deleterious by SIFT than alleles predicted to be ancestral (215 vs 54 respectively) (C) The effect on gene sequences of derived alleles that arefound in only a single strain Strain colors denote subpopulation origin as in figure 2 The strain tree is a neighbor-joining tree based on genome-wideSNP distances and the scale bar indicates sequence distance in units of SNPs per basepair

880

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

Tab

le1

Sequ

enci

ng

and

De

Nov

oA

ssem

bly

ofY

east

Stra

inG

enom

es

Stra

inSu

bpop

ula

tion

Sou

rce

Loca

tion

Cov

aN

um

ber

ofSc

affo

ldsb

Ass

emb

lySi

zeb

Con

tig

N50

bM

ax

Scaf

fold

Size

bSc

affo

ldN

50b

Sce

revi

siae

UW

OPS

87-2

421

Mos

aic

Opu

ntia

spp

Haw

aii

821

559

536

116

584

291

167

177

210

365

310

847

354

612

654

612

616

093

920

012

2

UW

OPS

83-7

873

Mos

aic

Opu

ntia

spp

Bah

amas

627

513

495

116

769

431

167

812

211

370

211

370

255

777

188

503

318

788

124

466

1

YPS

128

Nor

thA

mer

ican

Que

rcus

alba

USA

6485

579

111

741

720

11

768

379

948

959

934

145

129

451

896

010

955

518

762

3

SK1

Mos

aic

Soil

USA

561

315

111

911

704

669

11

753

398

491

495

407

232

107

256

961

067

234

267

272

L152

8W

ine

Euro

pean

Win

eC

hile

4995

180

811

565

290

11

594

025

361

083

957

424

618

146

398

655

077

101

460

W30

3M

osai

cLa

bora

tory

USA

481

391

121

911

630

510

11

666

829

345

373

798

620

922

328

454

752

228

122

824

DB

VPG

6765

Win

eEu

rope

anU

nkn

own

Un

know

n46

967

779

116

224

181

167

095

346

113

51

348

280

644

540

506

656

672

083

26

Y12

Sake

Sake

Japa

n44

172

11

582

116

100

641

164

974

725

846

26

998

113

242

229

213

369

446

014

8

DB

VPG

1106

Win

eEu

rope

anG

rap

esA

ust

ralia

431

280

116

311

539

734

11

580

105

266

132

724

617

296

418

308

036

291

47

371

Y55

Mos

aic

Gra

pes

Fran

ce38

151

61

219

116

354

231

170

113

924

282

25

808

263

008

418

695

421

571

418

98

DB

VPG

6044

Wes

tA

fric

anB

iliw

ine

Wes

tA

fric

a35

113

798

311

598

603

11

656

308

421

434

528

223

691

735

097

052

144

94

331

YJM

975

Win

eEu

rope

anC

linic

alIt

aly

333

069

242

911

406

731

11

688

902

570

15

849

334

645

940

37

414

150

38

UW

OPS

03-4

614

Mal

aysi

anB

ertr

amp

alm

Mal

aysi

a32

364

63

213

114

991

461

169

426

85

691

584

440

466

60

006

691

910

747

DB

VPG

1373

Win

eEu

rope

anSo

ilN

eth

erla

nds

321

232

970

115

594

541

162

650

421

249

22

464

185

406

368

851

355

429

852

6

YJM

978

Win

eEu

rope

anC

linic

alIt

aly

32mdash

mdashmdash

mdashmdash

DB

VPG

1788

Win

eEu

rope

anSo

ilFi

nla

nd

30mdash

mdashmdash

mdashmdash

L137

4W

ine

Euro

pean

Win

eC

hile

26mdash

mdashmdash

mdashmdash

BC

187

Win

eEu

rope

anW

ine

USA

23mdash

mdashmdash

mdashmdash

YJM

981

Win

eEu

rope

anC

linic

alIt

aly

16mdash

mdashmdash

mdashmdash

Sp

arad

oxus

Y8

5Eu

rop

ean

Que

rcus

spp

U

K50

243

911

623

026

11

133

254

762

020

411

1

Z1

1Eu

rop

ean

Que

rcus

spp

U

K42

342

641

211

616

631

11

624

232

113

350

115

435

555

399

555

399

238

384

289

391

Y9

6Eu

rop

ean

Que

rcus

spp

U

K40

874

311

615

713

67

180

27

189

283

447

Z1

Euro

pea

nQ

uerc

ussp

p

UK

367

685

116

865

90

102

641

357

366

128

755

Q59

1Eu

rop

ean

Que

rcus

spp

U

K54

931

803

117

144

391

175

763

767

680

73

926

362

301

496

642

710

431

509

95

N-4

4Fa

rEa

ster

nQ

uerc

ussp

p

Rus

sia

512

058

1488

115

761

821

173

337

611

054

12

242

662

541

561

2813

806

36

398

YPS

138

Am

eric

anQ

ve

lutin

aU

SA43

204

715

2911

613

204

11

740

006

110

251

180

169

807

117

759

147

973

328

5

S36

7Eu

rop

ean

Que

rcus

spp

U

K42

115

311

1211

658

851

11

672

986

465

804

688

017

463

818

728

750

727

56

021

Y6

5Eu

rop

ean

Que

rcus

spp

U

K41

113

498

111

655

419

11

704

717

470

305

210

719

704

435

741

450

244

78

826

Y7

Euro

pea

nQ

uerc

ussp

p

UK

391

164

976

116

539

441

172

135

543

703

46

679

163

882

230

460

458

728

876

2

Q95

3Eu

rop

ean

Que

rcus

spp

U

K35

122

91

002

116

604

421

173

403

642

716

46

351

190

059

307

994

482

611

113

11

UFR

J508

16A

mer

ican

Dro

soph

ilasp

p

Bra

zil

34mdash

mdashmdash

mdashmdash

T21

4Eu

rop

ean

Que

rcus

spp

U

K34

118

799

111

656

698

11

699

313

357

884

141

013

132

834

001

940

273

70

866

IFO

1804

Far

East

ern

Que

rcus

spp

R

ussi

a26

mdashmdash

mdashmdash

mdash

W7

Euro

pea

nQ

uerc

ussp

p

UK

232

070

116

476

94

129

27

946

15

145

11

Q74

4Eu

rop

ean

Que

rcus

spp

U

K23

mdashmdash

mdashmdash

mdash

Q89

8Eu

rop

ean

Que

rcus

spp

U

K22

mdashmdash

mdashmdash

mdash

Q62

5Eu

rop

ean

Que

rcus

spp

U

K20

mdashmdash

mdashmdash

mdash

Q69

8Eu

rop

ean

Que

rcus

spp

U

K18

mdashmdash

mdashmdash

mdash

Y8

1Eu

rop

ean

Que

rcus

spp

U

K16

mdashmdash

mdashmdash

mdash

KPN

3829

Euro

pea

nQ

uerc

ussp

p

Rus

sia

16mdash

mdashmdash

mdashmdash

Q32

3Eu

rop

ean

Que

rcus

spp

U

K12

mdashmdash

mdashmdash

mdash

NO

TEmdash

Stra

ins

wit

hout

valu

esw

ere

not

deno

voas

sem

bled

Add

itio

nal

info

rmat

ion

ofth

ese

quen

ced

stra

ins

was

repo

rted

inLi

tiC

arte

ret

al(

2009

)T

heS

cere

visi

aeY

12or

igin

ally

repo

rted

from

Ivor

yC

oast

Palm

win

esu

bseq

uent

lycl

arifi

ed(F

ayJ

pers

onal

com

mun

icat

ion)

that

the

stra

inse

nt

was

the

Sake

stra

inK

12fr

omJa

pan

a Ref

ers

tora

wco

vera

ge

bV

alue

sbe

fore

and

afte

rfo

rwar

dsl

ashe

sco

rres

pond

tobe

fore

and

afte

rad

diti

onal

scaf

fold

ing

wit

hlo

wco

vera

gep

aire

d-en

dSa

nge

rda

ta

All

valu

esin

unit

sof

base

-pai

rs

874

Bergstrom et al doi101093molbevmsu037 MBE

20 after quality filtering were de novo assembled The sixstrains with very high coverage were used to explore the effectof sequencing coverage on de novo assembly quality by as-sembling data subsets systematically downsampled to differ-ent coverage levels We find that assembly quality increaseswith increasing coverage up to but not above a level of about120 (supplementary fig S1 Supplementary Materialonline) As almost all strains were previously sequenced tolow coverage (1ndash4) using Sanger technology (Liti Carteret al 2009) and paired-end libraries with long inserts (meaninsert size 4480 bp) we could improve our assemblies byincluding this data Mapping Sanger reads onto the Illuminaassemblies and performing scaffolding based on paired-endinformation improved assembly connectivity substantiallywith an increase in average scaffold N50 from 648 to1181 kb (table 1)

The 27 de novo assemblies have total sizes between 1158and 1177 Mb suggesting little variability in yeast genome sizeThis is 32ndash48 smaller than the completed S cerevisiae ref-erence genome of 1216 Mb indicating a slight underestima-tion of true genome sizes presumably due to collapse in theassemblies of transposable elements and perhaps a few otherrepeat regions Underestimates are compatible with a re-ported Ty transposon content of 1ndash3 in these strains anda higher transposon content of 35 in the reference strainS288c (Liti Carter et al 2009) For the majority of strains thelargest scaffolds are longer than the smallest chromosomes ofthe reference genome (table 1) Some scaffolds in the highercoverage assemblies correspond to near full-length chromo-somes This illustrates that high-quality assembly of yeast ge-nomes is achievable using short-read sequencing data Thestructurally complex subtelomeric regions did not generallyassemble well however (with some exceptions see below) aknown problem in genome assembly

Assembly Scaffolding by Genetic Linkage RevealsExtensive Structural Conservation exceptin Subtelomeres

Four of the S cerevisiae strains sequenced the North Amer-ican YPS128 the West African DBVPG6044 the SakeJapa-nese Y12 and the WineEuropean DBVPG6765 have beenused as parents for advanced intercross lines from whichlarge numbers of highly recombined offspring genomeshave been sequenced (Cubillos et al 2013 Illingworth et al2013) The genetic linkage that manifests between nearby lociin these artificial populations was here put to use to furtherscaffold the de novo assemblies of these four strains Thisresulted in chromosome sized scaffolds for all 16 nuclearchromosomes in these strains containing 946ndash961 ofthe total assembly sequence and scaffold N50 values of852ndash890 kb

The linkage-assisted assemblies exposed the near-completecolinearity and structural conservation of the genomes offour of the major phylogenetic lineages in S cerevisiae(fig 1A) This is consistent with high rates of meiotic sporeviability observed in crosses between these strains (Cubilloset al 2011) as large-scale structural variation would impair

proper segregation of chromosomes as well as the highdegree of karyotype conservation within the S sensu strictospecies clade (Fischer et al 2000 Liti et al 2013) Interestinglythe high degree of structural conservation does not extendinto subtelomeric regions consistent with their rapid evolu-tion and high variability (Liti et al 2005 Brown et al 2010)Lower assembly quality in these regions makes analysis diffi-cult however the genetic linkage data allowed the identifica-tion of some cases of structural variation in the subtelomeresrelative to the S cerevisiae reference genome (fig 1B) We alsofind subtelomeric material that is present in some strains andabsent in others for instance a segment of approximately18 kb that localizes to the right subtelomere of chromosomeXIII and assembled well in the North American and WestAfrican S cerevisiae strains (fig 1C) This genomic region isabsent from the WineEuropean SakeJapanese as well as theS cerevisiae reference strain while being present in all S para-doxus strains constituting an example of a subtelomericregion that has recently been lost in certain S cerevisiaestrains These kind of structural differences have implicationsfor QTL mapping studies that rely on a reference genomebecause the causative sequence responsible for a phenotypicassociation might in fact be located in a different subtelomerein the mapping strains or simply be absent from the referencegenome thus risking that the search for candidates is di-rected to the incorrect genomic region (Cubillos et al 2011)

Genome and Gene Content Variation in S cerevisiaeExceeds That in S paradoxus

High quality de novo genome assemblies enable the system-atic identification of long sequence segments that are presentin only a subset of yeast strains such as the region exemplifiedin figure 1C We made pairwise comparisons between straingenomes and summed the total length of sequence regionslarger than 1 kb that are present in one strain but not theother and refer to this as genome content variation InS cerevisiae this sum is always larger than the number ofSNPs between a pair of strains for all possible strain compar-isons This is not the case for S paradoxus surprisingly as theamount of genome content variation within this species islower than within S cerevisiae despite genetic variation in theform of SNPs being almost an order of magnitude larger(fig 2A) In both species there is a positive correlation be-tween the SNP distance between strains and the amount ofgenome content difference but the correlation is muchweaker in S cerevisiae than in S paradoxus (r = 051 andr = 097 respectively) The highly variable relationship inS cerevisiae suggests that this kind of variation is the productof a different mode of evolution than the clocklike nature ofSNP accumulation Nonetheless clustering strains based ongenome content differences recapitulates the known popu-lation structures of both species (supplementary fig S2Supplementary Material online) To ensure that the detecteddifferences reflects true underlying genome variation we val-idated the presence and absence of 14 variable regions acrossstrains by polymerase chain reaction (PCR) confirming the denovo assembly predictions in 326327 cases as well as

875

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

compared one of our assemblies to an alternative assemblyfor the same strain and found no differences (see supplemen-tary fig S3 Supplementary Material online and Materials andMethods) The observation that genome content variation isrelatively larger within S cerevisiae than within S paradoxus ishighly unexpected under a neutral model of genome

evolution and implies pronounced differences in the evolu-tionary histories of these two species Although challenging toestablish at present we suggest that this unexpected excess ofgenome content variation in S cerevisiae is likely to be a majorcontributor to the equally surprising excess of phenotypicvariability within this species We furthermore note that

A

B C

890 892 894 896 898 900 902 904 906 908

Position on chromosomal scaffold XIII (kb)

FIG 1 Yeast genome structures revealed by de novo assemblies augmented by genetic linkage data (A) Scaffolding de novo assemblies using geneticlinkage information from advanced intercross lines dramatically improves assembly connectivity and reveals extensive structural conservation of thecore chromosomes in four of the major S cerevisiae lineages Displayed is a dot plot of sequence similarity between the assembly scaffolds of the strainYPS128 from the North American phylogenetic lineage and the 16 nuclear chromosomes of the S cerevisiae reference genome (strain S288c) before andafter the incorporation of the genetic linkage data into the scaffolding process After scaffolding by genetic linkage the majority of the assemblysequence is contained in 16 large scaffolds that are collinear with the chromosomes of the reference genome Results are highly similar for the otherthree strains for which genetic linkage data is available the West African strain DBVPG6044 the WineEuropean strain DBVPG6765 and the sakeJapanese strain Y12 (the recent sequencing of the sake strain Kyokai no 7 (Akao et al 2011) revealed two intrachromosomal inversions in chromosomesV and XIV in relation to the reference strain S288c however these are not shared by the sake strain Y12 sequenced here) Only scaffolds bigger than50 kb are displayed (B) Structural rearrangements relative to the chromosome organization of the S cerevisiae reference genome all localized to thesubtelomeric regions A directed arrow indicates that a sequence region is aligning to the part of the reference genome where the arrow starts but in thede novo assembly is located in the part of the genome corresponding to where the arrow ends (C) A subtelomeric 18-kb region that assembled well inseveral strains and could be localized by genetic linkage is displayed with coordinates corresponding to the YPS128 chromosome XIII scaffold Six geneswere found in this region by ab initio gene prediction (arrows indicate coding direction)

876

Bergstrom et al doi101093molbevmsu037 MBE

there are considerable amounts of genomic sequence that ispresent in one or more natural strains but absent in the Scerevisiae reference strain S288c (supplementary fig S4Supplementary Material online)

We annotated the de novo assemblies by homology andsynteny comparisons to the reference genome Out of 5774nondubious open reading frames (ORFs) in the S288cS cerevisiae reference genome we recovered a median of5417 ORFs (94) per strain with most failures to recover agene appearing to be caused by collapse in the assemblies ofvery close paralogs We also identified genes that are notpresent in the reference genome The number of nonrefer-ence genes present in a S cerevisiae strain genome rangesfrom four in the lab strain W303 which is very closely relatedto the reference strain S288c to 61 in the mosaic strainUWOPS87-2421 with a median of 31 genes per strain(fig 2B) Using the four S cerevisiae strains for which genetic

linkage data allowed assembly of chromosome-sized scaffoldswe estimate that at least 75 of these nonreference genes arelocated in the subtelomeric parts of the chromosomes Byexploiting distant sequence similarity to functionally anno-tated reference genome proteins we find that the set ofnonreference genes is enriched for gene ontology termsrelated to flocculation and sugarmdashparticularly maltosemdashtransport and metabolism These results are consistent withfindings on the evolutionary properties of subtelomeric genesacross larger evolutionary distances (Brown et al 2010)

CNV Is Extensive in Subtelomeres and Greater inS cerevisiae than in S paradoxus

Whereas the analysis described in the previous section is onlypowered to detect binary presence and absence due to thetendency of similar copies to collapse during de novo

A

B

FIG 2 Genome content variation within natural yeast populations (A) The relationship between genetic distance between strains as measured in SNPsand the amount of genomic material being presentabsent between strains All pairwise strain comparisons within each of the two species are included(B) The number of nonreference genes found in each strain genome Strain colors denote subpopulation origin (for S cerevisiae green = WineEuropeanred = West African cyan = Malaysian yellow = North American dark blue = SakeJapanese black = mosaic genome for S paradoxus orange =American brown = Far Eastern magenta = European) The strain trees are neighbor-joining trees based on genome-wide SNP distances and thescale bars indicate sequence distance in units of SNPs per basepair (distance scales differ between the species)

877

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

assembly by mapping reads to the reference genome we canalso assay differences between higher copy numbers Weidentified CNV across all strains with a mapped coverage of8 or higher (18 S cerevisiae strains and 19 S paradoxusstrains) The total size of genomic regions exhibiting CNVbetween strains is three times larger in S cerevisiae than inS paradoxus (423 kb vs 142 kb corresponding to 35 and12 of the total genome size respectively) despite lowerlevels of overall genetic divergence in the former speciesAlthough the true amount of CNV in S paradoxus couldbe slightly underestimated due to the quality of the referencegenome being lower than that of S cerevisiae (containingmultiple gaps where CNV detection is not possible) this isunlikely to explain the large difference observed (see Materialsand Methods) These results mirror the recent finding of dif-ferences in CNV rates between the great ape lineages(Sudmant et al 2013) CNV similarity between strains stronglycorrelates with SNP similarity within both S cerevisiae and Sparadoxus (r = 0843 and r = 0885 respectively) and cluster-ing strains based on CNV profiles recapitulates the broadphylogenetic structures of the two species (supplementaryfig S5 Supplementary Material online) In both S cerevisiaeand S paradoxus we find very limited CNV in nonsubtelo-meric regions and extensive variation in the subtelomericregions In S cerevisiae 320 of subtelomeric nucleotide po-sitions are affected by CNV compared with 07 in nonsub-telomeric regions (42-fold enrichment) and in S paradoxusthe corresponding numbers are 93 and 004 respectively(23-fold enrichment) In S cerevisiae genes contained in re-gions displaying CNV are enriched for gene ontology termsrelated to sugar transport and metabolism flocculation andion and metal transport and metabolism The same trends ofenrichment are observed in S paradoxus (although withfewer categories reaching statistical significance as a conse-quence of the overall lower number of genes affected) Othergenes with variable copy number between strains in bothspecies include the high copy number subtelomeric YRF1helicase PAU seripauperin and DUP gene families These re-sults imply that largely similar evolutionary forces are shapingthe landscapes of CNV in these two species By consideringaggregate sequencing depth across close paralogs in the ref-erence genome we could unveil further trends not necessarilyapparent when considering genes one by one This genefamily based analysis showed that one clinically isolatedstrain from the WineEuropean phylogenetic clusterYJM981 harbors an extremely large number of YRF1 genecopies on the order of 1000 copies compared with estimatesof 10ndash40 copies in other strains including the referencegenome This radical copy number increase is consistentwith an experimentally demonstrated phenomenon wherethe subtelomeric Y0-element which contains the YRF1genes is amplified to serve as an alternative mechanism fortelomere maintenance following failure of the conventionaltelomerase system (Lundblad and Blackburn 1993) This hasdramatic consequences on subtelomeric structure and overallgenome size and we confirmed by pulsed field gel electro-phoresis that the chromosomes of YJM981 are much largerthan normal (supplementary fig S6 Supplementary Material

online) Another closely related clinical isolate YJM978 alsoharbors an usually large number of YRF1 copies though muchfewer than YJM981 (~150 copies) While we have not identi-fied any obvious loss-of-function mutations affecting the keytelomere maintenance genes in these strains some of them(EST1 EST2 RIF2 TEL1 yKu80) appear to have very low ex-pression levels in an RNA-seq data set (Skelly et al 2013)These results raise the intriguing possibility that this alterna-tive mode of telomere elongation actually occurs in naturalstrains which to our knowledge has never beendemonstrated

Convergent Evolution of Subtelomeric CNV UnderliesNatural Arsenic Resistance in S cerevisiae andS paradoxus

One region of the genome exhibiting CNV within bothS cerevisiae and S paradoxus is the ARR gene clustercontaining three contiguous genes involved in the cellulardetoxification of arsenic and antimonyl compounds the tran-scription factor ARR1 the arsenate reductase ARR2 and theplasma membrane transporter ARR3 We hypothesized thatCNV in this gene cluster would impact growth phenotypes inmedia containing arsenic compounds and tested this usingpreviously collected phenotype data (Warringer et al 2011)In both S cerevisiae and S paradoxus we found a strongassociation between ARR cluster copy number and mitoticgrowth rate length of mitotic lag phase and mitotic growthefficiency in the presence of arsenite (fig 3A) In an additivemodel ARR copy number explains 50 (P = 15103) and92 (P = 12109) of the phenotypic variation in growthrate in S cerevisiae and S paradoxus respectively and 71(P = 25105) and 82 (P = 58107) respectively of thevariation in the length of lag time Interestingly for the growthefficiency phenotype additivity breaks down as there is nodifference between strains with one copy and strains with twocopies This result is biologically consistent with a modelwhere the rate at which the cell can expel arsenic compoundsincreases with the number of ARR copies but the energy costper arsenite molecule exported is constant and independentof copy number above one

In S paradoxus the distribution of the ARR gene clusterCNV tracks the population structure with all strains fromthe European population having two copies the NorthAmerican strain having one copy and the two Far Easternstrains missing the region The variant distribution within Scerevisiae shows a more complex pattern with evidence forboth convergent amplification and introgression betweenlineages (fig 3B) The region is found in one copy in moststrains in the species missing in the Malaysian strainUWOPS03-4614 and the predominantly West Africanmosaic strains SK1 and Y55 and in two copies in thesake strain Y12 and two strains from the WineEuropeancluster By computationally phasing the haplotypes of thetwo copies of the WineEuropean strain BC187 we foundthat one of these copies clusters phylogenetically with thealleles of the other WineEuropean strains as expectedwhereas the other clusters with the Y12 copies indicating

878

Bergstrom et al doi101093molbevmsu037 MBE

that this copy has been introgressed from the sake lineageinto parts of the WineEuropean population (fig 3C) Theother WineEuropean strain with a duplication DBVPG1373however harbors two copies with very low internal sequencedivergence and with no similarity to the sake copies implyingthat this is the result of an independent duplication eventwithin the WineEuropean population These findings dem-onstrate convergent evolution of ARR cluster duplication andloss both between different lineages within S cerevisiae andbetween S cerevisiae and S paradoxus It is tempting to spec-ulate that the ARR cluster CNV has been driven by differencesin environmental arsenic concentrations between the habi-tats of different yeast lineages

Derived Alleles in Yeast Tend to Be Deleterious andPrivate to a Single Population

To predict the effects of SNPs on protein function we em-ployed the Sorting Intolerant From Tolerant (SIFT) algorithmwhich estimates the probability of an amino acid substitutionbeing deleterious based on evolutionary conservation of thesite across a large number of species (Kumar et al 2009) on allnonsynonymous SNPs identified in the 18 S cerevisiae strainswith at least 8 coverage mapped to the reference genomeUsing the S paradoxus population as outgroup we also in-ferred the most likely ancestral state at each polymorphiclocus allowing the polarization of alleles into ancestral andderived Overall we found a strong tendency for derived

ARR cluster copy number

Rat

e (5

mM

NaA

s 2O3)

Lag

(5m

M N

aAs 2O

3)

Effi

cien

cy (

5mM

NaA

s 2O3)

ARR cluster copy number ARR cluster copy number

A

B C

FIG 3 Convergent evolution of ARR cluster copy number (A) Growth rate length of mitotic lag and mitotic growth efficiency in medium containing5 mM sodium arsenite oxide for strains with different ARR cluster copy number Units are on a log2 scale and relative to the S cerevisiae reference strainderivative BY4741 The strain data points are jittered along the horizontal dimension to increase visibility (B) Distribution of the ARR cluster copynumber variant within the populations of S cerevisiae and S paradoxus Strain colors denote subpopulation origin as in figure 2 The strain trees areneighbor-joining trees based on genome-wide SNP distances and the scale bars indicates sequence distance in units of SNPs per basepair (distancescales differ between the species) (C) The two copies of the ARR gene cluster in the WineEuropean strain BC187 were computationally phased and thesequences of the two copies were clustered with the corresponding sequences from the clean lineage strains of S cerevisiae using the neighbor-joiningalgorithm Although the JapaneseSake strain (Y12) carries two copies the haplotypes are very similar in sequence and are represented here by aconsensus version where the few positions that are polymorphic between the two haplotypes have been masked out The scale bar indicates sequencedistance in units of SNPs per basepair

879

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

alleles with higher functional potential to be shifted towardlow frequencies (fig 4A) Most derived alleles are present inonly one (37) or a few strains but among nonsynonymousalleles the bias is more pronounced with 46 occurring inonly a single strain Among nonsynonymous alleles predictedto be deleterious the bias toward lower frequencies is evenstronger with 64 occurring in a single strain These patternsreflect the effects of purifying selection and agree with obser-vations made in the human (Abecasis et al 2012) andArabidopsis thaliana (Cao et al 2011) populations Derivedand therefore recent alleles are four times more likely to beclassified as deleterious than ancestral alleles (fig 4B) consis-tent with elevated negative selection against them in thespecies as a whole We also specifically considered SNPswhere the derived allele is private to a single strain andfound that such alleles present in WineEuropean strainsare more frequently nonsynonymous and the nonsynony-mous SNPs are more frequently predicted deleterious thanSNPs in other strains (fig 4C) This suggests relaxed negativeselection in the WineEuropean population potentially asso-ciated with a recent population expansion into a new andbeneficial niche such as that introduced by humans with the

emergence and spread of wine production over the last 7000years (Borneman et al 2013) Running SIFT in the same fash-ion on variants identified in the S paradoxus population wefind that derived alleles in this species are on average slightlyless often deleterious than derived S cerevisiae alleles (178vs 215) also consistent with recently relaxed selection inS cerevisiae

Loss-of-Function Variants in the S cerevisiaePopulation

Certain classes of variants are expected to have dramaticconsequences on gene products and therefore constituteparticularly interesting candidates for contributing to pheno-typic variation Taking advantage of the well-annotated ref-erence genome of S cerevisiae we predicted highly probableloss-of-function variants in the form of prematurely intro-duced stop codons and frameshifting indels across all 19S cerevisiae strains Overall these variants are enriched to-ward the 30-end of ORFs (37-fold enrichment in the last 5 ofORFs Plt 1021) This reflects lower purifying selection pres-sures against mutations that only perturb translation of thevery C-terminal end of the protein and is in line with previous

A

B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of strains

Fra

ctio

n0

00

10

20

30

40

50

60

7

NonminuscodingCodingSynonymousNonminussynonymousDeleterious (SIFT)

Derived Ancestral

Fra

ctio

n de

lete

rious

(S

IFT

)

000

005

010

015

020

025

FIG 4 Distribution of SNPs within the S cerevisiae population (A) The derived allele frequency spectrum for SNPs with different coding effects Theancestral state of each SNP was inferred by using S paradoxus as an outgroup (B) SNP alleles inferred to be derived are much more frequently predictedto be deleterious by SIFT than alleles predicted to be ancestral (215 vs 54 respectively) (C) The effect on gene sequences of derived alleles that arefound in only a single strain Strain colors denote subpopulation origin as in figure 2 The strain tree is a neighbor-joining tree based on genome-wideSNP distances and the scale bar indicates sequence distance in units of SNPs per basepair

880

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

20 after quality filtering were de novo assembled The sixstrains with very high coverage were used to explore the effectof sequencing coverage on de novo assembly quality by as-sembling data subsets systematically downsampled to differ-ent coverage levels We find that assembly quality increaseswith increasing coverage up to but not above a level of about120 (supplementary fig S1 Supplementary Materialonline) As almost all strains were previously sequenced tolow coverage (1ndash4) using Sanger technology (Liti Carteret al 2009) and paired-end libraries with long inserts (meaninsert size 4480 bp) we could improve our assemblies byincluding this data Mapping Sanger reads onto the Illuminaassemblies and performing scaffolding based on paired-endinformation improved assembly connectivity substantiallywith an increase in average scaffold N50 from 648 to1181 kb (table 1)

The 27 de novo assemblies have total sizes between 1158and 1177 Mb suggesting little variability in yeast genome sizeThis is 32ndash48 smaller than the completed S cerevisiae ref-erence genome of 1216 Mb indicating a slight underestima-tion of true genome sizes presumably due to collapse in theassemblies of transposable elements and perhaps a few otherrepeat regions Underestimates are compatible with a re-ported Ty transposon content of 1ndash3 in these strains anda higher transposon content of 35 in the reference strainS288c (Liti Carter et al 2009) For the majority of strains thelargest scaffolds are longer than the smallest chromosomes ofthe reference genome (table 1) Some scaffolds in the highercoverage assemblies correspond to near full-length chromo-somes This illustrates that high-quality assembly of yeast ge-nomes is achievable using short-read sequencing data Thestructurally complex subtelomeric regions did not generallyassemble well however (with some exceptions see below) aknown problem in genome assembly

Assembly Scaffolding by Genetic Linkage RevealsExtensive Structural Conservation exceptin Subtelomeres

Four of the S cerevisiae strains sequenced the North Amer-ican YPS128 the West African DBVPG6044 the SakeJapa-nese Y12 and the WineEuropean DBVPG6765 have beenused as parents for advanced intercross lines from whichlarge numbers of highly recombined offspring genomeshave been sequenced (Cubillos et al 2013 Illingworth et al2013) The genetic linkage that manifests between nearby lociin these artificial populations was here put to use to furtherscaffold the de novo assemblies of these four strains Thisresulted in chromosome sized scaffolds for all 16 nuclearchromosomes in these strains containing 946ndash961 ofthe total assembly sequence and scaffold N50 values of852ndash890 kb

The linkage-assisted assemblies exposed the near-completecolinearity and structural conservation of the genomes offour of the major phylogenetic lineages in S cerevisiae(fig 1A) This is consistent with high rates of meiotic sporeviability observed in crosses between these strains (Cubilloset al 2011) as large-scale structural variation would impair

proper segregation of chromosomes as well as the highdegree of karyotype conservation within the S sensu strictospecies clade (Fischer et al 2000 Liti et al 2013) Interestinglythe high degree of structural conservation does not extendinto subtelomeric regions consistent with their rapid evolu-tion and high variability (Liti et al 2005 Brown et al 2010)Lower assembly quality in these regions makes analysis diffi-cult however the genetic linkage data allowed the identifica-tion of some cases of structural variation in the subtelomeresrelative to the S cerevisiae reference genome (fig 1B) We alsofind subtelomeric material that is present in some strains andabsent in others for instance a segment of approximately18 kb that localizes to the right subtelomere of chromosomeXIII and assembled well in the North American and WestAfrican S cerevisiae strains (fig 1C) This genomic region isabsent from the WineEuropean SakeJapanese as well as theS cerevisiae reference strain while being present in all S para-doxus strains constituting an example of a subtelomericregion that has recently been lost in certain S cerevisiaestrains These kind of structural differences have implicationsfor QTL mapping studies that rely on a reference genomebecause the causative sequence responsible for a phenotypicassociation might in fact be located in a different subtelomerein the mapping strains or simply be absent from the referencegenome thus risking that the search for candidates is di-rected to the incorrect genomic region (Cubillos et al 2011)

Genome and Gene Content Variation in S cerevisiaeExceeds That in S paradoxus

High quality de novo genome assemblies enable the system-atic identification of long sequence segments that are presentin only a subset of yeast strains such as the region exemplifiedin figure 1C We made pairwise comparisons between straingenomes and summed the total length of sequence regionslarger than 1 kb that are present in one strain but not theother and refer to this as genome content variation InS cerevisiae this sum is always larger than the number ofSNPs between a pair of strains for all possible strain compar-isons This is not the case for S paradoxus surprisingly as theamount of genome content variation within this species islower than within S cerevisiae despite genetic variation in theform of SNPs being almost an order of magnitude larger(fig 2A) In both species there is a positive correlation be-tween the SNP distance between strains and the amount ofgenome content difference but the correlation is muchweaker in S cerevisiae than in S paradoxus (r = 051 andr = 097 respectively) The highly variable relationship inS cerevisiae suggests that this kind of variation is the productof a different mode of evolution than the clocklike nature ofSNP accumulation Nonetheless clustering strains based ongenome content differences recapitulates the known popu-lation structures of both species (supplementary fig S2Supplementary Material online) To ensure that the detecteddifferences reflects true underlying genome variation we val-idated the presence and absence of 14 variable regions acrossstrains by polymerase chain reaction (PCR) confirming the denovo assembly predictions in 326327 cases as well as

875

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

compared one of our assemblies to an alternative assemblyfor the same strain and found no differences (see supplemen-tary fig S3 Supplementary Material online and Materials andMethods) The observation that genome content variation isrelatively larger within S cerevisiae than within S paradoxus ishighly unexpected under a neutral model of genome

evolution and implies pronounced differences in the evolu-tionary histories of these two species Although challenging toestablish at present we suggest that this unexpected excess ofgenome content variation in S cerevisiae is likely to be a majorcontributor to the equally surprising excess of phenotypicvariability within this species We furthermore note that

A

B C

890 892 894 896 898 900 902 904 906 908

Position on chromosomal scaffold XIII (kb)

FIG 1 Yeast genome structures revealed by de novo assemblies augmented by genetic linkage data (A) Scaffolding de novo assemblies using geneticlinkage information from advanced intercross lines dramatically improves assembly connectivity and reveals extensive structural conservation of thecore chromosomes in four of the major S cerevisiae lineages Displayed is a dot plot of sequence similarity between the assembly scaffolds of the strainYPS128 from the North American phylogenetic lineage and the 16 nuclear chromosomes of the S cerevisiae reference genome (strain S288c) before andafter the incorporation of the genetic linkage data into the scaffolding process After scaffolding by genetic linkage the majority of the assemblysequence is contained in 16 large scaffolds that are collinear with the chromosomes of the reference genome Results are highly similar for the otherthree strains for which genetic linkage data is available the West African strain DBVPG6044 the WineEuropean strain DBVPG6765 and the sakeJapanese strain Y12 (the recent sequencing of the sake strain Kyokai no 7 (Akao et al 2011) revealed two intrachromosomal inversions in chromosomesV and XIV in relation to the reference strain S288c however these are not shared by the sake strain Y12 sequenced here) Only scaffolds bigger than50 kb are displayed (B) Structural rearrangements relative to the chromosome organization of the S cerevisiae reference genome all localized to thesubtelomeric regions A directed arrow indicates that a sequence region is aligning to the part of the reference genome where the arrow starts but in thede novo assembly is located in the part of the genome corresponding to where the arrow ends (C) A subtelomeric 18-kb region that assembled well inseveral strains and could be localized by genetic linkage is displayed with coordinates corresponding to the YPS128 chromosome XIII scaffold Six geneswere found in this region by ab initio gene prediction (arrows indicate coding direction)

876

Bergstrom et al doi101093molbevmsu037 MBE

there are considerable amounts of genomic sequence that ispresent in one or more natural strains but absent in the Scerevisiae reference strain S288c (supplementary fig S4Supplementary Material online)

We annotated the de novo assemblies by homology andsynteny comparisons to the reference genome Out of 5774nondubious open reading frames (ORFs) in the S288cS cerevisiae reference genome we recovered a median of5417 ORFs (94) per strain with most failures to recover agene appearing to be caused by collapse in the assemblies ofvery close paralogs We also identified genes that are notpresent in the reference genome The number of nonrefer-ence genes present in a S cerevisiae strain genome rangesfrom four in the lab strain W303 which is very closely relatedto the reference strain S288c to 61 in the mosaic strainUWOPS87-2421 with a median of 31 genes per strain(fig 2B) Using the four S cerevisiae strains for which genetic

linkage data allowed assembly of chromosome-sized scaffoldswe estimate that at least 75 of these nonreference genes arelocated in the subtelomeric parts of the chromosomes Byexploiting distant sequence similarity to functionally anno-tated reference genome proteins we find that the set ofnonreference genes is enriched for gene ontology termsrelated to flocculation and sugarmdashparticularly maltosemdashtransport and metabolism These results are consistent withfindings on the evolutionary properties of subtelomeric genesacross larger evolutionary distances (Brown et al 2010)

CNV Is Extensive in Subtelomeres and Greater inS cerevisiae than in S paradoxus

Whereas the analysis described in the previous section is onlypowered to detect binary presence and absence due to thetendency of similar copies to collapse during de novo

A

B

FIG 2 Genome content variation within natural yeast populations (A) The relationship between genetic distance between strains as measured in SNPsand the amount of genomic material being presentabsent between strains All pairwise strain comparisons within each of the two species are included(B) The number of nonreference genes found in each strain genome Strain colors denote subpopulation origin (for S cerevisiae green = WineEuropeanred = West African cyan = Malaysian yellow = North American dark blue = SakeJapanese black = mosaic genome for S paradoxus orange =American brown = Far Eastern magenta = European) The strain trees are neighbor-joining trees based on genome-wide SNP distances and thescale bars indicate sequence distance in units of SNPs per basepair (distance scales differ between the species)

877

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

assembly by mapping reads to the reference genome we canalso assay differences between higher copy numbers Weidentified CNV across all strains with a mapped coverage of8 or higher (18 S cerevisiae strains and 19 S paradoxusstrains) The total size of genomic regions exhibiting CNVbetween strains is three times larger in S cerevisiae than inS paradoxus (423 kb vs 142 kb corresponding to 35 and12 of the total genome size respectively) despite lowerlevels of overall genetic divergence in the former speciesAlthough the true amount of CNV in S paradoxus couldbe slightly underestimated due to the quality of the referencegenome being lower than that of S cerevisiae (containingmultiple gaps where CNV detection is not possible) this isunlikely to explain the large difference observed (see Materialsand Methods) These results mirror the recent finding of dif-ferences in CNV rates between the great ape lineages(Sudmant et al 2013) CNV similarity between strains stronglycorrelates with SNP similarity within both S cerevisiae and Sparadoxus (r = 0843 and r = 0885 respectively) and cluster-ing strains based on CNV profiles recapitulates the broadphylogenetic structures of the two species (supplementaryfig S5 Supplementary Material online) In both S cerevisiaeand S paradoxus we find very limited CNV in nonsubtelo-meric regions and extensive variation in the subtelomericregions In S cerevisiae 320 of subtelomeric nucleotide po-sitions are affected by CNV compared with 07 in nonsub-telomeric regions (42-fold enrichment) and in S paradoxusthe corresponding numbers are 93 and 004 respectively(23-fold enrichment) In S cerevisiae genes contained in re-gions displaying CNV are enriched for gene ontology termsrelated to sugar transport and metabolism flocculation andion and metal transport and metabolism The same trends ofenrichment are observed in S paradoxus (although withfewer categories reaching statistical significance as a conse-quence of the overall lower number of genes affected) Othergenes with variable copy number between strains in bothspecies include the high copy number subtelomeric YRF1helicase PAU seripauperin and DUP gene families These re-sults imply that largely similar evolutionary forces are shapingthe landscapes of CNV in these two species By consideringaggregate sequencing depth across close paralogs in the ref-erence genome we could unveil further trends not necessarilyapparent when considering genes one by one This genefamily based analysis showed that one clinically isolatedstrain from the WineEuropean phylogenetic clusterYJM981 harbors an extremely large number of YRF1 genecopies on the order of 1000 copies compared with estimatesof 10ndash40 copies in other strains including the referencegenome This radical copy number increase is consistentwith an experimentally demonstrated phenomenon wherethe subtelomeric Y0-element which contains the YRF1genes is amplified to serve as an alternative mechanism fortelomere maintenance following failure of the conventionaltelomerase system (Lundblad and Blackburn 1993) This hasdramatic consequences on subtelomeric structure and overallgenome size and we confirmed by pulsed field gel electro-phoresis that the chromosomes of YJM981 are much largerthan normal (supplementary fig S6 Supplementary Material

online) Another closely related clinical isolate YJM978 alsoharbors an usually large number of YRF1 copies though muchfewer than YJM981 (~150 copies) While we have not identi-fied any obvious loss-of-function mutations affecting the keytelomere maintenance genes in these strains some of them(EST1 EST2 RIF2 TEL1 yKu80) appear to have very low ex-pression levels in an RNA-seq data set (Skelly et al 2013)These results raise the intriguing possibility that this alterna-tive mode of telomere elongation actually occurs in naturalstrains which to our knowledge has never beendemonstrated

Convergent Evolution of Subtelomeric CNV UnderliesNatural Arsenic Resistance in S cerevisiae andS paradoxus

One region of the genome exhibiting CNV within bothS cerevisiae and S paradoxus is the ARR gene clustercontaining three contiguous genes involved in the cellulardetoxification of arsenic and antimonyl compounds the tran-scription factor ARR1 the arsenate reductase ARR2 and theplasma membrane transporter ARR3 We hypothesized thatCNV in this gene cluster would impact growth phenotypes inmedia containing arsenic compounds and tested this usingpreviously collected phenotype data (Warringer et al 2011)In both S cerevisiae and S paradoxus we found a strongassociation between ARR cluster copy number and mitoticgrowth rate length of mitotic lag phase and mitotic growthefficiency in the presence of arsenite (fig 3A) In an additivemodel ARR copy number explains 50 (P = 15103) and92 (P = 12109) of the phenotypic variation in growthrate in S cerevisiae and S paradoxus respectively and 71(P = 25105) and 82 (P = 58107) respectively of thevariation in the length of lag time Interestingly for the growthefficiency phenotype additivity breaks down as there is nodifference between strains with one copy and strains with twocopies This result is biologically consistent with a modelwhere the rate at which the cell can expel arsenic compoundsincreases with the number of ARR copies but the energy costper arsenite molecule exported is constant and independentof copy number above one

In S paradoxus the distribution of the ARR gene clusterCNV tracks the population structure with all strains fromthe European population having two copies the NorthAmerican strain having one copy and the two Far Easternstrains missing the region The variant distribution within Scerevisiae shows a more complex pattern with evidence forboth convergent amplification and introgression betweenlineages (fig 3B) The region is found in one copy in moststrains in the species missing in the Malaysian strainUWOPS03-4614 and the predominantly West Africanmosaic strains SK1 and Y55 and in two copies in thesake strain Y12 and two strains from the WineEuropeancluster By computationally phasing the haplotypes of thetwo copies of the WineEuropean strain BC187 we foundthat one of these copies clusters phylogenetically with thealleles of the other WineEuropean strains as expectedwhereas the other clusters with the Y12 copies indicating

878

Bergstrom et al doi101093molbevmsu037 MBE

that this copy has been introgressed from the sake lineageinto parts of the WineEuropean population (fig 3C) Theother WineEuropean strain with a duplication DBVPG1373however harbors two copies with very low internal sequencedivergence and with no similarity to the sake copies implyingthat this is the result of an independent duplication eventwithin the WineEuropean population These findings dem-onstrate convergent evolution of ARR cluster duplication andloss both between different lineages within S cerevisiae andbetween S cerevisiae and S paradoxus It is tempting to spec-ulate that the ARR cluster CNV has been driven by differencesin environmental arsenic concentrations between the habi-tats of different yeast lineages

Derived Alleles in Yeast Tend to Be Deleterious andPrivate to a Single Population

To predict the effects of SNPs on protein function we em-ployed the Sorting Intolerant From Tolerant (SIFT) algorithmwhich estimates the probability of an amino acid substitutionbeing deleterious based on evolutionary conservation of thesite across a large number of species (Kumar et al 2009) on allnonsynonymous SNPs identified in the 18 S cerevisiae strainswith at least 8 coverage mapped to the reference genomeUsing the S paradoxus population as outgroup we also in-ferred the most likely ancestral state at each polymorphiclocus allowing the polarization of alleles into ancestral andderived Overall we found a strong tendency for derived

ARR cluster copy number

Rat

e (5

mM

NaA

s 2O3)

Lag

(5m

M N

aAs 2O

3)

Effi

cien

cy (

5mM

NaA

s 2O3)

ARR cluster copy number ARR cluster copy number

A

B C

FIG 3 Convergent evolution of ARR cluster copy number (A) Growth rate length of mitotic lag and mitotic growth efficiency in medium containing5 mM sodium arsenite oxide for strains with different ARR cluster copy number Units are on a log2 scale and relative to the S cerevisiae reference strainderivative BY4741 The strain data points are jittered along the horizontal dimension to increase visibility (B) Distribution of the ARR cluster copynumber variant within the populations of S cerevisiae and S paradoxus Strain colors denote subpopulation origin as in figure 2 The strain trees areneighbor-joining trees based on genome-wide SNP distances and the scale bars indicates sequence distance in units of SNPs per basepair (distancescales differ between the species) (C) The two copies of the ARR gene cluster in the WineEuropean strain BC187 were computationally phased and thesequences of the two copies were clustered with the corresponding sequences from the clean lineage strains of S cerevisiae using the neighbor-joiningalgorithm Although the JapaneseSake strain (Y12) carries two copies the haplotypes are very similar in sequence and are represented here by aconsensus version where the few positions that are polymorphic between the two haplotypes have been masked out The scale bar indicates sequencedistance in units of SNPs per basepair

879

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

alleles with higher functional potential to be shifted towardlow frequencies (fig 4A) Most derived alleles are present inonly one (37) or a few strains but among nonsynonymousalleles the bias is more pronounced with 46 occurring inonly a single strain Among nonsynonymous alleles predictedto be deleterious the bias toward lower frequencies is evenstronger with 64 occurring in a single strain These patternsreflect the effects of purifying selection and agree with obser-vations made in the human (Abecasis et al 2012) andArabidopsis thaliana (Cao et al 2011) populations Derivedand therefore recent alleles are four times more likely to beclassified as deleterious than ancestral alleles (fig 4B) consis-tent with elevated negative selection against them in thespecies as a whole We also specifically considered SNPswhere the derived allele is private to a single strain andfound that such alleles present in WineEuropean strainsare more frequently nonsynonymous and the nonsynony-mous SNPs are more frequently predicted deleterious thanSNPs in other strains (fig 4C) This suggests relaxed negativeselection in the WineEuropean population potentially asso-ciated with a recent population expansion into a new andbeneficial niche such as that introduced by humans with the

emergence and spread of wine production over the last 7000years (Borneman et al 2013) Running SIFT in the same fash-ion on variants identified in the S paradoxus population wefind that derived alleles in this species are on average slightlyless often deleterious than derived S cerevisiae alleles (178vs 215) also consistent with recently relaxed selection inS cerevisiae

Loss-of-Function Variants in the S cerevisiaePopulation

Certain classes of variants are expected to have dramaticconsequences on gene products and therefore constituteparticularly interesting candidates for contributing to pheno-typic variation Taking advantage of the well-annotated ref-erence genome of S cerevisiae we predicted highly probableloss-of-function variants in the form of prematurely intro-duced stop codons and frameshifting indels across all 19S cerevisiae strains Overall these variants are enriched to-ward the 30-end of ORFs (37-fold enrichment in the last 5 ofORFs Plt 1021) This reflects lower purifying selection pres-sures against mutations that only perturb translation of thevery C-terminal end of the protein and is in line with previous

A

B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of strains

Fra

ctio

n0

00

10

20

30

40

50

60

7

NonminuscodingCodingSynonymousNonminussynonymousDeleterious (SIFT)

Derived Ancestral

Fra

ctio

n de

lete

rious

(S

IFT

)

000

005

010

015

020

025

FIG 4 Distribution of SNPs within the S cerevisiae population (A) The derived allele frequency spectrum for SNPs with different coding effects Theancestral state of each SNP was inferred by using S paradoxus as an outgroup (B) SNP alleles inferred to be derived are much more frequently predictedto be deleterious by SIFT than alleles predicted to be ancestral (215 vs 54 respectively) (C) The effect on gene sequences of derived alleles that arefound in only a single strain Strain colors denote subpopulation origin as in figure 2 The strain tree is a neighbor-joining tree based on genome-wideSNP distances and the scale bar indicates sequence distance in units of SNPs per basepair

880

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

compared one of our assemblies to an alternative assemblyfor the same strain and found no differences (see supplemen-tary fig S3 Supplementary Material online and Materials andMethods) The observation that genome content variation isrelatively larger within S cerevisiae than within S paradoxus ishighly unexpected under a neutral model of genome

evolution and implies pronounced differences in the evolu-tionary histories of these two species Although challenging toestablish at present we suggest that this unexpected excess ofgenome content variation in S cerevisiae is likely to be a majorcontributor to the equally surprising excess of phenotypicvariability within this species We furthermore note that

A

B C

890 892 894 896 898 900 902 904 906 908

Position on chromosomal scaffold XIII (kb)

FIG 1 Yeast genome structures revealed by de novo assemblies augmented by genetic linkage data (A) Scaffolding de novo assemblies using geneticlinkage information from advanced intercross lines dramatically improves assembly connectivity and reveals extensive structural conservation of thecore chromosomes in four of the major S cerevisiae lineages Displayed is a dot plot of sequence similarity between the assembly scaffolds of the strainYPS128 from the North American phylogenetic lineage and the 16 nuclear chromosomes of the S cerevisiae reference genome (strain S288c) before andafter the incorporation of the genetic linkage data into the scaffolding process After scaffolding by genetic linkage the majority of the assemblysequence is contained in 16 large scaffolds that are collinear with the chromosomes of the reference genome Results are highly similar for the otherthree strains for which genetic linkage data is available the West African strain DBVPG6044 the WineEuropean strain DBVPG6765 and the sakeJapanese strain Y12 (the recent sequencing of the sake strain Kyokai no 7 (Akao et al 2011) revealed two intrachromosomal inversions in chromosomesV and XIV in relation to the reference strain S288c however these are not shared by the sake strain Y12 sequenced here) Only scaffolds bigger than50 kb are displayed (B) Structural rearrangements relative to the chromosome organization of the S cerevisiae reference genome all localized to thesubtelomeric regions A directed arrow indicates that a sequence region is aligning to the part of the reference genome where the arrow starts but in thede novo assembly is located in the part of the genome corresponding to where the arrow ends (C) A subtelomeric 18-kb region that assembled well inseveral strains and could be localized by genetic linkage is displayed with coordinates corresponding to the YPS128 chromosome XIII scaffold Six geneswere found in this region by ab initio gene prediction (arrows indicate coding direction)

876

Bergstrom et al doi101093molbevmsu037 MBE

there are considerable amounts of genomic sequence that ispresent in one or more natural strains but absent in the Scerevisiae reference strain S288c (supplementary fig S4Supplementary Material online)

We annotated the de novo assemblies by homology andsynteny comparisons to the reference genome Out of 5774nondubious open reading frames (ORFs) in the S288cS cerevisiae reference genome we recovered a median of5417 ORFs (94) per strain with most failures to recover agene appearing to be caused by collapse in the assemblies ofvery close paralogs We also identified genes that are notpresent in the reference genome The number of nonrefer-ence genes present in a S cerevisiae strain genome rangesfrom four in the lab strain W303 which is very closely relatedto the reference strain S288c to 61 in the mosaic strainUWOPS87-2421 with a median of 31 genes per strain(fig 2B) Using the four S cerevisiae strains for which genetic

linkage data allowed assembly of chromosome-sized scaffoldswe estimate that at least 75 of these nonreference genes arelocated in the subtelomeric parts of the chromosomes Byexploiting distant sequence similarity to functionally anno-tated reference genome proteins we find that the set ofnonreference genes is enriched for gene ontology termsrelated to flocculation and sugarmdashparticularly maltosemdashtransport and metabolism These results are consistent withfindings on the evolutionary properties of subtelomeric genesacross larger evolutionary distances (Brown et al 2010)

CNV Is Extensive in Subtelomeres and Greater inS cerevisiae than in S paradoxus

Whereas the analysis described in the previous section is onlypowered to detect binary presence and absence due to thetendency of similar copies to collapse during de novo

A

B

FIG 2 Genome content variation within natural yeast populations (A) The relationship between genetic distance between strains as measured in SNPsand the amount of genomic material being presentabsent between strains All pairwise strain comparisons within each of the two species are included(B) The number of nonreference genes found in each strain genome Strain colors denote subpopulation origin (for S cerevisiae green = WineEuropeanred = West African cyan = Malaysian yellow = North American dark blue = SakeJapanese black = mosaic genome for S paradoxus orange =American brown = Far Eastern magenta = European) The strain trees are neighbor-joining trees based on genome-wide SNP distances and thescale bars indicate sequence distance in units of SNPs per basepair (distance scales differ between the species)

877

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

assembly by mapping reads to the reference genome we canalso assay differences between higher copy numbers Weidentified CNV across all strains with a mapped coverage of8 or higher (18 S cerevisiae strains and 19 S paradoxusstrains) The total size of genomic regions exhibiting CNVbetween strains is three times larger in S cerevisiae than inS paradoxus (423 kb vs 142 kb corresponding to 35 and12 of the total genome size respectively) despite lowerlevels of overall genetic divergence in the former speciesAlthough the true amount of CNV in S paradoxus couldbe slightly underestimated due to the quality of the referencegenome being lower than that of S cerevisiae (containingmultiple gaps where CNV detection is not possible) this isunlikely to explain the large difference observed (see Materialsand Methods) These results mirror the recent finding of dif-ferences in CNV rates between the great ape lineages(Sudmant et al 2013) CNV similarity between strains stronglycorrelates with SNP similarity within both S cerevisiae and Sparadoxus (r = 0843 and r = 0885 respectively) and cluster-ing strains based on CNV profiles recapitulates the broadphylogenetic structures of the two species (supplementaryfig S5 Supplementary Material online) In both S cerevisiaeand S paradoxus we find very limited CNV in nonsubtelo-meric regions and extensive variation in the subtelomericregions In S cerevisiae 320 of subtelomeric nucleotide po-sitions are affected by CNV compared with 07 in nonsub-telomeric regions (42-fold enrichment) and in S paradoxusthe corresponding numbers are 93 and 004 respectively(23-fold enrichment) In S cerevisiae genes contained in re-gions displaying CNV are enriched for gene ontology termsrelated to sugar transport and metabolism flocculation andion and metal transport and metabolism The same trends ofenrichment are observed in S paradoxus (although withfewer categories reaching statistical significance as a conse-quence of the overall lower number of genes affected) Othergenes with variable copy number between strains in bothspecies include the high copy number subtelomeric YRF1helicase PAU seripauperin and DUP gene families These re-sults imply that largely similar evolutionary forces are shapingthe landscapes of CNV in these two species By consideringaggregate sequencing depth across close paralogs in the ref-erence genome we could unveil further trends not necessarilyapparent when considering genes one by one This genefamily based analysis showed that one clinically isolatedstrain from the WineEuropean phylogenetic clusterYJM981 harbors an extremely large number of YRF1 genecopies on the order of 1000 copies compared with estimatesof 10ndash40 copies in other strains including the referencegenome This radical copy number increase is consistentwith an experimentally demonstrated phenomenon wherethe subtelomeric Y0-element which contains the YRF1genes is amplified to serve as an alternative mechanism fortelomere maintenance following failure of the conventionaltelomerase system (Lundblad and Blackburn 1993) This hasdramatic consequences on subtelomeric structure and overallgenome size and we confirmed by pulsed field gel electro-phoresis that the chromosomes of YJM981 are much largerthan normal (supplementary fig S6 Supplementary Material

online) Another closely related clinical isolate YJM978 alsoharbors an usually large number of YRF1 copies though muchfewer than YJM981 (~150 copies) While we have not identi-fied any obvious loss-of-function mutations affecting the keytelomere maintenance genes in these strains some of them(EST1 EST2 RIF2 TEL1 yKu80) appear to have very low ex-pression levels in an RNA-seq data set (Skelly et al 2013)These results raise the intriguing possibility that this alterna-tive mode of telomere elongation actually occurs in naturalstrains which to our knowledge has never beendemonstrated

Convergent Evolution of Subtelomeric CNV UnderliesNatural Arsenic Resistance in S cerevisiae andS paradoxus

One region of the genome exhibiting CNV within bothS cerevisiae and S paradoxus is the ARR gene clustercontaining three contiguous genes involved in the cellulardetoxification of arsenic and antimonyl compounds the tran-scription factor ARR1 the arsenate reductase ARR2 and theplasma membrane transporter ARR3 We hypothesized thatCNV in this gene cluster would impact growth phenotypes inmedia containing arsenic compounds and tested this usingpreviously collected phenotype data (Warringer et al 2011)In both S cerevisiae and S paradoxus we found a strongassociation between ARR cluster copy number and mitoticgrowth rate length of mitotic lag phase and mitotic growthefficiency in the presence of arsenite (fig 3A) In an additivemodel ARR copy number explains 50 (P = 15103) and92 (P = 12109) of the phenotypic variation in growthrate in S cerevisiae and S paradoxus respectively and 71(P = 25105) and 82 (P = 58107) respectively of thevariation in the length of lag time Interestingly for the growthefficiency phenotype additivity breaks down as there is nodifference between strains with one copy and strains with twocopies This result is biologically consistent with a modelwhere the rate at which the cell can expel arsenic compoundsincreases with the number of ARR copies but the energy costper arsenite molecule exported is constant and independentof copy number above one

In S paradoxus the distribution of the ARR gene clusterCNV tracks the population structure with all strains fromthe European population having two copies the NorthAmerican strain having one copy and the two Far Easternstrains missing the region The variant distribution within Scerevisiae shows a more complex pattern with evidence forboth convergent amplification and introgression betweenlineages (fig 3B) The region is found in one copy in moststrains in the species missing in the Malaysian strainUWOPS03-4614 and the predominantly West Africanmosaic strains SK1 and Y55 and in two copies in thesake strain Y12 and two strains from the WineEuropeancluster By computationally phasing the haplotypes of thetwo copies of the WineEuropean strain BC187 we foundthat one of these copies clusters phylogenetically with thealleles of the other WineEuropean strains as expectedwhereas the other clusters with the Y12 copies indicating

878

Bergstrom et al doi101093molbevmsu037 MBE

that this copy has been introgressed from the sake lineageinto parts of the WineEuropean population (fig 3C) Theother WineEuropean strain with a duplication DBVPG1373however harbors two copies with very low internal sequencedivergence and with no similarity to the sake copies implyingthat this is the result of an independent duplication eventwithin the WineEuropean population These findings dem-onstrate convergent evolution of ARR cluster duplication andloss both between different lineages within S cerevisiae andbetween S cerevisiae and S paradoxus It is tempting to spec-ulate that the ARR cluster CNV has been driven by differencesin environmental arsenic concentrations between the habi-tats of different yeast lineages

Derived Alleles in Yeast Tend to Be Deleterious andPrivate to a Single Population

To predict the effects of SNPs on protein function we em-ployed the Sorting Intolerant From Tolerant (SIFT) algorithmwhich estimates the probability of an amino acid substitutionbeing deleterious based on evolutionary conservation of thesite across a large number of species (Kumar et al 2009) on allnonsynonymous SNPs identified in the 18 S cerevisiae strainswith at least 8 coverage mapped to the reference genomeUsing the S paradoxus population as outgroup we also in-ferred the most likely ancestral state at each polymorphiclocus allowing the polarization of alleles into ancestral andderived Overall we found a strong tendency for derived

ARR cluster copy number

Rat

e (5

mM

NaA

s 2O3)

Lag

(5m

M N

aAs 2O

3)

Effi

cien

cy (

5mM

NaA

s 2O3)

ARR cluster copy number ARR cluster copy number

A

B C

FIG 3 Convergent evolution of ARR cluster copy number (A) Growth rate length of mitotic lag and mitotic growth efficiency in medium containing5 mM sodium arsenite oxide for strains with different ARR cluster copy number Units are on a log2 scale and relative to the S cerevisiae reference strainderivative BY4741 The strain data points are jittered along the horizontal dimension to increase visibility (B) Distribution of the ARR cluster copynumber variant within the populations of S cerevisiae and S paradoxus Strain colors denote subpopulation origin as in figure 2 The strain trees areneighbor-joining trees based on genome-wide SNP distances and the scale bars indicates sequence distance in units of SNPs per basepair (distancescales differ between the species) (C) The two copies of the ARR gene cluster in the WineEuropean strain BC187 were computationally phased and thesequences of the two copies were clustered with the corresponding sequences from the clean lineage strains of S cerevisiae using the neighbor-joiningalgorithm Although the JapaneseSake strain (Y12) carries two copies the haplotypes are very similar in sequence and are represented here by aconsensus version where the few positions that are polymorphic between the two haplotypes have been masked out The scale bar indicates sequencedistance in units of SNPs per basepair

879

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

alleles with higher functional potential to be shifted towardlow frequencies (fig 4A) Most derived alleles are present inonly one (37) or a few strains but among nonsynonymousalleles the bias is more pronounced with 46 occurring inonly a single strain Among nonsynonymous alleles predictedto be deleterious the bias toward lower frequencies is evenstronger with 64 occurring in a single strain These patternsreflect the effects of purifying selection and agree with obser-vations made in the human (Abecasis et al 2012) andArabidopsis thaliana (Cao et al 2011) populations Derivedand therefore recent alleles are four times more likely to beclassified as deleterious than ancestral alleles (fig 4B) consis-tent with elevated negative selection against them in thespecies as a whole We also specifically considered SNPswhere the derived allele is private to a single strain andfound that such alleles present in WineEuropean strainsare more frequently nonsynonymous and the nonsynony-mous SNPs are more frequently predicted deleterious thanSNPs in other strains (fig 4C) This suggests relaxed negativeselection in the WineEuropean population potentially asso-ciated with a recent population expansion into a new andbeneficial niche such as that introduced by humans with the

emergence and spread of wine production over the last 7000years (Borneman et al 2013) Running SIFT in the same fash-ion on variants identified in the S paradoxus population wefind that derived alleles in this species are on average slightlyless often deleterious than derived S cerevisiae alleles (178vs 215) also consistent with recently relaxed selection inS cerevisiae

Loss-of-Function Variants in the S cerevisiaePopulation

Certain classes of variants are expected to have dramaticconsequences on gene products and therefore constituteparticularly interesting candidates for contributing to pheno-typic variation Taking advantage of the well-annotated ref-erence genome of S cerevisiae we predicted highly probableloss-of-function variants in the form of prematurely intro-duced stop codons and frameshifting indels across all 19S cerevisiae strains Overall these variants are enriched to-ward the 30-end of ORFs (37-fold enrichment in the last 5 ofORFs Plt 1021) This reflects lower purifying selection pres-sures against mutations that only perturb translation of thevery C-terminal end of the protein and is in line with previous

A

B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of strains

Fra

ctio

n0

00

10

20

30

40

50

60

7

NonminuscodingCodingSynonymousNonminussynonymousDeleterious (SIFT)

Derived Ancestral

Fra

ctio

n de

lete

rious

(S

IFT

)

000

005

010

015

020

025

FIG 4 Distribution of SNPs within the S cerevisiae population (A) The derived allele frequency spectrum for SNPs with different coding effects Theancestral state of each SNP was inferred by using S paradoxus as an outgroup (B) SNP alleles inferred to be derived are much more frequently predictedto be deleterious by SIFT than alleles predicted to be ancestral (215 vs 54 respectively) (C) The effect on gene sequences of derived alleles that arefound in only a single strain Strain colors denote subpopulation origin as in figure 2 The strain tree is a neighbor-joining tree based on genome-wideSNP distances and the scale bar indicates sequence distance in units of SNPs per basepair

880

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

there are considerable amounts of genomic sequence that ispresent in one or more natural strains but absent in the Scerevisiae reference strain S288c (supplementary fig S4Supplementary Material online)

We annotated the de novo assemblies by homology andsynteny comparisons to the reference genome Out of 5774nondubious open reading frames (ORFs) in the S288cS cerevisiae reference genome we recovered a median of5417 ORFs (94) per strain with most failures to recover agene appearing to be caused by collapse in the assemblies ofvery close paralogs We also identified genes that are notpresent in the reference genome The number of nonrefer-ence genes present in a S cerevisiae strain genome rangesfrom four in the lab strain W303 which is very closely relatedto the reference strain S288c to 61 in the mosaic strainUWOPS87-2421 with a median of 31 genes per strain(fig 2B) Using the four S cerevisiae strains for which genetic

linkage data allowed assembly of chromosome-sized scaffoldswe estimate that at least 75 of these nonreference genes arelocated in the subtelomeric parts of the chromosomes Byexploiting distant sequence similarity to functionally anno-tated reference genome proteins we find that the set ofnonreference genes is enriched for gene ontology termsrelated to flocculation and sugarmdashparticularly maltosemdashtransport and metabolism These results are consistent withfindings on the evolutionary properties of subtelomeric genesacross larger evolutionary distances (Brown et al 2010)

CNV Is Extensive in Subtelomeres and Greater inS cerevisiae than in S paradoxus

Whereas the analysis described in the previous section is onlypowered to detect binary presence and absence due to thetendency of similar copies to collapse during de novo

A

B

FIG 2 Genome content variation within natural yeast populations (A) The relationship between genetic distance between strains as measured in SNPsand the amount of genomic material being presentabsent between strains All pairwise strain comparisons within each of the two species are included(B) The number of nonreference genes found in each strain genome Strain colors denote subpopulation origin (for S cerevisiae green = WineEuropeanred = West African cyan = Malaysian yellow = North American dark blue = SakeJapanese black = mosaic genome for S paradoxus orange =American brown = Far Eastern magenta = European) The strain trees are neighbor-joining trees based on genome-wide SNP distances and thescale bars indicate sequence distance in units of SNPs per basepair (distance scales differ between the species)

877

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

assembly by mapping reads to the reference genome we canalso assay differences between higher copy numbers Weidentified CNV across all strains with a mapped coverage of8 or higher (18 S cerevisiae strains and 19 S paradoxusstrains) The total size of genomic regions exhibiting CNVbetween strains is three times larger in S cerevisiae than inS paradoxus (423 kb vs 142 kb corresponding to 35 and12 of the total genome size respectively) despite lowerlevels of overall genetic divergence in the former speciesAlthough the true amount of CNV in S paradoxus couldbe slightly underestimated due to the quality of the referencegenome being lower than that of S cerevisiae (containingmultiple gaps where CNV detection is not possible) this isunlikely to explain the large difference observed (see Materialsand Methods) These results mirror the recent finding of dif-ferences in CNV rates between the great ape lineages(Sudmant et al 2013) CNV similarity between strains stronglycorrelates with SNP similarity within both S cerevisiae and Sparadoxus (r = 0843 and r = 0885 respectively) and cluster-ing strains based on CNV profiles recapitulates the broadphylogenetic structures of the two species (supplementaryfig S5 Supplementary Material online) In both S cerevisiaeand S paradoxus we find very limited CNV in nonsubtelo-meric regions and extensive variation in the subtelomericregions In S cerevisiae 320 of subtelomeric nucleotide po-sitions are affected by CNV compared with 07 in nonsub-telomeric regions (42-fold enrichment) and in S paradoxusthe corresponding numbers are 93 and 004 respectively(23-fold enrichment) In S cerevisiae genes contained in re-gions displaying CNV are enriched for gene ontology termsrelated to sugar transport and metabolism flocculation andion and metal transport and metabolism The same trends ofenrichment are observed in S paradoxus (although withfewer categories reaching statistical significance as a conse-quence of the overall lower number of genes affected) Othergenes with variable copy number between strains in bothspecies include the high copy number subtelomeric YRF1helicase PAU seripauperin and DUP gene families These re-sults imply that largely similar evolutionary forces are shapingthe landscapes of CNV in these two species By consideringaggregate sequencing depth across close paralogs in the ref-erence genome we could unveil further trends not necessarilyapparent when considering genes one by one This genefamily based analysis showed that one clinically isolatedstrain from the WineEuropean phylogenetic clusterYJM981 harbors an extremely large number of YRF1 genecopies on the order of 1000 copies compared with estimatesof 10ndash40 copies in other strains including the referencegenome This radical copy number increase is consistentwith an experimentally demonstrated phenomenon wherethe subtelomeric Y0-element which contains the YRF1genes is amplified to serve as an alternative mechanism fortelomere maintenance following failure of the conventionaltelomerase system (Lundblad and Blackburn 1993) This hasdramatic consequences on subtelomeric structure and overallgenome size and we confirmed by pulsed field gel electro-phoresis that the chromosomes of YJM981 are much largerthan normal (supplementary fig S6 Supplementary Material

online) Another closely related clinical isolate YJM978 alsoharbors an usually large number of YRF1 copies though muchfewer than YJM981 (~150 copies) While we have not identi-fied any obvious loss-of-function mutations affecting the keytelomere maintenance genes in these strains some of them(EST1 EST2 RIF2 TEL1 yKu80) appear to have very low ex-pression levels in an RNA-seq data set (Skelly et al 2013)These results raise the intriguing possibility that this alterna-tive mode of telomere elongation actually occurs in naturalstrains which to our knowledge has never beendemonstrated

Convergent Evolution of Subtelomeric CNV UnderliesNatural Arsenic Resistance in S cerevisiae andS paradoxus

One region of the genome exhibiting CNV within bothS cerevisiae and S paradoxus is the ARR gene clustercontaining three contiguous genes involved in the cellulardetoxification of arsenic and antimonyl compounds the tran-scription factor ARR1 the arsenate reductase ARR2 and theplasma membrane transporter ARR3 We hypothesized thatCNV in this gene cluster would impact growth phenotypes inmedia containing arsenic compounds and tested this usingpreviously collected phenotype data (Warringer et al 2011)In both S cerevisiae and S paradoxus we found a strongassociation between ARR cluster copy number and mitoticgrowth rate length of mitotic lag phase and mitotic growthefficiency in the presence of arsenite (fig 3A) In an additivemodel ARR copy number explains 50 (P = 15103) and92 (P = 12109) of the phenotypic variation in growthrate in S cerevisiae and S paradoxus respectively and 71(P = 25105) and 82 (P = 58107) respectively of thevariation in the length of lag time Interestingly for the growthefficiency phenotype additivity breaks down as there is nodifference between strains with one copy and strains with twocopies This result is biologically consistent with a modelwhere the rate at which the cell can expel arsenic compoundsincreases with the number of ARR copies but the energy costper arsenite molecule exported is constant and independentof copy number above one

In S paradoxus the distribution of the ARR gene clusterCNV tracks the population structure with all strains fromthe European population having two copies the NorthAmerican strain having one copy and the two Far Easternstrains missing the region The variant distribution within Scerevisiae shows a more complex pattern with evidence forboth convergent amplification and introgression betweenlineages (fig 3B) The region is found in one copy in moststrains in the species missing in the Malaysian strainUWOPS03-4614 and the predominantly West Africanmosaic strains SK1 and Y55 and in two copies in thesake strain Y12 and two strains from the WineEuropeancluster By computationally phasing the haplotypes of thetwo copies of the WineEuropean strain BC187 we foundthat one of these copies clusters phylogenetically with thealleles of the other WineEuropean strains as expectedwhereas the other clusters with the Y12 copies indicating

878

Bergstrom et al doi101093molbevmsu037 MBE

that this copy has been introgressed from the sake lineageinto parts of the WineEuropean population (fig 3C) Theother WineEuropean strain with a duplication DBVPG1373however harbors two copies with very low internal sequencedivergence and with no similarity to the sake copies implyingthat this is the result of an independent duplication eventwithin the WineEuropean population These findings dem-onstrate convergent evolution of ARR cluster duplication andloss both between different lineages within S cerevisiae andbetween S cerevisiae and S paradoxus It is tempting to spec-ulate that the ARR cluster CNV has been driven by differencesin environmental arsenic concentrations between the habi-tats of different yeast lineages

Derived Alleles in Yeast Tend to Be Deleterious andPrivate to a Single Population

To predict the effects of SNPs on protein function we em-ployed the Sorting Intolerant From Tolerant (SIFT) algorithmwhich estimates the probability of an amino acid substitutionbeing deleterious based on evolutionary conservation of thesite across a large number of species (Kumar et al 2009) on allnonsynonymous SNPs identified in the 18 S cerevisiae strainswith at least 8 coverage mapped to the reference genomeUsing the S paradoxus population as outgroup we also in-ferred the most likely ancestral state at each polymorphiclocus allowing the polarization of alleles into ancestral andderived Overall we found a strong tendency for derived

ARR cluster copy number

Rat

e (5

mM

NaA

s 2O3)

Lag

(5m

M N

aAs 2O

3)

Effi

cien

cy (

5mM

NaA

s 2O3)

ARR cluster copy number ARR cluster copy number

A

B C

FIG 3 Convergent evolution of ARR cluster copy number (A) Growth rate length of mitotic lag and mitotic growth efficiency in medium containing5 mM sodium arsenite oxide for strains with different ARR cluster copy number Units are on a log2 scale and relative to the S cerevisiae reference strainderivative BY4741 The strain data points are jittered along the horizontal dimension to increase visibility (B) Distribution of the ARR cluster copynumber variant within the populations of S cerevisiae and S paradoxus Strain colors denote subpopulation origin as in figure 2 The strain trees areneighbor-joining trees based on genome-wide SNP distances and the scale bars indicates sequence distance in units of SNPs per basepair (distancescales differ between the species) (C) The two copies of the ARR gene cluster in the WineEuropean strain BC187 were computationally phased and thesequences of the two copies were clustered with the corresponding sequences from the clean lineage strains of S cerevisiae using the neighbor-joiningalgorithm Although the JapaneseSake strain (Y12) carries two copies the haplotypes are very similar in sequence and are represented here by aconsensus version where the few positions that are polymorphic between the two haplotypes have been masked out The scale bar indicates sequencedistance in units of SNPs per basepair

879

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

alleles with higher functional potential to be shifted towardlow frequencies (fig 4A) Most derived alleles are present inonly one (37) or a few strains but among nonsynonymousalleles the bias is more pronounced with 46 occurring inonly a single strain Among nonsynonymous alleles predictedto be deleterious the bias toward lower frequencies is evenstronger with 64 occurring in a single strain These patternsreflect the effects of purifying selection and agree with obser-vations made in the human (Abecasis et al 2012) andArabidopsis thaliana (Cao et al 2011) populations Derivedand therefore recent alleles are four times more likely to beclassified as deleterious than ancestral alleles (fig 4B) consis-tent with elevated negative selection against them in thespecies as a whole We also specifically considered SNPswhere the derived allele is private to a single strain andfound that such alleles present in WineEuropean strainsare more frequently nonsynonymous and the nonsynony-mous SNPs are more frequently predicted deleterious thanSNPs in other strains (fig 4C) This suggests relaxed negativeselection in the WineEuropean population potentially asso-ciated with a recent population expansion into a new andbeneficial niche such as that introduced by humans with the

emergence and spread of wine production over the last 7000years (Borneman et al 2013) Running SIFT in the same fash-ion on variants identified in the S paradoxus population wefind that derived alleles in this species are on average slightlyless often deleterious than derived S cerevisiae alleles (178vs 215) also consistent with recently relaxed selection inS cerevisiae

Loss-of-Function Variants in the S cerevisiaePopulation

Certain classes of variants are expected to have dramaticconsequences on gene products and therefore constituteparticularly interesting candidates for contributing to pheno-typic variation Taking advantage of the well-annotated ref-erence genome of S cerevisiae we predicted highly probableloss-of-function variants in the form of prematurely intro-duced stop codons and frameshifting indels across all 19S cerevisiae strains Overall these variants are enriched to-ward the 30-end of ORFs (37-fold enrichment in the last 5 ofORFs Plt 1021) This reflects lower purifying selection pres-sures against mutations that only perturb translation of thevery C-terminal end of the protein and is in line with previous

A

B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of strains

Fra

ctio

n0

00

10

20

30

40

50

60

7

NonminuscodingCodingSynonymousNonminussynonymousDeleterious (SIFT)

Derived Ancestral

Fra

ctio

n de

lete

rious

(S

IFT

)

000

005

010

015

020

025

FIG 4 Distribution of SNPs within the S cerevisiae population (A) The derived allele frequency spectrum for SNPs with different coding effects Theancestral state of each SNP was inferred by using S paradoxus as an outgroup (B) SNP alleles inferred to be derived are much more frequently predictedto be deleterious by SIFT than alleles predicted to be ancestral (215 vs 54 respectively) (C) The effect on gene sequences of derived alleles that arefound in only a single strain Strain colors denote subpopulation origin as in figure 2 The strain tree is a neighbor-joining tree based on genome-wideSNP distances and the scale bar indicates sequence distance in units of SNPs per basepair

880

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

assembly by mapping reads to the reference genome we canalso assay differences between higher copy numbers Weidentified CNV across all strains with a mapped coverage of8 or higher (18 S cerevisiae strains and 19 S paradoxusstrains) The total size of genomic regions exhibiting CNVbetween strains is three times larger in S cerevisiae than inS paradoxus (423 kb vs 142 kb corresponding to 35 and12 of the total genome size respectively) despite lowerlevels of overall genetic divergence in the former speciesAlthough the true amount of CNV in S paradoxus couldbe slightly underestimated due to the quality of the referencegenome being lower than that of S cerevisiae (containingmultiple gaps where CNV detection is not possible) this isunlikely to explain the large difference observed (see Materialsand Methods) These results mirror the recent finding of dif-ferences in CNV rates between the great ape lineages(Sudmant et al 2013) CNV similarity between strains stronglycorrelates with SNP similarity within both S cerevisiae and Sparadoxus (r = 0843 and r = 0885 respectively) and cluster-ing strains based on CNV profiles recapitulates the broadphylogenetic structures of the two species (supplementaryfig S5 Supplementary Material online) In both S cerevisiaeand S paradoxus we find very limited CNV in nonsubtelo-meric regions and extensive variation in the subtelomericregions In S cerevisiae 320 of subtelomeric nucleotide po-sitions are affected by CNV compared with 07 in nonsub-telomeric regions (42-fold enrichment) and in S paradoxusthe corresponding numbers are 93 and 004 respectively(23-fold enrichment) In S cerevisiae genes contained in re-gions displaying CNV are enriched for gene ontology termsrelated to sugar transport and metabolism flocculation andion and metal transport and metabolism The same trends ofenrichment are observed in S paradoxus (although withfewer categories reaching statistical significance as a conse-quence of the overall lower number of genes affected) Othergenes with variable copy number between strains in bothspecies include the high copy number subtelomeric YRF1helicase PAU seripauperin and DUP gene families These re-sults imply that largely similar evolutionary forces are shapingthe landscapes of CNV in these two species By consideringaggregate sequencing depth across close paralogs in the ref-erence genome we could unveil further trends not necessarilyapparent when considering genes one by one This genefamily based analysis showed that one clinically isolatedstrain from the WineEuropean phylogenetic clusterYJM981 harbors an extremely large number of YRF1 genecopies on the order of 1000 copies compared with estimatesof 10ndash40 copies in other strains including the referencegenome This radical copy number increase is consistentwith an experimentally demonstrated phenomenon wherethe subtelomeric Y0-element which contains the YRF1genes is amplified to serve as an alternative mechanism fortelomere maintenance following failure of the conventionaltelomerase system (Lundblad and Blackburn 1993) This hasdramatic consequences on subtelomeric structure and overallgenome size and we confirmed by pulsed field gel electro-phoresis that the chromosomes of YJM981 are much largerthan normal (supplementary fig S6 Supplementary Material

online) Another closely related clinical isolate YJM978 alsoharbors an usually large number of YRF1 copies though muchfewer than YJM981 (~150 copies) While we have not identi-fied any obvious loss-of-function mutations affecting the keytelomere maintenance genes in these strains some of them(EST1 EST2 RIF2 TEL1 yKu80) appear to have very low ex-pression levels in an RNA-seq data set (Skelly et al 2013)These results raise the intriguing possibility that this alterna-tive mode of telomere elongation actually occurs in naturalstrains which to our knowledge has never beendemonstrated

Convergent Evolution of Subtelomeric CNV UnderliesNatural Arsenic Resistance in S cerevisiae andS paradoxus

One region of the genome exhibiting CNV within bothS cerevisiae and S paradoxus is the ARR gene clustercontaining three contiguous genes involved in the cellulardetoxification of arsenic and antimonyl compounds the tran-scription factor ARR1 the arsenate reductase ARR2 and theplasma membrane transporter ARR3 We hypothesized thatCNV in this gene cluster would impact growth phenotypes inmedia containing arsenic compounds and tested this usingpreviously collected phenotype data (Warringer et al 2011)In both S cerevisiae and S paradoxus we found a strongassociation between ARR cluster copy number and mitoticgrowth rate length of mitotic lag phase and mitotic growthefficiency in the presence of arsenite (fig 3A) In an additivemodel ARR copy number explains 50 (P = 15103) and92 (P = 12109) of the phenotypic variation in growthrate in S cerevisiae and S paradoxus respectively and 71(P = 25105) and 82 (P = 58107) respectively of thevariation in the length of lag time Interestingly for the growthefficiency phenotype additivity breaks down as there is nodifference between strains with one copy and strains with twocopies This result is biologically consistent with a modelwhere the rate at which the cell can expel arsenic compoundsincreases with the number of ARR copies but the energy costper arsenite molecule exported is constant and independentof copy number above one

In S paradoxus the distribution of the ARR gene clusterCNV tracks the population structure with all strains fromthe European population having two copies the NorthAmerican strain having one copy and the two Far Easternstrains missing the region The variant distribution within Scerevisiae shows a more complex pattern with evidence forboth convergent amplification and introgression betweenlineages (fig 3B) The region is found in one copy in moststrains in the species missing in the Malaysian strainUWOPS03-4614 and the predominantly West Africanmosaic strains SK1 and Y55 and in two copies in thesake strain Y12 and two strains from the WineEuropeancluster By computationally phasing the haplotypes of thetwo copies of the WineEuropean strain BC187 we foundthat one of these copies clusters phylogenetically with thealleles of the other WineEuropean strains as expectedwhereas the other clusters with the Y12 copies indicating

878

Bergstrom et al doi101093molbevmsu037 MBE

that this copy has been introgressed from the sake lineageinto parts of the WineEuropean population (fig 3C) Theother WineEuropean strain with a duplication DBVPG1373however harbors two copies with very low internal sequencedivergence and with no similarity to the sake copies implyingthat this is the result of an independent duplication eventwithin the WineEuropean population These findings dem-onstrate convergent evolution of ARR cluster duplication andloss both between different lineages within S cerevisiae andbetween S cerevisiae and S paradoxus It is tempting to spec-ulate that the ARR cluster CNV has been driven by differencesin environmental arsenic concentrations between the habi-tats of different yeast lineages

Derived Alleles in Yeast Tend to Be Deleterious andPrivate to a Single Population

To predict the effects of SNPs on protein function we em-ployed the Sorting Intolerant From Tolerant (SIFT) algorithmwhich estimates the probability of an amino acid substitutionbeing deleterious based on evolutionary conservation of thesite across a large number of species (Kumar et al 2009) on allnonsynonymous SNPs identified in the 18 S cerevisiae strainswith at least 8 coverage mapped to the reference genomeUsing the S paradoxus population as outgroup we also in-ferred the most likely ancestral state at each polymorphiclocus allowing the polarization of alleles into ancestral andderived Overall we found a strong tendency for derived

ARR cluster copy number

Rat

e (5

mM

NaA

s 2O3)

Lag

(5m

M N

aAs 2O

3)

Effi

cien

cy (

5mM

NaA

s 2O3)

ARR cluster copy number ARR cluster copy number

A

B C

FIG 3 Convergent evolution of ARR cluster copy number (A) Growth rate length of mitotic lag and mitotic growth efficiency in medium containing5 mM sodium arsenite oxide for strains with different ARR cluster copy number Units are on a log2 scale and relative to the S cerevisiae reference strainderivative BY4741 The strain data points are jittered along the horizontal dimension to increase visibility (B) Distribution of the ARR cluster copynumber variant within the populations of S cerevisiae and S paradoxus Strain colors denote subpopulation origin as in figure 2 The strain trees areneighbor-joining trees based on genome-wide SNP distances and the scale bars indicates sequence distance in units of SNPs per basepair (distancescales differ between the species) (C) The two copies of the ARR gene cluster in the WineEuropean strain BC187 were computationally phased and thesequences of the two copies were clustered with the corresponding sequences from the clean lineage strains of S cerevisiae using the neighbor-joiningalgorithm Although the JapaneseSake strain (Y12) carries two copies the haplotypes are very similar in sequence and are represented here by aconsensus version where the few positions that are polymorphic between the two haplotypes have been masked out The scale bar indicates sequencedistance in units of SNPs per basepair

879

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

alleles with higher functional potential to be shifted towardlow frequencies (fig 4A) Most derived alleles are present inonly one (37) or a few strains but among nonsynonymousalleles the bias is more pronounced with 46 occurring inonly a single strain Among nonsynonymous alleles predictedto be deleterious the bias toward lower frequencies is evenstronger with 64 occurring in a single strain These patternsreflect the effects of purifying selection and agree with obser-vations made in the human (Abecasis et al 2012) andArabidopsis thaliana (Cao et al 2011) populations Derivedand therefore recent alleles are four times more likely to beclassified as deleterious than ancestral alleles (fig 4B) consis-tent with elevated negative selection against them in thespecies as a whole We also specifically considered SNPswhere the derived allele is private to a single strain andfound that such alleles present in WineEuropean strainsare more frequently nonsynonymous and the nonsynony-mous SNPs are more frequently predicted deleterious thanSNPs in other strains (fig 4C) This suggests relaxed negativeselection in the WineEuropean population potentially asso-ciated with a recent population expansion into a new andbeneficial niche such as that introduced by humans with the

emergence and spread of wine production over the last 7000years (Borneman et al 2013) Running SIFT in the same fash-ion on variants identified in the S paradoxus population wefind that derived alleles in this species are on average slightlyless often deleterious than derived S cerevisiae alleles (178vs 215) also consistent with recently relaxed selection inS cerevisiae

Loss-of-Function Variants in the S cerevisiaePopulation

Certain classes of variants are expected to have dramaticconsequences on gene products and therefore constituteparticularly interesting candidates for contributing to pheno-typic variation Taking advantage of the well-annotated ref-erence genome of S cerevisiae we predicted highly probableloss-of-function variants in the form of prematurely intro-duced stop codons and frameshifting indels across all 19S cerevisiae strains Overall these variants are enriched to-ward the 30-end of ORFs (37-fold enrichment in the last 5 ofORFs Plt 1021) This reflects lower purifying selection pres-sures against mutations that only perturb translation of thevery C-terminal end of the protein and is in line with previous

A

B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of strains

Fra

ctio

n0

00

10

20

30

40

50

60

7

NonminuscodingCodingSynonymousNonminussynonymousDeleterious (SIFT)

Derived Ancestral

Fra

ctio

n de

lete

rious

(S

IFT

)

000

005

010

015

020

025

FIG 4 Distribution of SNPs within the S cerevisiae population (A) The derived allele frequency spectrum for SNPs with different coding effects Theancestral state of each SNP was inferred by using S paradoxus as an outgroup (B) SNP alleles inferred to be derived are much more frequently predictedto be deleterious by SIFT than alleles predicted to be ancestral (215 vs 54 respectively) (C) The effect on gene sequences of derived alleles that arefound in only a single strain Strain colors denote subpopulation origin as in figure 2 The strain tree is a neighbor-joining tree based on genome-wideSNP distances and the scale bar indicates sequence distance in units of SNPs per basepair

880

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

that this copy has been introgressed from the sake lineageinto parts of the WineEuropean population (fig 3C) Theother WineEuropean strain with a duplication DBVPG1373however harbors two copies with very low internal sequencedivergence and with no similarity to the sake copies implyingthat this is the result of an independent duplication eventwithin the WineEuropean population These findings dem-onstrate convergent evolution of ARR cluster duplication andloss both between different lineages within S cerevisiae andbetween S cerevisiae and S paradoxus It is tempting to spec-ulate that the ARR cluster CNV has been driven by differencesin environmental arsenic concentrations between the habi-tats of different yeast lineages

Derived Alleles in Yeast Tend to Be Deleterious andPrivate to a Single Population

To predict the effects of SNPs on protein function we em-ployed the Sorting Intolerant From Tolerant (SIFT) algorithmwhich estimates the probability of an amino acid substitutionbeing deleterious based on evolutionary conservation of thesite across a large number of species (Kumar et al 2009) on allnonsynonymous SNPs identified in the 18 S cerevisiae strainswith at least 8 coverage mapped to the reference genomeUsing the S paradoxus population as outgroup we also in-ferred the most likely ancestral state at each polymorphiclocus allowing the polarization of alleles into ancestral andderived Overall we found a strong tendency for derived

ARR cluster copy number

Rat

e (5

mM

NaA

s 2O3)

Lag

(5m

M N

aAs 2O

3)

Effi

cien

cy (

5mM

NaA

s 2O3)

ARR cluster copy number ARR cluster copy number

A

B C

FIG 3 Convergent evolution of ARR cluster copy number (A) Growth rate length of mitotic lag and mitotic growth efficiency in medium containing5 mM sodium arsenite oxide for strains with different ARR cluster copy number Units are on a log2 scale and relative to the S cerevisiae reference strainderivative BY4741 The strain data points are jittered along the horizontal dimension to increase visibility (B) Distribution of the ARR cluster copynumber variant within the populations of S cerevisiae and S paradoxus Strain colors denote subpopulation origin as in figure 2 The strain trees areneighbor-joining trees based on genome-wide SNP distances and the scale bars indicates sequence distance in units of SNPs per basepair (distancescales differ between the species) (C) The two copies of the ARR gene cluster in the WineEuropean strain BC187 were computationally phased and thesequences of the two copies were clustered with the corresponding sequences from the clean lineage strains of S cerevisiae using the neighbor-joiningalgorithm Although the JapaneseSake strain (Y12) carries two copies the haplotypes are very similar in sequence and are represented here by aconsensus version where the few positions that are polymorphic between the two haplotypes have been masked out The scale bar indicates sequencedistance in units of SNPs per basepair

879

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

alleles with higher functional potential to be shifted towardlow frequencies (fig 4A) Most derived alleles are present inonly one (37) or a few strains but among nonsynonymousalleles the bias is more pronounced with 46 occurring inonly a single strain Among nonsynonymous alleles predictedto be deleterious the bias toward lower frequencies is evenstronger with 64 occurring in a single strain These patternsreflect the effects of purifying selection and agree with obser-vations made in the human (Abecasis et al 2012) andArabidopsis thaliana (Cao et al 2011) populations Derivedand therefore recent alleles are four times more likely to beclassified as deleterious than ancestral alleles (fig 4B) consis-tent with elevated negative selection against them in thespecies as a whole We also specifically considered SNPswhere the derived allele is private to a single strain andfound that such alleles present in WineEuropean strainsare more frequently nonsynonymous and the nonsynony-mous SNPs are more frequently predicted deleterious thanSNPs in other strains (fig 4C) This suggests relaxed negativeselection in the WineEuropean population potentially asso-ciated with a recent population expansion into a new andbeneficial niche such as that introduced by humans with the

emergence and spread of wine production over the last 7000years (Borneman et al 2013) Running SIFT in the same fash-ion on variants identified in the S paradoxus population wefind that derived alleles in this species are on average slightlyless often deleterious than derived S cerevisiae alleles (178vs 215) also consistent with recently relaxed selection inS cerevisiae

Loss-of-Function Variants in the S cerevisiaePopulation

Certain classes of variants are expected to have dramaticconsequences on gene products and therefore constituteparticularly interesting candidates for contributing to pheno-typic variation Taking advantage of the well-annotated ref-erence genome of S cerevisiae we predicted highly probableloss-of-function variants in the form of prematurely intro-duced stop codons and frameshifting indels across all 19S cerevisiae strains Overall these variants are enriched to-ward the 30-end of ORFs (37-fold enrichment in the last 5 ofORFs Plt 1021) This reflects lower purifying selection pres-sures against mutations that only perturb translation of thevery C-terminal end of the protein and is in line with previous

A

B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of strains

Fra

ctio

n0

00

10

20

30

40

50

60

7

NonminuscodingCodingSynonymousNonminussynonymousDeleterious (SIFT)

Derived Ancestral

Fra

ctio

n de

lete

rious

(S

IFT

)

000

005

010

015

020

025

FIG 4 Distribution of SNPs within the S cerevisiae population (A) The derived allele frequency spectrum for SNPs with different coding effects Theancestral state of each SNP was inferred by using S paradoxus as an outgroup (B) SNP alleles inferred to be derived are much more frequently predictedto be deleterious by SIFT than alleles predicted to be ancestral (215 vs 54 respectively) (C) The effect on gene sequences of derived alleles that arefound in only a single strain Strain colors denote subpopulation origin as in figure 2 The strain tree is a neighbor-joining tree based on genome-wideSNP distances and the scale bar indicates sequence distance in units of SNPs per basepair

880

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

alleles with higher functional potential to be shifted towardlow frequencies (fig 4A) Most derived alleles are present inonly one (37) or a few strains but among nonsynonymousalleles the bias is more pronounced with 46 occurring inonly a single strain Among nonsynonymous alleles predictedto be deleterious the bias toward lower frequencies is evenstronger with 64 occurring in a single strain These patternsreflect the effects of purifying selection and agree with obser-vations made in the human (Abecasis et al 2012) andArabidopsis thaliana (Cao et al 2011) populations Derivedand therefore recent alleles are four times more likely to beclassified as deleterious than ancestral alleles (fig 4B) consis-tent with elevated negative selection against them in thespecies as a whole We also specifically considered SNPswhere the derived allele is private to a single strain andfound that such alleles present in WineEuropean strainsare more frequently nonsynonymous and the nonsynony-mous SNPs are more frequently predicted deleterious thanSNPs in other strains (fig 4C) This suggests relaxed negativeselection in the WineEuropean population potentially asso-ciated with a recent population expansion into a new andbeneficial niche such as that introduced by humans with the

emergence and spread of wine production over the last 7000years (Borneman et al 2013) Running SIFT in the same fash-ion on variants identified in the S paradoxus population wefind that derived alleles in this species are on average slightlyless often deleterious than derived S cerevisiae alleles (178vs 215) also consistent with recently relaxed selection inS cerevisiae

Loss-of-Function Variants in the S cerevisiaePopulation

Certain classes of variants are expected to have dramaticconsequences on gene products and therefore constituteparticularly interesting candidates for contributing to pheno-typic variation Taking advantage of the well-annotated ref-erence genome of S cerevisiae we predicted highly probableloss-of-function variants in the form of prematurely intro-duced stop codons and frameshifting indels across all 19S cerevisiae strains Overall these variants are enriched to-ward the 30-end of ORFs (37-fold enrichment in the last 5 ofORFs Plt 1021) This reflects lower purifying selection pres-sures against mutations that only perturb translation of thevery C-terminal end of the protein and is in line with previous

A

B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Number of strains

Fra

ctio

n0

00

10

20

30

40

50

60

7

NonminuscodingCodingSynonymousNonminussynonymousDeleterious (SIFT)

Derived Ancestral

Fra

ctio

n de

lete

rious

(S

IFT

)

000

005

010

015

020

025

FIG 4 Distribution of SNPs within the S cerevisiae population (A) The derived allele frequency spectrum for SNPs with different coding effects Theancestral state of each SNP was inferred by using S paradoxus as an outgroup (B) SNP alleles inferred to be derived are much more frequently predictedto be deleterious by SIFT than alleles predicted to be ancestral (215 vs 54 respectively) (C) The effect on gene sequences of derived alleles that arefound in only a single strain Strain colors denote subpopulation origin as in figure 2 The strain tree is a neighbor-joining tree based on genome-wideSNP distances and the scale bar indicates sequence distance in units of SNPs per basepair

880

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

observations (Liti Carter et al 2009 Jelier et al 2011) WithinORFs indels with lengths that are multiples of threeare highly enriched when compared with noncoding se-quence consistent with strong purifying selection againstframeshifts (fig 5A) To test the possibility that full-lengthproteins could be produced from frameshifted genes byprogrammed translational frameshifting we searched for se-quence signatures believed to mediate this process (Theiset al 2008) but found no evidence for this As a groupORFs classified as dubious do not display the strong signsof purifying selection described above implying that mostof these are unlikely to constitute functional genes

Excluding ORFs classified as dubious and variants posi-tioned in the very 30 end of genes (last 2 of the ORF) weidentified 242 genes harboring loss-of-function variants in at

least one S cerevisiae strain This set is enriched for genesclassified as functionally uncharacterized (31-fold enrich-ment Plt 1033) demonstrating that this group of geneson average are under lower purifying selection pressures(fig 5A) One explanation for this result could be that thereis a bias in the investigation and therefore also in the func-tional classification of yeast genes toward genes that are morephenotypically important and therefore also under strongerselection Genes with putative loss-of-function variants arealso less likely to be essential in the reference strain S288c(11-fold depletion Plt 1015) Only four essential genes withloss-of-function variation were found In all these cases thevariants were positioned either very late in the reading frame(PZF1) or very early with an alternative start codon nearby(CBF2 LTO1) or the affected gene overlapped an essential

A B

C D

Time (days)

RIM15 genotypeBoth alleles

presentDBVPG6765allele deleted

Other alleledeleted

DBVPG6765x

YPS128(North American)

Diploidhybrid

S288c

5313 bp

RIM15

L1528DBVPG6765DBVPG1106

YJM975W303

Y55DBVPG6044

SK1UWOPS87-2421UWOPS03-4614UWOPS83-7873

YPS128Y12

S paradoxus

DBVPG6765x

Y12(Sake)

DBVPG6765x

DBVPG6044(West African)

Time (days)

s

poru

late

d

spo

rula

ted

s

poru

late

d

Time (days)0

020406080

1000

20406080

1000

20406080

100

1 2 4 7 0 1 2 4 7 0 1 2 4 7

Genes overall

10

002

004

006

008

010

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Genes with loss-of-function variants

Number of paralogs

Fra

ctio

n of

gen

eslength not multiple of three

length multiple of three

25

2

15

1

05

0

2

3

4

1

0

Inde

ls p

er b

p (x

10-3)

Sto

p-ga

ins

per

bp (

x10-

4 )

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Ess

entia

l

Ver

ified

Unc

hara

cter

ized

Dub

ious

Non

-cod

ing

FIG 5 Loss-of-function variants in the S cerevisiae population (A) Frequencies of indels and stop-gain SNPs in different categories of genes Essentialgenes refer to genes for which the deletion in the BY reference strain background is not viable (B) The distribution of the number of paralogs for geneswith loss-of-function variants and for genes overall The number of paralogs for each protein coding gene in the S cerevisiae reference genome wasestimated as the number of other genes in the genome returning BlastP hits with an e-valuelt 1050 and with the alignment covering at least 80 ofthe query protein length We note that because of CNV the exact number of paralogs for a given gene will vary between strains The fraction of geneswith zero paralogs is omitted (C) A 2-bp insertion in the strain DBVPG6765 disrupts the translational reading frame of the gene RIM15 Sequences ofS cerevisiae strains and one S paradoxus strain (the reference strain CBS432) for a segment surrounding the insertion in RIM15 are displayed (D) Thephenotypic effect of the frameshifting insertion variant was tested by deleting the RIM15 gene in the DBVPG6765 strain and in three other strainsrepresenting major phylogenetic lineages within S cerevisiae Diploid hybrids were then constructed between DBVPG6765 and the other three strainscontaining alleles of RIM15 from both parental strains or only from one of them These diploid strains were tested for their ability to sporulate in KAcmedium by scoring the proportion of cells that have undergone sporulation at different time points In all of the three genetic backgrounds presence ofonly the DBVPG6765 RIM15 allele leads to dramatically lower sporulation efficiency

881

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

gene so that the essentiality classification is likely to be false(YJR012C Ben-Shitrit et al 2012)

Subtelomeric genes are overrepresented among genesharboring loss-of-function variants (35-fold enrichmentPlt 1017) consistent with the idea that the subtelomeresare evolutionary dynamic regions for which selection pres-sures vary over time and space Genes that are part of multi-copy gene families show a similar higher tendency to harborloss-of-function mutations (fig 5B) This observation has alsobeen made in humans (MacArthur et al 2012) and is consis-tent with a model where these mutations can sometimesescape the effects of negative selection due to functional re-dundancy with a paralog elsewhere in the genome Howeverit is also consistent with a model where the kinds of genesthat tend to have paralogs are simply under lower or morevariable selection pressures in general (Woods et al 2013)Genes with loss-of-function mutations are enriched for geneontology terms related to transmembrane transport of smallmolecules and ions as well as flocculation These genes alsohave on average a lower number of gene ontology terms an-notated to them (189 vs 25 for the whole genome for theldquobiological processrdquo domain P = 11105 excluding geneswith zero terms) indicating a lower degree of pleiotropy

We reasoned that if the predicted loss-of-function variantsare truly disrupting the function of the affected genes thenwe should see evidence of lower purifying selection acting onthese genes in general within S cerevisiae Indeed genesharboring such variants have higher ratios of nonsynonymousover synonymous polymorphisms (a=s) within the speciesthan genes without loss-of-function variants (0643 vs 0485respectively Plt 1015 Fisherrsquos exact test) To test whetherthis relaxed selection pressure is specific to S cerevisiae orreflects a trend that has been running over longer evolution-ary timescales we also evaluated the observed correlation inS paradoxus Overall S paradoxus orthologs of S cerevisiaegenes with predicted loss-of-function variants have highera=s ratios than other S paradoxus genes (0555 vs 0457respectively Plt 1013 Fisherrsquos exact test) Thus the selectionacting on these genes has been generally low at least over themore than 2 billion generations back to the most recentcommon ancestor of S cerevisiae and S paradoxus (Dujon2010) To test whether the emergence of loss-of-functionmutations in certain S cerevisiae strains is associated with afurther relaxation of selection specifically in these lineages wecompared a=s ratios between those strains carrying theloss-of-function variant of a gene and those strains with the in-tact ORF and found no difference (0642 vs 0639 P = 08774Fisherrsquos exact test) Together these results are largely consis-tent with a nearly neutral model of gene evolution where thestrength of purifying selection is to a large extent a conservedproperty of the particular gene

As a case study for the phenotypic implications of loss-of-function variation in natural yeast populations we focused onthe gene RIM15 in which we identified a frameshifting 2 bpinsertion private to the WineEuropean strain DBVPG6765(fig 5C) We also observed selection against the DBVPG6765allele in the genomic region containing RIM15 during the gen-eration of a four-parent advanced intercross line (Cubillos

et al 2013) RIM15 encodes a protein kinase believed to beinvolved in the regulation of cell division proliferation andsporulation in response to nutrient availability (Broach 2012)To test whether the identified frameshift variant impairsthese cellular processes we constructed diploid hybrids be-tween DBVPG6765 and three representatives of other majorS cerevisiae lineages We deleted either of the two RIM15copies to obtain reciprocal hemizygote strains that containeither only the DBVPG6765 allele or only an allele with anintact reading frame We find that the frameshiftedDBVPG6765 allele has a massive negative impact on the abil-ity of the cell to undergo meiosis and form spores in responseto nutrient starvation (fig 5D) Thus the strain DBVPG6765harbors a nonfunctional RIM15 allele despite its strongly det-rimental effects on traits considered to be highly related to fit-ness We also find that RIM15 has very low expression levelin this strain in a RNA-seq data set (Skelly et al 2013)Interestingly an independent loss-of-function variant inRIM15 has been described in sake yeasts where it is linkedto a reduced ability to enter quiescence and an associatedincreased rate of ethanol production (Watanabe et al 2012)Further work is needed to elucidate why this strain ofS cerevisiae carries this highly deleterious variant

ConclusionsWe found surprising and striking differences in the nature ofgenomic diversity within the two yeast species of S cerevisiaeand S paradoxus Despite lower levels of genetic divergencebetween strains S cerevisiae displayed greater diversity in thepresence and absence as well as copy number of geneticmaterial than did S paradoxus Our results strongly reinforcethat the subtelomeres are the major focal regions for func-tional evolution they are almost exclusively the sites for struc-tural gene content and CNV and are also highly enriched forloss-of-function variants The relevance of this subtelomericdiversity to phenotypic variation is underscored by the find-ing that a third of QTLs for ecologically relevant traits map tothe subtelomeric regions (Cubillos et al 2011) even thoughthey only constitute approximately 8 of the genome Thecomplex structure of the subtelomeres unfortunately alsomakes them the most problematic regions of the genometo assemble and analyze hampering nucleotide-level dissec-tions of this variation and its functional consequencesPromising avenues for overcoming this technical limitationof short-read sequencing include the subcloning of individualsubtelomeres allowing independent sequencing and assem-bly and the use of emerging sequencing technologies thatproduce much longer reads (Bashir et al 2012 Loomis et al2013) We find systematic trends in the types of genes thattend to be affected by certain types of potentially functionalvariation Genes displaying copy number and loss-of-functionvariation as well as genes not present in the reference genomeare enriched for functions related to interaction with theexternal environment for example sugar transport and me-tabolism flocculation and cell adhesion and metal transportand metabolism It is plausible that this reflects variation inthe environmental conditions of different strain habitatsleading to selective pressures for these cellular functions

882

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

that vary across time and space and resulting in either gain orloss of gene functions in different lineages The strong popu-lation structure of natural yeast strains has a critical influenceon virtually all aspects of genomic diversity and we stress theimportance of considering this in analysis and interpretationof results Finally our results demonstrate the high utility ofshort-read next-generation sequencing for yeast populationgenomics especially the value of de novo assembly to identifyvariation in genome content and we hope that the sequencedata and assemblies presented will be useful to the yeastcommunity as well as the broader evolutionary and popula-tion genetics and genomics communities

Materials and Methods

Genome Sequencing and De Novo Assembly

Strains were selected for whole-genome sequencing toencompass most of the genetic diversity of S cerevisiae andS paradoxus at least one strain representing each major phy-logenetic lineage as defined in Liti Carter et al (2009) (inS cerevisiae the North American SakeJapanese WestAfrican WineEuropean and Malaysian lineages in S para-doxus the far Eastern American and European lineages) aswell as a larger number of strains from the European popu-lations of both species Additionally in S cerevisiae weselected the mosaic strains W303 SK1 and Y55 which arepopular laboratory strains as well as two wild mosaic strainsisolated from the Bahamas and Hawaii that phylogeneticallycluster closer to the non-European lineages than most othermosaic strains

DNA was extracted using the phenol chloroform protocolSamples were sequenced using the Illumina GAII and HiSeqplatforms with 2 108 bp or 2 100 bp paired end librariesprepared as previously described (Parts et al 2011) SGA(Simpson and Durbin 2012) was used to perform de novoassembly of all strains that had sequencing coverage of 20or higher after quality filtering (by the SGA preprocess pro-gram) including scaffolding of contigs using Illumina paired-end information Reads were error corrected with a k-merlength of 41 and a minimum of five read pairs were requiredto link two contigs into a scaffold Contigsscaffolds smallerthan 200 bp were discarded Further scaffolding was thenperformed using low-coverage Sanger capillary data previ-ously produced for the same strains (Liti Carter et al2009) The paired-end Sanger reads were mapped onto theSGA scaffolds using SSAHA2 (Ning et al 2001) and scaffoldpairs connected by at least two read pairs in which both readshad a mapping quality of 254 were used as input for thestand-alone scaffolder SSPACE (Boetzer et al 2011) whichwas run without contig extension and a maximum alloweddeviation from the mean pair distance of 50 Mean pairdistances were estimated for each strain individually frompairs where both reads mapped within the same SGA scaffoldTo avoid incorrect scaffolding because of collapsed repeats inassemblies scaffold ends showing signs of collapse in the formof higher Illumina coverage were excluded from Sanger scaf-folding The Illumina reads were mapped to the SGA assem-blies using BWA 061 (Li and Durbin 2009) with the ldquoq 10rdquo

parameter and scaffold edges where the log2-ratio betweenthe average coverage of the outermost 5 kb and the genome-wide median coverage was higher than 05 were excludedAfter scaffolding with the Sanger reads the SGA gapfill pro-gram was run to fill in some of the resulting gaps using theerror-corrected Illumina reads For four of the S paradoxusstrains assembled (Y85 Y96 Z1 and W7) no Sanger readsare available and so further scaffolding could not be per-formed for these strains Genome sizes reported in the textdo not take the large ribosomal DNA (rDNA) tandem repeatarray on chromosome XII into account which in the S288creference genome is represented by two copies (Johnstonet al 1997) and which collapses into a single copy in our denovo assemblies We also note that the mitochondrial ge-nomes did not assemble well likely a result of the very highAT content

Contaminant sequences displaying high (~99) identity togenomes from the prokaryotic Staphylococcus genus wereidentified in the assembly of the strain DBVPG6044 All con-tigsscaffolds that had a BlastN match to a Staphylococcusspecies in the top five hits from the NCBI nr database wereremoved from the assembly No hybrid contigs or scaffoldscontaining both Saccharomyces and Staphylococcus werefound

Assembly Scaffolding Using Genetic Linkage Data

Four of the S cerevisiae strains have been used as parents foradvanced intercross lines as part of projects to study complextraits and recombination and the genomes of a large numberof segregants from these lines have been sequenced 192 F12segregants from a two-parent cross between YPS128 andDBVPG6044 (Illingworth et al 2013) and 192 F12 segregantsfrom a four-parent cross between YPS128 DBVPG6044 Y12and DBVPG6765 (Cubillos et al 2013) in both cases se-quenced by paired-end Illumina technology with 100-bpread lengths The patterns of linkage disequilibrium (LD)between SNPs in these artificial populations were used tofurther scaffold the de novo assemblies of the four parentalstrains First for each of the parental genome assemblies theIllumina reads for the other parental strains were mapped tothe assembly and SNPs were called (read mapping and SNPcalling performed as described in the section ldquoRead-mappingand variant callingrdquo) For the two strains of the two-parentcross (YPS128 and DBVPG6044) only data from this cross wasused and data from the four-parent cross was used only forthe remaining two strains (Y12 and DBVPG6765) The segre-gant individuals were then genotyped at the resulting lists ofSNPs using samtools mpileup 0118 (Li et al 2009) assigningthe corresponding allele if the log-likelihood ratio betweenthe homozygous states of the two alleles was bigger than 10or smaller than 10 assigning unknown genotype if in-between Segregant individuals showing signs of diploidy orDNA contamination as assessed by genome-wide patterns ofheterozygosity were excluded (20 out of 192 individuals in thetwo-parent cross and 15 out of 192 in the four-parent cross)Using the resulting genotype calls LD in units of r2 was thencomputed between all pairs of SNPs using PLINK (Purcell et al

883

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

2007) A model for the approximation of physical base-pairdistances from genetic distances of the form r2 = eld whered is physical distance between two SNPs in units of base pairsand l is a constant was fit by nonlinear least squares to theset of values from SNPs located within the same scaffold (onlyscaffolds of size 50 kb and bigger were used for fitting) LDbetween SNPs on different scaffolds was used to constructpairwise scaffoldndashscaffold links with relative orientations in-ferred by comparing the strength of LD in the four cornerelements of the matrix of LD values between all SNPs in thetwo scaffolds The pairwise scaffoldndashscaffold links then con-stitutes a directed graph through which each linkage groupshould be traceable as a simple path The paths were tracedpartly using the scaffolding algorithm of SGA and partlyby manual curation aided by visualizations in Cytoscape(Shannon et al 2003) Scaffolds that could be positionedbut not reliably oriented because of a lack of difference inLD profiles between the two ends or because they did nothave at least two SNPs were not included in the linkage groupscaffolds but left as unplaced

Genome Assembly Validation

Diagnostic PCR was used to confirm the absence and pres-ence of genomic regions identified as variable between strainsin the de novo assemblies Primers were designed to target 14different regions in all 14 strains of S cerevisiae with de novoassemblies plus the S288c reference strain and for 9 of theregions also in all of the 13 S paradoxus strains with de novoassemblies (supplementary fig S3 Supplementary Materialonline) Different primers were designed for the two differentspecies and placed in nonpolymorphic locations Primersare listed in supplementary table S1 SupplementaryMaterial online

As an additional validation the de novo genome assemblyof the S cerevisiae strain SK1 was compared with anotherrecently released assembly of this strain (van Overbeeket al httpcbiomskccorgpublicSK1_MvO [last accessedOctober 4 2013] Sasaki et al 2013) produced primarily from454 pyrosequencing reads and with additional error-correc-tion and gap closing efforts The assemblies were compared toidentify any false negatives in the sense of sequence incor-rectly left out of our assembly and false positives in the senseof sequence or assembly artifacts in our assembly that do notrepresent actual sequence present in the genome of thisstrain A single region larger than 1 kb was found to be presentin the van Overbeek et al assembly and absent in oursInspection revealed that this is a technical cloning vectorthat has not been introduced into the SK1 derivative se-quenced here and thus does not represent genuinely missingbiological sequence Thirty-one regions larger than 1 kb werepresent in our assembly and absent in the van Overbeek et alassembly To test whether these are assembly artifacts in ourassembly or sequence incorrectly left out of the van Overbeeket al assembly the 454 reads used to construct the vanOverbeek et al assembly were mapped back to our SK1 as-sembly using SSAHA2 (Ning et al 2001) and the depth ofcoverage was assayed in these 31 regions Excluding the

100 bp at the edge of contigs all positions in these regionswere covered by five or more reads with mapping qualities ofat least 75 This shows that these sequences are actually pre-sent in the strain SK1 but have for some technical reason beenleft out of the van Overbeek et al assembly No false positivesand no false negatives of this kind were thus identified in ourSK1 de novo assembly

Reference Genomes and Annotation Data Used

For S cerevisiae the S288c reference genome and correspond-ing annotation files (Release R64-1-1 downloaded from theSaccharomyces Genome Database [httpyeastgenomeorglast accessed January 28 2014] on February 5 2011) was usedfor reference-based analyses The sequence of the 2-mm circleplasmid was downloaded from NCBIrsquos GenBank (accessionNC_001398) For S paradoxus the CBS432 reference genomefrom Liti Carter et al (2009) was used with annotation beingobtained from Scannell et al (2011) A S paradoxus CBS432mitochondrial sequence was obtained from Prochazka et al(2012) (GenBank accession JQ8623351) For analyses assayingthe presence of genomic material in the S paradoxus refer-ence genome an alternative version of the CBS432 assemblythat includes some additional sequence that was incorrectlyleft out from the original assembly was used (available fromLiti Carter et al [2009]) We define the subtelomeric regionsas the outermost 33 kb of each reference chromosome fol-lowing Brown et al (2010) Annotation on the essentiality ofgenes was that of the S cerevisiae reference strain S288c

Identification of Genome Content Differences

To identify genomic material present in a given genome butabsent in another alignments were constructed betweengenome assemblies using BlastN without low-complexity fil-tering and a region in the query sequence was called as notpresent in the target sequence if no alignment of either length100 bp and sequence identity of 75 or length 50 bp andsequence identity 90 was found For the reported resultsonly such identified regions of a minimum length of 1000 bpwere considered

Genome Assembly Annotation

The protein sequences of all nuclear ORFs annotated as gen-uine genes in the reference genomes of S cerevisiae andS paradoxus respectively (in the former species all ORFsnot classified as ldquodubiousrdquo and in the latter all ORFs classifiedas ldquorealrdquo) were searched for in the de novo assemblies usingexonerate version 23112 (Slater and Birney 2005) with theldquoprotein2dnardquo alignment model For each reference proteinquery the most likely homologous gene in each straingenome was inferred by sequence similarity and synteny com-parisons from the set of all candidate alignments with morethan 90 sequence similarity to the query and with an exon-erate alignment score within 90 of the top scoring align-ment for the gene This was done by identifying and selectingin priority order top-scoring alignments supported by genesynteny to the reference genome on both sides top-scoringalignments supported by synteny on one side and being right

884

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

next to the edge of the scaffold on the other side and nontopscoring alignments supported by synteny on both sides Ifmore than one reference protein query had their inferredhomolog localized to the same part of the assembly (bothstart and end positions differing by less than 100 bp) only oneof them was kept (prioritizing perfect synteny support com-bined with being the top scoring alignment then higher align-ment score and then longer gene length)

Ab initio gene prediction for genes not present in thereference genome was performed using GeneMarkS version410d (Besemer et al 2001) in the self-training mode and withthe ldquoeukrdquo option for intronless eukaryotic gene predictionPredicted genes that did not overlap the coordinates of theinferred reference genome homologs and that additionally toaccount for potential reference homologs missed by theabove homology inference did not display high similarity toany reference genome gene (defined as the presence of aBlastN hit covering either 80 of the query length or200 bp and with a sequence similarity of at least 90) wereclassified as nonreference genes For the numbers reportedonly predicted genes with lengths of 300 bp or more wereincluded

Read-Mapping and Variant Calling

For purposes of identification of SNPs and short indels readscleaned from adapter contamination using cutadapt (httpcodegooglecompcutadapt last accessed January 11 2011)were mapped to reference genomes and de novo assembliesusing Stampy 1018 (Lunter and Goodson 2011) with theldquosensitiverdquo parameter in hybrid mode with BWA 059 andwith the ldquoq 10rdquo BWA parameter Nonprimary alignmentsand nonproperly paired reads were filtered out and duplicatereads were removed using Picard (httppicardsourceforgenet last accessed October 22 2012) Before SNP calling readswere locally realigned using SRMA 0115 (Homer and Nelson2010) and read base qualities were capped by their BaseAlignment Qualities (Li 2011) as computed by samtools0118 (Li et al 2009) SNPs were called on the read alignmentsusing FreeBayes 095 (httpbioinformaticsbcedumarthlabwikiindexphpFreeBayes last accessed May 2 2012) set forhaploid samples Short indels were identified using Dindel101 (Albers et al 2011) executed in pooled modeIndividual indel genotype calls for each haploid strain wereobtained by extracting the genotype likelihoods as computedassuming diploid samples and assigning the correspondingallele if the log-likelihood ratio between the homozygousstates of the two alleles was bigger than 53 or smaller than53 assigning unknown genotype if in-between Only indelsshorter than 60 bp were called Indels were not counted asaffecting the ORF of a gene if the equivalent indel region(Krawitz et al 2010) was not completely contained withinthe ORF

Copy Number Variation

CNV was identified by mapping the Illumina reads to theS cerevisiae and S paradoxus reference genomes The averagedepth of read coverage was computed in nonoverlapping

windows of size 500 bp and normalized by the genome-wide median coverage for each strain and the log2 valuesof these ratios were then plotted Regions of the genomeshowing coverage variation between strains were identifiedmanually by systematically inspecting the plots Regions an-notated with transposable element associated features weremasked out at the plotting level and excluded from the anal-ysis As the S paradoxus reference genome is not comprehen-sively annotated for transposable elements masking wasapplied to regions of the genome displaying high sequencesimilarity to any S cerevisiae transposon-associated featuresequences (BlastN e-value lt1025) One strain ofS cerevisiae (YJM981) and four strains of S paradoxus(KPN3829 Q314 Q323 UFRJ50816) were excluded fromcopy number analysis because they had a coverage of readsmapped to the reference genome of below 8 For the genefamily-based CNV analysis read depth was aggregated acrossall members of a gene family and normalized by the genome-wide median coverage as above The gene families used arethose defined in Christiaens et al (2012) Although the trueamount of CNV in S paradoxus is likely slightly underesti-mated due to the presence of gaps in the reference genomesthis underestimation is unlikely to explain the large differencein the amount of CNV observed between S cerevisiae andS paradoxus 9 of subtelomeric sequence (and 15 of thewhole genome) in the S paradoxus reference assembly con-sists of gapsmdashassuming very conservatively that all of thisgapped subtelomeric sequence harbors CNV the estimatefor the total size of genomic regions containing CNVswould increase from 142 to 232 kb (and to 319 kb extendingthe assumption to all gapped sequence in the wholegenome) which is still considerably smaller than the 423 kbin S cerevisiae

ARR Gene Cluster Analysis

The haplotypes of the two ARR gene copy clusters in thestrain BC187 were phased by first calling SNPs in a 6400-bpregion encompassing the cluster using samtools mpileup onmapped Illumina reads and then phasing the SNPs using theReadBackedPhasing program from the Genome AnalysisToolkit (McKenna et al 2010) with both Illumina andSanger paired-end reads as input For the strains Y12 andDBVPG1373 the two ARR cluster copies were not sufficientlydiverged for phasing to be possible Neighbor-joining treeswere constructed by Seaview (Gouy et al 2010) and forY12 a single consensus sequence for the two haplotypeswas constructed for phylogenetic analysis by excluding thepositions where the two copies differ in sequence (five posi-tions) Data from all the strains on the mitotic growth ratelength of mitotic lag and mitotic growth efficiency ina medium containing 5 mM sodium arsenite oxide wasobtained from Warringer et al (2011)

Gene Ontology Enrichment

Gene ontology enrichment analyses were performed atYeastMine (httpyeastmineyeastgenomeorg last accessedFebruary 6 2013) with the BenjaminindashHochberg correction

885

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

for multiple testing and a P-value threshold of 005 As geneontology annotation is not directly available for S paradoxusgenes were mapped to their S cerevisiae orthologs and anal-yses were performed on the resulting S cerevisiae gene setsOrtholog mappings were obtained from Scannell et al (2011)

RIM15 Phenotyping

RIM15 reciprocal hemizygosity strains were constructedby one-step PCR deletion with URA3 as a selectablemarker (Salinas et al 2012) The gene was deleted in thehaploid versions of the four parental strains (either Mat ahoHygMX ura3KanMX or Mat a hoNatMX ura3KanMX)and deletions were confirmed by PCR Strains of oppositemating type were crossed to generate the hemizygotichybrid diploid strains Sporulation efficiency was then mea-sured as follows Strains were grown in 50 ml of YPG (2peptone 1 yeast extract 3 glycerol and 01 glucose)overnight washed with water three times transferred to50 ml of 2 potassium acetate and incubated with shakingat 23 C Samples were taken at 1 2 4 and 7 days after thestart of incubation and in each sample the number of cellshaving formed asci was counted using an optical microscopeTwo-hundred cells were assayed for each sample

Supplementary MaterialSupplementary figures S1ndashS6 and tables S1 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

The authors thank all the members of the Sanger Institutesequencing production teams for generating the sequencedata The authors also thank Conrad Nieduszynski for con-structive feedback on the manuscript and Magnus AlmRosenblad for assistance with computing resources Thiswork was supported by grants from ATIP-Avenir (CNRSINSERM) ARC (grant number SFI20111203947) and FP7-PEOPLE-2012-CIG (grant number 322035) to GL theWellcome Trust (grant number WT077192Z05Z) to RDand Becas Chile to FS

ReferencesAbecasis GR Auton A Brooks LD DePristo MA Durbin RM Handsaker

RE Kang HM Marth GT McVean GA 1000 Genomes ProjectConsortium et al 2012 An integrated map of genetic variationfrom 1092 human genomes Nature 49156ndash65

Akao T Yashiro I Hosoyama A Kitagaki H Horikawa H Watanabe DAkada R Ando Y Harashima S Inoue T et al 2011 Whole-genomesequencing of sake yeast Saccharomyces cerevisiae Kyokai no 7DNA Res 18423ndash434

Albers CA Lunter G MacArthur DG McVean G Ouwehand WHDurbin R 2011 Dindel accurate indel calls from short-read dataGenome Res 21961ndash973

Ambroset C Petit M Brion C Sanchez I Delobel P Guerin C ChiapelloH Nicolas P Bigey F Dequin S et al 2011 Deciphering the molecularbasis of wine yeast fermentation traits using a combined genetic andgenomic approach G3 1263ndash281

Argueso JL Carazzolle MF Mieczkowski PA Duarte FM Netto OVMissawa SK Galzerani F Costa GG Vidal RO Noronha MF et al2009 Genome structure of a Saccharomyces cerevisiae strain widelyused in bioethanol production Genome Res 192258ndash2270

Bashir A Klammer AA Robins WP Chin CS Webster D Paxinos E HsuD Ashby M Wang S Peluso P et al 2012 A hybrid approach forthe automated finishing of bacterial genomes Nat Biotechnol 30701ndash707

Ben-Shitrit T Yosef N Shemesh K Sharan R Ruppin E Kupiec M 2012Systematic identification of gene annotation errors in the widelyused yeast mutation collections Nat Methods 9373ndash378

Besemer J Lomsadze A Borodovsky M 2001 GeneMarkS a self-trainingmethod for prediction of gene starts in microbial genomesImplications for finding sequence motifs in regulatory regionsNucleic Acids Res 292607ndash2618

Boetzer M Henkel CV Jansen HJ Butler D Pirovano W 2011 Scaffoldingpre-assembled contigs using SSPACE Bioinformatics 27578ndash579

Borneman AR Desany BA Riches D Affourtit JP Forgan AH PretoriusIS Egholm M Chambers PJ 2011 Whole-genome comparison re-veals novel genetic elements that characterize the genome of indus-trial strains of Saccharomyces cerevisiae PLoS Genet 7e1001287

Borneman AR Forgan AH Pretorius IS Chambers PJ 2008 Comparativegenome analysis of a Saccharomyces cerevisiae wine strain FEMSYeast Res 81185ndash1195

Borneman AR Schmidt SA Pretorius IS 2013 At the cutting-edge ofgrape and wine biotechnology Trends Genet 29263ndash271

Broach JR 2012 Nutritional control of growth and development inyeast Genetics 19273ndash105

Brown CA Murray AW Verstrepen KJ 2010 Rapid expansion andfunctional divergence of subtelomeric gene families in yeasts CurrBiol 20895ndash903

Cao J Schneeberger K Ossowski S Gunther T Bender S Fitz J Koenig DLanz C Stegle O Lippert C et al 2011 Whole-genome sequencing ofmultiple Arabidopsis thaliana populations Nat Genet 43956ndash963

Christiaens JF Van Mulders SE Duitama J Brown CA Ghequire MG DeMeester L Michiels J Wenseleers T Voordeckers K Verstrepen KJet al 2012 Functional divergence of gene duplicates through ectopicrecombination EMBO Rep 131145ndash1151

Cliften P Sudarsanam P Desikan A Fulton L Fulton B Majors JWaterston R Cohen BA Johnston M 2003 Finding functional fea-tures in Saccharomyces genomes by phylogenetic footprintingScience 30171ndash76

Connelly CF Akey JM 2012 On the prospects of whole-genome asso-ciation mapping in Saccharomyces cerevisiae Genetics 1911345ndash1353

Conrad DF Pinto D Redon R Feuk L Gokcumen O Zhang Y Aerts JAndrews TD Barnes C Campbell P et al 2010 Origins and func-tional impact of copy number variation in the human genomeNature 464704ndash712

Cromie GA Hyma KE Ludlow CL Garmendia-Torres C Gilbert TL MayP Huang AA Dudley AM Fay JC 2013 Genomic sequence diversityand population structure of Saccharomyces cerevisiae assessed byRAD-seq G3 32163ndash2171

Cubillos FA Billi E Zorgo E Parts L Fargier P Omholt S Blomberg AWarringer J Louis EJ Liti G et al 2011 Assessing the complex ar-chitecture of polygenic traits in diverged yeast populations Mol Ecol201401ndash1413

Cubillos FA Parts L Salinas F Bergstrom A Scovacricchi E Zia AIllingworth CJ Mustonen V Ibstedt S Warringer J et al 2013High resolution mapping of complex traits with a four-parent ad-vanced intercross yeast population Genetics 1951141ndash1155

Doniger SW Kim HS Swain D Corcuera D Williams M Yang SP Fay JC2008 A catalog of neutral and deleterious polymorphism in yeastPLoS Genet 4e1000183

Dowell RD Ryan O Jansen A Cheung D Agarwala S Danford TBernstein DA Rolfe PA Heisler LE Chin B et al 2010 Genotypeto phenotype a complex problem Science 328469

Dujon B 2010 Yeast evolutionary genomics Nat Rev Genet 11512ndash524Dujon B Sherman D Fischer G Durrens P Casaregola S Lafontaine I De

Montigny J Marck C Neuveglise C Talla E et al 2004 Genomeevolution in yeasts Nature 43035ndash44

Dunn B Richter C Kvitek DJ Pugh T Sherlock G 2012 Analysis of theSaccharomyces cerevisiae pan-genome reveals a pool of copy

886

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

number variants distributed in diverse yeast strains from differingindustrial environments Genome Res 22908ndash924

Ehrenreich IM Bloom J Torabi N Wang X Jia Y Kruglyak L 2012Genetic architecture of highly complex chemical resistance traitsacross four yeast strains PLoS Genet 8e1002570

Ehrenreich IM Torabi N Jia Y Kent J Martis S Shapiro JA Gresham DCaudy AA Kruglyak L 2010 Dissection of genetically complex traitswith extremely large pools of yeast segregants Nature 4641039ndash1042

Fischer G James SA Roberts IN Oliver SG Louis EJ 2000 Chromosomalevolution in Saccharomyces Nature 405451ndash454

Fraser HB Moses AM Schadt EE 2010 Evidence for widespread adaptiveevolution of gene expression in budding yeast Proc Natl Acad SciU S A 1072977ndash2982

Gouy M Guindon S Gascuel O 2010 SeaView version 4 a multiplat-form graphical user interface for sequence alignment and phyloge-netic tree building Mol Biol Evol 27221ndash224

Hittinger CT 2013 Saccharomyces diversity and evolution a buddingmodel genus Trends Genet 29309ndash317

Homer N Nelson SF 2010 Improved variant discovery through local re-alignment of short-read next-generation sequencing data usingSRMA Genome Biol 11R99

Hyma KE Fay JC 2013 Mixing of vineyard and oak-tree ecotypes ofSaccharomyces cerevisiae in North American vineyards Mol Ecol 222917ndash2930

Hyma KE Saerens SM Verstrepen KJ Fay JC 2011 Divergence in winecharacteristics produced by wild and domesticated strains ofSaccharomyces cerevisiae FEMS Yeast Res 11540ndash551

Illingworth CJ Parts L Bergstrom A Liti G Mustonen V 2013 Inferringgenome-wide recombination landscapes from advanced intercrosslines application to yeast crosses PLoS One 8e62266

Jelier R Semple JI Garcia-Verdugo R Lehner B 2011 Predicting pheno-typic variation in yeast from individual genome sequences NatGenet 431270ndash1274

Johnston M Hillier L Riles L Albermann K Andre B Ansorge W Benes VBruckner M Delius H Dubois E et al 1997 The nucleotide sequenceof Saccharomyces cerevisiae chromosome XII Nature 38787ndash90

Kellis M Patterson N Endrizzi M Birren B Lander ES 2003 Sequencingand comparison of yeast species to identify genes and regulatoryelements Nature 423241ndash254

Koufopanou V Hughes J Bell G Burt A 2006 The spatial scale ofgenetic differentiation in a model organism the wild yeastSaccharomyces paradoxus Philos Trans R Soc Lond B Biol Sci 3611941ndash1946

Krawitz P Rodelsperger C Jager M Jostins L Bauer S Robinson PN 2010Microindel detection in short-read sequence data Bioinformatics 26722ndash729

Kumar P Henikoff S Ng PC 2009 Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithmNat Protoc 41073ndash1081

Li H 2011 Improving SNP discovery by base alignment qualityBioinformatics 271157ndash1158

Li H Durbin R 2009 Fast and accurate short read alignment withBurrows-Wheeler transform Bioinformatics 251754ndash1760

Li H Handsaker B Wysoker A Fennell T Ruan J Homer N Marth GAbecasis G Durbin R 1000 Genome Project Data ProcessingSubgroup et al 2009 The sequence alignmentmap format andSAMtools Bioinformatics 252078ndash2079

Liti G Carter DM Moses AM Warringer J Parts L James SA Davey RPRoberts IN Burt A Koufopanou V et al 2009 Population genomicsof domestic and wild yeasts Nature 458337ndash341

Liti G Haricharan S Cubillos FA Tierney AL Sharp S Bertuch AA PartsL Bailes E Louis EJ 2009 Segregating YKU80 and TLC1 alleles un-derlying natural variation in telomere properties in wild yeast PLoSGenet 5e1000659

Liti G Louis EJ 2012 Advances in quantitative trait analysis in yeast PLoSGenet 8e1002912

Liti G Nguyen Ba AN Blythe M Muller CA Bergstrom A Cubillos FADafhnis-Calas F Khoshraftar S Malla S Mehta N et al 2013 High

quality de novo sequencing and assembly of the Saccharomycesarboricolus genome BMC Genomics 1469

Liti G Peruffo A James SA Roberts IN Louis EJ 2005 Inferencesof evolutionary relationships from a population survey of LTR-retrotransposons and telomeric-associated sequences in theSaccharomyces sensu stricto complex Yeast 22177ndash192

Liti G Schacherer J 2011 The rise of yeast population genomics C R Biol334612ndash619

Loomis EW Eid JS Peluso P Yin J Hickey L Rank D McCalmon SHagerman RJ Tassone F Hagerman PJ et al 2013 Sequencing theunsequenceable expanded CGG-repeat alleles of the fragile X geneGenome Res 23121ndash128

Lundblad V Blackburn EH 1993 An alternative pathway foryeast telomere maintenance rescues est1-senescence Cell 73347ndash360

Lunter G Goodson M 2011 Stampy a statistical algorithm for sensitiveand fast mapping of Illumina sequence reads Genome Res 21936ndash939

MacArthur DG Balasubramanian S Frankish A Huang N Morris JWalter K Jostins L Habegger L Pickrell JK Montgomery SB et al2012 A systematic survey of loss-of-function variants in humanprotein-coding genes Science 335823ndash828

Marullo P Aigle M Bely M Masneuf-Pomarede I Durrens PDubourdieu D Yvert G 2007 Single QTL mapping and nucleo-tide-level resolution of a physiologic trait in wine Saccharomycescerevisiae strains FEMS Yeast Res 7941ndash952

McKenna A Hanna M Banks E Sivachenko A Cibulskis K Kernytsky AGarimella K Altshuler D Gabriel S Daly M et al 2010 TheGenome Analysis Toolkit a MapReduce framework for analyz-ing next-generation DNA sequencing data Genome Res 201297ndash1303

Nijkamp JF van den Broek M Datema E de Kok S Bosman L Luttik MADaran-Lapujade P Vongsangnak W Nielsen J Heijne WH et al 2012De novo sequencing assembly and analysis of the ge-nome of the laboratory strain Saccharomyces cerevisiaeCENPK113-7D a model for modern industrial biotechnologyMicrob Cell Fact 1136

Ning Z Cox AJ Mullikin JC 2001 SSAHA a fast search method for largeDNA databases Genome Res 111725ndash1729

Novo M Bigey F Beyne E Galeote V Gavory F Mallet S Cambon BLegras JL Wincker P Casaregola S et al 2009 Eukaryote-to-eukaryote gene transfer events revealed by the genome sequenceof the wine yeast Saccharomyces cerevisiae EC1118 Proc Natl AcadSci U S A 10616333ndash16338

Parts L Cubillos FA Warringer J Jain K Salinas F Bumpstead SJ MolinM Zia A Simpson JT Quail MA et al 2011 Revealing the geneticstructure of a trait by sequencing a population under selectionGenome Res 211131ndash1138

Prochazka E Franko F Polakova S Sulo P 2012 A complete sequence ofSaccharomyces paradoxus mitochondrial genome that restores therespiration in S cerevisiae FEMS Yeast Res 12819ndash830

Purcell S Neale B Todd-Brown K Thomas L Ferreira MA Bender DMaller J Sklar P de Bakker PI Daly MJ et al 2007 PLINK a tool setfor whole-genome association and population-based linkage analy-ses Am J Hum Genet 81559ndash575

Ralser M Kuhl H Ralser M Werber M Lehrach H Breitenbach MTimmermann B 2012 The Saccharomyces cerevisiae W303-K6001cross-platform genome sequence insights into ancestry and physi-ology of a laboratory mutt Open Biol 2120093

Replansky T Koufopanou V Greig D Bell G 2008 Saccharomyces sensustricto as a model system for evolution and ecology Trends Ecol Evol23494ndash501

Ruderfer DM Pratt SC Seidel HS Kruglyak L 2006 Population genomicanalysis of outcrossing and recombination in yeast Nat Genet 381077ndash1081

Salinas F Cubillos FA Soto D Garcia V Bergstrom A Warringer J GangaMA Louis EJ Liti G Martinez C et al 2012 The genetic basis ofnatural variation in oenological traits in Saccharomyces cerevisiaePLoS One 7e49640

887

A High-Definition View of Yeast Genome Variation doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE

Sasaki M Tischfield SE van Overbeek M Keeney S et al 2013 Meioticrecombination initiation in and around retrotransposable elementsin Saccharomyces cerevisiae PLoS Genet 9e1003732

Scannell DR Zill OA Rokas A Payen C Dunham MJ Eisen MB Rine JJohnston M Hittinger CT 2011 The awesome power of yeast evo-lutionary genetics new genome sequences and strain resources forthe Saccharomyces sensu stricto genus G3 111ndash25

Schacherer J Ruderfer DM Gresham D Dolinski K Botstein DKruglyak L 2007 Genome-wide analysis of nucleotide-level varia-tion in commonly used Saccharomyces cerevisiae strains PLoS One2e322

Schacherer J Shapiro JA Ruderfer DM Kruglyak L 2009 Comprehensivepolymorphism survey elucidates population structure ofSaccharomyces cerevisiae Nature 458342ndash345

Shannon P Markiel A Ozier O Baliga NS Wang JT Ramage D Amin NSchwikowski B Ideker T et al 2003 Cytoscape a software environ-ment for integrated models of biomolecular interaction networksGenome Res 132498ndash2504

Simpson JT Durbin R 2012 Efficient de novo assembly of large genomesusing compressed data structures Genome Res 22549ndash556

Sinha H David L Pascon RC Clauder-Munster S Krishnakumar SNguyen M Shi G Dean J Davis RW Oefner PJ et al 2008Sequential elimination of major-effect contributors identifies addi-tional quantitative trait loci conditioning high-temperature growthin yeast Genetics 1801661ndash1670

Skelly DA Merrihew GE Riffle M Connelly CF Kerr EO Johansson MJaschob D Graczyk B Shulman NJ Wakefield J et al 2013Integrative phenomics reveals insight into the structure of pheno-typic diversity in budding yeast Genome Res 231496ndash1504

Slater GS Birney E 2005 Automated generation of heuristics for bio-logical sequence comparison BMC Bioinformatics 631

Sniegowski PD Dombrowski PG Fingerman E 2002 Saccharomycescerevisiae and Saccharomyces paradoxus coexist in a natural wood-land site in North America and display different levels of reproduc-tive isolation from European conspecifics FEMS Yeast Res 1299ndash306

Steinmetz LM Sinha H Richards DR Spiegelman JI Oefner PJ McCuskerJH Davis RW 2002 Dissecting the architecture of a quantitative traitlocus in yeast Nature 416326ndash330

Sudmant PH Huddleston J Catacchio CR Malig M Hillier LW Baker CMohajeri K Kondova I Bontrop RE Persengiev S et al 2013

Evolution and diversity of copy number variation in the great apelineage Genome Res 231373ndash1382

Sudmant PH Kitzman JO Antonacci F Alkan C Malig M Tsalenko ASampas N Bruhn L Shendure J 1000 Genomes Project et al 2010Diversity of human copy number variation and multicopy genesScience 330641ndash646

Theis C Reeder J Giegerich R 2008 KnotInFrame prediction of -1ribosomal frameshift events Nucleic Acids Res 366013ndash6020

Tsai IJ Bensasson D Burt A Koufopanou V 2008 Population genomicsof the wild yeast Saccharomyces paradoxus quantifying the lifecycle Proc Natl Acad Sci U S A 1054957ndash4962

Warringer J Zorgo E Cubillos FA Zia A Gjuvsland A Simpson JTForsmark A Durbin R Omholt SW Louis EJ et al 2011 Trait var-iation in yeast is defined by population history PLoS Genet 7e1002111

Watanabe D Araki Y Zhou Y Maeya N Akao T Shimoi H 2012 Aloss-of-function mutation in the PAS kinase Rim15p is related todefective quiescence entry and high fermentation rates ofSaccharomyces cerevisiae sake yeast strains Appl EnvironMicrobiol 784008ndash4016

Wei W McCusker JH Hyman RW Jones T Ning Y Cao Z Gu Z BrunoD Miranda M Nguyen M et al 2007 Genome sequencing andcomparative analysis of Saccharomyces cerevisiae strain YJM789Proc Natl Acad Sci U S A 10412825ndash12830

Wenger JW Schwartz K Sherlock G 2010 Bulk segregant analysis byhigh-throughput sequencing reveals a novel xylose utilization genefrom Saccharomyces cerevisiae PLoS Genet 6e1000942

Woods S Coghlan A Rivers D Warnecke T Jeffries SJ Kwon T Rogers AHurst LD Ahringer J 2013 Duplication and retention biases of es-sential and non-essential genes revealed by systematic knockdownanalyses PLoS Genet 9e1003330

Zhang H Skelton A Gardner RC Goddard MR 2010 Saccharomycesparadoxus and Saccharomyces cerevisiae reside on oak trees in NewZealand evidence for migration from Europe and interspecies hy-brids FEMS Yeast Res 10941ndash947

Zheng DQ Wang PM Chen J Zhang K Liu TZ Wu XC Li YD Zhao YH2012 Genome sequencing and genetic breeding of a bioethanolSaccharomyces cerevisiae strain YJS329 BMC Genomics 13479

Zorgo E Gjuvsland A Cubillos FA Louis EJ Liti G Blomberg A OmholtSW Warringer J 2012 Life history shapes trait heredity by accumu-lation of loss-of-function alleles in yeast Mol Biol Evol 291781ndash1789

888

Bergstrom et al doi101093molbevmsu037 MBE