supplementar y information · s6.2 smed_v6 transcriptome assembly: raw data, assembly parameters,...

44
Table of Contents Table of Contents ................................................................................................... 1 S1: Genomic DNA Isolation ..................................................................................... 3 S1.1 Genomic DNA preparation .................................................................................................................... 3 S1.2 Mucus removal from planarian surface .......................................................................................... 4 S1.3 HMW DNA isolation for PacBio sequencing and Chicago libraries ..................................... 4 S1.4 Post purification of DNA using CTAB ............................................................................................... 5 S2: PacBio Long Read Sequencing ........................................................................... 5 S2.1 Large insert libraries ............................................................................................................................... 6 S3: Assembly Pipeline ............................................................................................. 7 3.1 MARVEL assembler .................................................................................................................................... 7 3.2 MARVEL - Determining read quality .................................................................................................. 8 3.3 MARVEL - Annotation of repeat regions ........................................................................................... 8 3.4 MARVEL - Read correction...................................................................................................................... 9 3.5 MARVEL - Computational requirements........................................................................................... 9 S4: Chicago/HiRise Scaffolding ............................................................................. 10 S4.1 Chicago library preparation and sequencing ............................................................................ 10 S4.2 HiRise scaffolding of PacBio-MARVEL assembly ..................................................................... 10 S4.3 Analysis of HiRise changes to PacBio-MARVEL assembly ................................................... 11 S4.4 Scaffold gap closure using PBJelly .................................................................................................. 11 S5: Assembly Polishing ......................................................................................... 11 S5.1 Final Error Correction with Quiver ................................................................................................ 11 S5.2 Redundancy filtering ............................................................................................................................ 12 S5.3 Consensus polishing using Pilon ..................................................................................................... 12 S6: RNAseq and Transcriptomes ........................................................................... 13 S6.1 RNA extraction and sequencing ...................................................................................................... 13 S6.2 Smed_v6 transcriptome assembly: Raw data, assembly parameters, filtering, annotations ......................................................................................................................................................... 13 WWW.NATURE.COM/NATURE | 1 SUPPLEMENTARY INFORMATION doi:10.1038/nature25473

Upload: others

Post on 29-Dec-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

  • TableofContents

    TableofContents...................................................................................................1

    S1:GenomicDNAIsolation.....................................................................................3S1.1GenomicDNApreparation....................................................................................................................3S1.2Mucusremovalfromplanariansurface..........................................................................................4S1.3HMWDNAisolationforPacBiosequencingandChicagolibraries.....................................4S1.4PostpurificationofDNAusingCTAB...............................................................................................5

    S2:PacBioLongReadSequencing...........................................................................5S2.1Largeinsertlibraries...............................................................................................................................6

    S3:AssemblyPipeline.............................................................................................73.1MARVELassembler....................................................................................................................................73.2MARVEL-Determiningreadquality..................................................................................................83.3MARVEL-Annotationofrepeatregions...........................................................................................83.4MARVEL-Readcorrection......................................................................................................................93.5MARVEL-Computationalrequirements...........................................................................................9

    S4:Chicago/HiRiseScaffolding.............................................................................10S4.1Chicagolibrarypreparationandsequencing............................................................................10S4.2HiRisescaffoldingofPacBio-MARVELassembly.....................................................................10S4.3AnalysisofHiRisechangestoPacBio-MARVELassembly...................................................11S4.4ScaffoldgapclosureusingPBJelly..................................................................................................11

    S5:AssemblyPolishing.........................................................................................11S5.1FinalErrorCorrectionwithQuiver................................................................................................11S5.2Redundancyfiltering............................................................................................................................12S5.3ConsensuspolishingusingPilon.....................................................................................................12

    S6:RNAseqandTranscriptomes...........................................................................13S6.1RNAextractionandsequencing......................................................................................................13S6.2Smed_v6transcriptomeassembly:Rawdata,assemblyparameters,filtering,

    annotations.........................................................................................................................................................13

    WWW.NATURE.COM/NATURE | 1

    SUPPLEMENTARY INFORMATIONdoi:10.1038/nature25473

  • S6.3Smes_v1transcriptomeassembly:Rawdata,assemblyparameters,filtering,

    annotations.........................................................................................................................................................13

    S7:AssemblyQualityControl...............................................................................14S7.1Transcriptomeback-mapping..........................................................................................................14S7.2Genomecompletenessanalysis.......................................................................................................14S7.3Transcript-basedgenomeassemblyqualitycontrol..............................................................15S7.4Contaminationcheck............................................................................................................................17

    S8:UCSCGenomeBrowser...................................................................................18S8.1GEM-Mappabilitytrack.....................................................................................................................18S8.2RepeatMasker–Repbasetrack........................................................................................................18S8.3RepeatMasker–RepeatModelertrack.........................................................................................19S8.4AlternativeScaffoldstrack.................................................................................................................19S8.5Quivertrack..............................................................................................................................................19S8.6PacBiocoveragetrack..........................................................................................................................19S8.7Transcriptstrack....................................................................................................................................19S8.8LinkstoPlanMine..................................................................................................................................20

    S9:RepetitiveElementPrediction........................................................................20S9.1Denovopredictionofrepetitiveelements..................................................................................20

    S10:LTRAnalysis..................................................................................................21S10.1LTRconsensusconstruction..........................................................................................................21S10.2GypsyLTRelementannotation.....................................................................................................22S10.3TheBurrofamily.................................................................................................................................22S10.4LTRelementexpressionanalysis.................................................................................................23S10.5LTRelementdivergenceanalysis.................................................................................................23S10.6Solo/pairedLTRelementabundanceanalysis.......................................................................23S10.7Scaffoldendanalysis..........................................................................................................................24

    S11:AAT-RepeatAnalysis.....................................................................................24S11.1AAT-microsatellitepredictionandreadalignmentbreakanalysis..............................24S11.2MicrosatelliterepeatlengthestimationbyCircularConsensusSequencing(CCS)24S11.3OverlaplengthvariationinCCSandP6/C4reads................................................................25S11.4GenomiccontextofAT-richregions............................................................................................26

    S12:AssemblyHeterozygosity..............................................................................26S12.1Drosophilaassembly.........................................................................................................................26S12.2Dotplotanalysis..................................................................................................................................27S12.3Coverageanalysisaltvs.maincontigs.......................................................................................27

    WWW.NATURE.COM/NATURE | 2

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • S13:GeneAnnotation..........................................................................................27S13.1Geneannotationandgenenumberestimate.............................................................................27S13.2Genestructure......................................................................................................................................28S13.3Orthologyanalysis..............................................................................................................................28

    S14:DivergenceAnalysis......................................................................................28S14.1Divergenceanalysis............................................................................................................................28

    S15:SyntenyAnalysis...........................................................................................29S15.1Syntenycomparisons........................................................................................................................29

    S16:PlanarianSpecificGeneSequences...............................................................29S16.1Identificationofplanarian-specificgenes................................................................................29S16.2Domainpredictions............................................................................................................................31S16.3Expressionanalysis............................................................................................................................31

    S17:GeneLossinPlanarians.................................................................................31S17.1Lostgenesidentificationandverification................................................................................31S17.2Essentialityassessmentoflostgenes.........................................................................................33S17.3Mad1/Mad2multiplesequencealignments............................................................................33

    S18:PlanarianSACFunction.................................................................................33S18.1Quantificationofmitoticindexindissociatedcellpreparations....................................33

    S19:MiscellaneousExperimentalMethods..........................................................36S19.1Animalhusbandry...............................................................................................................................36S19.2Transcriptcloning...............................................................................................................................36S19.3Wholemountinsituhybridization.............................................................................................37S19.4Karyotyping...........................................................................................................................................37

    SupplementaryTables..........................................................................................37

    References...........................................................................................................38

    S1:GenomicDNAIsolation

    S1.1GenomicDNApreparation

    OurgenomicDNAisolationprotocolimprovesuponapreviouslydevelopedmethod

    for DNA isolation procedure for planarians1, making it compatible with single

    molecule real-time (SMRTsequencing)2onaPacBioRSIIand theChicagomethod3.

    WWW.NATURE.COM/NATURE | 3

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • Importantmodificationsincludeplanarianexternalmucusremoval,thechoiceofsalt

    andalcoholforDNAprecipitationenablingpigmentremovalandapost-purification

    step that removes likely remainingmucopolysaccharides, which tend to co-isolate

    withgDNA.

    S1.2Mucusremovalfromplanariansurface

    This step makes use the mucolytic agent N-acetyl-L-cysteine (NAC) to strip the

    animals of surface mucus, which can be excreted in copious amounts by large

    animals.SinceNACisacidicinaqueoussolutionandthehighestmucolyticactivityis

    observed between pH 7 to 9 the solution was neutralized, while the pH was

    monitoredusingthecommonpHindicatordyephenolred4.

    Theanimalswerewashedin10mlfreshlypreparedNACstrippingsolution(0.5w/v

    N-acetyl-L-cysteine, 20mMHEPES-NaOH,pH7.25, 5µl phenol-red solution (0.5%

    w/v), pH to ~7 using 1MNaOH) for 10-15min at room temperaturewith strong

    agitation (e.g. rotator). The worms were rinsed briefly in distilled water prior to

    furtherprocessing.

    S1.3HMWDNAisolationforPacBiosequencingandChicagolibraries

    All procedures were carried out at room temperature (RT) unless otherwise

    specified. For handlingHMWDNA, onlywide-bore tipswereused and the sample

    wasmixedonlybycarefulinversion.ForDNAisolation,20mucusstrippedanimals(~

    1cm;starvedfor1-3weeks)weretransferredtoa50mltubeandresidualliquidwas

    removed. The animals were lysed in 15 ml of cold GTC buffer (4 M guanidinium

    thiocyanate, 25 mM sodium citrate, 0.5 % (w/v) N-Lauroylsarcosine, 7 % v/v β-

    mercaptoethanol)for30minonice.Thetubewascarefullyinvertedevery10minto

    help dissociating the tissue. Then, 1volume of phenol/chloroform

    (Phenol/chloroform/isoamylalcohol(25:24:1),10mMTris(pH8.0),1mMEDTA)was

    addedtothelysateandmixedbycarefullyinvertingthetube10to15times.Phases

    wereseparatedbycentrifugationfor20minat4,000xgat4°Candtheupperphase

    carefullytransferredtoafreshtube.Thephenol/chloroformextractionwasrepeated

    1-2 times, until the interphase disappeared. Residual phenol was removed by

    extracting the aqueous phase once with 1 volume of chloroform and mixing by

    inversion 10 to 15 times and another roundof centrifugation. The upper aqueous

    WWW.NATURE.COM/NATURE | 4

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • phasewastransferredtoafreshtubeand1volumeofice-cold5MNaClwasadded

    andmixedwellby invertingthetubeseveraltimes.Thetubewasplacedonicefor

    15mintosaltoutadditionalcontaminants,whichwerepelletedbycentrifugationfor

    10minatmaximumspeedat4°C.Thesupernatantcontainingthenucleicacidswas

    carefullydecantedorpipettedofftoafreshtubewithoutdisturbingthepellet.The

    nucleic acids were precipitated using 0.7-1 volumes of isopropanol and

    centrifugationat2,000xgat30-45minatRT.Thepelletwascarefullywashedwith

    70 % ethanol and spun at 2,000 x g at 5 min at RT and briefly air-dried and re-

    suspendedin50µlTEbufferovernightat4°C.

    S1.4PostpurificationofDNAusingCTAB

    Allprocedureswerecarriedoutat roomtemperature,ascetyltrimethylammonium

    bromide(CTAB)precipitatesattemperatureslowerthan15°C.Thisstepallowedthe

    removalofcontaminants (e.g.polysaccharides) fromDNApreparationsbycomplex

    formationthroughtuningtheNaClconcentration5.

    Theisolatedandre-suspendednucleicacidsweretreatedwith1µlof4mg/mlRNase

    Aandincubatedfor1h/37°CtoremoveRNA.TheNaClconcentrationwasadjusted

    byadding0.3volumesof5MNaClandthevolumewasadjustedto500µlwith2%

    CTAB(w/v)/1.4MNaClsolution.Aftercarefulandthoroughmixingbyinversion,1

    vol of chloroform was added and carefully mixed by inversion until the sample

    turnedmilky. Thephaseswere separatedby centrifugation for 15min at 12,000–

    16,000xgatRT.Theupperclearphasewasrecoveredwithoutdisturbingthewhite

    interphaseusingawideborepipettetipandtransferredtoanewtube.TheDNAwas

    precipitatedusing0.7-1volumesofisopropanolandcentrifugationat2,000xgat30-

    45minatRT.Thepelletwascarefullywashedwith70%ethanolandspunat2,000x

    gat5minatRTandbrieflyair-driedandre-suspendedin50µlTEbufferovernight

    at 4 °C. Quality of the DNA was checked by agarose gel electrophoresis and

    quantifiedusingtheQubitTMfluorometer(dsDNABRassay).

    S2:PacBioLongReadSequencingQuality control of the high molecular weight genomic DNA was performed by

    spectrophotometricdeterminationof260/280and260/230ratiosmakinguseofthe

    WWW.NATURE.COM/NATURE | 5

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • Nanodropdevice(ThermoScientific).The260/280ratioswereintherangeof1.66to

    2.08and260/230ratiosbetween1.84and2.01.Theconcentrationwasdetermined

    intriplicatesviaaQubitTMfluorometerwiththehighsensitivitystandard.Integrityof

    the HMW gDNA was determined by pulse field gel electrophoresis run using the

    PippinPulseTMdevice(SAGEScience).BeforestartinganyPacBiolibrarypreparation,

    planarian gDNA samples were purified with 1X AMPure XP beads. Note that the

    AMPuremethodcannotrecoverplanarianDNAoutofcrudelysates,possiblydueto

    resorptivecompetitionoftheabundantcontaminants.

    S2.1Largeinsertlibraries

    Toprepare large insert libraries,AMPurepurifiedgenomicDNAwasfragmentedto

    sizesof12-30kbpwiththeMegaruptordevice(Diagenode).Forshearing,either20

    kbp shearing settings for medium insert SMRT bell libraries or 35 kbp shearing

    settings for large insert SMRT bell libraries were used. PacBio SMRT bell libraries

    werepreparedasdescribedinthecompany-supplied“ProcedureandChecklist–20

    kbp Template Preparation Using BluePippinTM Size-Selection System”. SMRT bell

    libraries were size selected with the BluePippinTM system (Sage Science) with a

    minimumfragmentlengthcutoffof7.5and10kbpformediuminsertlibrariesandat

    15and17.5kbpforlargeinsertlibraries.

    PacBioRSIISMRTcellswere loadedwithSMRTbell librariesafterprimerannealing

    and polymerase binding usingMagBead loading. A total of 197 SMRT cells loaded

    with the medium insert libraries were sequenced on the PacBio RSII instrument

    making use of the P4 polymerase and the C2 sequencing chemistry (P4/C2). In

    addition,atotalof38SMRTcells loadedwithlargeinsert librariesweresequenced

    ontheRSIIinstrumentwiththeP6polymeraseandC4sequencingchemistry(P6/C4).

    MovietimesforallP4/C2SMRTcellswere240minutes.MovietimesforP6/C4SMRT

    cellswere240min(20SMRTcells)and360minutes(23SMRTcells).

    WWW.NATURE.COM/NATURE | 6

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • S3:AssemblyPipeline

    3.1MARVELassembler

    AtthetimeoftheinitialsequencingofSmedwiththeP4/C2chemistry,thestandard

    workflowforassemblingnoisylong-readsfromPacBiowastocleanthereads,either

    withIlluminasequencingdataorbyreducingthePacBionativeerrorrateof12-15%

    tounder1%byhighcoveragesequencing(PacBioToCA).Afterthiscorrectionstep,

    theoriginalCeleraassembler6wasusedtoassemblethelong-reads.However,initial

    cleaning of the reads can have one of two disadvantages, especially in a genome

    withahighrepeatcontentsuchasSmed.Eithertherepeatswithinthereadsarenot

    cleanedandtheassemblyofsuchregionsbytheCeleraassemblerishindereddueto

    aanerrorratehigherthanexpected(<1%)ortherepeatsare incorrectlycleaned

    duetotheambiguouscharacterofoverlapsbetweenrepeats.Bothapproacheslead

    to incorrectandfragmentedassemblieswithpossiblemisjoins.Therefore,MARVEL

    was developed to deal with notoriously complex genomes that contain a high

    fractionof repetitivegenomic informationandaresequencedwithnoisy long-read

    technologies (Novojilov et al., in preparation). For comparisons with the Canu

    assembler7 (see Table 1,main text),weusedCanu v1.3with standardparameters

    andthesameuncorrectedPacBioreadsasfortheMARVELassembly.

    TheMARVELassembler currently consistsof threemajorphasesnamely the setup

    phase,patchphaseandtheassemblyphase. Inthesetupphase,dependingonthe

    sequencing technology, thereadsareextracted fromtherawsequencingdataand

    saved inan internaldatabase.Thepatchphasedetectsandcorrects readartefacts

    includinguntrimmedadapters,polymerasestrand jumps,and ligationchimers that

    aretheprimaryimpedimentstolongcontiguousassemblies,whicharethenusedfor

    the final assembly phase. The assembly phase stitches short alignment artefacts

    resultingfrombadsequencingsegmentswithinoverlappingreadpairs.Thedefault

    parameterisgenerally50bpinlength,butforSmed300bpwereused.Thisstepis

    followed by repeat annotation and the generation of the overlap graph, which is

    subsequentlytouredinordertogeneratethefinalcontigs.

    WWW.NATURE.COM/NATURE | 7

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • AgenomewithaskewedGC-content,suchastheSmedgenome,hasadifferentk-

    merdistributionthanthosewithauniformdistributionoftheGCcontent.Thiscan

    beusedtoweighcertaink-mersandtheirexpecteddistributionwithinthegenome.

    Therefore,theATbiasoption(-b)wasusedinDALIGNER8(thenoisyreadalignerused

    inMARVELtofindtheoverlapsbetweenreads)aswellasinDBdust,toavoidmasking

    AT-rich regions as low complexity and thereby excluding them from the k-mer

    seeding.

    In addition, alignmentswere forced through variableATmicrosatellites (< 300bp)

    despiteoftensignificant lengthdiscrepanciesandnon-uniformityof their instances

    in the reads. This allowed for amore contiguous assembly. The criteria for repeat

    module identification in overlaps was relaxed, allowing for better recognition of

    repeat modules and therefore resulting in improved resolution of repetitive

    elements.

    Since the Smed dataset combined sequencing data generated by two different

    chemistries(P4/C2andP6/C4),eachchemistrybatchwaspatchedindividuallyprior

    tocombiningthedatasetsasinputfortheassemblyphase.Intheassemblyphase,

    theoverlapof thecombineddatasetwascalculatedandassembled.Afterall tours

    were calculated, the contigs were corrected and a consensus sequence was

    generatedbyremappingthepatchedreads.

    3.2MARVEL-Determiningreadquality

    Inordertodeterminethequalityofaread,theconsensusofthealignedsupporting

    readswereused.MARVELdoes not determine thequality of individual bases, but

    rather thequalityof the subreadwithin the tracepoints (usually consistingof100

    bp). The end result of the process is the determination of a consensus base

    supportedbythealignedreadoverlapsbetweenthetracepoints.

    3.3MARVEL-Annotationofrepeatregions

    Therepeat regionswere initiallyannotatedby themaskingserverand laterduring

    thescrubbingphasebyDBrepeat.Whilebothusethealignmentstodeterminerepeat

    regions, themasking serverannotates the repeat regiondynamicallyandonce the

    WWW.NATURE.COM/NATURE | 8

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • repeat coverage threshold is exceeded (in most cases twice the sequencing

    coverage)theregionismaskedandnotusedforsubsequentk-merseeding.

    DBrepeat runs sequentially over the reads to annotate the repeat regions. Once a

    basewithacoveragethatexceedstherepeatcoverageisidentified,arepeatregion

    isopened.Arepeatregionisclosedonceabaseatorbelowthenon-repeatcoverage

    is found. Theminimumoverlap length of 1 kbp results in complete lack of repeat

    annotationforthe1kbpregionsatthestartandendofthereads.DBhomogenizecan

    beusedtotransitivelytransferanexistingrepeatannotationbasedonthecomputed

    alignments,therebyproperlyannotatingthetipsofthereads.

    3.4MARVEL-Readcorrection

    Despite thenoisynatureof PacBio reads,MARVELdoesnot clean the reads to an

    error-rateoflessthan1%asiscommonpracticeofotherassemblerswhendealing

    with noisy long-reads. The reads are first patched after the initial overlap phase,

    removing themajor sequencing related insertions/deletions (indels), chimeras and

    additionalartefacts,whichisnecessaryforqualityoverlapsthatwillbeusedforthe

    assembly.Afterasecondroundofoverlaps,MARVEL improvesthealignmentsthat

    are corrupt due to short sequencing errors. Finally,Quiver can be used after the

    assembly phase in order to generate a high-quality consensus sequence, but this

    oftenresultsinlongruntimes,especiallywithlargegenomes.Thiscanbeimproved

    bycircumventingtheuseofBLASR9bytransformingthepreviouslycalculatedread

    overlapsintothepositionofthereadswithinthecontigs.

    3.5MARVEL-Computationalrequirements

    Foranythingbeyondbacterialsamples,whichcanbeassembledratherquicklyona

    standard laptop, a cluster environment with sufficient CPU and storage is

    recommended.We recommendhavingat least8GBofRAMperCPUcore for the

    clusternodesandafastinterconnecttothecluster-attachedstorage.RAMpercore

    demand can be lessened by further increasing the partitioning of the data into

    blocks, at the price of increased IOdemand. Example CPU times canbe found in

    SupplementaryTable1andhavebeenmeasuredonourSandy-BridgeandHaswell

    architecturebasedcluster.Storagerequirementsarehighlydependentoncoverage

    andtherepetitivenessofthegenome.Also,storagerequirementsforasinglephase

    WWW.NATURE.COM/NATURE | 9

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • ofMARVELcanbefoundinSupplementaryTable1.Forpeakstoragerequirements,

    thisnumberneedstobemultipliedbytwo.

    S4:Chicago/HiRiseScaffolding

    S4.1Chicagolibrarypreparationandsequencing

    AChicago librarywasprepared as describedpreviously3. Briefly, ~500ngofHMW

    gDNA(>150kbpmeanfragmentsize)wasreconstitutedintochromatin invitroand

    fixed with formaldehyde. Fixed chromatin was then digested with DpnII, the 5’

    overhangswerefilledinwithbiotinylatednucleotides,andthenfreebluntendswere

    ligated.After ligation,crosslinkswerereversedandtheDNApurified fromprotein.

    Purified DNA was treated to remove biotin that was not internal to ligated

    fragments. The DNA was sheared to a mean fragment size of ~350 bp, and

    sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-

    compatible adapters. Biotin-containing fragments were then isolated using

    streptavidin beads before PCR enrichment of the library. The libraries were

    sequenced on an Illumina HiSeq 2500 in rapid run mode to produce 286 million

    2x100bpreadpairs,amountingto93xphysicalcoverage(1-50kbpspanningpairs)of

    theSmedgenome.

    S4.2HiRisescaffoldingofPacBio-MARVELassembly

    AdraftgenomeassemblyofSmed(PacBio-MARVEL,782.1MbpwithacontigN50of

    708.7 kbp), Illumina shotgun sequence data (IlluminaHiSeq 2500 rapid runmode,

    2x150bp readpairs, SRR5408395),andChicago library readpairs inFASTQ format

    wereusedasinputdataforHiRise,asoftwarepipelinedesignedspecificallyforusing

    Chicagodataforerror-correctingandscaffoldinggenomeassemblies3.Shotgunand

    Chicagolibrarysequenceswerealignedtothedraftinputassemblyusingamodified

    SNAPreadmapper(http://snap.cs.berkeley.edu/).TheseparationsofChicagoread

    pairsmappedwithindraftscaffoldswereanalysedbyHiRisetoproducealikelihood

    modelforgenomicdistancebetweenreadpairs,andthemodelwasusedtoidentify

    putativemisjoinsandscoreprospective joins.After scaffolding, shotgunsequences

    wereusedtoclosegapsbetweencontigs.

    WWW.NATURE.COM/NATURE | 10

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • S4.3AnalysisofHiRisechangestoPacBio-MARVELassembly

    MARVELrawcontigsweremappedontoHiRisescaffoldswithDALIGNERtofindthe

    exact break positions within the raw contigs. Then, each position was visually

    inspected using the layout of the toured overlap graph (similar to Figure 2e, left

    graph,maintext),aidedbythedirectvisualizationofreadoverlapswiththeMARVEL

    toolLAexplorer.

    S4.4ScaffoldgapclosureusingPBJelly

    FollowingHiRisescaffolding,weusedPBjelly10 to fillor reduceasmanyscaffolding

    gapsaspossible.All632HiRisescaffoldswereusedasinputforPBjelly,aswellasall

    patched P4/C2 and P6/C4 chemistry reads longer than 8 kbp. The 632 scaffolds

    contained1,256gapsof100bp (HiRisedefaultgapsize).PBjellycompletelyclosed

    482gaps,176gapswereclosedpartiallyandoverfilling176gaps,meaningthegap

    wasactuallylargerthan100bp.Inadditiontothis,itwaspossibleforPBjellytojoin

    32 scaffolds into bigger scaffolds, reducing the overall scaffold number to 600.

    OverlapanalysisbetweenthescaffoldsusingDALIGNER(seebelow)showedthat119

    PBjelly scaffolds were almost completely contained within other scaffolds and

    consistedof>95%repeatbases.Thesewere thereforemoved into thealternative

    contig set (_alt, seebelow), leaving481scaffoldsas the finalhaploiddd_Smed_g4

    assembly.

    S5:AssemblyPolishing

    S5.1FinalErrorCorrectionwithQuiver

    The final consensus sequence was generated first by a round of the PacBio tool

    Quiver.DuetotheruntimeofBLASR8,whichisusedbyQuiver,onlyreadsthatwere

    partoftheoverlapgraphofeachcontig(i.e.thetruenon-repeatinducedoverlaps)

    wereusedtopolishthecontigsequences.Thisstephadmostly todowiththeAT-

    biasandhighrepeatcontentoftheSmedgenome. Inordertoenrichthecoverage

    and improve the finalpolishedsequence,additional subreadswereextracted from

    the ZeroModeWaveguides and used for polishing (if available). This brought the

    coveragetoapproximately40x.

    WWW.NATURE.COM/NATURE | 11

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • S5.2Redundancyfiltering

    DALIGNERwasusedto identifythehaplotypecontentoftheSmedPacBio-MARVEL

    assembly,bymappingofthecontigsagainsteachother.Ifatleast70%ofthebases

    inthetwocontigsaswellas70%ofthereadsusedtobuildthecontigswerepartof

    both contigs they were considered alternate haplotypes and separated from the

    mainassemblyintoasecondsetofalternativecontigs,designatedbya_altsuffixto

    thecontigname.Forcontigsshorterthan100kbpthatterminatedinlongrepetitive

    sequences,thesethresholdswerereducedto50%.Thisremoved1,509alternative

    contigswithatotallengthof168.1Mbp(notethatthisstepwasperformedpriorto

    scaffolding).

    S5.3ConsensuspolishingusingPilon

    Forfine-polishingofresidualPacBiobasepairinaccuracies,weusedPilon11.Thistool

    uses theevidence fromshort-read sequencingdata inorder to correct singlebase

    differences, small indels, block substitution events, and gaps. Illumina 2x150 bp

    shotgun HiSeq 2500 paired-end sequencing data (SRR5408395; see S4.2 “HiRise

    scaffolding of PacBio-MARVEL assembly”) was aligned with bowtie212 against the

    scaffoldsofdd_Smed_g4andthecorrespondingalignmentswereprovidedasinput

    toPilon.SincePilonwasdesignedmainlyforsmallerprokaryoticgenomes,weranit

    over each of the assembled scaffolds separately. To assess the assembly

    improvements, variants were called from the scaffolding PacBio data provided by

    Dovetail Genomics with samtools/bcftools before and after Pilon correction.

    Correction was assessed by counting InDels and SNPs per kbp, and independent

    rateswere calculated for exons, genic regions andpolished (dd_Smed_g4) vs. raw

    assembly(Pacbio-MARVEL).AfterPilonpolishingtherewasa2.24-foldreductionof

    indelsinintergenic,a3.25-foldreductioninexonicanda1.94-foldreductioningenic

    regions. The changes in SNP counts were minor, indicating that the detected

    differencescorrespondmostlytotruevariants.

    WWW.NATURE.COM/NATURE | 12

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • S6:RNAseqandTranscriptomes

    S6.1RNAextractionandsequencing

    For total RNA isolation, 30 sexual animals (~1 cm, 2weeks starved) of the inbred

    genome sequencing Smed strain were homogenized in 20 ml Trizol reagent

    (Invitrogen) on ice using aMiccra D-8 homogenizer (Miccra) and processed using

    Trizol reagent protocol. Following phase separation by chloroform addition and

    isopropanol precipitation, the RNA pellet was dissolved in 350 µl RLT buffer and

    purified according to the RNeasyMini Kit, including on-columnDNase I digestion.

    TheRNAwaselutedinnuclease-freewaterandstoredat-80°C.ThetotalRNAwas

    quantified using a NanoDrop 1000 spectrophotometer and RNA integrity was

    checkedbycapillaryelectrophoresisonaBioanalyzerusingtheRNA6000NanoKit.

    The sample was poly-A selected and sequenced on an Illumina NextSeq 500

    sequencer.

    S6.2Smed_v6transcriptomeassembly:Rawdata,assemblyparameters,

    filtering,annotations

    The transcriptome assembly of the asexual strain has been published before13.

    Briefly,weusedTrinity14 toassembleRNA-Seqdata fromamixed setofpublically

    available data sources (Dresden, Berlin, Muenster). Transcripts were filtered for

    contaminants, directionality-corrected and subsequently annotated and filtered by

    ORF-content, domains, and RefSeq-protein BLAST hits to define a likely Protein

    Coding Fraction (PCF). For many applications, we further sub-filter the longest

    isoformofaTrinitygraphcomponent.Suchusesofthetranscriptomeareindicated

    by a “PCFL” suffix, e.g. “Smed_v6.PCFL”. Details about the dataflow and the

    companion website to access the data are documented in the PlanMine13

    publication.

    S6.3Smes_v1transcriptomeassembly:Rawdata,assemblyparameters,

    filtering,annotations

    The transcriptome assembly Smes_v1 of the sexual strain ofSmedwas assembled

    from SRR955511 using the same pipeline as for Smed_v6, except that the strand

    WWW.NATURE.COM/NATURE | 13

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • correction step and ordering of transcript numbers according to expression level

    were omitted. ORF directionality is therefore randomized in the Smes_v1 version.

    The PCFL-suffix, e.g., “Smes_v1.PCFL” designates the longest graph component

    subsetoftheproteincodingfraction,asdefinedabove.

    S7:AssemblyQualityControl

    S7.1Transcriptomeback-mapping

    De novo assembled transcriptomes of the asexual (Smed_v6) and sexual strains

    (Smes_v1) available at the PlanMine13 resource were used to assess the

    completenessof thedd_Smed_g4assembly. For thispurpose, the longest isoform

    perTrinitygraphcomponentoftheprotein-codingfraction(PCFL)wasselectedand

    mappedtothedd_Smed_g4gassemblyusingtheGMAPaligner15,withaminimum

    query coverage cut-off of 0.6 and aminimum identity cut-off of 0.6. The resulting

    mappingswerethenusedtoinvestigateuniquelymapping,multi-mapping,andnon-

    mappingtranscripts,alongwithputativechimerictranscripts.

    S7.2Genomecompletenessanalysis

    Thetwoback-mappedtranscriptomeswereusedtoestimatedd_Smed_g4assembly

    completeness.Thefactthat98.7%oftranscriptsofthesexual(Smes_v1)and97.3%

    oftheasexual(Smed_v6)trancriptomemappedtothedd_Smed_g4assemblyunder

    our mapping criteria demonstrates that the genome is largely complete. The

    remaining non-mapping transcripts (538 or 667 in the Smes_v1 and Smed_v6

    transcriptomes, respectively) could either represent true positives, i.e. genes

    genuinely missing from the dd_Smed_g4 assembly. Alternatively, they could

    represent transcriptome contaminants (e.g., from commensal organisms or other

    sourcesofsequencingnoise)andthusfalsepositives.Todistinguishbetweenthese

    twopossibilities,weBLASTsearched(blastp)allnon-mappingSmes_v1andSmed_v6

    transcriptORFsagainstNCBIRefSeqandrecordedthespecies fromwhichthebest

    blastp hit of each transcript originated. We then checked if any species was

    specificallyover-representedfornon-mappingtranscriptsusingtheenricherfunction

    fromtheClusterProfilerRpackage16. Inthisregardwefoundoverrepresentationof

    WWW.NATURE.COM/NATURE | 14

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • the protist species Tetrahymena thermophila, Ichthyophthirius multifiliis, and

    Paramecium tetraurelia in Smes_v1,whilstCaenorhabditis elegans andBos taurus

    were over-represented in Smed_v6. After the annotation of these transcripts as

    likelycontaminants,wefurtherclassifiedtheremaininggenesaslikelymissinggenes

    if they uniquely mapped to the SmedSxl v4.017 assembly and were annotated in

    PlanMine13ashavingorthologuesinatleast5otherplanarianspecies.Theremaining

    transcripts that did not fit the contaminant or likely missing gene criteria were

    markedas“unknown”andarelikelyenrichedforassemblynoise.

    Altogether, our analysis identified 46 out of a total of 31,966 Smed transcripts as

    genuinelymissingfromthedd_Smed_g4assembly,amountingtoacompletenessof

    99.9985%.Further,allmissinggenescouldbe identifiedonsmall contigs thathad

    beenremovedbyourinitialassemblyfiltering.Wethereforeconcludethatbasedon

    themappingofprevioustranscriptomeassemblies,thegenomeisofhighquality.In

    contrast, if we consider Smes_v1 transcripts that map to our assembly

    (dd_Smed_g4),butdonotmaptothepublishedSangergenomeassembly(SmedSxl

    v4.0)17thenweidentify2,193suchtranscriptsofwhich1,229alsowereannotatedin

    PlanMine as having homologues in at least 5 other planarian species.Overall, this

    analysis thereforedemonstratesconclusively that thedd_Smed_g4assembly is the

    mostcompleteSmedgenomeassemblytodate.

    S7.3Transcript-basedgenomeassemblyqualitycontrol

    We also used transcriptome backmapping to further probe the quality of the

    genomeassembly.Sinceanentiredenovoassembledtranscriptomecannotprovide

    anerror-free“groundtruth”dataset,wefirstsub-selectedahighconfidencesetof

    SmedtranscriptsfromourSmes_v1denovotranscriptomeassemblyonPlanMine13.

    Specifically,wealignedprimaryORFs(i.e.thelongestnon-overlappingopenreading

    frame of each transcript) against the 7 other in-house assembled planarian

    proteomes currently available in PlanMine. The searchwas performedwith blastp

    usingtheBLOSUM45similaritymatrixandane-valuecut-offof1E-3.Blastpresults

    were filtered for subject and query coverage of at least 90 % in at least 5 other

    transcriptomes.Altogether, thisanalysis identified1,509SmedORFs thathadclear

    WWW.NATURE.COM/NATURE | 15

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • full-length homologues in multiple other planarian species and therefore

    representedhighconfidencecDNAs.

    To stringently probe the quality of the dd_Smed_g4 assembly using the high

    confidence cDNA (HC-cDNA) subset, we further sub-filtered the previous GMAP

    transcriptomemappingresults (seesectionS7.1) foraminimumquerycoverageof

    0.9andaminimumidentitycut-offof0.9andcategorizedtheHC-cDNAasuniquely

    mapping, non-mapping,multi-mapping. As shown in Extended Data Fig. 2a, 1,432

    out of the 1,509 (94.90 %) transcripts mapped uniquely to the dd_Smed_g4

    assembly, despite the high stringency and coverage filter criteria. Therefore, this

    provides a strong indication that the underlying gene sequences were correctly

    assembled. Of the 10 non-mapping transcripts (0.66 %), only 1 represented a

    genuineassemblymistakelocatedatacontigend,whereby4ofthe5’-exonswere

    invertedandappendedtothe3’-endofthegene(ExtendedDataFig.2b,c).Also,2

    ofthenon-mappingtranscriptsrepresentedGMAPfalsepositives(bothwereBLAT-

    mappablewith>0.9querycoverageand identity)and theother7were transcripts

    already identified as “missing gene (4)”, “unknown (2)” or “putative contaminant

    (1)”.Together, thehighconfidencegeneanalysis thereforeconfirmsboth thehigh

    assemblyqualityandcompletenessofthedd_Smed_g4assembly.

    Furthermore, 67 out of the 1,509 HC-cDNA (4.44 %) were classified as multi-

    mapping. Manual inspection demonstrated that all of these genes indeed had

    multiple(usually2)“perfect”maplocationsinthedd_Smed_g4assembly.Some,e.g.

    dd_Smes_v1_25458 (Extended Data Fig. 2d) mapped on opposite strands within

    non-repetitive and high-quality regions of the genome, thus representing likely

    biological gene duplications. However, other multi-mapping locations like that of

    dd_Smes_v1_28816,weresituatedinlow-confidenceregionsclosetoassemblygaps

    (identified by bedtools closest18), indicated by lack of a respectiveQuiver track in

    that region (Extended Data Fig. 2e). Indeed, we found that the 67multi-mapping

    transcriptsmappedclosetocontigends incomparisonwithuniquelymappinghigh

    confidencegenes(ExtendedDataFig.2f,g)andthesizeoftheduplicatedgenomic

    region was below 5 kbp in all cases. The size of the duplicated regions was

    determined by mapping transcripts containing intronic sequences to the

    dd_Smed_g4 assembly usingGMAPwith parameters --chimera-margin=200, --min-

    WWW.NATURE.COM/NATURE | 16

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • trimmed-coverage=0.9 and --min-identity=0.9. If the transcript with intronic

    sequencesstillmappedtomorethanonelocusinthegenome,aflankingwindowof

    2kbp(1kbpup-and1kbpdownstream)wasaddedtothetranscriptsequenceand

    GMAPwasrunagain.Thisprocesswasrepeateduntilonlyone locationwas found

    foragiventranscript.

    Altogether, this indicates that themajority ofmulti-mapping transcripts represent

    over-assembly artefacts in the dd_Smed_g4 assembly that are too small to be

    detectedbytheChicago/HiRisescaffoldingmethod.Likely,thesecontigendmistakes

    ariseasaconsequenceofduplicatedassemblyofhaplotypesandrepresentaknown

    probleminassemblinghighlyheterozygousgenomes19.

    However, their low overall proportion minimally impact the quality of the

    dd_Smed_g4 assembly and the various quality control tracks provided in the

    PlanMine genome browser (e.g., gap locations, mappability, quivered intervals)

    providetoolsfortheinterpretationofdubiouscases.

    Inter and intra-contig similarity in Extended Data Figure 2c-e was visualized using

    miropeats20withthedescribedthresholds.

    S7.4Contaminationcheck

    Although all animals used for genome sequencing were exposed to a stringent

    antibiotic treatment prior to DNA extraction, the sequencing of whole animal

    extracts always comes at the risk of co-sequencing parasitic or commensal

    organisms. To assay for the putative presence of non-Smed contigs in our

    dd_Smed_g4 assembly,we screened all contigs for GC-content and coverage (this

    method was recently used to identify contamination in the original Hypsibius

    dujardini Tardigrade genome assembly)21. Coverage of dd_Smed_g4 contigs was

    computed by first aligning reads from SRR954965 and SRR849810 to dd_Smed_g4

    usingbowtie212, andestimating coverageper scaffoldwithgenomeCoverageBed18.

    GCcontentwasdeterminedusinginfoseqfromtheEMBOSStoolssuite22.Wecould

    not detect major contig clusters that deviated from the Smed GC-content or

    haploid/diploid read coverage, thus indicating lack of bacterial or other

    contaminationintheassembly(notshown).

    WWW.NATURE.COM/NATURE | 17

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • S8:UCSCGenomeBrowserTofacilitateaccess to thedd_Smed_g4assembly,webuiltacustomversionof the

    “UCSC Genome Browser”. Further, we integrated the browser into PlanMine, the

    planarian transcriptome platform maintained by our group13. The genome

    enhancement thus further improves the utility of PlanMine towards its mission

    objectiveascommunityplatformfortheplanarianmodelsystem.

    The UCSC Genome Browser23 source code was downloaded from GitHub

    (https://github.com/ucscGenomeBrowser/kent) and several tracks were made

    available,whicharebrieflydescribedinthefollowing.

    S8.1GEM-Mappabilitytrack

    Overall genome mappability was determined by the GEM mappability program24

    (GEM-binaries-Linux-x86_64-core_i3-20130406-045632). Briefly, using the fast

    mapping-based algorithm, it produces mappability scores for different lengths of

    NGSreadsonthegenome.AhigherGEM-scoreatagivenpositionindicatesahigher

    chanceofuniquelymappingareadof therespective lengthtothatpositionof the

    genome.Regionswithlowerscorearelessuniqueandwilllikelyproduceambiguous

    mappings for the given read length. Generally, duplicated and repetitive regions

    generate lower mappability scores. One exception are repetitive regions in the

    assemblythatwerenoterror-correctedbythePacBioQuivertool(seeS8.5).These

    regions appearmore unique, due to the higher error rate and tend to have high

    mappabilityscores,butcaneasilybespottedbytheabsenceofaQuivertrack.

    Togeneratethemappabilitytrack,standardparametersalongwithareadlengthof

    200 were applied. After that, the output files were converted to bigwig using

    standardparameters.The“mappability_200”trackisavailableunder“Mappingand

    SequencingTracks”atthecustomUCSCGenomeBrowser.

    S8.2RepeatMasker–Repbasetrack

    Inordertoprovideinformationaboutknownrepeats(i.e.RepBase)aRepeatMasker

    runusingparameters:-species“schmidteamediterranea”and-s(forsensitive)was

    performed on the dd_Smed_g4 assembly. Also, RepeatMasker25 was configured

    using rmblastn (v2.2.28+) andTRF (v4.09)26. Theoutput information is available at

    WWW.NATURE.COM/NATURE | 18

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • thecustomUCSCGenomeBrowserunder“VariationandRepeats”sectionandtrack

    named“RepeatMaskerRepbase”.

    S8.3RepeatMasker–RepeatModelertrack

    Inordertoalsoprovidedenovorepeatprediction,RepeatModeler(v1.0.8)27wasrun

    on the dd_Smed_g4 assembly. Standard parameters were used and the tool was

    configuratedwithRepeatMasker patched versionof RECON (v1.08)28, RepeatScout

    (v1.0.5)29, TRF (v4.09)26, rmblastn (v2.2.28+). The resulting track is available at

    “VariationandRepeats”sectionunder“RepeatMasker-denovo”.

    S8.4AlternativeScaffoldstrack

    The alternative scaffolds track highlights genomic regions forwhich an alternative

    haplotypeexists.Eachalternativehaplotype(indicatedwith“_alt”suffix)isavailable

    inthebrowser(describedindetailinsectionS12.3).

    S8.5Quivertrack

    Quiver status, described indetails in section S5.1, is representedgenomewideon

    this track. LackofQuiver coverage indicates thata regionwasnoterror corrected

    andthattheunderlyingsequenceistobeinterpretedwithcaution.

    S8.6PacBiocoveragetrack

    PacBiorawdata(P4/C2andP6/C4)weremappedtodd_Smed_g4usingbwa-mem30

    with-xpacbiooption.Usingsamtools31andBEDTools–genomeCoverage18,coverage

    informationwasgeneratedanddisplayedonthistrack.

    S8.7Transcriptstrack

    PlanMine13 Smed assembled transcripts (Smes_v1 and Smes_v6) were mapped to

    dd_Smed_g4assemblyusingGMAP(describedinsectionS7.1).

    Coverage informationwasfetchedfrommappingrawdatausedfortranscriptomes

    assemblies from PlanMine to dd_Smed_g4 using STAR32 with default parameters,

    samtools31andBEDToolsGenomeCoverage18wereusedtobuildthetrack.

    WWW.NATURE.COM/NATURE | 19

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • S8.8LinkstoPlanMine

    The customUCSCGenome Browser version is embedded in PlanMine available at

    http://planmine.mpi-cbg.de.Also,eachSmedtranscriptislinkedbetweenPlanMine

    andtheUCSCGenomeBrowser.

    S9:RepetitiveElementPrediction

    S9.1Denovopredictionofrepetitiveelements

    An initial prediction of repetitive element classes was performed using

    RepeatModeler(v1.0.8)27andtheresultinglibrarywasclassifiedusingRepeatMasker

    (v4.0.6)25 using libraries fromRepBase (20160829 snapshot)33. TheRepeatModeler

    librarycontainedlargenumbersofmodelsthatcouldnotbeclassified(classification

    ‘Unknown’) and which, together, masked 25.29 % of the assembly. Uponmanual

    inspection,highproportionsofthemodelswerechimericorcontainedelementson

    varying strands. Removal of these models from the library increased masking

    ascribed to other repeat classes, also suggesting conflict between the models

    producedbyRepeatModeler.

    A library was manually created to ameliorate these issues by compiling repeat

    elementclassesfromdifferentsources,orfromRepBasewherenodifferenceswere

    notedbetweenmaskingwithRepBaseentries in comparison togeneratedmodels.

    Simple repeats and DNA elements were taken directly from RepBase using the

    RepeatMasker“queryRepeatsDatabase.pl”script.MaskingofLINEelementswiththe

    RepBase entries was poor in comparison to using models produced by

    RepeatModeler(4.39%versus10.98%genomemasking,respectively)andthelatter

    were included in the library. LTR retroelement models were produced through a

    separatepipeline(seeS10.1“LTRelementconsensusconstruction”).

    A hard-masked assembly was generated using this library and this assembly was

    again used as input into RepeatModeler. Masking from models classified as

    ‘Unknown’ was reduced to 14.39 %. Again, manual inspection revealed these

    contained high proportions of fragmentary and chimeric models (1,502 of 1,768

    models)andthesewereexcludedfromdownstreamanalysis.

    WWW.NATURE.COM/NATURE | 20

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • For all masking, RepeatMasker was configured with rmblastn (v2.2.27+) and TRF

    (v4.09)26andwasruninsensitive(-s)mode.

    S10:LTRAnalysis

    S10.1LTRconsensusconstruction

    LTR elements were identified through an approach combining outputs from 4

    predictionprograms:LTR_FINDER(v1.0.6)34(runwithandwithoutinbuiltmaskingof

    highly repetitive sequences), LTRharvest (GenomeTools v1.5.9)35, MGEScan-LTR36,

    andRed37.Whereprogramsexposetheoptions,maximumelementsizeconstraints

    were set to 50 kbp and minimum LTR sizes to 150 bp. As this was program-

    dependent,however,allidentifiedfeatureswerefurtherfilteredonthepresenceof

    >150 bp LTRs and to be between 2 and 50 kbp in length. Passing features

    overlapping>=50%weremergedusingBEDOPS(v2.4.20)38 togivethe locationsof

    finalpredictions.

    Predicted elements were translated in all frames and scanned using hmmsearch

    (HMMER v3.1b2)39 for the 'rve' (PF00665.21), 'RVP' (PF00077.15), and 'RVT_1'

    (PF00078.22)PfamHMMs40,representingintegrase(IN),protease(PR),andreverse

    transcriptase (RT), respectively. Matching regions were extracted for all elements

    and separate alignments built for IN, PR, and RT using MUSCLE (v3.8.31)41. For

    elementscontainingall threeHMMfeatures,alignmentswereconcatenatedanda

    neighbour-joining tree built with MUSCLE. Clusters were generated using

    tree2clusters (USEARCH v9.0.2132)42 using aminimum identity threshold of 80%.

    Not all elements contained all three HMM features and were represented in the

    tree,thusclustersweremanuallyrefinedbasedontheindividualtreestomaximise

    inclusionandtoexcludeelementswithsizes>=2SDsfromthemean.

    Nucleotide sequences for all elements of each cluster were extracted from the

    assembly and aligned with MAFFT (v7.271)43. Neighbour-joining trees were built

    using MUSCLE and any sub-clusters identified using USEARCH were separated.

    Levitsky consensus sequenceswere calculated and annotated using an automated

    pipelinewithinUGENE(v1.24.2)44.ConcatenatedIN,PR,andRTsequencesfromthe

    consensuseswereaddedtothepreviousalignmentandthetreeinspectedtoensure

    WWW.NATURE.COM/NATURE | 21

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • representation of all elements. Finally, thiswas confirmed by using the consensus

    sequences as a custom library tomask the assembly using RepeatMasker and by

    comparisontotheinitialpredictions.

    To streamline further analysis, the LTRs were replaced with consensus sequences

    formed for each of the (sub-) clusters individually. To match the methods and

    nomenclatureusedinRepBase,theLTRandinternalsequenceswereseparatedfor

    theRepeatMaskerlibrary.

    S10.2GypsyLTRelementannotation

    ConsensuselementswereannotatedwiththelocationsassociatedwithPfammodels

    used in their discovery (‘rve’, 'RVP', and 'RVT_1') and additionally with ‘RNase_H’

    (PF00075.19).tRNAmodelspredictedbytRNAscan-SE(v1.4)45werescannedagainst

    the consensus elements using blastn (v2.2.30+)46 and the best matching region

    annotated. HMMs compiled from the GyDB 2.0 collection47 were also tested,

    alongside the entire PfamA and PfamB HMM sets. CCHC-type Zinc finger

    annotations,knowntobeassociatedwithretroviralgagnucleocapsid(NC)proteins

    couldbeidentified,howeverfurtherhitswereoflowconfidenceandwereomitted.

    AdditionalputativedomainswerepredictedusingScanProsite48andCD-search49.

    S10.3TheBurrofamily

    We detected 14 classes of gypsy LTR elements in the Smed genome, all of which

    share thecentralprotease (PR), reverse transcriptase (RT), ribonucleaseH (RNAse)

    and integrase (INT) module of Ty3-gypsy-like retroelements. However, a

    phylogenetic analysis of the conserved PR/RT/Int/RNase module could not assign

    any of the 14 classes to known retroelement classes. Themost striking structural

    featureof theSmed retroelements is themassive sequenceexpansionofgroups4

    (Burro-1), 5 (Burro-2) and 13 (Burro-3), which results in LTR elements of 25 kbp

    flankedby5kbpLTRs(comparedtothe5-7kbpoftypicalvertebrateLTRelements).

    TheOgre-elementsfoundinplantssofarprovidetheonlyexampleofsimilarly-sized

    LTRs,butprimingby tRNAPROinsteadof tRNAARG inOGREsand the lackofanopen

    reading frame upstream of the likely nucleocapsid protein (zinc knuckle domains)

    indicatethattheSmedelementsarelikelynotrelatedtoOGREelements50.Further,

    thepresenceofputativenucleaseandBIR-domainsthatarenotpartof thetypical

    WWW.NATURE.COM/NATURE | 22

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • domaincomplementofMetaviridaefurther indicatethatthegiantSmedgypsyLTR

    elementsrepresentlikelynovelbranchesoftheMetaviridae,theunusualcomplexity

    ofwhichharboursthepotentialforcomplexlifecyclesorinteractionswiththehost

    transposondefencemachinery.

    S10.4LTRelementexpressionanalysis

    LTR element expression was estimated by TETranscripts51. Briefly, TETranscripts

    estimatesboth, theexpressionof transposableelements (TE) andgene transcripts

    takingasinputsTEandgeneannotationsandalignmentfiles(i.e.bamfiles).RNA-Seq

    reads (SRR5408394)weremapped to the genomeusing STAR (v2.5.2a_modified)32

    with--winAnchorMultimapNmax100,--outFilterMultimapNmax100.

    AcustomGTFcontainingcoordinatesofSmed_v6transcriptsthatuniquelymapped

    tothegenomebyGMAP15(--chimera-margin=200,--min-trimmed-coverage=0.6and

    –min-identity=0.6)inadditiontothegenomewereusedasinputtobuildSTARindex.

    Afterthereadalignment,atablewithrawcountswasgeneratedusingTETranscripts

    (defaultparameters).

    S10.5LTRelementdivergenceanalysis

    Kimurasubstitutionlevelsofmaskedregionsagainsttheircorrespondingmodelare

    automatically calculated by RepeatMasker and output by default. These were

    compiled for each (sub-) cluster and plotted as a means of assessing their

    divergence. LTR element agewas calculatedusingmutation rates estimated forC.

    elegans52 (per generation: 2.70E-09 / per year 2.46375E-07). Assuming C. elegans

    mutationrates,LTRelementinvasionwascalculatedtobe~10to60Mioyearsold

    (SupplementaryTable1).

    S10.6Solo/pairedLTRelementabundanceanalysis

    Wehave estimated the proportion of solo LTRs and complete or nearly complete

    LTR/Gypsy.SoloLTRsweredefinedasevents that lackany traceof internal region

    andhavealengthofatleast60%ofthespecificconsensussequence.Completeor

    nearlycompleterepeatelementsweredefinedaseventsthathaveLTRsflankingan

    internal regionandanoverall length (LTRsplus internal region)ofat least60%of

    theconsensussequence.

    WWW.NATURE.COM/NATURE | 23

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • S10.7Scaffoldendanalysis

    Toexaminetheeffectofdifferentrepeatclassesontheexistingassembly, their

    occurrence-frequency at scaffold ends was compared to the expected chance

    occurrence of the class. For each scaffold in the assembly, the terminal 1 kbp

    endswereexaminedforoverlapwithrepeatannotationsusingBEDTools18and

    scoredasrepetitiveifmorethan600bpwerecoveredbyrepeatannotations.For

    calculating the sum of repetitive sequence length, only the actual bases

    overlapping the repeat annotation were counted. Chance occurrence was

    determinedbyexamininga setof1,000randomgenomic regions (1kbpeach)

    forrepeatoverlapannotationasabove.Scaffoldterminiwereexcludedfromthe

    randomgenomicregionsset.

    S11:AAT-RepeatAnalysis

    S11.1AAT-microsatellitepredictionandreadalignmentbreakanalysis

    Theanalysisofmicrosatelliteswasbasedona tandemrepeat finderannotationof

    the dd_Smed_g4 assembly. All variations of AAT* patternswere converted into a

    Marvel interval trackand incorporated intoaMarveldd_Smed_g4contigdatabase.

    Patched P6C4 reads (coverage 20x) were mappedwith DALIGNER on the

    dd_Smed_g4 assembly. For each of the AAT* regions, the alignments of P6/C4

    reads were analyzed and the fraction breaking within the AAT* region versus all

    readsspanning500basesleftandrightoftheAAT*regionwasdetermined.Breaking

    andspanningreadswerebinned(binsize:180bp)accordingto lengthoftheAAT*

    repeatwithinthe interval[100,1000].Bin1:100-280bp,avg166bp;bin2:280-460

    bp, avg351bp;bin3:460-640bp, avg536bp;bin4:640-820bp, avg719bp;bin5:

    820-1000bp,avg900bp.

    S11.2MicrosatelliterepeatlengthestimationbyCircularConsensus

    Sequencing(CCS)

    Inordertodeterminetheoriginofthelengthvariationscausingthereadalignment

    breaksoverAT-regions,circularconsensussequencing(CCS)53readswithaminimal

    WWW.NATURE.COM/NATURE | 24

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • genomecoverage(<1x)weregenerated.For2kbpCCSlibrary,2µgofAMPureXP

    beadpurifiedhighmolecularweightgenomicDNAwasshearedto2kbpfragments

    usingclearminiTUBEsontheCovarisS2instrumentaccordingtothemanufacturers

    conditions.Themajorityoffragmentswereabout3.9kbpinsize.Ofthis750ngof

    shearedgenomicDNAwentintoPacBioSMRTbelllibrarypreparationasdescribedin

    “Procedure and Checklist – 2 kbp template preparation and sequencing”with the

    exceptionthatthebluntsequencingadapterwasused inafinalconcentrationof5

    µM.Primerdimersandshort fragmentswereremovedafter librarypreparationby

    2x0.6XAMPureXPbeadwashingsteps.SMRTbelllibrarieswereloadedontoPacBio

    RSII SMRT cells after primer annealing and polymerase binding, applying the

    MagBead loading protocol. Sequencing was done on the RSII instrument with P6

    polymerase and C4 sequencing chemistry and 360 min movies to generate

    polymerase reads that lead to circular consensus sequences (CCS). The sequences

    weresubmittedtotheSRAunderaccessionSRR5408393.

    S11.3OverlaplengthvariationinCCSandP6/C4reads

    CCS raw reads and patched P6/C4 reads were both mapped to the dd_Smed_g4

    assemblywithDALIGNER. TheCCS consensus readswere generatedbypbccswith

    the use of the default parameters andwere only used to detect uniquemapping

    locationsforCCSreads,whileambiguouslymappedP6/C4readswereexcludedfrom

    theanalysis.

    Fromthis,AATrepeatswerepredictedasfollows:

    1. AAT-regionswere selected,by choosing regions that canbe spanned (~100

    bp anchor right and left of AAT region) by uniquely mapping (see S11.3)

    circularconsensussequences(CCS).Thisresulted in1,141uniqueregions in

    thedd_Smed_g4assembly,withlengthbinsrangingfrom300–1000bp.

    2. To eliminate sequencing coverage differences between different regions of

    thegenome,5randomCCSand5randomP6/C4readswereselectedforeach

    ofthose1,141dd_Smed_g4regions.

    3. BoxplotsforthefractionofCCSoverlaplengthreadsvsoverlaplengthinCCS

    consensusreadandfractionofP6/C4overlaplengthvsdd_Smed_g4overlap

    lengthwerecreated.

    WWW.NATURE.COM/NATURE | 25

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • 4. Theplotincludesonlylengthbinsintherangeof400-800,because

    a. Thereareonlyasmallnumberofeventsinbinsof900-1,000

    b. For bin 300 the difference between unique andAAT regions is very

    smallasAATregionscanhaveupto200anchorbases.

    ThevariancebetweentheCCSreadsanduniquesegmentsofthegenomecompared

    tothevariancebetweenstandardP6/C4readsanduniquesegmentsofthegenome

    wereunder theerror-rateofstandardP6/C4PacBioreads.Thesamewasthecase

    forthevariancebetweentheCCSreadsandATT*regionscomparedtothevariance

    of standard P6/C4 and AAT* regions. However, if the AAT* loci within the

    dd_Smed_g4assemblydidvary,itwouldbeexpectedthatthisvariancewasoutside

    theexpectederror-rateofPacBioreads.Singleregionswerefoundthatdidhavea

    high variance. However, at this point we cannot exclude that these were due to

    differences between haplotype differences being observed. Therefore, a biological

    source for the variance within AAT* loci could not be found and it can only be

    concludedthattheseareofatechnicalorigindueresultingfrompairwisealignment

    ofrepeatregions.

    S11.4GenomiccontextofAT-richregions

    To characterise the genomic location of AT-rich microsatellite repeats, genomic

    features were defined on basis of the mapping results of the Smes_v1

    transcriptome13againstthedd_Smed_g4assembly.Then,locationsofthepreviously

    defined AT-rich microsatellite repeats were extracted using IntersectBed18. An

    overlapofat least60%of thequery (>100bpevents) toagivengenomic feature

    wasrequiredtobescoredaseitherintergenic,intronicorexonic.

    S12:AssemblyHeterozygosity

    S12.1Drosophilaassembly

    For the comparison between the dd_Smed_g4 assembly and a D. melanogaster

    assembly,weused50xcoverageofthepubliclyavailableSRAdataset(SRX4999318)

    withareadlengthcut-offof14kbp.TheresultingstandardMARVELassemblyhada

    WWW.NATURE.COM/NATURE | 26

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • contigN50of20Mbpandthe longestcontigwas28Mbp.Theassemblydisplayed

    almostnoalternativepaths(bubbles)intheoverlapgraph,indicatingahighlevelof

    homozygositywithin thedataset.A representative 1.7Mbpportionof the longest

    contigwasusedtoillustratethedifferencesbetweenD.melanogasterandSmed in

    Fig.2e.

    S12.2Dotplotanalysis

    TheanalysisandcorrespondingdotplotinFig.2fwasgeneratedbyMummer

    (v3.2.3)54.

    S12.3Coverageanalysisaltvs.maincontigs

    To quantify genome-wide coverage differences between alternative and main

    contigs,weonlyconsideredindelsasregionsofunambiguousdifferencebetweenalt

    andmain contigs. For each indel region, the fraction of breaking read alignments

    versusthecoverageofspanningreadswasquantified.OnlypositionsofnonAT-rich

    indelsoflength>99baseswereconsidered.Thecoverageofbreakingandspanning

    alignments was limited to [5, 25] and indel lengths had to be consistent at each

    position. TheanalysiswasbasedonMARVELassembled contigs (no correction,no

    scaffolding), and on the trimmed P6/C4 reads. 1,509 alternative contigs (bubbles)

    that mapped to 1,495 unique locations on 695 contigs in dd_Smed_g4 were

    considered (14alternative regions (~1%) representednon-uniformstructuressuch

    asstackedalternativecontigs).Alternativecontigshadanaverageof13breakevents

    (includingAAT*patternsandinsertsofanysize)withamedianspacingof6,676bp.

    S13:GeneAnnotation

    S13.1Geneannotationandgenenumberestimate

    Thetranscriptomeofthesexualstrain(Smes_v1,PCFL)wasmappedtothe

    dd_Smed_g4assemblyusingGMAP15withaminimumquerycoverageof0.6and

    minimumidentityof0.6.Ourtranscriptomecontains~31000non-redundantand

    likelyprotein-codingtranscripts.Thisnumbercorrespondswelltoaprevious

    estimateofgenenumbersintheplanariangenome54.However,theactualnumber

    WWW.NATURE.COM/NATURE | 27

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • ofSmedgenesislikelysubstantiallylowerduetofragmentedtranscriptsor

    transcribedrepeatelements.

    S13.2Genestructure

    Gene structure statistics were calculated for a set of high-confidence transcripts,

    identifiedonbasisofi)uniquemappinglocationinthedd_Smed_g4assemblyandii)

    havingahomologueinatleast4otherspecies.Intotal,12,413transcriptsfulfilledall

    criteria.Themedian/meantranscriptsizewas1,545/1,875bpandmappedlength

    1,534 / 1,866 bp, respectively. Median/mean gene size was 7,473 / 15,395 bp,

    numberof4/5.76introns,andintronlengthof1,329/2,694,respectively.

    S13.3Orthologyanalysis

    ToestablishanorthologymodelofSmedgenes,weperformedapair-wiseBLASTof

    (a) 20 selected Ensembl reference proteomes and (b) the proteome of Smed. To

    definethelatter,weusedtheSmes_v1.PCFLtranscriptomeofthesexualstrainand

    extracted justtheprimaryOpenReadingFrame(ORF) (i.e. longestnon-overlapping

    ORF)ofeachtranscript.BLASTsearchwasperformedusingblastpusingaBLOSUM45

    similarity matrix and e-value cutoff of 1E-3. BLAST results were pre-filtered for

    subjectandquerycoverageofatleast10%andpost-filteredforreciprocalbesthits.

    Resultingdatawere clustered intoorthology groups (i.e. graph components)using

    the igraph package in R56. Orthology groups were assessed with respect to

    presence/absenceofagivenspeciesand/orcombinationofspecies.

    S14:DivergenceAnalysis

    S14.1Divergenceanalysis

    Orthologuegroupsof52highlyconservedproteinsacross22speciesweregenerated

    usingtheorthoMCLpipeline(fordetailsseeS16.1)(SupplementaryTable2).Briefly,

    orthogroupsthatcontainedallspeciesandwhereonlyoneortwospecieshadmore

    than1orthologuewereselected.Ambiguousaminoacidresidues (letter“X”)were

    first removed from each of the sequences. Each orthologue set was then aligned

    using PRANK57 (v10603, parameters: 10 iterations, +F) and trimmed using trimAl58

    (v1.4rev15,parameters:gappyout).Theorthologuesetswerethenconcatenatedto

    WWW.NATURE.COM/NATURE | 28

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • formasupermatrixusingaperlscript.Maximumlikelihoodanalysiswasperformed

    using RaxML (v8.2.959 with the PROTCATLG model. The topology of non-

    Platyhelminthesspecieswasconstrainedusingaguidetree(basedonthephylogeny

    byCannonetal.2016)60.

    S15:SyntenyAnalysis

    S15.1Syntenycomparisons

    We used the dd_Smed_g4 assembly as the reference genome and computed

    pairwisegenomealignmentstootherPlatyhelminthes(query)genomesusinglastz61

    (v1.03.54;parametersK=2400L=3000Y=3400H=2000,HoxD55scoringmatrix)and

    thechain/netpipeline62(parameterschainMinScore1000,chainLinearGaploose).In

    addition, we used highly-sensitive local alignments with lastz parameters K=1500

    L=2500 and W=5 to find co-linear alignments in the un-aligning regions that are

    spannedbylocalalignments(gapsinthechains)63.Thesameprocedurewasapplied

    to align theM. lignano genome64 to other Platyhelminthes including Smed. For

    comparison,we also aligned the human genome against frog (xenTro7 assembly),

    zebrafish(danRer10)andlamprey(petMar2).

    S16:PlanarianSpecificGeneSequences

    S16.1Identificationofplanarian-specificgenes

    To search for putative novel or planarian-specific genes in Smed,wemade use of

    OrthoMCL,which groups putative orthologues and paralogs using aMarkov Chain

    algorithm65.Here,weappliedOrthoMCL(v2.0.9)totheanalysisofasetof27species

    (Supplementary Table 3) representing the taxonomic groups Annelida (n=1),

    Arthropoda(n=4),Chordata(n=3),Cnidaria(n=1),Ctenophora(n=1),Mollusca(n=2),

    Nematoda (n=2) and Platyhelminthes (n=13,which included the transcriptomes of

    n=6 planarian species). To generate the input similarity matrix for OrthoMCL, we

    carriedoutanall-against-allreciprocalblastpanalysisamongstthepredictedprotein

    sets of the 27 species with -matrix “BLOSUM45” and -evalue 1e-5 parameters.

    OrthoMCL was run with a parameter of “-I 1.5”. Only OrthoMCL

    WWW.NATURE.COM/NATURE | 29

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • orthologue/paralogue groups comprised exclusively of flatworm entries were

    consideredforfurtheranalysis.

    Next, theSmed sequences(derivedfromSmes_v1.PCFLORFs)wereextractedfrom

    flatworm-specific orthogroups and further sub-filtered according to the following

    criteria:

    1. ORFlengthequalorgreaterthan160aminoacids.

    2. Similarity (blastp -evalue lower or equal to 1e-15) to at least one other

    distantplanarianspecies(Dlac,Pten,PnigorPtor).

    3. NoRefseqBLASThit(-evaluelowerorequalto1e-3)innon-Platyhelminthes

    species(RefSeqBLASTdatabasesnapshotfromMay24th2016).

    4. Mappinglocationinthedd_Smed_g4assemblyhaslowoverlapwithrepeats

    (lessthan60%ofORFlength).

    5. Mapping location in the dd_Smed_g4 assembly has no other transcripts

    mappedwithin50bpupstreamand150bpdownstreaminordertoexclude

    fragmented transcripts. Manual inspection confirmed that the latter two

    filtering criteria greatly reduced the presence of transcript fragments

    representing poorly conserved protein regions, yet the presence of some

    suchsequencesinthefinaldatasetispossible.

    Altogether,1,165transcriptspassedthefilteringandwethereforerefertothisdata

    set as flatworm-specific genes (Supplementary Table 4). Given the focus on

    transcribed sequences and the additional application of ORF length criteria, these

    largelyrepresentactivelytranscribed,protein-codinggenesthathavenodetectable

    sequence similarity outside the phylum by standard BLAST criteria (step 3). Our

    analysis underestimates the content of “new” genes in the Smed genome, since

    recently repeat-derived sequences are excluded by filtering (step 4) and the

    requirement for similarity in distant planarian species (step 2) removes recently

    generated or rapidly evolving gene sequences. Given the significant sequence

    divergencebetweenplanarianspeciesatthetranscriptomelevel(notshown),itisin

    factvery likelythatmanysuchgenesexistandthatplanariansthereforeconstitute

    aninterestingmodelsystemforstudyingtheevolutionofnewgenes.

    WWW.NATURE.COM/NATURE | 30

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • S16.2Domainpredictions

    Domainsofflatworm-specificgeneswerepredictedusingSMART66(“Normalmode”

    withthecheckedoptions“Outlierhomologuesandhomologuesofknownstructure”,

    “PFAMdomains”,“signalpeptides”and“internalrepeats”)andInterProScan67with

    SUPERFAMILYandPfamapplications.

    S16.3Expressionanalysis

    To check the expression of flatworm-specific genes in Smed, transcript expression

    data sets for embryogenesis (SRA accession No.: SRR3629913 - SRR3629944),

    regeneration (SRR2051616 - SRR2051633) and stem cells (SRR2051616 -

    SRR2051633)werefetchedfromtheNCBI–SRAdatabase.

    All data were mapped to the dd_Smed_g4 assembly using STAR32 (standard

    parameters) and a custom GTF file with genomic coordinates of mapped ORFs.

    Differential expression analysis was carried out using the DESeq268 package for

    datasets with replicates (stem cells and embryogenesis). Log fold change greater

    than0.5orlowerthat-0.5wereusedascut-offfortheregenerationdataset.

    S17:GeneLossinPlanarians

    S17.1Lostgenesidentificationandverification

    ThedetectionofhighconfidencegenelosseventsinSmedwasperformedusingan

    in-housecustompipeline.Briefly,weperformedapair-wisereciprocalBLASTsearch

    against (a)Ensembl referenceproteomes (seeSupplementaryTable3 fordatabase

    versions) and (b) 7proteomespredicted fromassembledplanarian transcriptomes

    (including both Smed_v6 and Smes_v1). The inclusion of other planarian

    transcriptomesprovidedanimportantguardagainstfalsepositivelosseventsdueto

    accidental absence from one transcriptome. Orthology group clusters lacking any

    planarian species were then further filtered and validated using the dd_Smed_g4

    genomeassembly,asdescribedindetailbelow:

    Step1–PairwisereciprocalBLASTandcoveragefiltering

    WWW.NATURE.COM/NATURE | 31

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • For the pairwise reciprocal BLAST search, just the primary (i.e. longest non-

    overlapping open reading frame) of the longest transcript of each Trinity graph

    componentwasusedas input.BLASTsearcheswereperformedusingblastpwitha

    BLOSUM45substitutionmatrixandane-valuecutoffof1e-3.

    Step2-Orthologygroupconstruction

    BLASTresultswerepre-filteredforsubjectandquerycoverageofatleast10%and

    post-filtered for reciprocal best hits. Resulting data were clustered into orthology

    groups(i.e.graphcomponents)usingtheigraphpackageinR56.

    Orthology groups were assessed with respect to species presence/absence, and

    groupscontainingatleast9non-flatwormspecies,noSmedandnootherplanarian

    specieswereconsideredtobeputativelylostinSmed.

    Step3-Genomiclossverificationusingexonerate

    ToindependentlyverifythelossofsuchgenesintheSmedgenome,allotherprotein

    members of planarian-deficient orthology groups were mapped onto the

    dd_Smed_g4 assembly using exonerate (--showvulgar no --showalignment no --

    model protein2genome -s 40 --showtargetgff yes --percent 2). Genes were

    consideredasgenuinely lost ifnopoint inthegenomewascoveredbyoverlapping

    exoneratealignmentsofmorethan2differentnon-planarianspecies.Thisthreshold

    wasnecessary inordertofilterforspuriousalignmentsandresidual2-hitpositions

    invariablyhadneitherRNAseqcoveragenorcodingpotential,asevidencedbyalack

    ofblastxhitsagainstNCBInrdatabaseofthoseintervals.

    Overall,ourpipelinebasesthedefinitionofgenelossontheabsenceofdetectable

    sequencesimilaritybothin6independentlyassembledplanariantranscriptomesand

    intheSmedgenome.Further,the limitationtoorthogroupswith>9non-flatworm

    members generally restricts theanalysis to genes that arehighly conservedat the

    sequencelevel.Althoughfalsepositivesareconsequentlyhighlyunlikely,wecannot

    ruleoutthepresenceofhomologuesthat,onlyinplanarians,havedivergedbeyond

    thepointofdetectablehomologybyconventionalsequencecomparisons.

    Step4-Validationoflostgenecandidates

    WWW.NATURE.COM/NATURE | 32

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • Theproteintranslationsofputativelylostgeneswereadditionallymanuallysearched

    against Annelid, Molluscan and Platyhelminth transcriptomes using tblastn

    (parameters:-wordsize2-matrixBLOSUM45-soft_maskingTRUE-evalue100).Hits

    wereinspectedmanuallytoverifyabsenceintherespectivespecies(Supplementary

    Table 5). The fact that we did not recover the previously reported loss of the

    centrosome proteins69 indicates that our focus on deeply conserved geneswith a

    highdegreeofsequenceconservationlikelyunderestimatestheextentofgeneloss

    inSmed.

    S17.2Essentialityassessmentoflostgenes

    Thehumanand/ormousehomologuesofgeneslostinSmedwereusedtoquerythe

    Database of Essential Genes (DEG v13.3)70. Positive hits were annotated as

    “essential”inSupplementaryTable5.

    S17.3Mad1/Mad2multiplesequencealignments

    Multiple sequence alignments for Mad1 and Mad2 were generated using the

    Constraint-based Multiple Alignment Tool (COBALT)71 (standard parameters) and

    resulting alignments were rendered using the BoxShade website with standard

    parameters (http://www.ch.embnet.org/software/BOX_form.html). Furthermore,

    the COBALT alignments were imported into Geneious72 (v9.1.5 Build 2016-06-11

    21:21,http://www.geneious.com)andsequencesimilaritymatricesbetweenspecies

    were computedbasedonBLOSUM62 similaritymatrix (threshold0). The similarity

    matriceswererenderedusingmatrix2png73.

    S18:PlanarianSACFunction

    S18.1Quantificationofmitoticindexindissociatedcellpreparations

    RNAiandnocodazoletreatment

    RNA interference was performed as previously described74 with some minor

    modifications. In vitro synthesized dsRNA was diluted 1:20 in H2O and quantified

    against a column purified standard eGFP control dsRNA (1 µg/µl) on a 0.7 % TAE

    agarose gel using ImageStudio Lite (v5.2.5, LI-COR Biosciences). The quantified

    WWW.NATURE.COM/NATURE | 33

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • dsRNAwasthenaddedtoliverpastetoafinalconcentrationof1µg/µlandfrozenat

    -80°Cinsmallaliquots.Theadditionofredfooddyeastracerwasomitted,because

    itgeneratedstrongfluorescentbackgroundintherhodaminechannel(see“Imaging

    anddataanalysis”).

    Foreachexperimentalreplicate,20animals(~0.5mm,asexualCIW4strain)werefed

    3x with dsRNA food every third day. On the eighth day, worms were split into 2

    batchesof10animalsandeithertransferredtodisheswith50ng/mlnocodazole+1

    % DMSO (v/v) in planarian water (+gentamycin) or 1 % DMSO (v/v) in planarian

    water(+gentamycin)asavehiclecontrol.Thewormsweretreatedfor24hoursand

    thenmacerated.

    Animalmaceration,proteinaseKtreatmentandformaldehydefixation

    Animals were passively macerated in 4 ml of maceration solution (glycerol:glacial

    aceticacid:H2Oinaratioof1:1:13(w/v/v))75for10minonthebenchatRT,followed

    by10minonaverticalrotatoratRTuntilthetissuewascompletelydissociated.The

    lysatewasfilteredthroughCellTrics®50µmfilters(Sysmex,Cat.No.:04-0042-2317).

    Dependingonthesizeoftheanimals,thecellsuspensionwasdiluted1:4(small,7-8

    mm) to1:10(large,16-18mm)usingmacerationsolutionand120µlofthediluted

    cellsuspensionwastransferredintoeachwellofablack96-wellglassbottomplate

    (Greiner Bio-One, Cat. No.: 655090). For each condition, five or six separatewells

    wereusedastechnicalreplicates.Thecellswerepelletedfor5minat2,000xgatRT.

    Then 44µl of 16% formaldehyde (PFA) in 1X PBSwere added and the cellswere

    fixedfor20-25minatRTandrinsedoncewith200µl1XPBS,followedbyawashin

    1XPBSfor5min.Afterfixation,thecellsweretreatedwith2µg/mlProteinaseKin

    1XPBSfor10minatRT,rinsedoncewith1XPBSandpost-fixedwith4%PFAin1X

    PBS for 10min at RT. Then, the cells were rinsed once in 1X PBS andwashed in

    PBSTx0.1(1XPBS;0.1%(v/v)TritonX-100)for5min.

    Riboprobehybridization,washing

    TheriboprobehybridizationbufferswerepreparedaccordingtoPearsonetal.76.

    The cells were incubated 5 min in 1:1 (v/v) PBSTx0.1:WashHyb and subsequently

    blocked inPreHyb for45-60minat56 °C.PreHybwas removedandreplacedwith

    WWW.NATURE.COM/NATURE | 34

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • pre-warmedHybcontainingDIG-labeledriboprobeagainstsmedwi-1andhybridized

    over 16 h at 56 °C. Hybwas removed and the cells were briefly rinsedwith pre-

    warmedWashHyb. The cellswere then successivelywashed once 5minwith pre-

    warmed1:1SSC/WashHyb,followedby2x15min2XSSC+0.1%(v/v)TritonX-100,

    2x15min0.2XSSC+0.1%(v/v)TritonX-100.Then,thecellswerebroughttoRTand

    rinsedandwashedoncewithPBSTx0.1 for10min.Endogenousperoxidaseactivity

    wasinactivatedwith100mMsodiumazideinPBSTx0.1for15minatRTandrinsed

    onceandwashedtwicewithPBSTx0.1for10mineach.

    Tyramidesignalamplification

    Thecellswereblocked1hwith5%(v/v)horseserum+0.5%(v/v)RocheWestern

    BlockingReagent(RWBR;Roche,CatNo.:11921673001)inPBSTx0.1atRTandthen

    incubatedwithanti-DIG-PODFab fragments (Roche,CatNo.:11207733910) in1%

    (v/v)horseserum/0.1%(v/v)RWBRinPBSTx0.1for2hatRT.Afterwashing3x20

    minwithPBSTx0.1thesignalwasdevelopedwithrhodamine-tyramideinTSAbuffer

    (2MNaCl,0.1Mboricacid,pH8.5)containing0.006%H2O2for10min.

    Antibodystaining

    Antibody staining was essentially performed as previously described with some

    modifications77. After removing the TSA buffer and washing 3x for 10 min with

    PBSTx0.1,primaryantibody(1:1,000anti-HistoneH3antibody(rabbit,phosphoS10+

    T11)[E173];Abcam,Cat.No.:ab32107)in1%goatserum+0.1%RWBRwasadded

    andincubated16hat4°C.Afterwashing4x10minwithPBSTx0.1,thesolutionwas

    replaced with secondary antibody in 1% (v/v) goat serum + 0.1 % (v/v) RWBR

    (1:1,000 goat anti-rabbit IgG (H+L), Alexa Fluor 647; ThermoFisher Scientific, Cat.

    No.: A-21245) and DAPI (1 µg/ml) and incubated 2-3 h at RT in the dark. The

    secondaryantibodywaswashedofffor3x20minwithPBSTx0.1andtheplateswere

    storedin0.02%(w/v)sodiumazidein1XPBSat4°Cinthedark.

    Imaginganddataanalysis

    WWW.NATURE.COM/NATURE | 35

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • TheplateswereimagedonanOperettaHigh-ContentImagingSystem(PerkinElmer)

    usinga20xlensandthefollowingexcitation/emissionfilters:DAPI(Ex.360-400/Em.

    410-480nm),Rhodamine(520-550/560-630nm)andfarred(620-640/650-760nm).

    Foreachwell, five separate locationswere imaged. Imageanalysiswasperformed

    usingtheintegratedHarmonyHighContentImagingandAnalysisSoftwareandFIJI78.

    Statisticalanalysis

    Comparisons of group means (± S.D.) for the effect of RNAi interference of SAC

    componentsontheabilityoftheplanarianneoblaststoelicitaSACresponseupon

    nocodazoletreatment,atwo-wayanalysisofvariance(ANOVA)wasperformedand

    incasesofsignificance,followedbyDunnett’stest.

    Two-way ANOVA revealed a significant effect of SAC component RNAi on the

    activationoftheSACbynocodazoletreatment(F(4,24)=124.7,P<0.0001).Forthe

    testedRNAitargets,rod,zwilch,zw10yieldedsignificantreductionsofanti-H3ser10P

    -positive cells (M-phase marker) upon nocodazole treatment (P < 0.0001), while

    bub3 did not differ from control RNAi (egfp) (P = 0.78). RNAi itself did not

    significantlyalterthebaselinenumberofcellsinM-phaseintheDMSOcontrols.

    All statistical analyseswereperformedusingGraphPadPrismversion7.0c forMac

    OSX,GraphPadSoftware,LaJollaCaliforniaUSA,www.graphpad.com.

    S19:MiscellaneousExperimentalMethods

    S19.1Animalhusbandry

    Sexual(S2)andasexual(CIW4)strainsofSmedwerekeptinplasticcontainersin1X

    Montjuïc salt water with 25 mg/L gentamycin sulfate. The animals were fed

    homogenizedorganiccalfliverpaste.

    S19.2Transcriptcloning

    Target cDNA fragments were PCR-amplified from Smed cDNA using KAPA HiFi

    HotStart DNA Polymerase (Kapa Biosystems), TAE agarose gel purified and cloned

    intopPR-T4P79.Thecloned insertwasused forpreparationof riboprobetemplates

    for insituhybridizationanddsRNAproduction forRNA interference.ThePlanMine

    WWW.NATURE.COM/NATURE | 36

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • transcript IDs and gene specific forward and reverse primers used for cloning are

    providedintheSupplementaryTable7.

    S19.3Wholemountinsituhybridization

    Wholemount in situ hybridization (WISH)was essentially performed as previously

    described76,80.

    S19.4Karyotyping

    Forkaryotyping,animals (~5mm)ofsexualSmedwere fedthreetimeseverythird

    dayusingcalfliverpaste.Thethirddayafterthelastfeed,theplanarianwaterwas

    supplementedwith50ng/mlnocodazol (SigmaCat#M1404)and1% (v/v)DMSO

    (SigmaCat # 276855) for a final 24h incubationperiod. Subsequently, 15 animals

    weremacerated in5ml6.6% (v/v) glacial acetic acid supplementedwith5µg/ml

    Hoechst 34580 (ThermoFisher Scientific Cat # H21486). Single cell dispersion was

    achievedbycarefullyrotatingthemacerationsolutionforthelast10minofthe20

    mintotalincubationperiod.Subsequently,50µlofthecellsuspensionwaspipetted

    asasingledropona0.17mmthick/No.1.5borosilicatecoverslip(Menzel).Toassist

    chromosomespreading,100µlof0.05%(v/v)TritonX-100solutionwasaddeddrop-

    wisetotheslide.Chromosomespreadswereagedat37°Cfor60minuntiltheliquid

    completely evaporated. Coverslips were mounted in 80 % (v/v) glycerol prior to

    imaging.ChromosomeswereimagedonanAndorRevolutionWDBorealisconfocal

    spinning disc system: The Olympus IX83 stand was equipped with an Andor iXon

    Ultra 888 EMCCD camera and an Olympus 150x U Apochromat NA 1.45 Oil

    immersionobjective.Hoechst34580stainedchromosomeswereexcitedwitha405

    nmlaserdiodeanda452/45bandpassfilterwasusedtodetecttheemissionlight.

    Imagestackswere3DdeconvolvedusingImageJ/Fijiplugins78,81andfinallymaximum

    projectedforbettervisualization.

    SupplementaryTablesSupplementaryTable1-CalculatedKimuradistancesandageforSmedLTR

    elements

    SupplementaryTable2-MARVELcomputationalrequirements&assemblystats

    WWW.NATURE.COM/NATURE | 37

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • SupplementaryTable3-Conservedgenesusedfordivergenceanalysis

    SupplementaryTable4-Speciesanddatabaseversionsusedfororthologyanalysis

    SupplementaryTable5-Planarianspecificgenes

    SupplementaryTable6-Lostgenegenes&essentialityanalysis

    Supplementary Table 7 - PlanMine identifiers for transcripts andprimers used for

    cloningusedinthisstudy

    References1. Garcia-Fernàndez,J.,Baguñà,J.&Saló,E.Genomicorganizationand

    expressionoftheplanarianhomeoboxgenesDth-1andDth-2.Development

    118,241–253(1993).

    2. Eid,J.etal.Real-timeDNAsequencingfromsinglepolymerasemolecules.

    Science323,133–8(2009).

    3. Putnam,N.H.etal.Chromosome-scaleshotgunassemblyusinganinvitro

    methodforlong-rangelinkage.GenomeRes26,342–50(2016).

    4. Sheffner,A.L.Thereductioninvitroinviscosityofmucoproteinsolutionsbya

    newmucolyticagent,N-acetyl-L-cysteine.AnnNYAcadSci106,298–310

    (1963).

    5. Murray,M.G.&Thompson,W.F.Rapidisolationofhighmolecularweight

    plantDNA.NucleicAcidsRes8,4321–5(1980).

    6. Adams,M.D.etal.ThegenomesequenceofDrosophilamelanogaster.

    Science287,2185–95(2000).

    7. Koren,S.etal.Canu:scalableandaccuratelong-readassemblyviaadaptivek-

    merweightingandrepeatseparation.GenomeRes27,722–736(2017).

    8. Myers,G.EfficientLocalAlignmentDiscoveryamongstNoisyLongReads.

    AlgorithmsbioinformaticsWABI2014LectnotesComputSci8701,52–67

    (2014).

    9. Chaisson,M.J.&Tesler,G.Mappingsinglemoleculesequencingreadsusing

    basiclocalalignmentwithsuccessiverefinement(BLASR):applicationand

    theory.BMCBioinformatics13,238(2012).

    WWW.NATURE.COM/NATURE | 38

    SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473

  • 10. English,A.C.etal.Mindthegap:upgradinggenomeswithPacificBiosciences

    RSlong-readsequencingtechnology.PLoSOne7,e47768(2012).

    11. Walker,B.J.etal.Pilon:anintegratedtoolforcomprehensivemicrobial

    variantdetectionandgenomeassemblyimprovement.PLoSOne9,e112963

    (2014).

    12. Langmead,B.&Salzberg,S.L.Fastgapped-readalignmentwithBowtie2.Nat

    Methods9,357–9(2012).

    13. Brandl,H.etal.PlanMine--amineableresourceofplanarianbiologyand

    biodiversity.NucleicAcidsRes44,D764-73(2016).

    14. Grabherr,M.G.etal.Full-lengthtranscriptomeassemblyfromRNA-Seqdata

    withoutareferencegenome.NatBiotechnol29,644–52(2011).

    15. Wu,T.D.&Watanabe,C.K.GMAP:agenomicmappingandalignment

    programformRNAandESTsequences.Bioinformatics21,1859–75(2005).

    16. Yu,G.,Wang,L.-G.,Han,Y.&He,Q.-Y.clusterProfiler:anRpackagefor

    comparingbiologicalthemesamonggeneclusters.OMICS16,284–7(2012).

    17. Robb,S.M.C.,Gotting,K.,Ross,E.&SánchezAlvarado,A.SmedGD2.0:The

    Schmidteamediterraneagenomedatabase.Genesis53,535–546(2015).

    18. Quinlan,A.R.&Hall,I.M.BEDTools:aflexiblesuiteofutilitiesforcomparing

    genomicfeatures.Bioinformatics26,841–2(2010).

    19. Kelley,D.R.&Salzberg,S.L.Detectionandcorrectionoffalsesegmental

    duplicationscausedbygenomemis-assembly.GenomeBiol11,R28(2010).

    20. Parsons,J.D.Miropeats:graphicalDNAsequencecomparisons.ComputAppl

    Biosci11,615–619(1995).

    21. Koutsovoulos,G.etal.Noevidenceforextensivehorizontalgenetransferin

    thegenomeofthetardigradeHypsibiusdujardini.ProcNatlAcadSci113,