preliminary discovery of repetitive elements in the genome...

17
Key words: ETR, Holothuria scabra, microsatellites, STR, Transposable elements (TEs), Preliminary Discovery of Repetitive Elements in the Genome of the Sea Cucumber Holothuria scabra Jaeger, 1833 Delbert Almerick T. Boncan 1,2 , Iris Diana C. Uy 1 , Crimson C. Tayco 1 , and Arturo O. Lluisma 1,3 * *Corresponding author: [email protected] Philippine Journal of Science 145 (4): 339-355, December 2016 ISSN 0031 - 7683 Date Received: ?? Feb 20?? 1 Marine Science Institute, College of Science, University of the Philippines Diliman, Quezon City 1101 Philippines 2 National Institute of Molecular Biology and Biotechnology, College of Science, University of the Philippines Diliman, Quezon City 1101 Philippines 3 Philippine Genome Center, University of the Philippines Diliman, Quezon City 1101 Philippines Various classes of repetitive elements exist in the genomes of organisms. Characterizing these genomic elements is important not only because of the potential insights on the biology and evolution of their host's genomes but also because of the potential practical applications that such information might yield. So far, little is known about the types of repetitive elements in the genome of holothurids. In this study, we generated a partial sequence of the genome of the sea cucumber, Holothuria scabra, and searched for tandem and interspersed repetitive elements using various approaches. We conducted the same search on another sea cucumber, Parastichopus parvimensis, using its publicly available genome sequence. The perfect microsatellite profiles of both sea cucumbers show similarities to some known patterns in eukaryotes. The combined perfect and imperfect microsatellite data sets also highlight fundamental microsatellite profile dissimilarities between the two holothurids. This study demonstrates that as much as half of microsatellites in a holothurid genome remain unidentified in perfect repeat scans, and highlights the importance of imperfect repeat-inclusive searches. This study also demonstrates that partial genome sequencing may be used as a cheaper and more efficient alternative to the traditional methods of developing microsatellite markers for H. scabra. On the other hand, combined approach of sequence similarity-based and de novo search of interspersed repeats reveals a diverse subclass/ superfamily of transposable elements in the genomes of H. scabra and P. parvimensis. The two species exhibit similar patterns of repeat profiles notwithstanding the disparity in the number of predicted transposable elements. Notably, the major subclass/superfamily identified in the two genomes include DNA/hAT-Blackjack, DNA/hAT-Tip100, DNA/Maverick, RC/Helitron, LINE/L2, LTR/Gypsy, SINE/ MIR and SINE/tRNA. The interspersed repeats identified in the study presents the first attempt to survey the transposable elements from the genomes of these two holothurids. INTRODUCTION It has long been known that repetitive elements can account for a sizeable fraction of many eukaryotic genomes (Britten and Kohne 1968). Depending on the species, this proportion can vary from a few percent (3% in Saccharomyces cerevisiae, Kim et al. 1998) to a significant amount (e.g. > 80% in maize, Schnable et al. 2009). These repeats are classified as tandem or interspersed repeats based on their sequence characteristics and the mechanism of their generation and replication 339

Upload: others

Post on 15-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

Key words: ETR, Holothuria scabra, microsatellites, STR, Transposable elements (TEs),

Preliminary Discovery of Repetitive Elements in the Genome of the Sea Cucumber Holothuria scabra Jaeger, 1833

Delbert Almerick T. Boncan1,2, Iris Diana C. Uy1, Crimson C. Tayco1, and Arturo O. Lluisma1,3*

*Corresponding author: [email protected]

Philippine Journal of Science145 (4): 339-355, December 2016ISSN 0031 - 7683Date Received: ?? Feb 20??

1Marine Science Institute, College of Science, University of the Philippines Diliman, Quezon City 1101 Philippines

2National Institute of Molecular Biology and Biotechnology, College of Science, University of the Philippines Diliman, Quezon City 1101 Philippines

3Philippine Genome Center, University of the Philippines Diliman, Quezon City 1101 Philippines

Various classes of repetitive elements exist in the genomes of organisms. Characterizing these genomic elements is important not only because of the potential insights on the biology and evolution of their host's genomes but also because of the potential practical applications that such information might yield. So far, little is known about the types of repetitive elements in the genome of holothurids. In this study, we generated a partial sequence of the genome of the sea cucumber, Holothuria scabra, and searched for tandem and interspersed repetitive elements using various approaches. We conducted the same search on another sea cucumber, Parastichopus parvimensis, using its publicly available genome sequence. The perfect microsatellite profiles of both sea cucumbers show similarities to some known patterns in eukaryotes. The combined perfect and imperfect microsatellite data sets also highlight fundamental microsatellite profile dissimilarities between the two holothurids. This study demonstrates that as much as half of microsatellites in a holothurid genome remain unidentified in perfect repeat scans, and highlights the importance of imperfect repeat-inclusive searches. This study also demonstrates that partial genome sequencing may be used as a cheaper and more efficient alternative to the traditional methods of developing microsatellite markers for H. scabra. On the other hand, combined approach of sequence similarity-based and de novo search of interspersed repeats reveals a diverse subclass/superfamily of transposable elements in the genomes of H. scabra and P. parvimensis. The two species exhibit similar patterns of repeat profiles notwithstanding the disparity in the number of predicted transposable elements. Notably, the major subclass/superfamily identified in the two genomes include DNA/hAT-Blackjack, DNA/hAT-Tip100, DNA/Maverick, RC/Helitron, LINE/L2, LTR/Gypsy, SINE/MIR and SINE/tRNA. The interspersed repeats identified in the study presents the first attempt to survey the transposable elements from the genomes of these two holothurids.

INTRODUCTIONIt has long been known that repetitive elements can account for a sizeable fraction of many eukaryotic genomes (Britten and Kohne 1968). Depending on

the species, this proportion can vary from a few percent (3% in Saccharomyces cerevisiae, Kim et al. 1998) to a significant amount (e.g. > 80% in maize, Schnable et al. 2009). These repeats are classified as tandem or interspersed repeats based on their sequence characteristics and the mechanism of their generation and replication

339

Page 2: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

in the genome. Tandem repeats are comprised by either microsatellites or minisatellites. Microsatellites can exhibit high levels of intraspecific polymorphisms and thus have emerged as popular genetic markers for a wide range of applications in population genetics, conservation biology and evolutionary biology (Goldstein and Schlotterer 1999). On the other hand, interspersed repeats are mainly comprised by transposable elements (TEs) which were initially considered as selfish and junk genetic elements. Studies have shown that these elements have a significant role in the evolution of their hosts' genome, contributing to the creation of novel genes and modification of regulatory networks (Muotri et al. 2007), such as by providing alternative splice sites or polyadenylation signals or by modifying gene expression by serving as promoter or enhancer elements (Krull et al. 2007; Lerat 2010). Identifying and characterizing these repetitive elements can thus provide insights into the structural and functional evolution of the host's genome. The information can also provide practical benefits, including facilitating analysis of genome sequences (e.g., gene annotation and genome assembly (Lerat 2010).

The abundance and types of repetitive elements have been well characterized in several species, particularly chordates (Kondo and Akasaka 2012). However, echinoderms are not as equally explored and to date the sea urchin Strongylocentrotus purpuratus is the only fully sequenced echinoderm (Sea Urchin Sequencing Consortium 2006). Repeats in the Holothuria genome were first investigated by Sainz et al. (1992), in which known genomic repeats were scanned among randomly selected regions of the genome.

This study provides preliminary insights into the repeat composition of Holothuria genomes. Holothuroidea (Holothurians) is a class of echinoderms with more than 1,250 species worldwide, inhabiting various habitat types from shallow offshore to abyssal depths where they often make up majority of the animal biomass (Kang et al. 2011). This report presents initial findings on repetitive elements (microsatellites and interspersed repeats, particularly transposable elements) in the partial genome sequence of the holothurian H. scabra.

MATERIALS AND METHODS

DNA Extraction One H. scabra specimen, MS-050 (Figures 1 and 2; see Supplementary Material 1), was used as source of genomic DNA and for generating the partial genome sequence. The DNA was extracted from 50 mg of the specimen's body wall (homogenized in liquid nitrogen) using a commercial

column-based DNA extraction kits (QIAGEN DNEasy Blood and Tissue Mini Kit), following the protocol for animal tissue as described by the manufacturer.

Genome Sequencing The H. scabra genome was sequenced using pyrosequencing (Margulies et al. 2005) on Roche 454 GS Junior Sequencing platform. All steps were performed following the manufacturer's protocol, except for a minor modification (1.5 cpb was used instead in the emulsion PCR as the recommended copy-per-bead number was suboptimal for H. scabra). Two sequencing work flows were carried out and the reads were merged into a single dataset which was used to generate an assembly using Newbler (Table 1). Version 2.5 was used for generating

Figure 1. A. Holothuria scabra specimen sources. H. scabra specimens were collected from four sites in the Philippines (green circles; 24 see S1). These sites represent northern (Luzon) and central (Visayas) Philippines, as well as both the western and eastern seaboards. 25 B. Holothuria scabra specimen MS-050, the source of the partial genome assembly used in this study, dorsal (1) and ventral (2) sides 26 (bar = 2 cm).

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

340

Page 3: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

Figure 2. The trend of descriptive STR metrics over STR period. The color scheme of lines is as follows: ETR (blue) = the iETR data set, or the data set with perfect or exact tandem repeats (ETRs) from MISA only; EAS (green) = the ETR+ATR (shorter) data set, or the data set with both predicted ETRs and ATRs, in which priority in STR typing was given to shorter periods; EAL (pink) = the ETR+ATR (longer) data set, or the data set with both predicted ETRs and ATRs, in which priority in STR typing was given to longer periods. Panels A & C are plots for STR frequency or locus density (#Loci/Mb genome) over STR period. Panels B, D, & E are plots for STR genome coverage (bp STR/Mb genome) over STR period. Panels A through D plot this study's STR scans (A & B for H. scabra, C & D for Parastichopus parvimensis), while panel E is a plot for H. scabra perfect repeat (ETR) genome coverage from another study (Meglecz et al. 2012). The iETR data sets of both sea cucumbers follow a general decreasing trend over period or motif length. The H. scabra ETR genome coverage of this study (Panel B) and that of another study with a bigger genome assembly (Panel E) show similar patterns, albeit the predominance of dinucleotides in the latter is more pronounced. The dereplicated data sets, on the other hand, deviate from the decreasing trend and are also different between the two sea cucumber species. Notably, far less STR count and coverage are observed when imperfect or approximate repeats (ATRs) are excluded.

Table 1. Sequencing and Assembly Metrics.

Genome For microsatellite marker development For genome-wide STR profiling

Expected sequencing coverate* 0.0350x 0.0350x

Run Metrics

Total Number of Reads 248,644 164,120

Total Number of Bases 91,013,835 76,654,741

All Contig Metrics

Number of Contig 9,680 4,663

Number of Bases 3,851,152 2,024,804

Large Contig Metrics

Number of Contigs (% of all) 3,175 (32.80%) 1,414 (30.29%)

Number of Base (% of all) 2,077,562 (53.95%) 1,162,915 (57.43%)

Average Contig Size 654 822

N50 Contig Size 586 815

Largest Contig Size 6,137 4,824

Number of >= Q40 Bases (%) 1,886,236 (90.79%) 1,075,485 (92.48%)

% of Genome Assembled 0.1926% 0.1012%

*Coverage was calculated using an approximate genome size of 2,000 Mb. based on calculations from the C-value of a congener, Holothuria floridiana, as listed in the Animal Genome Size Database (http://www.genomesize.com/). Calculations were made with the assumption that 1 bp has an average molecular weight of 650 g/mol.

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

341

Page 4: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

an assembly that was subsequently used for microsatellite marker prediction (henceforth, “unmodified” assembly). For genome-wide STR and transposable element (TE) profiling, a reduced assembly was used. This dataset was generated by discarding potential mitochondrial sequence reads (prior to de novo assembly) identified by BLASTx search (with an e-value threshold of 1E-5) against GenBank non-redundant database. The resulting reads were assembled using Newbler version 2.7 (henceforth, “reduced” assembly). Both assemblies were set at complex genome mode, with all other parameters set to default. Metrics of the partial genome sequencing and assemblies are listed in Table 1. Data processing and analyses were carried out using custom-made Python scripts and R unless stated otherwise.

Microsatellite Search and Profiling The two assemblies (the reduced assembly for profiling and the separate, unmodified assembly for marker development) were scanned for STRs using MISA (Thiel et al. 2003; Thiel 2004) and mreps (Kolpakov et al. 2003). The search parameters used are listed in Supplementary Material 2, and were determined by two different operational definitions of microsatellites. The results of MISA and mreps were consolidated and dereplicated.

Microsatellite profiles were generated for each dataset. One dataset is from the independent exact tandem repeat (ETR) search (henceforth “iETR”), which was obtained from MISA only. The two other data sets were the product of dereplication of the combined results of MISA and mreps

Supplementary Material 1. Holothuria scabra specimen sources. The specimens are summarized according to source municipality. The sample ID nomenclature is as follows: 'MS-XX-YYY-##-###', where 'MS' pertains to the Microsatellites Project, 'XX' is a code for the species (HS for H. scabra), 'YYY' is a code for the site from which the specimen was collected, '##' is the year of collection, and '###' is the unique specimen code. In the text, the sample ID is usually shortened to just 'MS-###', since each specimen has a unique code. The specimens’ life stage and type of source are also indicated.Group # Code Prefix Sample code Source Notes

1 MS-HS-PNG-11- 001 to 008 (8) Bolinao, Pangasinan juveniles; hatchery2 MS-HS-PNG-11- 049 to 052 (4) Bolinao, Pangasinan adults; hatchery3 MS-HS-ZBL-12- 127 to 141 (15) Masinloc, Zambales adults; field4 MS-HS-SAM-12- 142 to 151 (10) Guiuan, Eastern Samar adults; field5 MS-HS-CAG-12- 152 to 161 (10) Santa Ana, Cagayan adults; field

Total = 47

Supplementary Material 2. STR search parameter settings. MISA and mreps were used to search for repeats for both marker development and for microsatellite profiling, albeit employing different parameter settings. These settings were determined by the operational definition of a microsatellite, which is different for each search objective.

Repeat-finding software Parameter Settings Note

For microsatellite marker development

MISA Set definitions in misa.ini Combine all results

Single search 1-10 2-6 3-5 4-5 5-5 6-5 (default)

mreps -res -minperiod -maxperiod -minsize -exp

Set A (mononucleotide repeats) 0 0 1 default default

Set B (di- to trinucleotide repeats) 1 2 3 default default

Set C (tetranucleotide repeats) 1 4 4 default default

Set D (penta- to hexanucleotide repeats) 2 5 6 default default

For genome-wide STR profiling

MISA Set definitions in misa.ini Combine all results

Single search 1-10 2-5 3-4 4-3 5-3 6-3 7-3 8-3

mreps -res -minperiod -maxperiod -minsize -exp

Set A (mono- to octanucleotide repeats) 0 1 8 10 3

Set B (tri- to octanucleotide repeats) 1 3 8 10 3

Set C (hexa- to octanucleotide repeats) 2 6 8 10 3

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

342

Page 5: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

searches for both ETRs and approximate tandem repeats or ATRs (henceforth “ETR+ATR”). One data set was generated by a dereplication process that was set to favor shorter over longer repeat periods when deciding the STR type of a complex overlap of motifs (henceforth “ETR+ATR (shorter)”), while the other was from a dereplication that was set to favor longer over shorter periods (henceforth “ETR+ATR (longer)”). Metrics were calculated from datasets, including frequency or locus density (the number of STR occurrences per megabase (Mb) of the genome) and coverage (the number of bases covered by STRs per megabase of the genome). Each STR's exponent—or the number of repeats of the motif—was determined. Likewise, GC content was calculated. The evaluation of exponent and GC content was complicated by the nature of the dereplication process. Therefore, the evaluation of these metrics was performed only on the iETR data set.

The microsatellite profile of another holothurid was also evaluated and compared with that of H. scabra. The whole genome scaffold sequences of Parastichopus parvimensis (NCBI Project JXUT00000000.1) were downloaded (S1 summary), stripped of long Ns (>3 consecutive Ns), and scanned for microsatellites using the method described above.

Search for Interspersed Repeats – Transposable Elements Interspersed repeats in the draft genome (reduced assembly) of H. scabra were predicted using RepeatMasker version 4.0.6 (Smit et al. 2010) against Phylum Echinodermata as reference database. In addition, de novo repeat search was performed using RepeatModeler 1.0.8 with de novo search tools RepeatScout (Price et al. 2005) and RECON (Bao and Eddy 2002) where a custom library was built. Independent results from RepeatMasker and RepeatModeler were consolidated and dereplicated. This was done by matching de novo repeat sequences with library-based predictions by Reciprocal BLASTn setting the e-value threshold to 1E-25, 85% identity and 80% query coverage. Overlapping matches were resolved by selecting the match with the lowest e-value as the best hit. Likewise, transposable elements were also searched from the genome scaffold sequences of P. parvimensis (NCBI Project JXUT00000000.1) using the same approach employed for H. scabra. The repeat profiles from the two holothurids were analyzed from the consolidated and dereplicated data taking note of the number and length of each repeat subclass/superfamily.

Similarly, transposable elements from model organisms Caenorhabditis elegans, Strongylocentrotus purpuratus and Drosophila melanogaster were obtained from RepeatMasker/REPBASE Libraries (20150807). The repeat profiles of these model organisms were compared to those of H. scabra and P. parvimensis.

RESULTS AND DISCUSSION

Microsatellite Search The two software used to survey STRs in the Holothuria partial genome employ different search algorithms. MISA (Thiel et al. 2003; Thiel 2004) uses an exhaustive search strategy using a specific set of construction rules (Sharma et al. 2007). Mreps, on the other hand, uses a two-phase screening of STRs, in which a tentative list of STRs is generated in the first phase, and statistical screening narrows down the list in the second phase (Kolpakov et al. 2003; Sharma et al. 2007). These two have their own strengths and weaknesses, and are not likely to yield exactly the same set of results even with the same or equivalent parameter settings (Sharma et al. 2007). More specifically, the search strategy of MISA allows an exhaustive search and the identification of even out-of-phase compound repeats. However, it can only detect perfect or exact tandem repeats (ETRs), and is known for its occasional unsatisfactory identification of motifs (Sharma et al. 2007). Mreps, on the other hand, offers more control on how STRs are identified, and is able to detect ambiguous or approximate tandem repeats (ATRs). It is, however, unable to detect compound repeats. By using both MISA and mreps, the strengths of both software are utilized.

Although useful, the use of the two repeat-finding software also generates two overlapping STR prediction sets. For instance, a single STR may both be detected as an ETR by MISA, and as a flank-extended ATR of higher exponent identified by mreps. In some cases, different motifs are assigned to the same STR. Dereplication was an attempt to correct these cases and similar potential sources of profile errors.

Genome-wide STR Profiling There is no consensus on the exact definition of a ‘microsatellite’. The typical definition of microsatellites in literature is given as short tandem repeats (STRs) with motif lengths of 1 to 6 (Zane et al. 2002; Selkoe and Toonen 2006; Sharma et al. 2007; Tang et al. 2008), and an exponent of three (3) or more (Richard et al. 2008). The operational definition used here for the genome-wide STR survey (period <= 8, exponent >= 3) is based on the functional definition of microsatellites proposed by Richard et al. (2008).

In this study, separate analyses were conducted on each of the three data sets (two dereplicated datasets, and one independently searched perfect repeat dataset). The dereplicated datasets were designed to be as inclusive of all microsatellites as possible to gain the least biased insights into the microsatellite make-up of the H. scabra genome. On the other hand, the iETR data set was intended

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

343

Page 6: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

to be a subset of all operationally- accepted STRs that is comparable to other STR profiles of other studies. The analyses on this set were geared toward gaining more robust comparisons with the repeat profiles of other organisms. This is because perfect repeats (or ETRs) are simpler than imperfect repeats (or ATRs) and may be defined by fewer variables. In theory, ETR searches can yield results with less artefactual variation than ATR-inclusive searches, given a single genomic data set. The use of independently-searched ETRs is therefore advantageous in comparative microsatellite profile analysis. Performing ETR and ATR searches separately is recommended (Richard et al. 2008). Despite this, it is still recognized that the total ETR count can still vary greatly between different ETR search algorithms (Mreps, Sputnik, TRF, RepeatMasker, and STAR) (Delgrange and Rivals 2004; Richard et al. 2008).

The results of the iETR search using MISA, and the dereplication results of the combined ETR+ATR searches using MISA and mreps, are tabulated in Table 2. STR locus densities are shown in Figure 2. H. scabra ETRs identified by MISA in the iETR search total to 608 (independently counting each repeat that is classified as part of a compound repeat). The total is reduced to 433 when not counting compound repeats nor their component repeats. As expected, less ETRs are identified in the ETR+ATR searches than in the iETR search. After dereplication, only 390 of the STRs are identified as ETRs. This is because MISA detects only the ETR regions of STRs, some of which may in fact be ATRs when flanking regions are extended. This highlights the importance of dereplication. Taking only independently searched ETRs (iETRs), the STR locus frequency of H. scabra (301/Mb) (Table 2) is more than the average for plants (281/Mb) (Morgante et al. 2002), budding yeast (145/Mb) (Malpertuy et al. 2003), and humans (87/Mb) (Lander et al. 2001). It is, however, far less than that of another holothurid, P. parvimensis (811/Mb) (Table 2) and of teleostean fishes (more than 1000/Mb) (Richard et al. 2008).

Dinucleotides are the most frequent STR type in H. scabra's iETR data set (Figure 2A). This is consistent with a study that profiled the perfect repeats of 154 partial eukaryote genomes, of which 136 were shown to have dinucleotides as the most frequent STRs (Meglécz et al. 2012). However, pentanucleotide STRs far exceed all other STRs in terms of frequency when imperfect repeats are taken into account (Figure 2A). The same is not true for P. parvimensis—by a far margin, its most frequent STRs are mononucleotides in both iETR and dereplicated data sets alike (Figure 2C). This is consistent with an earlier study which profiled ETRs using MISA in 26 complete eukaryotic genomes (Sharma et al. 2007). In H. scabra, there is a relative under-representation of trinucleotide

STRs, but this is not evident in P. parvimensis.

Generally, the frequency of STRs varies inversely with motif length (Temnykh et al. 2001; Grover et al. 2007; Sharma et al. 2007). This can be observed as a general trend in the iETR data set of both H. scabra and P. parvimensis (Figure 2A, C). However, it is only evident in STRs of higher periods (5 to 8) in the dereplicated data sets of both sea cucumbers. The consistency of the known inverse trend with that of the iETR data set, and its inconsistency with those of the dereplicated data sets, may be a reflection of how majority of past STR profiling studies have focused only on perfect repeats. It has been hypothesized that longer motifs are less likely to form proto-microsatellites than shorter ones, and that this is the mechanism underlying the inverse relationship between STR frequency and motif length (Meglécz et al. 2012). Although the trends on the iETR and dereplicated data sets support this hypothesis at longer periods, they also suggest that other factors may be influencing microsatellite formation at shorter periods. This can be further evaluated by comparing ATR-inclusive microsatellite profiles of whole genomes from across the tree of life.

Several patterns are similar between STR coverage and locus density (Figure 2). Firstly, in both H. scabra and P. parvimensis, an overall decreasing trend in genome coverage with respect to motif length is still evident among iETRs. In H. scabra iETR, specifically, dinucleotide repeats are the most predominant STRs. The same microsatellite coverage pattern was observed in H. scabra in a separate partial genome ETR profiling (Meglécz et al. 2012) (Figure 2E). Notably, in all th ree data sets, the STRs with the highest genome coverage in H. scabra are dinucleotide STRs. Considering the abundance of pentanucleotide STRs in the dereplicated data sets, it appears that dinucleotide STRs are longer on average than pentanucleotide STRs. In fact, each dinucleotide STR locus is longer on average than almost any other STR type in any of the searches (Table 2; (Figure 3A). Additionally, dinucleotide STRs in the iETR search have greater coverage than all other motifs combined. The same was observed in 124 of 154 eukaryote genomes including H. scabra in another STR profiling study of partial genomes (Meglécz et al. 2012).

In the dereplicated data sets, the genome coverage of STR periods 1 to 4 show patterns that are dissimilar between the two sea cucumber species (Figure 2B, D). For instance, dinucleotide STRs are predominant in H. scabra, but are relatively under-represented in P. parvimensis—a rarity among eukaryotes. Mono- and trinucleotide STRs are relatively under-represented in H. scabra, but are the two most predominant STRs in P. parvimensis. In either species, the relative under-representation of an STR class is a prominent feature. Systematic algorithm bias

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

344

Page 7: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

Table 2. A summary of the STR profile of the H. scabra genome. Raw count, frequency or locus density, total STR length, and genome coverage are shown by STR period for each of the data sets.

STRPeriod

H. scabra P. parvimensisiETR ETR+ATR [s] ETR+ATR [l] iETR ETR+ATR [s] ETR+ATR [l]

Frequency or Locus Density (locus/Mbp)1 64.85 65.34 62.87 379.08 378.86 359.992 100.98 99.99 95.04 143.84 140.01 136.033 30.20 56.93 51.98 97.49 160.66 152.874 59.90 84.15 79.20 132.71 163.68 162.115 32.67 139.60 138.61 39.25 167.08 167.866 10.40 74.75 81.18 13.83 93.32 112.707 1.49 31.19 37.13 2.48 46.21 52.068 0.50 23.27 29.21 2.75 23.31 29.51

Total 300.97 575.22 575.22 811.42 1173.13 1173.13

Coverage (bp/Mbp)1 743.52 777.19 722.24 4582.86 4871.50 4353.762 2988.95 3151.81 2651.84 1762.47 1946.10 1760.393 475.22 1244.98 1004.40 1601.55 4229.12 3510.514 865.30 1709.31 1504.87 1966.48 3455.92 3293.505 519.77 2468.68 2447.39 656.51 3206.40 3210.546 258.40 1790.00 2218.20 298.05 2250.31 3246.537 31.19 788.08 1131.62 59.95 1151.83 1377.728 19.80 668.78 918.27 87.37 689.94 1048.15

Total 5902.15 12598.82 12598.82 11015.24 21801.11 21801.11

Average STR Length1 11.47 11.89 11.49 12.09 12.86 12.092 29.60 31.52 27.90 12.25 13.90 12.943 15.74 21.87 19.32 16.43 26.32 22.964 14.45 20.31 19.00 14.82 21.11 20.325 15.91 17.68 17.66 16.73 19.19 19.136 24.86 23.95 27.32 21.56 24.11 28.817 21.00 25.27 30.48 24.18 24.93 26.468 40.00 28.74 31.44 31.72 29.60 35.52

Total 19.61 21.90 21.90 13.58 18.58 18.58

With Exp > 501 0 – – 107 – – 2 13 – – 8 – – 3 0 – – 27 – – 4 1 – – 5 – – 5 0 – – 0 – – 6 0 – – 0 – – 7 0 – – 0 – – 8 0 – – 0 – –

Total 14 – – 147 – –

GC Content (%)1 29.43 – – 61.65 – – 2 43.51 – – 26.04 – – 3 22.92 – – 26.00 – – 4 27.57 – – 31.34 – – 5 30.38 – – 35.10 – – 6 36.59 – – 43.92 – – 7 52.38 – – 41.98 – – 8 37.50 – – 42.43 – –

Total 36.31 – – 43.04 – –

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

345

Page 8: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

is an unlikely cause of these STR depletions, as the same dereplication method was applied to both genomes and they exhibit different STR profile patterns. In H. scabra, the observed under-representation is unlikely an artefact of small sampling size because the perfect repeat patterns are consistent between this study (H. scabra iETR set) and another which used a bigger sampling of the genome (Meglécz et al. 2012). Therefore, the STR profiles of the two sea cucumbers likely reflect real patterns in their respective genomes. Although such marked difference in a fundamental genome feature between closely-related organisms has been reported before (Meglécz et al. 2012), this is the first such report between same-order species in the phylum Echinodermata. Microsatellites are known drivers of evolution; fundamental differences in STR profiles may therefore imply important differences in genome dynamics that drive evolution.

Significant discrepancies are evident between iETRs and the dereplicated data sets, whether comparing STR locus density or genome coverage. In both H. scabra and P. parvimenses, the coverage of each dereplicated data set is twice that of the corresponding iETR data set (Table 2). This means half of all possible true STRs are not identified when imperfect repeats (ATRs) are excluded. Studies that compare perfect repeat profiles across different taxa may therefore be disadvantaged by the loss of potentially informative signals in imperfect repeats, or misled by patterns seen in perfect repeats alone. This

study highlights the importance of complementing profiling studies with STR scans that are inclusive of imperfect repeats.

Microsatellites that exceed 50 repetitions are rare (Garza et al. 1995; Stefanini and Feldman 2000; Sibly et al. 2003; Whittaker et al. 2003; Sainudiin et al. 2004; Buschiazzo and Gemmell 2006; Ustinova et al. 2006); the occurrence of several expanded repeats in a genome is therefore exceptional. Nevertheless, repeat expansions have been observed in eukaryotic genomes (Estoup et al. 1993; Primmer et al. 1996; Pearson et al. 2005; Buschiazzo and Gemmell 2006; Clark et al. 2006) in both coding and non-coding regions (Richard et al. 2008). In H. scabra and P. parvimensis, several STRs with exponents greater than 50 were observed (Table 2; Figure 4), with the proportion over total STR count higher in H. scabra (2.3%) than in P. parvimensis (0.024%). The most expanded repeat in P. parvimensis is the perfect tetranucleotide STR (GATT)398. This unusually expanded repeat is bound to form large-scale secondary structures during replication, repair, or recombination, and therefore may affect the dynamics of neighboring elements. It is, therefore, an interesting locus to study. Majority of expansions known in humans— most of which are implicated in diseases—are trinucleotide STRs (Richard et al. 2008; Zhao & Usdin 2015). Their relative abundance is due to the propensity of trinucleatide repeats to form secondary structures. There were several expanded trinucleotide STRs identified in

Figure 3. ETR length distribution across periods. Shown are distributions of length of independently-searched perfect repeats or exact tandem repeats (ETRs) across different periods, (A) in the partial H. scabra genome from this study and (B) in the P. parvimensis genome sequence downloaded from Genbank. The x-axis (length values) is in logarithmic scale.

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

346

Page 9: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

Figure 4. ETR exponent distribution across periods. Shown are distributions of exponents of independently-searched perfect repeats or exact tandem repeats (ETRs) across different periods, (A) in the partial H. scabra genome from this study and (B) in the P. parvimensis genome sequence downloaded from Genbank. The x-axis (exponent values) is in logarithmic scale.

P. parvimensis. Only one expanded trinucleotide STR was identified in H. scabra; it has an exponent of 52 and is about a hundred bases from a region highly similar to a reverse transcriptase.

The whole H. scabra assembly has a GC content that is 39.31%, which is consistent with the 39.1% reported for the partial genome sequence of H. scabra in the study by Meglécz et al. (2012). The overall GC content of STRs in the iETR data set is 36.31%—slightly lower than the GC content of the assembly (Table 2). This supports what has been repeatedly observed in most eukaryotes (Meglécz et al. 2012; Sharma et al. 2007). P. parvimensis, however, deviates from this general observation. The P. parvimensis assembly is 37.15% GC—less than that of H. scabra—yet the collective GC content of its repeats is 43.04% (Table 2). Broken down by period, the GC content of H. scabra repeats fall below or near the overall GC content of the assembly, except for dinucleotide and heptanucleotide ETRs (Figure 5). In P. parvimensis, the general observation is that STR classes with longer periods have higher combined GC content. Exception to this is the mononucleotide STR class, which has an extremely high GC content (61.65%). Interestingly, the GC repeat motif was not found in H. scabra, and only a few were identified in P. parvimensis (~0.08% of the total dinucleotide STR loci). This is consistent with the noted rarity of the GC motif among eukaryotes (Meglécz et al. 2012), potentially due to the propensity

of CpG arrays outside of CpG islands to gain methylated cytosines that are easily converted to thymines (Pelizzola and Ecker 2011).

Generation of Primers for Microsatellite Marker Development The function-based definition of microsatellites, which was used in STR profiling, encompasses all STRs types. However, this definition, while ideal for a genome-wide surveys, is not suitable as an operational definition for identifying candidate microsatellite markers because not all types of STRs are suitable genetic markers. In the search for potential markers, mononucleotide repeats were omitted, and the search was limited to STRs with periods of two (2) to six (6). The upper bound of the period was chosen based on the most common operational definition of microsatellites.

Microsatellite allele lengths usually vary by quanta due to unit expansion/contraction, a proposed mechanism for which is described by the DNA replication slippage model (Levinson and Gutman 1987). The rapid and cheap method of microsatellite genotyping by locus size determination takes advantage of this property. However, the core assumption of the method—that each allele is distinct in size—is likely not met by many loci (Estoup et al. 2002). Some microsatellite alleles from different descent can correspond to the same length, with only a portion containing sequence divergence and thus can be

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

347

Page 10: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

Figure 5. ETR GC content distribution across periods. Shown are distributions of GC content of independently-searched perfect repeats or exact tandem repeats (ETRs) across different periods, (A) in the partial H. scabra genome from this study and (B) in the P. parvimensis genome sequence downloaded from Genbank.

Figure 6. Amplicon sizes of candidate microsatellite loci. Each candidate microsatellite locus of H. scabra is represented here as a bar, the height of which corresponds to the length of its amplified product. All products are between 100 and 400 bp, with an even and almost linear distribution across the size range. This is advantageous for selecting loci for testing and development in multiplex fragment analyses.

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

348

Page 11: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

Supplementary Material 3. A summary of parameter settings used in running the Primer-BLAST software.

Min Opt Max

Product Size 100 nt -- 400 nt

Primer Tm 50°C 55°C 60°C

Database Non-redundant nucleotide database

*All other parameters are set at default values

detected. This phenomenon, known as size homoplasy, can underestimate true diversity and overestimate gene flow when mutation rate is high (Selkoe and Toonen 2006). To increase the chances of using microsatellite loci that follow the stepwise mutation model assumed by population studies, it is recommended to choose microsatellites with perfect repeat motifs (Estoup et al. 2001; Guichoux et al. 2011). For this reason, a filter for perfect repeats was applied in choosing candidate microsatellite markers.

Although STRs are mostly found in gene-poor chromosomal regions, some persist near genes (O’Dushlaine et al. 2005; Thomas 2005; Verstrepen et al. 2005; Legendre et al. 2007). STRs that are associated with genes may be suboptimal as markers as variability may be influenced by the selection pressures which may underestimate diversity. To avoid this, BLASTx hits of contigs from the assembly were used to cross-check and filter out STR loci that are associated with genes. Among SSRs identified in H. scabra, none were observed to be in the same contig as a BLASTx hit.

PCR primer pairs were designed for the amplification of a total of 48 candidate microsatellite markers (Figure 6 and Supplementary Material 4). Evaluation from visualizing gel electrophoresis methods (AGE and PAGE) and fragment sequencing revealed potential high-resolution genetic markers for H. scabra; of the 11 loci tested 5 were found to be polymorphic (data not shown), indicating that screening the other candidate markers (supplementary material 4) may also lead to the identification of polymorphic loci.

Transposable Elements Currently, identified repeats for the Class Holothuroidea are scarce. While a number of transposable elements (TEs) have been identified from the genome of H. scabra using echinodermata repeat library as reference (Table 3), a greater number of TEs has been predicted using an ab initio approach. From the consolidated and dereplicated datasets (Table 4, Supplementary material 6) from RepeatMasker and RepeatModeler, 3,223 transposable elements have been predicted

Table 3. Independent count of transposable elements from the genomes of H.scabra and P. parvimensis. RS – Repeats searched by RepeatMasker; RM – Repeats predicted de novo by RepeatModeler. The shaded boxes indicate high abundance repeats (Top 3 or Top 10).

H. scabra P. parvimensis

RS RM RS R1M0

DNA 1 0 729 101DNA/Academ 1 0 564 378DNA/CMC-EnSpm 0 0 0 6475DNA/CMC-Transib 0 0 6 0DNA/Crypton 0 0 4 0DNA/hAT 1 0 16 0DNA/hAT-Ac 0 0 4 0DNA/hAT-Blackjack 6 0 810 494DNA/hAT-Charlie 0 0 2 281DNA/hAT-hAT6 0 0 1 0DNA/hAT-hATw 0 0 94 0DNA/hAT-Tip100 8 0 595 2269DNA/hAT-Tip100? 0 0 14 0DNA/Kolobok-Hydra 0 0 19 0DNA/Kolobok-T2 0 0 0 261DNA/Maverick 14 0 668 1592DNA/MULE-MuDR 1 0 198 11603DNA/PIF-Harbinger 4 0 102 217DNA/PIF-ISL2EU 0 0 0 478DNA/PiggyBac 0 0 16 0DNA/Sola 2 0 558 292DNA/TcMar 0 0 1 0DNA/TcMar-Fot1 0 0 18 0DNA/TcMar-ISRm11 0 0 14 0DNA/TcMar-Tc1 0 0 1 0DNA/TcMar-Tc2 0 0 3 0LINE 0 0 0 910LINE/CR1 16 56 357 4333LINE/I 0 0 6 0LINE/Jockey 0 0 17 837LINE/L1 1 0 36 598LINE/L1-Tx1 16 38 310 4087LINE/L2 47 25 3387 19609LINE/Penelope 0 0 0 7210LINE/R2-Hero 0 0 4 0LINE/Rex-Babar 0 0 0 3115LINE/RTE 0 0 0 171LINE/RTE-BovB 0 0 418 1355LINE/RTE-RTE 0 0 112 0LINE/RTE-X 3 0 69 0LTR 0 0 0 7046LTR/Copia 0 0 0 4222LTR/DIRS 4 0 92 1641LTR/Gypsy 7 0 1845 204LTR/Ngaro 0 0 6 0LTR/Pao 2 0 194 129RC/Helitron 65 0 2074 1654SINE/MIR 4 0 8645 0SINE/tRNA 32 0 577 0

SINE/tRNA-Core 0 0 1189 0SINE/tRNA-Deu-CR1 0 0 43 0

SINE/tRNA-Deu-L2 0 0 4 0

Unknown 2 2942 931 795549Total 237 3061 24753 877010

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

349

Page 12: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

Supplementary Material 4. A summary of primer pairs designed and chosen for the amplification of candidate microsatellite loci in H. scabra. There are a total of 48 loci suitable for testing. Highlighted rows indicate the 11 primer pairs that were synthesized to evaluate the polymorphism of corresponding microsatellite loci.

Locus F Primer Seq (5'-3') R Primer Seq (5'-3') Predicted repeat

Prod Len in Ref (bp) Blastx Hits

hs.p2.0005 TGATGGACTTTATCCGTACTCA TGCTGTCACTTAATTGTGTTAGAC (TC)9 400 --hs.p2.0007 AATCAAAGTCACAGCCAAAGACA GGTGGTGGTCTGCTGGTTTC (TA)9 164 --hs.p2.0023 GCTCCCTTCACAAGGTTGAGC CAGCACTGGCCTTAATCGGG (TA)11 342 --hs.p2.0034 ATTTTCTCCCCGTGGGCATC ACACTTCCTGTTTGCTGGCAT (CT)9 250 --hs.p2.0054 ACTGAGACCTTGCTTAAGACTGGA CCTGGTGACCCCAAATGACC (GT)12 398 --hs.p4.0056 ATCAGGGAATGTGTTACTGTGCC GTTGCCCCTCTAAGCCCGTA (AAAC)9 378 --hs.p2.0068 CGCCCTTGTCCGTTTACGTATT ACAGTTGACAAAGCCGCCAG (AT)8 264 --hs.p2.0082 AAACCCCTAACCGGACTGGA TTTTCATTCTTCCCATGTCCCT (CT)7 222 --hs.p2.0083 ACCTCCAGAAACCAGTCTTACAT AAGCCATTTTCGGTGTTTCGC (TG)8 341 --hs.p4.0086 GCCCCGTTCTTCATTTGCCT AAGTTTCCCATGCACCGACA (AATA)10 332 --hs.p2.0095 TTTGGCAGCACCAAGCTTCT GCTGACCATTTCAGCTACAACC (TG)7 154 --hs.p2.0099 ACCTCCAACAAGCAAGCACA TTGGGGTGAGAGGATCAGGC (AC)7 324 --hs.p2.0106 GGGCACCAAAAACAATAGGGC CAAGTTGGGAAGCTGGGAGG (AC)12 372 --hs.p2.0112 TTGGCAGTCAGACATGTGGT TGCGGGGCAAAGACTTGATT (AG)8 216 --hs.p5.0116 AGCTAAAAGCCTGGCACGAA GGGCATCTTATGGGGCCTTG (CATAA)29 318 --hs.p3.0118 GCCCAAGGTCAAGGAGGAGT GCGGTTACACTTACTGCATGGT (CAA)8 107 --hs.p2.0129 GACCTCCGCCAGAAACAACA GCTGGACAGGACAAGGCTG (AC)7 265 --hs.p2.0130 TTTGCAGACCCTGGACACCT GCAGGCATGTGCACTTTGGA (CT)7 167 --hs.p2.0133 AAGCGACGTCCTATCTCCCA GCCTTTGGCCAGTGTTTTCC (TA)10 290 --hs.p2.0135 ACTCACAGAGTGGTATCCTTTGA ATGCAGCAAGGAAAGCGAGG (AT)7 202 --hs.p2.0143 AGCAGTGAAACAGCAGGAGT TTCTTCCTTTCAGCATGGGGT (TG)9 282 --hs.p2.0153 ACCCTGGTGACCCCAAATGA ACAGTTCTGTCGTACCCCGT (AC)8 212 --hs.p2.0158 AAGGCTTTTGGTTTCGGTGA CCGATCCGACCTTGTTTGGT (GT)7 347 --hs.p2.0160 GCGTTGGTAGGGCCAATTCT CCCCCATGGTGATGGACTGA (TC)7 384 --hs.p2.0163 TCGTGTTTACAAGCAAGTGTC GGCTTGTCTTGGGTTGATGG (AC)11 294 --hs.p2.0169 GATGGAAACGCCAGGGAAGC ATCGCACCACATACCTCGCA (GT)9 238 --hs.p3.0171 GGTGTTGTAAAGCGCTACTGA CGAAAGACGGAGGAGAACGC (TAA)7 359 --hs.p2.0172 CTTCCTGTCCCTGTAGCCGA AGGCCGGTCTGTCCAAAATG (CT)10 290 --hs.p2.0183 TGAATAGGCTCCAGGGTGGT TACTGCCTGCTGTAAGCCTGT (AG)9 280 --hs.p2.0191 TCGGCCCAAGTAGGTTCAGT AGTCCGCTCTTGTCAGCTCT (TC)9 135 --hs.p2.0193 AACCATCCCAAACAAATACGCC CGCCAACGGCAAAAGGTATG (CA)9 100 --hs.p2.0196 ACCTAAGGTAGGATCCAGCG GATCAGACCCTCCCGGTCAA (TA)9 327 --hs.p2.0208 GGTGCTGAGGTACCGTATGAG TGTGCGGTACCGATAGGTTG (AC)8 124 --hs.p3.0221 GAGAGTCTTATTGCGGGGGC GCCATGTGGGATTTAAGTGGG (GAA)7 260 --hs.p2.0224 TTATCGCCGGTTCACGGATG CGGGCAGCAAGCCTATGTAT (AT)9 320 --hs.p2.0236 GGCACACAGCAGGATACACA CGATAGGGCACACCTTCTCC (GT)13 221 --hs.p2.0238 TGCATTTCTGGTTGTTTTTGGTT CCCCATGACCTTTGACCTCC (TG)20 244 --hs.p2.0247 GGGGTGACTGAAGGTCCAAG GGCCCAGCCGTTTACAAGATA (GT)9 129 --hs.p2.0249 CCCATGCCCTCATCATCTCT TTTTCCTAACACACGCAATCT (TA)9 139 --hs.p2.0268 GATGAAAGAGAAACCTATGCATTC TTCAGACTTTGACCCCCGTT (GT)7 233 --hs.p4.0271 GGCGGACCTCAAACCACATC GCACGCAGTTGCAGGGTATT (TCCG)7 169 --hs.p2.0287 CCATCCAACCTTCCCTTGCT TTTTGGTTCCTTGCGGTAGC (CA)15 123 --hs.p2.0290 TGGGGCATCTATGTGCTAAGT ATGGGGTCCTCAGGTTGTGT (AC)10 145 --hs.p2.0312 TCGTGTAAACAAGCAAAGTGT TGTTTCGGCTTGTACACTGA (AC)13 165 --hs.p2.0332 ACTTCCCCCTAAGTGCCGA ACCGCACTGGTCTACTTATTC (AT)17 110 --hs.p2.0333 GCAGGCAAGGAATCTATGCGA TGGGCAATCTACATGCCAAG (TG)7 128 --hs.p2.0335 ATCGTGCTAACAAGCAAAGTGG TGCTTACGGGGAAGTTCGGT (AC)7 105 --hs.p2.0338 AGTCACACTTACCTTACTTGAGA CACCATGATACACAGGTCCTCA (AC)8 107 --

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

350

Page 13: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

Supplementary Material 5. A per-locus summary of conclusions drawn from the two microsatellite locus evaluation methods. Locus MS-050 Prod Len (bp) Visual (AGE/PAGE) Sequencing

hs.p2.0153 212 Stutters --

hs.p2.0249 139 None (not robust) --

hs.p3.0118 107 Polymorphic & heterozygous No definitive proof of polymorphism or heterozygosity

hs.p3.0171 359 No variability observed At least 2 alleles; Heterozygous

hs.p3.0221 260 No variability observed No variability observed

hs.p4.0056 378 No variability observedNo definitive proof of polymorphism or heterozygosity; Cannot be used—mononucleotide repeat observed

hs.p4.0086 332 Polymorphic At least 4 alleles; Heterozygous; Size anomaly observed

hs.p4.0271 169 Stutters --

hs.p5.0116 318 Not optimized --

hs.p2.0129 256 Polymorphic Polymorphic

hs.p2.0183 280 No variability observed Heterozygous

Table 4. Summary of transposable elements predicted from H. scabra and P. parvimensis genome data using RepeatMasker (RS) and RepeaModeler (RM).

H. scabra P. parvimensis

DNA transposon Retrotransposon Unknown DNA transposon Retrotranposon Unknown

RS 103 132 2 6511 17311 931

RM 0 119 2942 25994 55467 795549

Total (RS and RM) 3298 901763

Total consolidated and dereplicated TE 3223 899685

from the genome of H. scabra - including both retrotransposons (Class I) and DNA transposons (Class II). From this number, 279 (8.66%) match the reference database while a larger fraction still remains unknown. Correspondingly, the genome of P. parvimensis, a closely related holothurid, is analyzed using the same approach. The number of TE predicted from the genome of P. parvimensis reaches 899,685. From this number, 103,205 (11.47%) match the reference database leaving a larger fraction unidentified. Remarkably, this count far exceeds that of the H. scabra. The large disparity in the number may be attributed to the scale of sequencing considering that only a portion of H. scabra's genome was sequenced and analyzed. Notwithstanding the result, the most abundant repeats between the two genomes observes a similar pattern (Table 3) where DNA/hAT-Blackjack, DNA/hAT-Tip100, DNA/Maverick, RC/Helitron, LINE/L2, LTR/Gypsy, SINE/MIR and SINE/tRNA appear to be among the most replete class of TEs in both H. scabra and P. parvimensis genomes. Variation in the relative abundance of DNA transposons and retrotransposons are known to exist among species regardless of their sheer number (Feschotte and Pritham 2007). In the both

genomes, the respective numbers subclass/superfamily (Table 4) of TEs show that retrotransposons are more abundant than DNA transposons. Considering the two repeat finding software used in this study, the coverage and density (Table 5) of TEs in H. scabra are found to be in the range of 30.24-240.19 kb repeats/Mb genome and 117-1515 TEs/Mb genome, respectively. On the other hand, the coverage and density of TEs in P. parvimensis

Table 5. Coverage, density statistics and genome proportion of transposable elements from the genomes of H. scabra and P. parvimensis.

H. scabra (2.02 Mb)

P. parvimensis (846.5 Mb)

RS RM RS RM

Coverage (kb repeats/Mb assembly) 30.24 240.19 5.27 211.14

Density (count per Mb) 117 1515 29 1036

Genome Proportion 3.02% 24.02% 0.53% 21.11%*The size (Mb) indicated by a parenthesis after the species name refers to the size of the assembly/scaffold sequences used in the study. RS – RepeatMasker; RM – RepeatModeler.

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

351

Page 14: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

Figure 7. Heatmap of different genomes of model organisms (S. purpuratus, C. elegans and D. melanogaster) and non-model organisms (H. scabra and P. parvimensis) based on transposable elements. The left panel shows the count of repeats from REPBASE (model organism) and those searched by REPEATMASKER (non-model organisms). Right panel shows the top 10 repeats for each organism (indicated by a solid box). Unknown/unclassified and ambiguous transposable elements (denoted by ‘?’ after the name of a superfamily) are excluded from the graph.

Supplementary Material 6. Comparison of the transposable elements searched by RepeatMasker (RS) and RepeatModeler (RM) from the genomes of H. scabra and P. parvimensis. UID_RM-ID _RS – Unidentified transposable elements from RepeatModeler that have been identified by RepeatMasker; ID_RM==ID_RS – Identified transposable elements from RepeatModeler that have also been identified by RepeatMasker; ID_RM!=ID_RS – Transposable elements that have different identities based from RepeatModeler and RepeatMasker.

H. scabra P. parvimensisUID_RM-ID_RS 2 16

DNA/Academ 0 2

DNA/hAT-Blackjack 0 25

DNA/hAT-Tip100 0 110

DNA/hAT-Tip100? 0 1

DNA/Maverick 0 2

DNA/MULE-MuDR 0 2

DNA/PIF-Harbinger 0 2

LINE/L1-Tx1 0 9

LINE/L2 0 30

LTR/DIRS 0 3

RC/Helitron 45 586

SINE/MIR 0 726

SINE/tRNA 12 9

SINE/tRNA-Core 0 36

SINE/tRNA-Deu-CR1 0 2

ID_RM==ID_RS 3 8

DNA/Academ 0 24

DNA/hAT-Blackjack 0 9

DNA/Sola 0 4

LINE/CR1 2 22

LINE/L1-Tx1 2 5

LINE/L2 14 440

LTR/Gypsy 0 2

LTR/Pao 0 27

ID_RM!=ID_RS 0 5

DNA/Maverick 0 1

DNA/MULE-MuDR 0 7

LINE/L2 0 10

LTR/Gypsy 0 4

SINE/MIR 0 324

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

352

Page 15: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

are 5.27-211.14 kb repeats/Mb genome and 29-1036 TEs/Mb genome, respectively. Moreover, the size of each retrotransposon and DNA transposon predicted from the genomes of the two sea cucumbers estimates that they occupy 3.02-24.02% and 0.53%-21.11% (Table 5) of H. scabra’s and P. parvimensis’s genomic space, respectively. Finally, comparative analysis of transposable elements (Figure 7) from the genomes of holothurids (H. scabra and P. parvimensis) and model organisms including S. purpuratus, C. elegans and D. melanogaster illustrates general trends of similarities and dissimilarities in TE composition between closely and distantly related species.

CONCLUSIONSThe iETR profiles of H. scabra and P. parvimensis in both microsatellite locus density and coverage exhibit patterns that are typical of eukaryotes based on comparable studies of perfect repeat profiles. The dereplicated data sets—which more completely depict the microsatellite composition of genomes than do their counterpart iETR data sets—also bring to light fundamental dissimilarities between the microsatellite profiles of the two sea cucumbers. For instance, the dereplicated data sets exhibit different predominant STR classes and different prominently under-represented STR classes between the two sea cucumber genomes. These fundamental STR profile differences have never been observed previously in species of the same order in the phylum Echinodermata, and may reflect important differences in genome dynamics that drive evolution. Additionally, comparisons between corresponding dereplicated and iETR data sets show that as much as half of all true STRs may be missed when imperfect repeats are excluded from an STR scan. This highlights the importance of ATR-inclusive STR searches. In identifying and testing candidate microsatellite markers, this study has also demonstrated that partial genome sequencing may be used as a cheaper and more efficient alternative to the traditional methods of developing microsatellite markers for H. scabra.

This study has presented for the first time a list of interspersed repeats particularly transposable elements in the genome of non-model organisms H. scabra and P. parvimensis. Albeit a large disparity in the number of predicted transposable elements in the two genomes, a number of similar patterns in the repeat profile were observed and noted between H. scabra and P. parvimensis.

AUTHORS’ CONTRIBUTIONDATB, IDCU and CCT equally contributed in the implementation of the experiments. DATB and IDCU contributed equally to the analysis of the data and preparation of the manuscript. AOL conceptualized the project, helped in the analysis of the data and in the preparation of the manuscript.

ACKNOWLEDGMENTSThe authors are deeply grateful to Dr. Marie Antonette Juino-Menez and Dr. Rachel Ravago-Gotanco for the H. scabra samples. This study was funded by a research grant from the Philippine Department of Agriculture-Biotechnology Program Implementation Unit (DA-Biotech) to AOL.

CONFLICTS OF INTERESTThe authors declare no conflict of interest.

REFERENCESBAO Z, EDDY SR. 2002. Automated De Novo

Identification of Repeat Sequence Families in Sequenced Genomes. Genome Res. 12:1269–1276.

BRITTEN R, KOHNE D. 1968. Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Science (80-. ). 161:529–540. Clark RM, Bhaskar SS, Miyahara M, Dalgliesh GL, Bidichandani SI. 2006. Expansion of GAA trinucleotide repeats in mammals. Genomics 87:57–67.

BUSCHIAZZO E, GEMMELL NJ. 2006. The rise, fall and renaissance of microsatellites in eukaryotic genomes. BioEssays 28:1040–50.

CLARK RM, BHASKAR SS, MIYAHARA M, DALGLIESH GL, BIDICHANDANI SI. 2006. Expansion of GAA trinucleotide repeats in mammals. Genomics 87:57–67.

DELGRANGE O, RIVALS E. 2004. STAR: an algorithm to Search for Tandem Approximate Repeats. Bioinformatics 20:2812–2820.

ESTOUP A, WILSON IJ, SULLIVAN C, CORNUET JM, MORITZ C. 2001. Inferring population history from microsatellite and enzyme data in serially introduced cane toads, Bufo marinus. Genetics 159:1671–1687.

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

353

Page 16: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

ESTOUP A, ARNE P, CORNUET J-M. 2002. Homoplasy and mutation model at microsatellite loci and their consequences for population genetics analysis. Mol. Ecol. 11:1591–1604.

ESTOUP A, SOLIGNAC M, HARRY M, CORNUET JM. 1993. Characterization of (GT)n and (CT)n microsatellites in two insect species: Apis mellifera and Bombus terrestris. Nucleic Acids Res. 21:1427–1431.

FESCHOTTE C, KESWANI U, RANGANATHAN N, GUIBOTSY ML, LEVINE D. 2009. Exploring repetitive DNA landscapes using REPCLASS, a tool that automates the classification of transposable elements in eukaryotic genomes. Genome Biol. Evol. 1:205–220.

FESCHOTTE C, PRITHAM EJ. 2007. DNA transposons and the evolution of eukaryotic genomes. Annu. Rev. Genet. 41:331–368.

GARZA JC, SLATKIN M, FREIMER NB. 1995. Microsatellite allele frequencies in humans and chimpanzees, with implications for constraints on allele size. Mol. Biol. Evol. 12:594–603.

GOLDSTEIN D, SCHLOTTERER C. 1999 . Microsatellites: Evolution and Appications. Oxford Univ. Press.

GROVER A, AISHWARYA V, SHARMA PC. 2007. Biased distribution of microsatellite motifs in the rice genome. Mol. Genet. Genomics 277:469–480.

GUICHOUX E, LAGACHE L, WAGNER S, CHAUMEIL P, LÉGER P, LEPAIS O, LEPOITTEVIN C, MALAUSA T, REVARDEL E, SALIN F, ET AL. 2011. Current trends in microsatellite genotyping. Mol. Ecol. Notes 11:591–611.

KANG JH, KIM YK, KIM MJ, PARK JY, AN CM, KIM BS, JUN JC, KIM SK. 2011. Genetic differentiation among populations and color variants of sea cucumbers (Stichopus japonicus) from Korea and China. Int. J. Biol. Sci. 7:323–332.

KIM J, VANGURI S, BOEKOE J, GABRIEL A, VOYTAS D. 1998. Transposable elements and genome organization: a comprehensive survey of retrotransposons revealed by the complete Saccharomyces cerevisiae genome sequence. Genome Res. 8:464–478.

KOLPAKOV R, BANA G, KUCHEROV G. 2003. mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31:3672–3678.

KONDO M, AKASAKA K. 2012. Current Status of Echinoderm Genome Analysis - What do we Know? Curr. Genomics 13:134–143.

KRULL M, PETRUSMA M, MAKALOWSKI W, BROSIUS J, SCHMITZ J. 2007. Functional persistence of exonized mammalian-wide interspersed repeat elements (MIRs). Genome Res. 17:1139–1145.

LANDER ES, LINTON LM, BIRREN B, NUSBAUM C, ZODY MC, BALDWIN J, DEVON K, DEWAR K, DOYLE M, FITZHUGH W, ET AL. 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921.

LEGENDRE M, POCHET N, PAK T, VERSTREPEN KJ. 2007. Sequence-based estimation of minisatellite and microsatellite repeat variability. Genome Res.:1787–1796.

LERAT E. 2010. Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity (Edinb). 104:520–533.

LEVINSON G, GUTMAN GA. 1987. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4:203–221.

MALPERTUY A, DUJON B, RICHARD GF. 2003. Analysis of microsatellites in 13 hemiascomycetous yeast species: mechanisms involved in genome dynamics. J. Mol. Evol. 56:730–741.

MARGULIES M, EGHOLM M, ALTMAN WE, ATTIYA S, BADER JS, BEMBEN L A, BERKA J, BRAVERMAN MS, CHEN Y-J, CHEN Z, ET AL. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–780.

MEGLÉCZ E, NÈVE G, BIFFIN E, GARDNER MG. 2012. Breakdown of phylogenetic signal: a survey of microsatellite densities in 454 shotgun sequences from 154 non model eukaryote species. PLoS One 7:e40861.

MORGANTE M, HANAFEY M, POWELL W. 2002. Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nat. Genet. 30:194–200.

MUOTRI AR, MARCHETTO MCN, COUFAL NG, GAGE FH. 2007. The necessary junk: new functions for transposable elements. Hum. Mol. Genet. 16 Spec No:R159–167.

O’DUSHLAINE CT, EDWARDS RJ, PARK SD, SHIELDS DC. 2005. Tandem repeat copy-number variation in protein-coding regions of human genes. Genome Biol. 6:R69.

PEARSON CE, NICHOL EDAMURA K, CLEARY JD. 2005. Repeat instability: mechanisms of dynamic mutations. Nat. Rev. Genet. 6:729–742.

PELIZZOLA M, ECKER JR. 2011. The DNA methylome.

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

354

Page 17: Preliminary Discovery of Repetitive Elements in the Genome ...philjournalsci.dost.gov.ph/.../Sea_Cucumber...2017.pdf · sea urchin Strongylocentrotus purpuratus is the only fully

FEBS Lett. 585:1994–2000.

PRICE AL, JONES NC, PEVZNER PA. 2005. De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1:i351–358.

PRIMMER CR, SAINO N, MØLLER AP, ELLEGREN H. 1996. Directional evolution in germline microsatellite mutations. Nat. Genet. 13:391–393.

RICHARD G-F, KERREST A, DUJON B. 2008. Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Microbiol. Mol. Biol. Rev. 72:686–727.

SAINUDIIN R, DURRETT RT, AQUADRO CF, NIELSEN R. 2004. Microsatellite mutation models: insights from a comparison of humans and chimpanzees. Genetics 168:383–395.

SAINZ J, PRATS E, RUIZ S, CORNUDELLA L. 1992. Organization of repetitive DNA sequences in the genome of the echinoderm Holothuria tubulosa. Biochimie 74:1067–1074.

Sea Urchin Sequencing Consortium. 2006. The genome of the sea urchin Stronglyocentrotus purpuratus. Science. 314:941–952.

SELKOE KA, TOONEN RJ. 2006. Microsatellites for ecologists: a practical guide to using and evaluating microsatellite markers. Ecol. Lett. 9:615–629.

SHARMA PC, GROVER A, KAHL G. 2007. Mining microsatellites in eukaryotic genomes. Trends Biotechnol. 25:490–498.

SIBLY RM, MEADE A, BOXALL N, WILKINSON MJ, CORNE DW, WHITTAKER JC. 2003. The structure of interrupted human AC microsatellites. Mol. Biol. Evol. 20:453–459.

SMIT A, HUBLEY R, GREEN P. 2010. RepeatMasker Open-3.0.

STEFANINI FM, FELDMAN MW. 2000. Bayesian estimation of range for microsatellite loci. Genet. Res. 75:167– 177.

TANG J, BALDWIN SJ, JACOBS JM, LINDEN CG VAN DER, VOORRIPS RE, LEUNISSEN JA, VAN ECK H, VOSMAN B. 2008. Large-scale identification of polymorphic microsatellites using an in silico approach. BMC Bioinformatics 9:374.

TEMNYKH S, DECLERCK G, LUKASHOVA A, LIPOVICH L, CARTINHOUR S, MCCOUCH S. 2001. Computational and experimental analysis of microsatellites in rice (Oryza sativa L.): frequency, length variation, transposon associations, and genetic marker potential. Genome Res. 11:1441–1452.

THIEL T, MICHALEK W, VARSHNEY RK, GRANER A. 2003. Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor. Appl. Genet. 106:411–422.

THIEL T. 2004. MISA - MIcroSAtellite identification tool (update, September 2010).

THOMAS EE. 2005. Short, local duplications in eukaryotic genomes. Curr. Opin. Genet. Dev. 15:640–4. Thompson JM, Salipante SJ. 2009. PeakSeeker: a program for interpreting genotypes of mononucleotide repeats. BMC Res. Notes 2:17.

USTINOVA J, ACHMANN R, CREMER S, MAYER F. 2006. Long repeats in a huge genome: microsatellite loci in the grasshopper Chorthippus biguttulus. J. Mol. Evol. 62:158–167.

VERSTREPEN KJ, JANSEN A, LEWITTER F, FINK GR. 2005. Intragenic tandem repeats generate functional variability. Nat. Genet. 37:986–90.

WHITTAKER JC, HARBORD RM, BOXALL N, MACKAY I, DAWSON G, SIBLY RM. 2003. Likelihood-based estimation of microsatellite mutation rates. Genetics 164:781–787.

ZANE L, BARGELLONI L, PATARNELLO T. 2002. Strategies for microsatellite isolation: a review. Mol. Ecol. 11:1– 16.

ZHAO X-N, USDIN K. 2015. The Repeat Expansion Diseases: The dark side of DNA repair. DNA Repair (Amst). 32:96–105.

Lluisma AO et al.: Preliminary Discovery of Repetitive Elements in the Genome of Sea Cucumber

Philippine Journal of ScienceVol. 145 No. 4, December 2016

355