whole genome comparison of six crocosphaera watsonii strains with differing phenotypes

16
WHOLE GENOME COMPARISON OF SIX CROCOSPHAERA WATSONII STRAINS WITH DIFFERING PHENOTYPES 1 Shellie R. Bench, 2 Philip Heller, Ildiko Frank, Martha Arciniega, Irina N. Shilova, and Jonathan P. Zehr 3 Ocean Sciences Department, University of California Santa Cruz, 1156 High Street, Earth & Marine Sciences Bldg., Santa Cruz, CA 95064, USA Abstract Crocosphaera watsonii, a unicellular nitrogen-fixing cyanobacterium found in oligotrophic oceans, is important in marine carbon and nitrogen cycles. Isolates of C. watsonii can be separated into at least two phenotypes with environmentally important differences, indicating possibly distinct ecological roles and niches. To better understand the evolutionary history and variation in metabolic capabilities among strains and phenotypes, this study compared the genomes of six C. watsonii strains, three from each phenotypic group, which had been isolated over several decades from multiple ocean basins. While a substantial portion of each genome was nearly identical to sequences in the other strains, a few regions were identified as specific to each strain and phenotype, some of which help explain observed phenotypic features. Overall, the small-cell type strains had smaller genomes and a relative loss of genetic capabilities, while the large-cell type strains were characterized by larger genomes, some genetic redundancy, and potentially increased adaptations to iron and phosphorus limitation. As such, strains with shared phenotypes were evolutionarily more closely related than those with the opposite phenotype, regardless of isolation location or date. Unexpectedly, the genome of the type-strain for the species, C. watsonii WH8501, was quite unusual even among strains with a shared phenotype, indicating it may not be an ideal representative of the species. The genome sequences and analyses reported in this study will be important for future investigations of the proposed differences in adaptation of the two phenotypes to nutrient limitation, and to identify phenotype-specific distributions in natural Crocosphaera populations. Key index words: exopolysaccharide biosynthesis; gen- ome comparison; genome evolution; marine cyano- bacteria; nitrogen fixation Crocosphaera watsonii is a unicellular nitrogen (N 2 )-fixing cyanobacterium that is widely distributed throughout tropical and subtropical oligotrophic oceans. In those regions, the low level of bioavail- able nitrogen (N) often limits primary production, and N 2 -fixing (i.e., diazotrophic) phytoplankton can be an important source of N for the phytoplankton community. A variety of studies have demonstrated that unicellular diazotrophic cyanobacteria, espe- cially Crocosphaera and UCYN-A, are abundant and contribute substantial amounts of N in many oligo- trophic regions (Zehr et al. 2001, Falcon et al. 2004, Montoya et al. 2004, Church et al. 2005, 2008, Lang- lois et al. 2008, Kitajima et al. 2009, Moisander et al. 2010). Crocosphaera strains, all of which are the spe- cies C. watsonii, have been successfully isolated from multiple ocean basins and maintained in culture for many years. Although these strains exhibit pheno- typic differences, genetic comparisons have found the vast majority of DNA sequences to be essentially identical among cultivated strains and environmen- tal sequences (Zehr et al. 2007, Bench et al. 2011). In the context of such high levels of sequence con- servation, C. watsonii strains appear to diverge and maintain genetic diversity through genetic rear- rangements and by incorporating strain-specific sequences (Zehr et al. 2007, Bench et al. 2011). Large numbers of mobile genetic elements (i.e., transposase genes) in the C. watsonii WH8501 gen- ome provide a possible mechanism for such genetic insertions, deletions, and rearrangements (Bench et al. 2011). Crocosphaera are distinguished by these characteristics from sympatric non-N 2 -fixing marine cyanobacteria, such as Synechococcus and Prochlorococ- cus, which generally lack transposase genes and have a high degree of genomic sequence diversity in cultured strains and environmental sequences (Coleman et al. 2006, Rusch et al. 2007, Zhao and Qin 2007, Dufresne et al. 2008, Scanlan et al. 2009, Partensky and Garczarek 2010). 1 Received 4 November 2012. Accepted 14 May 2013. 2 Present address: Environmental Earth System Science, Stanford University, 473 Via Ortega, Rm 140, Stanford, CA 94305, USA. 3 Author for correspondence: e-mail [email protected]. Editorial Responsibility: D. Lindell (Associate Editor) J. Phycol. 49, 786–801 (2013) © 2013 Phycological Society of America This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made. DOI: 10.1111/jpy.12090 786

Upload: unjfsc

Post on 24-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

WHOLE GENOME COMPARISON OF SIX CROCOSPHAERA WATSONII STRAINS WITHDIFFERING PHENOTYPES1

Shellie R. Bench,2 Philip Heller, Ildiko Frank, Martha Arciniega, Irina N. Shilova, and Jonathan P. Zehr3

Ocean Sciences Department, University of California Santa Cruz, 1156 High Street, Earth & Marine Sciences Bldg., Santa Cruz,

CA 95064, USA

AbstractCrocosphaera watsonii, a unicellular nitrogen-fixing

cyanobacterium found in oligotrophic oceans, isimportant in marine carbon and nitrogen cycles.Isolates of C. watsonii can be separated into atleast two phenotypes with environmentallyimportant differences, indicating possibly distinctecological roles and niches. To better understandthe evolutionary history and variation in metaboliccapabilities among strains and phenotypes, thisstudy compared the genomes of six C. watsoniistrains, three from each phenotypic group, whichhad been isolated over several decades frommultiple ocean basins. While a substantial portionof each genome was nearly identical to sequencesin the other strains, a few regions were identifiedas specific to each strain and phenotype, some ofwhich help explain observed phenotypic features.Overall, the small-cell type strains had smallergenomes and a relative loss of genetic capabilities,while the large-cell type strains were characterizedby larger genomes, some genetic redundancy, andpotentially increased adaptations to iron andphosphorus limitation. As such, strains withshared phenotypes were evolutionarily more closelyrelated than those with the opposite phenotype,regardless of isolation location or date.Unexpectedly, the genome of the type-strain forthe species, C. watsonii WH8501, was quite unusualeven among strains with a shared phenotype,indicating it may not be an ideal representative ofthe species. The genome sequences and analysesreported in this study will be important for futureinvestigations of the proposed differences inadaptation of the two phenotypes to nutrientlimitation, and to identify phenotype-specificdistributions in natural Crocosphaera populations.

Key index words: exopolysaccharide biosynthesis; gen-ome comparison; genome evolution; marine cyano-bacteria; nitrogen fixation

Crocosphaera watsonii is a unicellular nitrogen(N2)-fixing cyanobacterium that is widely distributedthroughout tropical and subtropical oligotrophicoceans. In those regions, the low level of bioavail-able nitrogen (N) often limits primary production,and N2-fixing (i.e., diazotrophic) phytoplankton canbe an important source of N for the phytoplanktoncommunity. A variety of studies have demonstratedthat unicellular diazotrophic cyanobacteria, espe-cially Crocosphaera and UCYN-A, are abundant andcontribute substantial amounts of N in many oligo-trophic regions (Zehr et al. 2001, Falcon et al. 2004,Montoya et al. 2004, Church et al. 2005, 2008, Lang-lois et al. 2008, Kitajima et al. 2009, Moisander et al.2010). Crocosphaera strains, all of which are the spe-cies C. watsonii, have been successfully isolated frommultiple ocean basins and maintained in culture formany years. Although these strains exhibit pheno-typic differences, genetic comparisons have foundthe vast majority of DNA sequences to be essentiallyidentical among cultivated strains and environmen-tal sequences (Zehr et al. 2007, Bench et al. 2011).In the context of such high levels of sequence con-servation, C. watsonii strains appear to diverge andmaintain genetic diversity through genetic rear-rangements and by incorporating strain-specificsequences (Zehr et al. 2007, Bench et al. 2011).Large numbers of mobile genetic elements (i.e.,transposase genes) in the C. watsonii WH8501 gen-ome provide a possible mechanism for such geneticinsertions, deletions, and rearrangements (Benchet al. 2011). Crocosphaera are distinguished by thesecharacteristics from sympatric non-N2-fixing marinecyanobacteria, such as Synechococcus and Prochlorococ-cus, which generally lack transposase genes and havea high degree of genomic sequence diversity incultured strains and environmental sequences(Coleman et al. 2006, Rusch et al. 2007, Zhao andQin 2007, Dufresne et al. 2008, Scanlan et al. 2009,Partensky and Garczarek 2010).

1Received 4 November 2012. Accepted 14 May 2013.2Present address: Environmental Earth System Science, StanfordUniversity, 473 Via Ortega, Rm 140, Stanford, CA 94305, USA.3Author for correspondence: e-mail [email protected] Responsibility: D. Lindell (Associate Editor)

J. Phycol. 49, 786–801 (2013)© 2013 Phycological Society of AmericaThis is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License,which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial andno modifications or adaptations are made.DOI: 10.1111/jpy.12090

786

Physiological studies of cultivated and naturalpopulations of C. watsonii have identified a numberof genetic strategies which appear to be adaptationsto the oligotrophic environment. These include reg-ulation of gene expression, nitrogen fixation rates,and cellular protein content in response to changesin nutrient (e.g., iron and phosphorus) levels andother environmental variables (Webb et al. 2001,Tuit et al. 2004, Falcon et al. 2005, Dyhrman andHaley 2006, Fu et al. 2008, Hewson et al. 2009,Compaor�e and Stal 2010, Shi et al. 2010, Saitoet al. 2011). Currently, cultivated C. watsonii strainscan be divided into two broad phenotypic catego-ries: (i) those that produce large amounts of exo-polysaccharide (EPS) and have larger cell diameters(over 4 lm), and (ii) those that do not producenoticeable EPS, and have cell diameters less than4 lm (Webb et al. 2009, Sohm et al. 2011). Themost apparent difference between the two pheno-types in culture is that the large-cell strains produceover 10 times the amount of EPS as the small-cellstrains (Sohm et al. 2011). While it is not wellunderstood why Crocosphaera sp. produce EPS, stu-dies on other species have shown that EPS produc-tion can have cell protective properties (Pereiraet al. 2009), and can also enhance carbon exportfrom surface waters in the form of marine snow(Passow et al. 2001, Sohm et al. 2011). A recentgenomic comparison of two Crocosphaera strains,one of each phenotype, identified a region in thelarge-cell type genome that is likely to play animportant role in EPS production (Bench et al.2011). This region contained 25 genes, many ofwhich encoded functions related to EPS biosynthesis,and all of which were absent from the small-celltype genome. The two phenotype groups have addi-tional, ecologically relevant differences in phosphorusscavenging gene content, growth temperatureoptima, per-cell nitrogen fixation rates, and photo-synthetic efficiency (Dyhrman and Haley 2006,Webb et al. 2009, Sohm et al. 2011).

To better understand the genetic basis of theC. watsonii phenotypes, this study compared thegenomes of six strains, three in each phenotypicgroup, isolated over decades from multiple oceanbasins. This comparison included examining evolu-tionary relationships among strains and identifyinggenomic features and metabolic capabilities that areunique to strains and phenotypes.

MATERIALS AND METHODS

Strain growth and genomic DNA isolation and sequencing. Thephenotypes, isolation location and genome GenBank acces-sion numbers for C. watsonii strains described in this study arelisted in Table 1 and all strains have been previously described(Waterbury et al. 1986, 1988, and Webb et al. 2009). All strainswere grown in nitrogen-free SO medium (Waterbury et al.1986, 1988) in polycarbonate tissue culture flasks with a0.2 lm pore-size vent cap (Corning Inc., Corning, NY, USA) at26°C under a 12:12 h light/dark cycle. The genome of theWH8501 strain was sequenced by the Joint Genome Institute,and the resulting publicly available sequence (accession num-ber in Table 1) was used for comparisons in this study. Thegenome of strain WH0003 was sequenced prior to this study,with detailed methods described in (Bench et al. 2011).Briefly, a non-axenic culture was subjected to bead-beating todetach cells from their extracellular matrix (ECM), and subse-quently cells were sorted using fluorescence activated cellsorting. The genomic DNA from the sorted cells was amplifiedusing the GenomiPhi V2 amplification kit (AmershamBiosciences, Piscataway, NJ, USA).

Genomic DNA for the four additional strains described inthis study was obtained using the same methods as theWH0003 strain (Bench et al. 2011), with the following modifi-cations: the WH8502 and WH0401 strains do not producelarge amounts of ECM, so they were sorted without bead-beat-ing, and instead of GenomiPhi, the amplification kit used forall four strains was the REPLI-g Midi kit (Qiagen, Valencia,CA, USA). The REPLI-g amplification method was based onthe protocol provided by Qiagen for “small numbers of cellsor single cells” (http://www.qiagen.com/products/genomicdnastabilizationpurification/replig/repligminimidikits.aspx#Tabs=t2). Specifically, sorted cells (5,000–10,000 cells foreach strain) were spun at 14,000 rpm (20,800 g) for 40 minand the supernatant was discarded. Pelleted cells were resus-pended in 2.5 lL of PBS followed by 3.5 lL of Buffer D2(see Qiagen protocol above), and lysed in a 65°C water bathfor 5 min. The lysed cells were placed on ice, and lysis wasterminated by adding 3.5 lL of Stop Solution (provided withkit). Amplification was immediately carried out in 50 lL reac-tions, which contained the cell-lysis mix plus 1 lL of REPLI-gMidi DNA Polymerase, 29 lL of REPLI-g Midi reactionbuffer, and 10 lL of RT-PCR grade H20.

Prior to 454 sequencing, amplified genomic DNA wasquantified using Pico Green (Invitrogen Corporation, Carls-bad, CA, USA). Using sorted cell amplified DNA, shotgunlibraries for each strain were constructed and sequenced bythe UCSC Genome Sequencing Center (http://biomedical.ucsc.edu/GenomeSequencing.html) on the Genome SequencerFLX instrument using Titanium Series protocols according tothe manufacturer’s specifications (454 Life Sciences,Branford, CT, USA).

Genome assembly and annotation. For the four strainssequenced during this study, an average of 363,200 readswere generated per genome, with an average read length of

TABLE 1. Crocosphaera watsonii phenotypes and strain origins.

Strain Phenotype Year isolated Ocean basin where isolated Location where isolated Genome accessions

WH8501 Small-cell 1984 S. Atlantic 28°S, 48°W AADV00000000.2WH8502 Small-cell 1984 S. Atlantic 26°S, 42°W CAQK01000001 - CAQK01000869WH0401 Small-cell 2002 N. equatorial Atlantic 6°N, 49°W CAQM01000001 - CAQM01000918WH0003 Large cell 2000 N. Pacific (St. ALOHA) 22°N, 158°W AESD01000001 - AESD01001126WH0005 Large cell 2000 N. Pacific (St. ALOHA) 22°N, 158°W CAQL01000001 - CAQL01001266WH0402 Large cell 2002 S. equatorial Atlantic 11°S, 32°W CAQN01000001 - CAQN01001343

COMPARISON OF SIX CROCOSPHAERA GENOMES 787

374 bp and ~135,900 kb of sequence data for each genome.The genome-wide average depth of coverage ranged from239 to 309 for each strain, and while some variation inread depth was noted among contigs, the vast majority(86%–98%) of contigs in each genome had an average cov-erage of over 10 reads deep. The reads for each strain wereassembled separately using Version 2.0.00 of the NewblerGS De Novo Assembler program (454 Life Sciences). Theassembly was run via command line interface using the“-nrm”, “-consed”, and “-large” flags. All other parametersused were the default values, as described in the manufac-turer’s publication, “Genome Sequencer Data Analysis Soft-ware Manual”.

Open reading frames (ORFs) in each of the contigsequences for all four draft genomes were identified and anno-tated using RAST (Aziz et al. 2008). A small number of contigs[two contigs, ~1.3 kb, from the WH0401 genome, and 18 con-tigs (~14 kb) from the WH0005 genome] were removed fromfurther analysis based on a lack of recognizable codingsequence and/or their lack of any homology to known cyano-bacterial sequences. It should be noted that there is anunavoidable difficulty in ORF identification and annotationusing multi-contig genomes. As such, there may be a smallfraction of coding sequences in these genomes that were notproperly identified, particularly at the ends of contigs. In addi-tion to annotated ORFs, each genome contained a singlerRNA operon and 39 tRNAs. The final number of bases andcontigs in each genome, as well as the %G+C and number ofannotated ORFs are listed in Table 2. The genome sequencesand annotations deposited at DDBJ/EMBL/GenBank arepublicly available at http://www.ncbi.nlm.nih.gov/ using theaccession numbers listed in the Table 1.

Transposase genes were identified and assigned to IS fami-lies using the BLAST tool on the ISfinder website (http://www-is.biotoul.fr/is.html) with default parameters (Siguieret al. 2006). ORFs with protein BLAST (BLASTp) E-values of<10�3 were annotated as transposases, and assigned to the ISfamily of the top BLAST hit. Also, a small number of ORFs(10–18 per genome) were annotated with the transposasefunction by RAST, but did not have qualifying BLAST hits tothe IS finder database. These were included in all of thetransposase analyses, such as genome counts, and IS familytallies. ORFs without a qualifying IS finder hit and RASTannotation lacking IS family information were listed as“unknown” in the IS family counts. A similar process was usedto identify transposase genes in WH0003 and WH8501, withone additional pre-analysis step for the WH8501 genomewhich involved identification and grouping of highlyduplicated sequences (see methods in Bench et al. 2011).

Genome comparisons. Comparisons between all six genomeswere based on nucleotide BLAST of ORFs. The ORFs fromeach genome were used as queries in BLAST comparisons

against the other five genomes, and the criteria used to clas-sify an ORF as shared between genomes was >95% nucleotideidentity over at least 70% of the ORF length. These criteriawere based on the observation that the nucleotide sequencesfor shared ORFs were generally >99% identical, and fell offrapidly below 98% (Figure S1 in the Supporting Informa-tion). The same similarity criteria were used to cluster ORFswithin a single genome into repeated sequence groups usingthe CD-HIT web server (Li and Godzik 2006, Huang et al.2010). To assess similarities across all six genomes, six tablesof BLAST results (one for each genome versus the other fivegenomes) were merged using custom software according tosequence similarity and binned by the genomes in which thesequence was present. From the original 34,455 ORFs in thesix genomes, this process produced a non-redundant set of11,635 sequences that represented all ORFs in the sixgenomes.

Those 11,635 ORFs were grouped into 63 unique sharingpatterns based on their presence–absence observed in the sixgenomes. The 63 patterns were further aggregated into cate-gories according to the number of genomes where the ORFwas present: there are six patterns for ORFs present in onlyone genome, 15 possible patterns for the category of ORFspresent in two genomes, etc. Equal similarity among the sixgenomes would result in random sharing and the expectationof equal ORF counts for each pattern within each category.For example, there were 1,727 sequences in the 15 patternswhere ORFs were present in two genomes and, in the case ofrandom sharing, the expected ORF count would be 115.1(i.e., 1727/15) in each. Chi-squared goodness-of-fit test wasused to assess the statistical significance of differencesbetween the observed and expected ORF counts in each pat-tern. A similar analysis was performed on ORF counts for pat-terns grouped according to strain phenotypes. For example,the 1,727 sequences shared by exactly two genomes wereaggregated into three groups with two groups where the twogenomes share a phenotype (both are large cell or both aresmall cell), and one group of ORFs found in a genome ofeach type. The random ORF sharing hypothesis was alsotested for the grouped patterns using the chi-squaredgoodness-of-fit test.

Analysis and PCR of specific groups of functional genes. Inorder to investigate the phylogenetic relationships of the sixC. watsonii strains, nucleotide sequences from 25 ORFs wereconcatenated and aligned to construct a phylogenetic treeand distance matrix. The 25 genes were chosen using the fol-lowing criteria: (i) They were present in all six strains,(ii) they had some variation between strains (i.e., the vastmajority of 100% identical ORF sequences could not beused), and (iii) they had homologs in the two Cyanothece spe-cies used as the outgroup (sp. 51142 and CCY0110, which arethe two most closely related genomes available, based on 16S

TABLE 2. Genome sizes and gene content statistics for six Crocosphaera watsonii strains.

StrainTotal genomelength (bp)

Average%G+C

Number ofORFs

Number oftransposase genes

Number (and%) ofstrain-specific ORFsa

Number ofstrain-specific

transposase genesNumber ofcontigs

Small-cell strains WH8501 6,238,156 37.1 5,958 1,211 229 (3.8%) 71 320WH8502 4,683,052 37.6 4,965 165 104 (2.1%) 4 869WH0401 4,551,017 37.7 4,997 166 132 (2.6%) 4 918

Large-cell strains WH0003 5,892,658 37.7 6,145 223 223 (3.6%) 9 1,130WH0005 5,975,524 37.6 5,919 204 167 (2.8%) 10 1,266WH0402 5,880,358 37.7 6,471 216 315 (4.9%) 19 1,343

aAn ORF was considered strain-specific if it had no BLAST similarity to ORFs in any other genomes at 95% ID over 70% of theORF length.

788 SHELLIE R. BENCH ET AL.

rRNA similarity). Because the C. watsonii strains are very clo-sely related, nucleotide sequences were compared, ratherthan translated amino acid sequences. This allowed the analy-sis to take into account all possible sequence variation,including synonymous third position changes. The sequenceIDs for the original 200 sequences (25 ORFs from each ofeight genomes) that were concatenated are listed in Table S1in the Supporting Information. Eight of the 150 CrocosphaeraORFs were split into two sequences, either because they wereon two contigs or by an internal stop codon, which probablyarose from sequencing error. For these sequences, the twosequences are listed together in a single cell, in the order inwhich they were aligned. The sequences were manually con-catenated into a single sequence for each genome, and theresulting eight sequences were initially aligned using ClustalXv2.0.11 (Thompson et al. 1994, 1997, Larkin et al. 2007),followed by some manual curation and phylogenetic tree con-struction in MEGA4 (Tamura et al. 2007). The Neighbor-Joining method, with 1,000 bootstrap replicates, was used toconstruct the phylogeny (Felsenstein 1985, Saitou and Nei1987). Evolutionary distances for the tree and distance matrixwere calculated based on the same alignment, using theJukes-Cantor method in MEGA4 (Jukes and Cantor 1969,Tamura et al. 2007). For both the phylogeny and distancematrix, all codon positions were included (1st, 2nd, 3rd, andnoncoding), and positions containing alignment gaps andmissing data were eliminated only in pair-wise sequencecomparisons (pair-wise deletion option). There were 22,611positions in the final dataset.

Because of prior observations of differences betweenstrains in photosynthetic efficiency (Sohm et al. 2011), per-cell N2-fixation rates (Webb et al. 2009), and phosphorusscavenging genes (Dyhrman and Haley 2006), ORFs withfunctions related to these processes were compared. AllORFs in the each of the six C. watsonii genomes were com-pared to the Kyoto Encyclopedia of Genes and Genomes(KEGG) database and given KEGG orthology (KO) assign-ments using the single-directional best hit method to assignorthologs via the web interface of the KEGG AutomaticAnnotation Server (KAAS, http://www.genome.jp/tools/kaas/; Moriya et al. 2007). Using the KO description and/orthe RAST annotated function, genes with roles in the struc-ture or function of the two photosystems, N2-fixation, andphosphorus transport/metabolism were identified. The num-ber of variants of each gene was totaled for each C. watsoniigenome using the same criteria used to create the table of11,635 sequences described above (>95% ID over >70% ofthe ORF).

The observation of at least one phenotype-specific variantof the isiA gene led to a more detailed analysis of thoseORFs. The C. watsonii psbC and isiA genes were used as pro-tein BLAST query sequences against the NCBI nr proteindatabase, and the ten most similar sequences to each wereretrieved, followed by removing redundant sequences. Theresulting set of protein sequences were aligned along withthe C. watsonii ORFs (58 sequences total) using the onlineMultiple Sequence Comparison by Log-Expectation (MUS-CLE, http://www.ebi.ac.uk/Tools/msa/muscle/) tool withdefault parameters. A phylogenetic tree was generated fromthe resulting alignment using the UPGMA method with 500bootstrap replicates in MEGA5 (Sneath and Sokal 1973,Felsenstein 1985, Tamura et al. 2011). Evolutionary dis-tances, as the number of amino acid substitutions per site,were computed using the Poisson correction method (Zuc-kerkandl and Pauling 1965). All ambiguous positions wereremoved for each sequence pair, and there were a total of776 positions in the final dataset. In the WH0402 genome,two of the isiA genes were split into two adjacent ORFs bya stop codon, with adjacent ORFs homologous to adjacent

regions of full-length isiA sequences, suggesting that thestop codons may have arisen from sequencing errors. Inboth cases, the adjacent ORFs were merged prior to align-ment, and are so noted on the phylogenetic tree. It shouldalso be noted that WH0401 is not shown in Clade 1because the ORF was truncated and could not be properlyaligned. As such, it is possible the full-length gene is notpresent in the WH0401 genome, but given the location ofthe ORF at the end of a contig, it is more likely missingdue to the multi-contig, draft status of the genome. ThepsbC clade was identified by examining the sequence align-ment for the ~114 amino acid region between the 5th and6th transmembrane domains of the protein that is knownto be absent from IsiA proteins (Laudenbach and Straus1988, Bricker 1990). The synteny of the genomic regionswas examined using BLAST comparisons against WH0005contig 0012 which was the longest contiguous sequencecontaining all isiA genes.

Prior to the sequencing of any Crocosphaera genome otherthan WH8501, fosmid libraries were constructed for multi-ple C. watsonii strains. The whole genome sequences, whichwere determined shortly afterward, made detailed analysis ofthose libraries redundant, so they are not included in thisstudy. However, initial analyses of fosmid end sequencesidentified a gene in the library from a large-cell strain thatwas not present in the WH8501 genome. Three differentvariants of this gene, peptidoglycan-binding LysM:PeptidaseM23B (referred to hereafter as lysM), were present in theWH8501 genome, and the fosmid end sequence was afourth variant with <90% similarity to the other three vari-ants. Based on that finding, a PCR assay was developedwhich provided support for the genomic comparisons. Onereverse PCR primer was designed to a conserved region ofthe lysM gene (complementary to all four gene variants),and individual forward primers were designed for each vari-ant using Primer 3 (Rozen and Skaletsky 1999). Thesequences for all primers and PCR product sizes are givenin Table S2 in the Supporting Information. PCR reactionswere carried out in 50 lL reactions, containing 2 lL of tem-plate DNA from eight cultures (four large-cell types andfour small-cell types). Final reaction concentrations ofreagents were as follows: 19 PCR buffer; 2% DMSO,0.2 mM each of dNTPs; 0.4 lM of each primer, and twounits of Platinum taq (Invitrogen, Life Technologies, GrandIsland, NY, USA). Reactions underwent an initial heatingstep of 94°C for 90 s, then 30 cycles of: 94°C for 30 s, 56°Cfor 60 s, 72°C for 150 s, followed by a final extension stepof 70°C for 5 min, and holding at 4°C. PCR reactions withbands of the expected size on an agarose gel were verifiedby subcloning and sequencing in the pGEM-T vector system(Promega, Madison, WI, USA) or by direct sequencing fol-lowing reaction clean-up with QIAquick PCR Purificationkits (Qiagen). Sanger sequencing reactions and electropho-resis were completed at the UC Berkeley sequencing centerusing a 3730 DNA Analyzer (Applied Biosystems, Foster City,CA, USA) according to manufacturer’s protocols. The PCRresults are summarized in Table S3 in the Supporting Infor-mation, and for the six strains which now have genomesequence data, corresponding ORF IDs have been addedfor those with a positive PCR result.

RESULTS

Genome characteristics and sequence duplication withingenomes. Genome sequence statistics were tabulatedfor all six strains, and are summarized in Table 2.The large-cell strains had genome sizes of nearly6 Mb, while two of the small-cell strains had

COMPARISON OF SIX CROCOSPHAERA GENOMES 789

genomes closer to 4.5 Mb. WH8501 was the excep-tion to this pattern, with the largest genome(6.2 Mb) of the six strains. The total number ofORFs per genome correlated closely with genomesize, indicating similar average ORF sizes and similarcoding percentages for all strains. The%G+C wasnearly identical (37.6%–37.7%) for five of thestrains, and the sixth, WH8501, was just below thatat 37.1%. All six genomes contained a single rRNAoperon, which were nearly identical among thestrains. Within the 869 bp region examined in aprevious study (Webb et al. 2009), there were fourpositions with single nucleotide differences amongthe six genomes. Three of these were identified byWebb et al. (2009) at alignment positions 179, 324,and 794, and the fourth was at alignment position514. The difference observed at position 514 was achange from an adenine in five of the genomes, toa guanine in strain WH8502 (a strain which was notincluded in the Webb et al. (2009) study). TheWH8501 genome contained a much larger numberof transposase genes than any of the other fivestrains. Aside from WH8501, the large-cell strainshad slightly higher genomic abundance of transpos-ase genes than the small-cell strains. Two of thesmall-cell strains (WH8502 and WH0401) had thefewest (~100) strain-specific ORFs, while WH0402had the most (over 300), and the remaining threestrains (WH8501 and two large-cell strains) had~200 strain-specific ORFs each.

The level of ORF duplication within genomes wasassessed by clustering identical sequences intogroups of repeated genes. For all genomes, asidefrom that of WH8501, highly repeated sequenceswere not found, and most ORFs grouped with onlyone or two other sequences (Table 3). WH8501 wasthe only strain with repeat-groups of greater than 10sequences, which ranged from 14 to 277. Twelvegroups had more than 30 copies and six were verylarge groups of over 100 copies of a sequence in thegenome. Most of the repeated sequences (10 of the12 groups) were annotated as transposase genes,and two (the 124 copy and the 84 copy clusters)had unknown functions, but are likely to be sometype of mobile element based on their very highcopy number. In addition to more transposasegenes and much more genomic sequence duplica-tion, the transposase genes in the WH8501 genomealso had a very different composition when assignedto IS families (Figure S2 in the Supporting Informa-tion). Transposase genes in the other five genomesmostly fell into the IS200/605 and IS607 families,with smaller numbers observed in the IS4 and IS91families. In contrast, the four most abundant ISfamilies in the WH8501 genome were IS5, IS66,IS630, IS1380, and IS1634. The other five strainscontained sequences in these families as well, but invery small numbers.Shared and unique ORFs among the six genomes. The

ORFs of each strain were compared to the other five

strains using nucleotide BLAST. If a query sequencewas >95% identical over at least 70% of the lengthof the ORF, it was considered to be present in thereference strain. A high percentage (78%–89%) ofORFs in each large-cell strain was found in thegenomes of the other two large-cell strains, while asmaller percentage (62%–70%) was found in thegenomes of the three small-cell strains (Fig. 1 andFigure S1). The WH8501 genome shared the mostORFs (79%) with the WH0401 genome, and muchless (67%–73%) with the other four strains. In con-trast, a large percentage (78%–87%) of ORFs in thegenomes of the other two small-cell strains(WH8502 and WH0401) was shared with all fiveother strains. Reciprocal genome comparisons didnot yield the same percentages because of differ-ences in the numbers of ORFs in each genome(Fig. 1 and Figure S1). For example, when theWH8502 ORFs were queried against the WH0003genome, 80% of them were found. However, only64% of the WH0003 ORFs were found in theWH8502 genome. This is not surprising given that

TABLE 3. Counts of repeated sequences in each Crocosphaerawatsonii genome.

GenomeNumber of

repeats in each groupNumber of groups

in genomeTotal

sequences

WH0003 6 1 65 1 53 8 242 66 132

WH0005 4 2 82 49 98

WH0401 6 1 65 2 103 3 92 8 16

WH0402 3 1 32 45 90

WH8502 10 1 103 2 62 21 42

WH8501 277 1 277150 1 150139 1 139129 1 129124 1 12482 1 8268 1 6864 1 6450 1 5046 1 4632 1 3231 1 3117 1 1716 1 1615 1 1514 1 1410 1 108 1 87 2 146 2 124 6 243 6 182 49 98

790 SHELLIE R. BENCH ET AL.

the WH0003 genome contains 6,145 ORFs, whilethe WH8502 genome contains only 4,965 ORFs(Table 2).

Based on the BLAST results of each genomeagainst the other five, a single set of sequences wasestablished that represented all ORFs present in allsix genomes. Using the same criteria describedabove (95% identity over at least 70% of the lengthof the ORF), the 34,455 ORFs from all six genomeswere grouped into 11,635 sequences. The presenceor absence of each of those sequences in the sixgenomes is shown in Figure 2. A total of 3,825sequences were present in all six genomes, whichrepresented approximately 60% of ORFs in the larg-est genomes, and up to 80% of the smallestgenomes. The nucleotide percent identity for these3,825 sequences averaged 99.8%. Each genome alsocontained between ~100 and ~300 sequences thatwere strain-specific (i.e., absent from all otherstrains). See Table S4 in the Supporting Informa-tion for ORF IDs and functions of all strain-specificORFs. For sequences found in exactly three strains,the largest category contained sequences that werepresent in only the three large-cell type strains (781sequences), and the second largest was sequencesfound in the three small-cell strains (153 sequences;Fig. 2B). The remaining 909 sequences present inthree strains fell into 18 different categories with10–93 sequences in each (see Table S5 in theSupporting Information). The categories withsequences present in exactly two strains showed asimilarly skewed pattern, with the largest numbersof sequences in categories segregated by phenotype(Fig. 2C and Table S5). There were 15 two-straincategories containing a total of 1,727 sequences. Ofthose, the six phenotype-specific categories con-tained over 1,250 sequences (751 in only large-cellstrains, and 502 in only small-cell strains, or anaverage of over 200 sequences per category), withonly 474 sequences total in the remaining nine

categories (average of 53 sequences per category).Overall, 3,825 sequences were shared among all sixstrains, 2,237 sequences were found exclusively inlarge-cell strains (in one, two, or three genomes),and 1,121 were found only in small-cell strains, withthe remaining 4,452 present in at least one strain ofeach phenotype.The observed tendency of sequences to segregate

by phenotype was tested for statistical significanceusing the chi-squared goodness-of-fit test. Sequencesthat were shared between at least two genomes, butnot present in all genomes were included in theanalysis. After binning by the number of genomesin which the sequence was found, the observednumber of sequences in each pattern-group (i.e.,category) was compared to the expected value if allcategories were equal, and the differences betweenthe observed and expected values were used in thestatistical tests (Fig. 3). In all cases, the deviationfrom expected values was statistically significant(P < 10�15). The largest difference in observed val-ues above expected values were generally in catego-ries of a single phenotype, particularly in sequencesfound in the three large-cell genomes. Amongsequences present in four genomes, there were fourcategories with many more sequences than theexpected values (Fig. 3). Two of these were catego-ries where a sequence was present in all of thesmall-cell genomes, plus one large-cell genome, andtwo categories were equally split between pheno-types (i.e., two large cell and two small cell). Thecategories in the five-genome bin showed the leastamount of deviation from expected values. Amongthose, the smallest category contained sequencespresent in the five genomes not including WH0003,which had an observed value well below theexpected value. In categories where the observedvalues were far below the expected values, there wasno consistent pattern of genomes or phenotypesthat were included or excluded (Fig. 3). A similar

FIG. 1. Percentage of ORFsshared between Crocosphaerawatsonii strains. ORFs for eachgenome were used as querysequences against the other fivegenomes in nucleotide BLASTsearches. Alignments >95%identity over at least 70% of theORF were totaled and plotted asa percent of the total number ofORFs in the query genome.Small-cell strains are representedby red shades, and large-cellstrains by shades of blue.

COMPARISON OF SIX CROCOSPHAERA GENOMES 791

statistical analysis was conducted on counts ofsequences in categories that were further binnedbased on strain phenotypes. In that analysis, theexpected values were the product of the expectedvalue for a single category (as described above) mul-tiplied by how many categories were binned. Forsequences found in two or three genomes, all binsthat included only a single phenotype had observedvalues much above the expected values, and inmixed phenotype bins the observed values were wellbelow expected values (Figure S3 in the SupportingInformation). As would be expected from differ-ences in observed and expected values, the highest

contribution to the chi-squared statistic was fromthe bin of sequences found only in the three large-cell genomes, followed by sequences found in onlytwo large-cell genomes.In a previous study of the WH0003 genome, a

25 kb region was identified as likely to be critical toEPS production because it contained a number ofEPS-biosynthesis genes and was present in WH0003genome, but absent from the WH8501 genome(Bench et al. 2011). Not surprisingly, most of theORFs in this region were also absent from the othertwo small-cell strains, and were present in all large-cell strains (Table 4). Furthermore, in the large-cell

WH

0003

WH

0005

WH

0402

WH

0401

WH

8501

WH

8502

pre

sen

t in

6 s

trai

ns

5 st

rain

s 4

stra

ins

1 st

rain

3

stra

ins

2 st

rain

s

Large-cell strains Small-cell strains

(382

5 se

qu

ence

s)

(117

0 se

qs)

(172

7 se

qs)

(1

843

seq

s)

(160

6 se

qs)

(1

464

seq

s)

WH

0003

WH

0005

WH

0402

WH

0401

WH

8501

WH

8502

Large-cell strains Small-cell strains

( 751

seq

s)

Smal

l-ce

ll o

nly

La

rge-

cell

stra

ins

on

ly

Larg

e &

sm

all

( 474

seq

s)

( 502

seq

s)

WH

0003

WH

0005

WH

0402

WH

0401

WH

8501

WH

8502

( 781

seq

s )

Smal

l La

rge-

cell

stra

ins

only

Bo

th la

rge-

and

sm

all-c

ell s

trai

ns

Large-cell strains Small-cell strains

( 909

seq

s - a

vera

ge: 5

0/gr

oup

) ( 1

53 )

A B

C

FIG. 2. Presence/absence of all ORFs in six Crocosphaera watsonii genomes. Presence (gray) or absence (black) of 11,635 sequences thatrepresent all ORFs in the genomes of six strains (A). Each strain is represented by the column above the strain names, and each row rep-resents one sequence. Rows are grouped by the number of strains in which the sequence is found, and the total number of sequences ineach category is listed on the left. The dendogram above the columns is based on the presence/absence pattern for all 11,635 rows.Zoomed in views of the sequences found in three strains (B) and two strains (C) are shown with totals for subcategories listed on the left.

792 SHELLIE R. BENCH ET AL.

strains, all of the genes were 100% identical at thenucleotide level over the full lengths of the ORFs.Phylogenetic analysis and comparison of metabolic

capabilities. The evolutionary relationships of the sixC. watsonii strains and two closely related Cyanothecespecies were examined using an alignment of 25functionally unrelated genes and the correspondingphylogenetic tree and distance matrix. As expected,the two Cyanothece species clustered together as anoutgroup to the six Crocosphaera strains (Fig. 4). Thesix Crocosphaera strains clustered into two subcladeswith the three large-cell strains in one clade, andthe three small-cell strains in the second clade.Over the entire 22 kb alignment, the distancesbetween the Crocosphaera strains was very small(Table S6 in the Supporting Information). Withineach of the phenotype subclades, distances rangedfrom 0 to 0.009 substitutions per site, and betweenthe two clades distances were between 0.024 and0.028. The distance between Crocosphaera strains andCyanothece sp. was ~0.16 substitutions per site.

In addition to the genes coding for EPS-biosyn-thesis described above, the six C. watsonii genomeswere explored for the presence and variants ofgenes involved in N2-fixation, iron and phosphorusscavenging and metabolism, and photosynthesis. Allgenes related to N2-fixation that were examined(nifB, D, E, H, K, N, T, U, V, V, W, X, and Z, andglnB) were present in a single copy in all sixgenomes. Phosphorus scavenging and metabolismgenes were less uniformly found in the genomes(Table 5). The pst genes were often present in

multiple copies, with a range of copy numbers pergenome that did not appear to correlate withphenotype, except for the presence of more pstScopies in the large-cell strains. There was a distinc-tion between the phenotypes in the various forms ofalkaline phosphatase, with phoD present in onlylarge-cell strains, and phoA found exclusively insmall-cell strains (Table 5). A single gene copy wasfound in all genomes for most of the other phos-phorus-related genes. Finally, the total number ofphosphorus-related genes examined was higher(30–32) in the large-cell strains than the small-cellstrains (19–25).Among iron-related genes, many did not vary in

copy number among genomes (e.g., fur, tonB, andexbB/D), while others showed different patterns ofcopy numbers among strains (Table S7 in the Sup-porting Information). Some were variable, but copynumbers did not correlate with phenotype or strainorigins, while others had higher copy numbers inthe large-cell strains, including isiA and feoB. Similarto the phosphorus-related genes, but with a smallerdifference, the total number of iron-related geneswas higher in large-cell strains (29–33) than insmall-cell strains (25–28). Photosystem genesshowed less variability among strains than phospho-rus and iron-related genes. With very few excep-tions, photosystem I (PSI) genes were present as asingle copy in all six genomes, although a few ORFswhich were split by a stop codon (likely sequencingerror), or onto two contigs (Table S8 in the Sup-porting Information). Similarly, most photosystem

0

001

002

003

004

005

006

007

008

009s e c n e u q e s f o r e b

m

u N

devresbO

detcepxE

WH8501 WH8502 WH0401 WH0003 WH0005 WH0402

Smal

l-cel

l

Larg

e-ce

ll

Present in 2 genomes Present in 3 genomes Present in 4 genomes In 5 genomes

FIG. 3. Counts of ORFs found in genomes of 2, 3, 4, and 5 Crocosphaera watsonii strains. Observed counts of ORFs for each category,indicated by bars, were compared to the expected count for each category, indicated by black lines. Categories were binned by the num-ber of genomes in which a sequence was present, and expected counts were calculated by assuming all categories were equally likely ineach bin. The 6-genome presence/absence pattern for each category is indicated by the boxes below the bars (black= sequence is absentfor that genome, colored = present). The chi-squared goodness-of-fit test indicated statistically significant difference between observed andexpected counts for all categories (n = 6640, Df = 55, v2 = 8469, P < 10�15).

COMPARISON OF SIX CROCOSPHAERA GENOMES 793

II (PSII) genes were also present in a single copy inthe six genomes. Exceptions included the following:psb28, which was present in two variants in all six

genomes and psbD, for which a second variant wasfound in two genomes and some differences amonggenomes in the number of variants of psbA. Asequence and alignment-based examination of thepsbA ORFs revealed the presence of two isoforms inthe genomes. Most of the Crocosphaera predictedPsbA proteins were the D1:1 isoform, based on thepresence of a glutamine in the D1:1/D1:2 determi-nant position (Garczarek et al. 2008). There wasone complete copy of the D1:1 gene in the WH8501strain, two complete copies in WH8502, one com-plete and one partial copy in WH0005, two partialcopies that appear to be one gene (based on start/stop) in WH0401, and multiple partial genes in theWH0003 genome. Because of the multi-contig nat-ure of the genome and the observation that mostpartial ORF sequences are at the ends of contigs, itis difficult to know how many full-length genes arepresent in the WH0003 genome. The D1:1 isoformwas not found in the WH0402 genome; however, allsix genomes contained exactly one copy of a moredivergent isoform with a leucine at the D1:1/D1:2determinant position along with a variety of otheramino acid changes throughout the protein.Phylogenetic analysis of the C. watsonii isiA genes

and a closely related psbC gene revealed four vari-ants of isiA in the C. watsonii genomes, three ofwhich correlated strongly with phenotype. The psbC

FIG. 4. Phylogenetic relationship of six Crocosphaera watsoniistrains and two Cyanothece species, based on 25 genes. Evolution-ary relationships were inferred based on a 25 kb alignment of 25concatenated genes using the Neighbor-Joining method and thepercentage of replicate trees in which the associated taxa clus-tered together in the bootstrap test (1,000 replicates) are shownnext to the branches. The optimal tree with the sum of branchlength = 0.24091980 is shown drawn to scale, with branch lengthsin units of base substitutions per site.

TABLE 4. Presence/absence of genes in putative EPS-critical region.

Global%ID

WH0003 WH0005 WH0402 WH0401 WH8501 WH8502 W0003 ORF ID Annotated function

1 1 1 0 0.94 1 CWATWH0003_3496 Hypothetical protein, similar to glycosyl transferase1 1 1 0 0 1 CWATWH0003_3497 Short-chain dehydrogenase/reductase SDR1 1 1 0 0 0.98 CWATWH0003_3498 Hypothetical protein1 1 1 0 0 1 CWATWH0003_3499 Sugar transferase involved in lipopolysaccharide

synthesis1 1 1 0 0 1 CWATWH0003_3500 Pyruvate dehydrogenase (lipoamide)1 1 1 0 0 0 CWATWH0003_3501 Pyruvate dehydrogenase (lipoamide)1 1 1 0 0 0 CWATWH0003_3502 Putative aldo/keto reductase1 1 1 0 0 0 CWATWH0003_3503 Macrocin-O-methyltransferase1 1 1 0 0 0 CWATWH0003_3504 Glycosyl transferase, group 11 1 1 0 0 0 CWATWH0003_3505 WblG protein1 1 1 0 0 0 CWATWH0003_3506 Hypothetical protein1 1 1 0 0 0 CWATWH0003_3507 O-antigen translocase (like Wzx)1 1 1 0 0 0 CWATWH0003_3508 DegT/DnrJ/EryC1/StrS aminotransferase family

protein1 1 1 0 0 0 CWATWH0003_3509 Hypothetical protein1 1 1 0 0 0 CWATWH0003_3510 Hypothetical protein1 1 1 0 0 0 CWATWH0003_3511 Hypothetical protein1 1 1 0 0 0 CWATWH0003_3512 Acetyltransferase, putative1 1 1 0 0 0 CWATWH0003_3513 Oxidoreductase domain protein1 1 1 0 0 0 CWATWH0003_3514 Putative UDP-N-acetyl-D-mannosamine

6-dehydrogenase1 1 1 0 0 0 CWATWH0003_3515 Polysaccharide biosynthesis protein CapD1 1 1 0 0 0 CWATWH0003_3516 Polysaccharide export protein (like Wza)1 1 1 0 0 0 CWATWH0003_3517 Uncharacterized protein involved in

exopolysaccharidebiosynthesis (like Wzc)

1 1 1 0 0 0 CWATWH0003_3518 Hypothetical protein1 1 1 1 0.99 0 CWATWH0003_3519 Animal hem peroxidase1 1 1 0 0 0 CWATWH0003_3520a1 Transposase, IS200/IS605 family

794 SHELLIE R. BENCH ET AL.

gene and one of the isiA genes (Clade 1 in Fig-ure S4 in the Supporting Information) were presentin all six strains, and both of those genes were mostsimilar to closely related cyanobacteria (e.g., Cyanot-hece). The other three isiA variants (Clades 3, 4, and5 in Figure S4) were most similar to Trichodesmiumerythraeum sequences, one of which (Clade 4) hadno homologs to any other organism. The sequencesin Clade 5a were divergent enough from the 5bsequences that they did form a single clade in thephylogenetic tree. However, manual inspection ofthe sequence alignment revealed that they were sub-stantially longer than ORFs in other four clades due

to the presence of a PsaL domain at the C-terminusof the amino acid sequence. As such, they werelabeled with the same clade number to indicatetheir apparent functional similarity. In addition, thethree Trichodesmium-like isiA variants were foundalmost exclusively in the large-cell strains, and wereimmediately adjacent to each other in thosegenomes, with a flavodoxin gene immediatelyupstream (Fig. 5). Aligning the C. watsonii genomiccontigs with the T. erythraeum genome illustratedthat synteny for the three isiA genes was conserved,but the T. erythraeum flavodoxin gene was in a dis-tant genomic region. Additionally, the genomes of

FIG. 5. Illustration of alignment of genome regions containing iron-related genes. Genomic regions were identified and aligned usingBLAST comparisons to the longest contiguous sequence containing all isiA genes (WH0005 contig0012). Ends of contigs are shown bystraight edges, and wavy edges indicate the contig continues but without similarity to the aligned region shown. If no contig is shown fora portion of the alignment, there are no similar sequences to that region anywhere in the genome of that strain. Contig IDs are listed inwhite text for all Crocosphaera species and the GenBank locus tags are shown in gray text for the Trichodesmium genomic regions.

TABLE 5. Counts of phosphorus-related genes in each Crocosphaera watsonii genome.

GeneKEGG Orthology

number Function WH0003 WH0005 WH0402 WH0401 WH8501 WH8502

phnD K02044 Phosphonate binding 2 2 2 1 2 2phoB K07657 Phosphate regulon transcriptional regulator 1 1 1 1 1 1phoH K06217 Phosphate starvation-inducible protein 1 1 1 1 1 1phoR K07636 Phosphate regulon sensor 1 1 1 1 1 2phoU K02039 Phosphate transport system regulator 1 1 1 1 1 1pstA K02038 Phosphate transport system permease 3 3 3 1 3 1pstB K02036 ATP-binding phosphate transport 3 3 4 3 3 3pstC K02037 Periplasmic phosphate-binding ABC-transporter 2 2 2 1 1 2pstS K02040 High-affinity phosphate-binding 6 5 7 3 3 4phoD K01113 Phosphodiesterase/alkaline phosphatase D 2 2 2 0 0 0phoA Alkaline phosphatase 0 0 0 1 1 1

Alkaline phosphatase (non-phoD & non-phoA) 2 4 2 0 0 1dedA Alkaline phosphatase-like 1 1 1 1 1 1pitA K03306 Inorganic phosphate transporter 1 1 1 1 1 1ppa K01507 Inorganic pyrophosphatase 2 1 2 1 1 2ppk K00937 Polyphosphate kinase 1 1 1 1 1 1ppx K01524 Exopolyphosphatase 1 1 1 1 1 1Total 30 30 32 19 22 25

COMPARISON OF SIX CROCOSPHAERA GENOMES 795

the small-cell strains contained only fragments ofthis region, and in the WH8501 genome, these frag-ments were located on much larger contigs adjacentto sequences that were not similar to the alignedregion (Fig. 5).

DISCUSSION

Broad genome observations and transposase abun-dances. The C. watsonii WH8501 genome appears tobe unusual among this group of Crocosphaeragenomes in a number of respects, includinggenome size and transposase abundance (Table 2),IS family distribution (Figure S2), and repeatedgenomic sequences (Table 3). There is a possibilitythat some differences stem from the fact that thegenomic DNA preparation and sequencing methodsfor WH8501 were different from the other fivestrains (i.e., no cell sorting, and the use of Sangersequencing rather than pyrosequencing). However,it is not clear how such methodological differenceswould have led to the observed genomic differences.PCR experiments from 16 separate loci have beenbased on the WH8501 genomic sequence and nonehave found unexpected results. Four of those lociare described in this study (Tables S2 and S3), fourwere developed for another project by the authorsof this study (data not shown), three were describedin Dyhrman and Haley (2006), and five weredescribed in Zehr et al. (2007). Furthermore, awhole genome microarray was designed based onthe WH8501 genome sequence, and subsequentexperiments using cultured cells have not shownany systematic problems that might call into ques-tion the validity of the genome sequence (e.g., Shiet al. 2010). Transposase content (total length of1,211 transposase ORFs = 919,337 bp) accounts formost of the extra ~1.5 Mb in the WH8501 genomecompared to the other two small-cell strains. Of thesix draft genomes, the WH0402 genome has thelargest number of contigs and smallest averagecontig length and has the highest ratio of ORFs togenome size. As such, ORFs were more likely to besplit between two contigs, and if both fragmentswere annotated, the gene function would appearduplicated, with a “copy” on each contig. This wasseen in the genome counts of iron-related and pho-tosystem genes (Tables S7 and S8), where WH0402had multiple genes that were separated into twoORFs, both shorter than expected, and annotatedwith the same function. This type of ORF-splittingmay partly explain why the WH0402 genome hasthe highest number and percentage of strain-specific ORFs (Table 2).

The extremely large numbers of transposasegenes in the WH8501 genome are mostly sharedwith at least one other strain, with only 71 beingstrain-specific (Table 2). It is not clear why the highlevel of gene duplication observed in WH8501 trans-posase genes was not observed in any of the other

genomes. The relative abundances of IS familiesamong the WH8501 strain-specific transposases arenot proportional to abundances in the whole gen-ome (Figure S2). Three of the four IS families thatare most abundant in the non-WH8501 strains havethe unusual property of lacking associated invertedrepeat sequences, while the most abundant familiesin WH8501 are more typical insertion sequenceswith associated inverted repeats (Siguier et al.2006). The lack of inverted repeats may partlyexplain why the insertion sequences in five of thegenomes are not highly replicated. IS elements areknown to confer adaptive advantages such as acqui-sition of new metabolic capabilities and increasedgenomic plasticity via genomic insertions, deletion,and homologous recombination between multi-copyelements (Chandler and Mahillon 2002, Lysnyanskyet al. 2009). As such, the larger number of IS ele-ments in the WH8501 genome, particularly thosepresent in many identical copies, could provide thatstrain with an increased ability to adapt to environ-mental changes. Future work could test this ideathrough competitive growth experiments of multi-ple strains conducted under changing physical and/or chemical conditions.Sequence conservation and shared versus specific genes

among genomes. The very high nucleotide percentidentity (99.8%) for sequences shared among all sixCrocosphaera strains illustrates that there has beenvery little mutation accumulation since strain diver-gence. This level of identity is much higher thanwhat has been observed in sympatric cyanobacteria,even among very closely related strains (Rocap et al.2003, Rusch et al. 2007, Dufresne et al. 2008, Scan-lan et al. 2009). There are a few single nucleotidedifferences among C. watsonii 16S rRNA sequences,yet their genomic nucleotide identity is higher thanthe ~97% average genomic nucleotide identity inBaltic Sea heterotrophic Shewanella species withidentical 16S rRNA sequences (Caro-Quintero et al.2011). The closest example to the high level ofidentity in Crocosphaera is a study of two populationsof marine Vibrio species with >99% average identityin amino acid sequences (Shapiro et al. 2012).However, those Vibrio populations were recentlydiverged, and also had an entire second genomicchromosome that differed between populations.Species from non-marine environments also appearto accumulate sequence mutations more rapidlythan Crocosphaera. For example, Leptospirillum thatevolved in an acid mine drainage system over a per-iod of only 9 years accumulated genome-widesequence differences of 6% between strains (Denefand Banfield 2012). The mechanism by which Cro-cosphaera prevents the accumulation of DNA muta-tions remains unclear. No notable genes or unusualsequences were found during examinations ofknown DNA replication and repair proteins in thegenomes of the six strains examined in this study.Future work is clearly needed to address this

796 SHELLIE R. BENCH ET AL.

question. Directed mutation studies would be idealto help identify relevant genes, but no strains of Cro-cosphaera are known to be genetically transformable.A less efficient strategy could be to use chemicalmutagens to generate strains with higher mutationrates, whose genomes could then be examined foraltered or missing genes related to DNA replicationor repair.

Despite sequence conservation in sharedsequences, genome size and reciprocal genomiccomparisons show that the larger genomes of thelarge-cell strains contain functions that are missingfrom the small-cell strains (Table 2; Fig. 1 and Fig-ure S1). Some evidence suggests that the large-cellspecific functions may have been lost from thesmall-cell strains since divergence (Fig. 5). However,it is also possible that some functions have beenacquired after phenotypic divergence. There areroughly 1,000 more ORFs in the large-cell genomesthan two of the small-cell strains, and the large-cellgenomes are more similar to each other than to thesmall-cell genomes (Figs. 1 and 4). Because thelarge-cell specific ORFs have no similarity tosequences in the small-cell strains, gene duplicationcannot explain their larger genome sizes. Further-more, replicated sequences could partly explain thelarger genome of WH8501, but cannot explain thelarger genomes of the large-cell strains that do nothave genomic sequence duplication (Table 3).

The presence/absence patterns of the ORFs inthe six genomes further shows that the large-cellstrains harbor functions that are missing from theother three strains. There were nearly 800sequences shared among all three large-cell strains,and absent from all small-cell strains (Fig. 2B). Ofthose sequences, 65% (501) were hypothetical orunknown, and only 3% (25) were transposases. Theremaining 246 sequences were annotated with awide variety of functions, including a number withfunctions related to DNA metabolism and modifica-tion, such as single-stranded DNA binding proteins,and DNA polymerases and primases (functionslisted in Table S9 in the Supporting Information).In addition, large-cell-specific sequences alsoinclude the EPS-biosynthesis pathway genes identi-fied in the previous genome comparison of twostrains (Bench et al. 2011). In fact, 17 of the 23genes identified in a large deletion from theWH8501 genome are also absent from the othertwo small-cell strains, but present in all three strainscharacterized by abundant EPS production(Table 4).Genomic and metabolic differences between phenotypes. The

dendrogram based on the presence/absence pat-terns of all possible 11,635 sequences (top ofFig. 2A), and the fraction of shared genes betweenstrains (Fig. 1) both demonstrate that the C. watso-nii strains which share a phenotype are more closelyrelated than those that share a similar cultivationorigin (ocean basin, or year of isolation, see

Table 1). The three large-cell strains share manymore ORFs among their group than they do withany of the small-cell strains (Fig. 1 and Figure S1).In addition, the categories with sequences foundexclusively in large-cell strains were the most over-represented and contributed most to the statisticalsignificance of differences between observed andexpected counts of sequences (Fig. 3 and Fig-ure S3), despite being isolated from different oceanbasins (Table 1). Similarly, the three small-cellstrains cluster together, yet WH0401 was isolatedfrom the North Atlantic 20 years after WH8501 andWH8502 were isolated from the South Atlantic. Theclustering of the two strains isolated in 2000(WH0003 and WH0005), and in 1984 (WH8501 andWH8502) is likely a result of their shared pheno-type, rather than shared isolation history. Whileadditional genomic sequence for a co-isolated strainof the opposite phenotype would be required tocompletely verify that theory, a PCR-based analysisof four-genes showed consistent genetic differencesbetween phenotypes even in co-isolated strains. Spe-cifically, a small-cell strain (WH0004) isolated in thesame year, season, and region as WH0003 andWH0005 (two large-cell strains) showed theexpected pattern for small-cell strains for the LysMgenes (Table S3). Further evidence of phenotypicclustering is provided by the phylogenetic tree inFigure 4, where C. watsonii strains cluster by pheno-type with 100% support, and by the resultingdistance matrix in Table S6, which shows almostzero evolutionary distance among the phenotypes,and larger distances between strains of oppositephenotypes.While many genes and functions were redundant

within the genomes of the six strains, no redun-dancy was observed in genes critical for N2-fixationor most of the photosystem genes (Table S8). Incontrast, a number of iron- and phosphorus-relatedgenes were present in multiple copies in thegenomes, with some copies specific to a phenotype(Table 5 and Table S7). For example, the observa-tion that large-cell genomes contained more copiesof genes in the high-affinity phosphate transport sys-tem (i.e., the pstSCAB operon, which is known to beupregulated during phosphorus starvation (Dyhr-man and Haley 2006)) indicates that large-cellstrains may have larger phosphorus requirementsand/or may be better adapted to low phosphateconditions. It is also interesting that the genomes ofthe two phenotypes contain different variants ofalkaline phosphatase (Table 5) which may differen-tiate the phosphorylated substrates utilized by eachphenotype. The higher number of iron metabolismgenes in the large-cell strains may signify that thosestrains also are more capable of thriving under lowiron conditions. Studies that have directly examinedthe response of cultivated C. watsonii to changes inFe and P have observed dramatic diel recyclingof iron metalloproteins (e.g., photosynthesis and

COMPARISON OF SIX CROCOSPHAERA GENOMES 797

N2-fixation proteins) as well as changes in growth,gene expression, and nitrogen fixation rates (Webbet al. 2001, Tuit et al. 2004, Falcon et al. 2005, Dy-hrman and Haley 2006, Fu et al. 2008, Compaor�eand Stal 2010, Shi et al. 2010, Saito et al. 2011).However, those studies were carried out almostexclusively on the C. watsonii WH8501 strain, sofuture experiments with additional strains will beneeded to verify whether the phenotypes areadapted differently to low or changing nutrient lev-els. In addition, metatranscriptomic or metaproteo-mic studies could be used to examine theexpression of phenotype-specific genes in naturalpopulations.Evidence of genome evolution in photosynthesis gene

clusters. Phylogenetic analysis of isiA and psbC genesproduced three distinct groups of sequences. ThepsbC sequences formed a clade (Figure S4, Clade 2in gray box), with 100% bootstrap support andshorter branch lengths than the isiA clades, indicat-ing less sequence divergence in psbC. This is notsurprising because PsbC, a chlorophyll binding pro-tein, is a critical component of PSII under consider-able selective pressure (Chisholm and Williams1988, Ananyev et al. 2005). The iron starvation-induced chlorophyll binding protein (IsiA), a clo-sely related homolog of PsbC, has at least threeknown functions: (i) chlorophyll storage duringiron-limited conditions, (ii) dissipation of light-exci-tation energy, and (iii) a light antennae in PSI(Sandstr€om et al. 2001, Singh and Sherman 2007,Chauhan et al. 2011). There are intriguing differ-ences in the presence of isiA genes in the sixC. watsonii genomes, with only one of the isiA vari-ants being found in all six strains (Clade 1 in Fig-ure S4). All of the C. watsonii (and TrichodesmiumIMS101) isiA genes in Clade 1, including the trun-cated WH0401 sequence, are adjacent to a down-stream flavodoxin (fldA or isiB) ORF that isidentical among the six strains (100% nucleotideidentity). Flavodoxin is an iron-free replacement forferredoxin, the iron–sulfur electron transfer proteinimportant in N2-fixation and CO2 fixation, and as inthe C. watsonii genomes, isiB is commonly found ina single operon with isiA (Singh and Sherman 2007,Chauhan et al. 2011). As such, this cluster of genesis likely to be important for Crocosphaera duringperiods of iron limitation, which are common inoligotrophic habitats.

The isiA genes in Clades 3, 4, and 5 appear tohave a different evolutionary history than the Clade1 genes (Figure S4). This is illustrated by the rela-tively long branch length between Clade 1 and therest of the isiA genes, as well as the observationthat only Clade 1 isiA ORFs are found in the spe-cies most closely related to C. watsonii by nifH and16S rRNA phylogenies (e.g., Cyanothece spp.), whilethe other three variants are only found in Trichodes-mium and more distantly related cyanobacteria.Furthermore, the Clade 1 isiA genes were found in

a different genomic location than the three othervariants (referred to hereafter using the clade des-ignations from Figures 5 and S4; C3, C4, and C5)which are directly adjacent to each other in the ge-nomes of the three large-cell strains and immedi-ately downstream (Fig. 5) of a second flavodoxingene (distinct from the isiB genes found adjacentto the Clade 1 isiA genes). Additionally, despiteamino acid sequence divergence of over 20%between the two species, Trichodesmium has con-served synteny for the three adjacent isiA ORFs,illustrating that gene order has been maintainedsince they diverged from their last common ances-tor. The genome of the more distantly related cya-nobacterium Anabaena PCC7120 also contains twoadjacent genes (locus tags all4002 and all4003)with similarity to two of the adjacent CrocosphaeraisiA-like ORFs (C5 and C3, respectively). Syntentydoes not extend beyond those two genes, but isinteresting to note that in PCC7120, these ORFsare part of a cluster of four adjacent isiA-likeORFS, but neither of the other two ORFs are moresimilar to the Crocosphaera C4 isiA ORFs. ThePCC7120 genome also contains an ORF similar tothe Crocosphaera isiB ORF shown in Figure 5, butthat ORF is located in a separate genomic location,as was observed in the Trichodesmium genome.Given the apparent ancient evolutionary origin ofthe Crocosphaera isiA gene cluster, it is surprisingthat the small-cell strains are missing most of thesegenes. However, the pattern of remaining genomicfragments in the small-cell genomes (Fig. 5) dem-onstrates a generalized loss of genetic material sug-gested by the smaller genome sizes of the small-cellstrains and by the significant number of genesshared among large-cell strains, but absent fromsmall-cell strains. Evidence for a mechanism lead-ing to such genetic loss is provided by the WH8501genome, where the isiA C4 gene is present, but theisiA C3 ORF is truncated by the presence of atransposase gene and the resulting contig sequencecontinues for some length without any sequencesimilarity to isiA C3 or isiA C5 (Fig. 5). This is thetype of gene loss that would result from geneticrearrangements, which would be expected in a gen-ome containing abundant transposase genes.Finally, because IsiA plays an important role inphotosynthesis during iron-limited conditions(Chauhan et al. 2011), it is possible that theadditional copies of the isiA gene in the genomesof the large-cell strains could make them betterable to continue photosynthesis in low iron envi-ronments. This may also explain the observationsof higher photosynthetic efficiency (Fv/Fm) in thelarge-cell phenotype strains (Sohm et al. 2011).The apparent genome degradation observed in

the small-cell Crocosphaera strains relative to thelarge-cell strains provides insight into the evolution-ary history of the species. Bacterial genomes areknown to degrade over time, with non-critical genes

798 SHELLIE R. BENCH ET AL.

more likely to be lost. An example of this is thegenome reduction observed in pathogenic bacteriaduring adaptation to host environments (Rau et al.2012). Further, in small effective populations, suchas pathogens recently adapted to a host, there isoften a higher incidence of mobile genetic ele-ments and a higher likelihood that deleteriousmutations in important genes will become fixedwithin the population (Ochman and Davalos 2006).Evidence for these evolutionary forces is apparentin the Crocosphaera genomes, where critical genes,such as the nif operon, are present with 100%sequence identity in all of the strains, while regionscontaining redundant genes, such as the isiA genes(Fig. 5), show evidence of degradation, partlythrough the activity of mobile genetic elements. Anumber of studies have demonstrated that very clo-sely related bacterial species can diverge from eachother and adapt to their respective environments byacquiring a relatively small number of specificgenetic capabilities that provide a metabolic advan-tage, and such adaptation is often facilitated by ahigh level of lateral gene transfer. This process hasbeen observed in a number of species acrossdifferent habitats, including marine Vibrio species(Shapiro et al. 2012), Escherichia coli from soil andfreshwater (Luo et al. 2011), Shewanella from theBaltic Sea (Caro-Quintero et al. 2011), and Leptospir-illum in an acid mine drainage system (Denef andBanfield 2012). In the marine environment, there isevidence that the most abundant cyanobacterialgenera, Prochlorococcus and Synechococcus, haveevolved into distinct ecotypes through gain and lossof hyper variable genomic islands that appear to belaterally transferred. Most genes in those islandshave unknown functions, but others have clearlyadaptive functions including cell wall proteins(affecting grazing), DNA mobility, and genes thatare differentially expressed during light stress ornutrient limitation (Coleman et al. 2006, Dufresneet al. 2008, Scanlan et al. 2009). Furthermore, ithas been proposed that the higher abundance ofgenetic regulatory systems in costal Synechococcusstrains (versus open ocean strains), makes them bet-ter adapted to deal with the environmental fluctua-tions that occur more commonly in coastalecosystems (Dufresne et al. 2008). As such thegenomic differences described in this study make itlikely that the two Crocosphaera phenotypes areadapted to different marine environments orniches. Although the strains have been isolatedfrom marine habitats with similar chemical andphysical characteristics, there may be subtle environ-mental differences that have yet to be identified.

CONCLUSIONS

The vast majority of genes in each of the six Cro-cosphaera genomes were shared with at least oneother strain, many with multiple strains, and a

large fraction were shared among all six strains at>99% nucleotide identity, which was not surprisingin light of previous studies that have found a highdegree of genetic sequence conservation in the spe-cies. The genome of WH8501, which has been thetype-strain for the species for decades, was foundto be surprisingly unique within the small-cellphenotype and among this group of isolates in anumber of respects, such as a larger genome,much more abundant transposase genes, and muchhigher levels of gene duplication. This calls intoquestion whether WH8501 should continue to beused as the type-strain for the species in futurestudies. The various genomic and statistical analysesdescribed here show that C. watsonii strains withthe same phenotype cluster together, while similarclustering was not observed in strains with temporalor spatial proximity of isolation. Despite substantialgenetic similarity among the genomes of the sixstrains, the strain-specific and phenotype-specificgenes identified in this comparison apparently pro-vide enough differences to result in phenotypicdivergence. The resulting phenotypes are geneti-cally characterized by small-cell strains with smallergenomes and apparent gene loss, and largergenomes and more redundancy in genetic andmetabolic capabilities in the large-cell strains.Finally, there is some evidence that among theredundant genes are capabilities which may makethe large-cell strains better adapted to iron andphosphorus limited environments. The genomesequences analyzed in this study provide importantdata that can be applied in future studies to testsuch hypotheses in isolated Crocosphaera strains aswell as natural populations.

The authors acknowledge funding from NSF grantEF0424599 for the Center for Microbial Oceanography:Research and Education (C-MORE) and from the Gordonand Betty Moore Foundation Marine Microbiology Initiative(to J.P.Zehr). We are also grateful to John Waterbury for pro-viding Crocosphaera isolates and to Brandon Carter and theMEGAMER facility for assistance with flow cytometry and cellsorting. We thank Eric Webb and Jack Meeks for insight anddiscussion, and Mary Hogan, Kendra Turk, Jim Tripp, JasonHilton, Julie Robidart, Anne Thompson, and Deniz Bombarfor technical and scientific input. Finally, we thank two anon-ymous reviewers for their astute observations and suggestionsthat helped improve the manuscript.

Ananyev, G., Nguyen, T., Putnam-Evans, C. & Dismukes, G. C.2005. Mutagenesis of CP43-arginine-357 to serine reveals newevidence for (bi)carbonate functioning in the water oxidiz-ing complex of Photosystem II. Photochem. Photobiol. Sci.4:991–8.

Aziz, R., Bartels, D., Best, A., DeJongh, M., Disz, T., Edwards, R.,Formsma, K. et al. 2008. The RAST server: rapid annotationsusing subsystems technology. BMC Genomics 9:75.

Bench, S. R., Ilikchyan, I. N., Tripp, H. J. & Zehr, J. P. 2011. Twostrains of Crocosphaera watsonii with highly conservedgenomes are distinguished by strain-specific features.Front. Microbio. 2:1–13, doi: 10.3389/fmicb.2011.00261.

Bricker, T. 1990. The structure and function of CPa-1 and CPa-2in Photosystem II. Photosynthesis Res. 24:1–13.

COMPARISON OF SIX CROCOSPHAERA GENOMES 799

Caro-Quintero, A., Deng, J., Auchtung, J., Brettar, I., Hofle, M.G., Klappenbach, J. & Konstantinidis, K. T. 2011.Unprecedented levels of horizontal gene transfer among spa-tially co-occurring Shewanella bacteria from the Baltic Sea.ISME J. 5:131–40.

Chandler, M. & Mahillon, J. 2002. Insertion sequences revisited.In Craig, N. L. [Ed.] Mobile DNA II, 2nd ed. ASM Press,Washington, DC, pp. 305–66.

Chauhan, D., Folea, I. M., Jolley, C. C., Kouril, R., Lubner, C. E.,Lin, S., Kolber, D. et al. 2011. A novel photosynthetic strat-egy for adaptation to low-iron aquatic environments. Biochem-istry 50:686–92.

Chisholm, D. & Williams, J. G. K. 1988. Nucleotide sequence ofpsbC, the gene encoding the CP-43 chlorophyll a-bindingprotein of Photosystem II, in the cyanobacterium Synechocystis6803. Plant Mol. Biol. 10:293–301.

Church, M. J., Bjorkman, K. M., Karl, D. M., Saito, M. A. & Zehr,J. P. 2008. Regional distributions of nitrogen-fixing bacteriain the Pacific Ocean. Limnol. Oceanogr. 53:63–77.

Church, M. J., Jenkins, B. D., Karl, D. M. & Zehr, J. P. 2005. Verti-cal distributions of nitrogen-fixing phylotypes at Stn ALOHAin the oligotrophic North Pacific Ocean. Aquat. Microb. Ecol.38:3–14.

Coleman, M. L., Sullivan, M. B., Martiny, A. C., Steglich, C.,Barry, K., DeLong, E. F. & Chisholm, S. W. 2006. Genomicislands and the ecology and evolution of Prochlorococcus.Science 311:1768–70.

Compaor�e, J. & Stal, L. J. 2010. Oxygen and the light–dark cycleof nitrogenase activity in two unicellular cyanobacteria. Envi-ron. Microbiol. 12:54–62.

Denef, V. J. & Banfield, J. F. 2012. In situ evolutionary rate mea-surements show ecological success of recently emerged bacte-rial hybrids. Science 336:462–6.

Dufresne, A., Ostrowski, M., Scanlan, D., Garczarek, L., Mazard,S., Palenik, B., Paulsen, I. et al. 2008. Unraveling the geno-mic mosaic of a ubiquitous genus of marine cyanobacteria.Genome Biol. 9:R90.

Dyhrman, S. T. & Haley, S. T. 2006. Phosphorus scavenging inthe unicellular marine diazotroph Crocosphaera watsonii. Appl.Environ. Microbiol. 72:1452–8.

Falcon, L. I., Carpenter, E. J., Cipriano, F., Bergman, B. &Capone, D. G. 2004. N2 Fixation by unicellular bacterio-plankton from the Atlantic and Pacific Oceans: phylogenyand in situ rates. Appl. Environ. Microbiol. 70:765–70.

Falcon, L. I., Pluvinage, S. & Carpenter, E. J. 2005. Growth kinet-ics of marine unicellular N2-fixing cyanobacterial isolates incontinuous culture in relation to phosphorus and tempera-ture. Mar. Ecol. Prog. Ser. 285:3–9.

Felsenstein, J. 1985. Confidence limits on phylogenies: anapproach using the bootstrap. Evolution 39:783–91.

Fu, F.-X., Mulholland, M. R., Garcia, N. S., Aaron, B., Bernhardt,P. W., Warner, M. E., Sanudo-Wilhelmy, S. A. et al. 2008.Interactions between changing pCO2, N2 fixation, and Felimitation in the marine unicellular cyanobacterium Crocos-phaera. Limnol. Oceanogr. 53:2472–84.

Garczarek, L., Dufresne, A., Blot, N., Cockshutt, A. M., Peyrat, A.,Campbell, D. A., Joubin, L. et al. 2008. Function andevolution of the psbA gene family in marine Synechococcus:Synechococcus sp. WH7803 as a case study. ISME J. 2:937–53.

Hewson, I., Poretsky, R. S., Beinart, R. A., White, A. E., Shi, T.,Bench, S. R., Moisander, P. H. et al. 2009. In situ transcrip-tomic analysis of the globally important keystone N2-fixingtaxon Crocosphaera watsonii. ISME J. 3:618–31.

Huang, Y., Niu, B., Gao, Y., Fu, L. & Li, W. 2010. CD-HIT Suite: aweb server for clustering and comparing biologicalsequences. Bioinformatics 26:680–2.

Jukes, T. H. & Cantor, C. R. 1969. Evolution of protein mole-cules. In Munro, H. N. [Ed.] Mammalian Protein MetabolismIII. Academic Press, New York, pp. 21–132.

Kitajima, S., Furuya, K., Hashihama, F., Takeda, S. & Kanda, J.2009. Latitudinal distribution of diazotrophs and their nitro-gen fixation in the tropical and subtropical western NorthPacific. Limnol. Oceanogr. 54:537–47.

Langlois, R. J., Hummer, D. & LaRoche, J. 2008. Abundancesand distributions of the dominant nifH phylotypes in theNorthern Atlantic Ocean. Appl. Environ. Microbiol.74:1922–31.

Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGett-igan, P. A., McWilliam, H., Valentin, F. et al. 2007. Clustal Wand clustal X version 2.0. Bioinformatics 23:2947–8.

Laudenbach, D. E. & Straus, N. A. 1988. Characterization of acyanobacterial iron stress-induced gene similar to psbC.J. Bacteriol. 170:5018–26.

Li, W. & Godzik, A. 2006. Cd-hit: a fast program for clusteringand comparing large sets of protein or nucleotide sequences.Bioinformatics 22:1658–9.

Luo, C., Walk, S. T., Gordon, D. M., Feldgarden, M., Tiedje, J. M.& Konstantinidis, K. T. 2011. Genome sequencing of envi-ronmental Escherichia coli expands understanding of the ecol-ogy and speciation of the model bacterial species. Proc. Natl.Acad. Sci. U. S. A. 108:7200–5.

Lysnyansky, I., Calcutt, M. J., Ben-Barak, I., Ron, Y., Levisohn, S.,Meth�e, B. A. & Yogev, D. 2009. Molecular characterization ofnewly identified IS3, IS4 and IS30 insertion sequence-likeelements in Mycoplasma bovis and their possible roles in gen-ome plasticity. FEMS Microbiol. Lett. 294:172–82.

Moisander, P. H., Beinart, R. A., Hewson, I., White, A. E., John-son, K. S., Carlson, C. A., Montoya, J. P. et al. 2010. Unicellu-lar cyanobacterial distributions broaden the oceanic N2

fixation domain. Science 327:1512–4.Montoya, J. P., Holl, C. M., Zehr, J. P., Hansen, A., Villareal, T. A.

& Capone, D. G. 2004. High rates of N2 fixation by unicellu-lar diazotrophs in the oligotrophic Pacific Ocean. Nature430:1027–31.

Moriya, Y., Itoh, M., Okuda, S., Yoshizawa, A. C. & Kanehisa, M.2007. KAAS: an automatic genome annotation and pathwayreconstruction server. Nucleic Acids Res. 35:W182–5.

Ochman, H. & Davalos, L. M. 2006. The nature and dynamics ofbacterial genomes. Science 311:1730–3.

Partensky, F. & Garczarek, L. 2010. Prochlorococcus: advantages andlimits of minimalism. Ann. Rev. Mar. Sci. 2:305–31.

Passow, U., Shipe, R. F., Murray, A., Pak, D. K., Brzezinski, M. A.& Alldredge, A. L. 2001. The origin of transparent exopoly-mer particles (TEP) and their role in the sedimentation ofparticulate matter. Cont. Shelf Res. 21:327–46.

Pereira, S., Zille, A., Micheletti, E., Moradas-Ferreira, P., Philippis,R. D. & Tamagnini, P. 2009. Complexity of cyanobacterialexopolysaccharides: composition, structures, inducing factorsand putative genes involved in their biosynthesis and assem-bly. FEMS Microbiol. Rev. 33:917–41.

Rau, M. H., Marvig, R. L., Ehrlich, G. D., Molin, S. & Jelsbak, L.2012. Deletion and acquisition of genomic content duringearly stage adaptation of Pseudomonas aeruginosa to a humanhost environment. Environ. Microbiol. 14:2200–11.

Rocap, G., Larimer, F. W., Lamerdin, J., Malfatti, S., Chain, P.,Ahlgren, N. A., Arellano, A. et al. 2003. Genome divergencein two Prochlorococcus ecotypes reflects oceanic niche differen-tiation. Nature 424:1042–7.

Rozen, S. & Skaletsky, H. 1999. Primer3 on the WWW for generalusers and for biologist programmers. In Misener, S. &Krawetz, S. A. [Eds.] Bioinformatics Methods and Protocols.Humana Press, Totowa, NJ, pp. 365–86.

Rusch, D. B., Halpern, A. L., Sutton, G., Heidelberg, K. B.,Williamson, S., Yooseph, S., Wu, D. et al. 2007. The SorcererII Global Ocean Sampling Expedition: northwest Atlanticthrough eastern tropical Pacific. PLoS Biol. 5:e77.

Saito, M. A., Bertrand, E. M., Dutkiewicz, S., Bulygin, V. V.,Moran, D. M., Monteiro, F. M., Follows, M. J. et al. 2011.Iron conservation by reduction of metalloenzyme inventoriesin the marine diazotroph Crocosphaera watsonii. Proc. Natl.Acad. Sci. U. S. A. 108:2184–9.

Saitou, N. & Nei, M. 1987. The neighbor-joining method: a newmethod for reconstructing phylogenetic trees. Mol. Biol. Evol.4:406–25.

Sandstr€om, S., Park, Y.-I., €Oquist, G. & Gustafsson, P. 2001.CP43′, the isiA gene product, functions as an excitation

800 SHELLIE R. BENCH ET AL.

energy dissipator in the cyanobacterium Synechococcus sp.PCC 7942. Photochem. Photobiol. 74:431–7.

Scanlan, D. J., Ostrowski, M., Mazard, S., Dufresne, A., Garczarek,L., Hess, W. R., Post, A. F. et al. 2009. Ecological genomicsof marine picocyanobacteria. Microbiol. Mol. Biol. Rev.73:249–99.

Shapiro, B. J., Friedman, J., Cordero, O. X., Preheim, S. P., Tim-berlake, S. C., Szabo, G., Polz, M. F. et al. 2012. Populationgenomics of early events in the ecological differentiation ofbacteria. Science 336:48–51.

Shi, T., Ilikchyan, I., Rabouille, S. & Zehr, J. P. 2010. Genome-wide analysis of diel gene expression in the unicellularN2-fixing cyanobacterium Crocosphaera watsonii WH 8501.ISME J. 4:621–32.

Siguier, P., Perochon, J., Lestrade, L., Mahillon, J. & Chandler,M. 2006. ISfinder: the reference centre for bacterial inser-tion sequences. Nucleic Acids Res. 34:D32–6.

Singh, A. & Sherman, L. 2007. Reflections on the function ofIsiA, a cyanobacterial stress-inducible, Chl-binding protein.Photosynth. Res. 93:17–25.

Sneath, P. H. A. & Sokal, R. R. 1973. Numerical Taxonomy: ThePrinciples and Practice of Numerical Classification. W. H. Free-man, San Francisco, CA, 573 pp.

Sohm, J. A., Edwards, B. R., Wilson, B. G. & Webb, E. A. 2011.Constitutive extracellular polysaccharide (EPS) productionby specific isolates of Crocosphaera watsonii. Front. Microbio.2:1–9, doi: 10.3389/fmicb.2011.00229.

Tamura, K., Dudley, J., Nei, M. & Kumar, S. 2007. MEGA4: Molec-ular Evolutionary Genetics Analysis (MEGA) software version4.0. Mol. Biol. Evol. 24:1596–9.

Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M. &Kumar, S. 2011. MEGA5: molecular evolutionary geneticsanalysis using maximum likelihood, evolutionary distance,and maximum parsimony methods.Mol. Biol. Evol. 28:2731–9.

Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F. &Higgins, D. G. 1997. The CLUSTAL_X windows interface:flexible strategies for multiple sequence alignment aided byquality analysis tools. Nucleic Acids Res. 25:4876–82.

Thompson, J. D., Higgins, D. G. & Gibson, T. J. 1994. CLUSTAL-W - improving the sensitivity of progressive multiplesequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic AcidsRes. 22:4673–80.

Tuit, C., Waterbury, J. & Ravizza, G. 2004. Diel variation of molyb-denum and iron in marine diazotrophic cyanobacteria.Limnol. Oceanogr. 49:978–90.

Waterbury, J. B., Watson, S. W., Valois, F. W. & Franks, D. G.1986. Biological and ecological characterization of themarine unicellular cyanobacterium Synechococcus. Can. Bull.Fish. Aquat. Sci. 214:71–120.

Waterbury, J. B., Willey, J. M., Packer, L. & Alexander, N. G.1988. Isolation and growth of marine planktonic cyanobacte-ria. Methods Enzymol. 167:100–5.

Webb, E. A., Ehrenreich, I. M., Brown, S. L., Valois, F. W. &Waterbury, J. B. 2009. Phenotypic and genotypic character-ization of multiple strains of the diazotrophic cyanobacte-rium, Crocosphaera watsonii, isolated from the open ocean.Environ. Microbiol. 11:338–48.

Webb, E. A., Moffett, J. W. & Waterbury, J. B. 2001. Iron stress inopen-ocean cyanobacteria (Synechococcus, Trichodesmium, andCrocosphaera spp.): Identification of the IdiA protein. Appl.Environ. Microbiol. 67:5444–52.

Zehr, J. P., Bench, S. R., Mondragon, E. A., McCarren, J. &DeLong, E. F. 2007. Low genomic diversity in tropicaloceanic N2-fixing cyanobacteria. Proc. Natl. Acad. Sci. U. S. A.104:17807–12.

Zehr, J. P., Waterbury, J. B., Turner, P. J., Montoya, J. P., Omore-gie, E., Steward, G. F., Hansen, A. et al. 2001. Unicellular cy-anobacteria fix N2 in the subtropical North Pacific Ocean.Nature 412:635–8.

Zhao, F. & Qin, S. 2007. Comparative molecular populationgenetics of phycoerythrin locus in Prochlorococcus. Genetica129:291–9.

Zuckerkandl, E. & Pauling, L. 1965. Evolutionary divergence andconvergence in proteins. In Bryson, V. & Vogel, H. J. [Eds.]Evolving Genes and Proteins. Academic Press, New York,pp. 97–166.

Supporting Information

Additional Supporting Information may befound in the online version of this article at thepublisher’s web site:

Figure S1. Percent identity bins of all ORFs ineach genome versus other five genomes.

Figure S2. Abundance and distribution of ISfamilies in each of the six Crocosphaera watsoniigenomes.

Figure S3. Combined counts of ORFs found ingenomes of 2, 3, 4 and 5 Crocosphaera watsoniistrains.

Figure S4. Evolutionary relationships of Crocos-phaera watsonii psbC and isiA genes.

Table S1. ORF IDs for sequences included inalignment of 25 genes for phylogenetic analysis.

Table S2. LysM gene PCR primer sequencesand product sizes.

Table S3. ORF IDs and lengths for the fourLysM gene forms in six Crocosphaera watsoniigenomes and PCR results [(+) or (�) amplifica-tion] for two additional strains.

Table S4. (a). Strain-specific ORFs from six Cro-cosphaera watsonii genomes with assigned functions(b). Strain-specific ORFs from six Crocosphaerawatsonii genomes: unknown functions.

Table S5. Counts of sequences in each cate-gory, based on genomes in which the sequenceswere found.

Table S6. Estimates of evolutionary divergenceamong six Crocosphaera watsonii strains and twoCyanothece species, based on 25 genes.

Table S7. Counts of iron-related genes in eachCrocosphaera watsonii genome (probable split ORFsare marked with *).

Table S8. Counts of photosystem I and II genesin the genome of each Crocosphaera watsonii strain(probable split ORFs are marked with *).

Table S9. (a) Large-cell phenotype-specificORF IDs and functions from Crocosphaera watsoniigenomes (b) Small-cell phenotype-specific ORFIDs and functions from Crocosphaera watsoniigenomes.

COMPARISON OF SIX CROCOSPHAERA GENOMES 801