homologs of eukaryotic ras superfamily proteins in prokaryotes and their novel phylogenetic...

9
Homologs of eukaryotic Ras superfamily proteins in prokaryotes and their novel phylogenetic correlation with their eukaryotic analogs Jiu-Hong Dong a,b , Jian-Fan Wen a, , Hai-Feng Tian a,b a Key Laboratory of Cellular and Molecular Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, Yunnan Province, China b Graduate School of the Chinese Academy of Sciences, Beijing 100039, China Received 27 November 2006; received in revised form 2 March 2007; accepted 3 March 2007 Received by M. Di Giulio Available online 14 March 2007 Abstract Ras superfamily proteins are key regulators in a wide variety of cellular processes. Previously, they were considered to be specific to eukaryotes, and MglA, a group of obviously different prokaryotic proteins, were recognized as their only prokaryotic analogs or even ancestors. Here, taking advantage of quite a current accumulation of prokaryotic genomic databases, we have investigated the existence and taxonomic distribution of Ras superfamily protein homologs in a much wider prokaryotic range, and analyzed their phylogenetic correlation with their eukaryotic analogs. Thirteen unambiguous prokaryotic homologs, which possess the GDP/GTP-binding domain with all the five characteristic motifs of their eukaryotic analogs, were identified in 12 eubacteria and one archaebacterium, respectively. In some other archaebacteria, including four methanogenic archaebacteria and three Thermoplasmales, homologs were also found, but with the GDP/GTP-binding domains not containing all the five characteristic motifs. Many more MglA orthologs were identified than in previous studies mainly in delta-proteobacteria, and all were shown to have common unique features distinct from the Ras superfamily proteins. Our phylogenetic analysis indicated eukaryotic Rab, Ran, Ras, and Rho families have the closest phylogenetic correlation with the 13 unambiguous prokaryotic homologs, whereas the other three eukaryotic protein families (SRbeta, Sar1, and Arf) branch separately from them, but have a relatively close relationship with the methanogenic archaebacterial homologs and MglA. Although homologs were identified in a relative minority of prokaryotes with genomic databases, their presence in a relatively wide variety of lineages, their unique sequence characters distinct from those of eukaryotic analogs, and the topology of our phylogenetic tree altogether do not support their origin from eukaryotes as a result of lateral gene transfer. Therefore, we argue that Ras superfamily proteins might have already emerged at least in some prokaryotic lineages, and that the seven eukaryotic protein families of the Ras superfamily may have two independent prokaryotic origins, probably reflecting the fusionevolutionary history of the eukaryotic cell. © 2007 Elsevier B.V. All rights reserved. Keywords: Small GTPase; Origin; MglA; Genome survey 1. Introduction Ras superfamily proteins (also known as small GTPases, small G proteins, and small GTP-binding proteins) are a number of monomeric molecules with masses ranging from 20 to 40 kDa, within the larger class of regulatory GTP hydrolases. They are ubiquitous in eukaryotes and have diverse roles in cellular processes (Paduch et al., 2001). More than 150 Ras superfamily proteins have already been identified in eukaryotes from yeast to metazoans. According to their structures and cellular functions, they are classified into seven families: SRbeta, Sar1, Arf, Ran, Rab, Ras, and Rho (Jekely, 2003). SRbeta is a component of the Gene 396 (2007) 116 124 www.elsevier.com/locate/gene Abbreviations: Ras, Rat sarcoma protein; MglA, gliding motility protein A; Rab, Ras-like protein in brain; Rho, Ras homolog; SRbeta, the beta subunit of signal recognition particle receptor; Sar1, Secretion-associated and ras- superfamily-related protein; Arf, ADP-ribosylation factor; EF-Tu, elongation factor Tu; Era, Escherichia coli Ras-like protein; Obg, spoOB associated GTP- binding protein; PCR, polymerase chain reaction; RTPCR, reverse transcrip- tion polymerase chain reaction; LRR, Leucine-rich repeats. Corresponding author. Tel./fax: +86 871 5198682. E-mail address: [email protected] (J.-F. Wen). 0378-1119/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2007.03.001

Upload: independent

Post on 16-Jan-2023

1 views

Category:

Documents


0 download

TRANSCRIPT

116–124www.elsevier.com/locate/gene

Gene 396 (2007)

Homologs of eukaryotic Ras superfamily proteins in prokaryotes and theirnovel phylogenetic correlation with their eukaryotic analogs

Jiu-Hong Dong a,b, Jian-Fan Wen a,⁎, Hai-Feng Tian a,b

a Key Laboratory of Cellular and Molecular Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, Yunnan Province, Chinab Graduate School of the Chinese Academy of Sciences, Beijing 100039, China

Received 27 November 2006; received in revised form 2 March 2007; accepted 3 March 2007

Available onlin

Received by M. Di Giulio

e 14 March 2007

Abstract

Ras superfamily proteins are key regulators in a wide variety of cellular processes. Previously, they were considered to be specific toeukaryotes, and MglA, a group of obviously different prokaryotic proteins, were recognized as their only prokaryotic analogs or even ancestors.Here, taking advantage of quite a current accumulation of prokaryotic genomic databases, we have investigated the existence and taxonomicdistribution of Ras superfamily protein homologs in a much wider prokaryotic range, and analyzed their phylogenetic correlation with theireukaryotic analogs. Thirteen unambiguous prokaryotic homologs, which possess the GDP/GTP-binding domain with all the five characteristicmotifs of their eukaryotic analogs, were identified in 12 eubacteria and one archaebacterium, respectively. In some other archaebacteria, includingfour methanogenic archaebacteria and three Thermoplasmales, homologs were also found, but with the GDP/GTP-binding domains notcontaining all the five characteristic motifs. Many more MglA orthologs were identified than in previous studies mainly in delta-proteobacteria,and all were shown to have common unique features distinct from the Ras superfamily proteins. Our phylogenetic analysis indicated eukaryoticRab, Ran, Ras, and Rho families have the closest phylogenetic correlation with the 13 unambiguous prokaryotic homologs, whereas the otherthree eukaryotic protein families (SRbeta, Sar1, and Arf) branch separately from them, but have a relatively close relationship with themethanogenic archaebacterial homologs and MglA. Although homologs were identified in a relative minority of prokaryotes with genomicdatabases, their presence in a relatively wide variety of lineages, their unique sequence characters distinct from those of eukaryotic analogs, andthe topology of our phylogenetic tree altogether do not support their origin from eukaryotes as a result of lateral gene transfer. Therefore, we arguethat Ras superfamily proteins might have already emerged at least in some prokaryotic lineages, and that the seven eukaryotic protein families ofthe Ras superfamily may have two independent prokaryotic origins, probably reflecting the ‘fusion’ evolutionary history of the eukaryotic cell.© 2007 Elsevier B.V. All rights reserved.

Keywords: Small GTPase; Origin; MglA; Genome survey

Abbreviations: Ras, Rat sarcoma protein; MglA, gliding motility protein A;Rab, Ras-like protein in brain; Rho, Ras homolog; SRbeta, the beta subunit ofsignal recognition particle receptor; Sar1, Secretion-associated and ras-superfamily-related protein; Arf, ADP-ribosylation factor; EF-Tu, elongationfactor Tu; Era, Escherichia coli Ras-like protein; Obg, spoOB associated GTP-binding protein; PCR, polymerase chain reaction; RT–PCR, reverse transcrip-tion polymerase chain reaction; LRR, Leucine-rich repeats.⁎ Corresponding author. Tel./fax: +86 871 5198682.E-mail address: [email protected] (J.-F. Wen).

0378-1119/$ - see front matter © 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.gene.2007.03.001

1. Introduction

Ras superfamily proteins (also known as small GTPases, smallG proteins, and small GTP-binding proteins) are a number ofmonomeric molecules with masses ranging from 20 to 40 kDa,within the larger class of regulatory GTP hydrolases. They areubiquitous in eukaryotes and have diverse roles in cellularprocesses (Paduch et al., 2001). More than 150 Ras superfamilyproteins have already been identified in eukaryotes from yeast tometazoans. According to their structures and cellular functions,they are classified into seven families: SRbeta, Sar1, Arf, Ran,Rab, Ras, and Rho (Jekely, 2003). SRbeta is a component of the

117J.-H. Dong et al. / Gene 396 (2007) 116–124

signal recognition complex, having a role in eukaryotic ribosomedocking to the endoplasmic reticulum (Schwartz and Blobel,2003); Sar1 and Arf regulate coated vesicle budding from theendoplasmic reticulum and the Golgi complex, respectively; Rabproteins are central to the regulation of vesicle transport andfusion; Ran proteins are central regulators of nuclear function;Ras and Rho proteins are involved in gene expression andtransduction of extracellular signals, and Rho is also a centralregulator of cytoskeleton reorganization (Takai et al., 2001). Theirbiological functions often depend on posttranslational modifica-tions. Many Rab, Rho and Ras proteins have cysteine-richsequences (CAAX (C, cysteine; A, aliphatic acid; X, any aminoacid), CAA(L/F), CXC, or CC) at their C-termini that undergoposttranslational modifications with lipids (Paduch et al., 2001).Arf proteins have an N-terminal glycine residue that is modifiedwithmyristic acid (Moss andVaughan, 1995). Ran proteins do nothave such sequences to direct posttranslational modifications, buthave a specific carboxyl-terminal region, DEDDL, which isrequired for their role in cell cycle regulation (Ren et al., 1995).

Numerous comparisons of amino acid sequences of Rassuperfamily proteins from various species have revealed that theyare conserved in primary structures at a level of 30–55%homology, possessing a consensus sequence responsible forGDP/GTP-binding and GTP hydrolytic activity (Drivas et al.,1991; Valencia et al., 1991). Crystallographic and NMR analyseshave indicated that their GDP/GTP-binding domains share acommon topology: five alpha helices, six beta sheets and fivepolypeptide loops, and five conserved regions (G1–G5 motifs) ofthe polypeptide chain associatewith the loops (Bourne et al., 1990).

It was reported that these eukaryotic Ras superfamily proteinswere more closely related to each other than to any prokaryoticGTPase. Therefore, they have been considered to be specific toeukaryotes and show a spectacular expansion in eukaryotes(Jekely, 2003). In view of the high conservation of the GDP/GTP-binding domain, it has been postulated that members of thesuperfamily evolved by gene duplication and divergence (Drivaset al., 1991). Based on their universality and high conservation, ithas also been deduced that the latest common ancestor ofeukaryotes already possessed small GTPases performing cellularactivities similar to those performed by present-day familymembers, and the major functional diversification of the familiesoccurred after the origin of eukaryotes from prokaryotic ancestors,but before the diversification of present-day eukaryotic groups(Jekely, 2003).

GTP-binding proteins, including EF-Tu, Era, and Obg, havebeen identified in prokaryotes, but they are large in size and possessextra domains not found in the Ras superfamily proteins (Valenciaet al., 1991; Mittenhuber, 2001). To date, only the gliding motilityproteins,MglA, first found inMyxococcus xanthus (Stephens et al.,1989), are recognized as prokaryotic analogs of the eukaryotic Rassuperfamily proteins (Hartzell, 1997; Leipe et al., 2002) and evenas the ancestors of the latter (Moreira and Lopez-Garcia, 1998;Jekely, 2003). But the apparent differences between MglA andeukaryotic Ras superfamily proteins seem not to support such adirect evolutionary relationship. Based on prokaryotic genomicdatabases, two groups have surveyed the prokaryotic Ras super-family protein homologs in 1999 and 2003, respectively (Ponting

et al., 1999; Pandit and Srinivasan, 2003). But due to the limitednumber of prokaryotic genomic databases at that time, only a fewhomologs (nine and seven, respectively) were found by them. Upuntil now, little has been known about the prokaryotic homologs.Therefore, when and how the Ras superfamily proteins aroseduring the eukaryotic evolution from prokaryotes, and whethertheir ancestor arose or even diverged in prokaryotes, remain openquestions. These questions are important to the origin of eukaryoticcells from prokaryotic cells, because the origin of eukaryotic Rassuperfamily proteins are closely related to the origin ofcompartmentalization, especially the formation of the nucleus,due to their key functions in the compartmentalized eukaryotic cellsmentioned above.

These days, a lot of prokaryotic genomic databases areavailable with the predicted protein sequences being annotated.Taking advantage of this, in the present work we survey muchwider than before to identify the prokaryotic homologs of Rassuperfamily proteins and to investigate their taxonomic distribu-tion and phylogenetic correlation with their eukaryotic analogs.

2. Materials and methods

2.1. Search and identification of prokaryotic homologs of Rassuperfamily proteins

The Ras superfamily protein sequences of yeast Schizosac-charomyces pombe (GenBank accession numbers: SRbetaCAB11661; Sar1 Q01475; Arf1 P36579; Spi1 (Ran protein)P28748; ypt1 (Rab protein) CAA36139; Rho1 BAA07377; Ras1CAB11218) and the prokaryote M. xanthus MglA proteinsequence (GenBank accession number: B32048) were used asqueries to search against all available prokaryotic genomicdatabases whose predicted protein sequences had been annotatedin GenBank (621 species available at 2006–09–25) using theblastp algorithm. Blastp outputs were parsed using a cut-off of E-value≤0.1. The results of the blasts were merged. To furtherconfirm the obtained sequences, the Pfam program (http://www.sanger.ac.uk/Pfam/search.shtml) was used to analyze whether thesequences contained the characteristicGDP/GTP-binding domainof Ras superfamily proteins. Only those possessing the GDP/GTP-binding domain were collected and regarded as putativeprokaryotic homologs. In order to avoid the loss of someprokaryotic homologswhich have high divergence, these putativeprokaryotic homologs were also used as queries to search againstall available prokaryotic genomic databases to find more, if any,homologs. The blastp hits were parsed using a cut-off of E-value≤1e–03. Domain analyses were also performed by usingthe Pfam program to identify homologs. The conserved motifs ofproteins were identified using the FingerPRINTScan program(http://bioinf.man.ac.uk/fingerPRINTScan/).

2.2. PCR and RT–PCR examination of the transcription of theidentified prokaryotic homologs genes

Nostoc punctiforme (FACHB-252) was purchased from theAlgal Culture Collection of the Institute of Hydrobiology, CAS,and was cultured with Endo medium (Lu et al., 2001) in

118 J.-H. Dong et al. / Gene 396 (2007) 116–124

sunlight at 20–25 °C. Genomic DNA was extracted from N.punctiforme using a modified phenol–chloroform method (Luet al., 2001). Total RNAwas isolated from N. punctiforme usingthe UNIQ-10 total RNAminipreps classic kit (Shanghai SangonBiological Engineering and Technology and Service Co., Ltd.,Shanghai, China) according to the manufacturer's instructions.To perform PCR and RT–PCR, the gene-specific primers weredesigned according to the sequence of our identified Rassuperfamily protein homolog of N. punctiforme (NospuRas)from GenBank, and their sequences are as follow: NpRasP1,7–26: 5′-GGTGCATTTGCTACAGGTA-3′; NpRasP2, 350–331:5′-TCCCATTCATCCGTTATATC-3′. The reverse transcriptionreaction (with NpRasP2 as the RT primer) was carried out at 50 °Cfor 30 min using the TaKaRa RNA PCR Kit (Avian Myelo-blastosis Virus (AMV)) V.2.1 (TaKaRa Biotechnology (Dalian)Co., Ltd., Dalian, China) according to the manufacturer'sinstructions. A control, without adding AMV Reverse Transcrip-taseXL, was set simultaneously. Subsequent PCR, with the primerpair of NpRasP1/NpRasP2, was carried out using the same Kit for35 cycles at 94 °C for 30 s, 48 °C for 30 s and 72 °C for 45 s. Incontrast with the RT–PCR, a PCR with the same primer pair wasperformed usingN. punctiforme genomic DNA as templates. Boththe RT–PCR and PCR products were analyzed in the same 1.5%agarose gels.

2.3. Sequence alignment and phylogenetic analysis

Sequence alignment was performed using the programsMUSCL (Edgar, 2004) and ClustalX (v 1.83) (Thompson et al.,1997) with default parameters. The alignment was then manuallyrefined, and only unambiguously aligned regions (126 amino acidsites) were used for maximum likelihood phylogenetic analysis,performed using the PHYML program (Guindon and Gascuel,2003). The ProtTest program (Abascal et al., 2005) was used toselect the model of protein evolution that best fits our dataset. Theinvoked options in the maximum likelihood analysis withPHYML program were 100 bootstrap replications, the WAGsubstitution matrix, and the gamma distribution model (1invariable site+8 gamma rate categories) for estimation of rateheterogeneity.

3. Results

3.1. Identification and characterization of prokaryotic homo-logs of Ras superfamily proteins

By searching all available 621 prokaryotic genomic databases(58.7% of them are completed) in GenBank, whose predictedprotein sequences had already been annotated, and then byanalyzing whether the obtained sequences have the characteristicGDP/GTP-binding domain of Ras superfamily proteins, 51 of the621 species were found to have homologs. The homologs arepresent in a minority of the 621 species but in a relatively widevariety of lineages (see Fig. 1). The 51 species contain 10archaebacteria and 41 eubacteria (5 cyanobacteria; 20 proteo-bacteria, including 2 alpha-proteobacteria, 1 beta-proteobacteria,3 gamma-proteobacteria, 13 delta-proteobacteria, and 1 others; 7

Bacteroidetes/Chlorobi; and 9 bacteria in others including 1Aquificae, 1 Acidobacteria, 4 Deinococcus–Thermus, and 3Choloroflexi) (see Fig. 1 and supplementary material Table S1).No homologs were found in the other 570 species (22archaebacteria; 20 Bacteroidetes/Chlorobi; 20 cyanobacteria;320 proteobacteria; all 41 Actinobacteria; all 10 Chlamydiae;all 123 Firmicutes; all 7 Spirochaetales; and 9 bacteria in others).From the 51 species, 74 homologs possessing the characteristicGDP/GTP-binding domain of Ras superfamily proteins wereidentified (see Fig. 1 and supplementary material Table S1).Seventeen of the 74 sequences are relatively long (ranging from393aa to 1185aa), containing extra domains besides the GDP/GTP-binding domain. Among the 17 long sequences, 16 (631–1185aa) contain a common peculiar C-terminal region next to theGDP/GTP-binding domain and other domains such as LRRs.These are the features of the recently characterized Roco family(Bosgraaf and Van Haastert, 2003), and therefore these sequencesmay belong to the Roco family and were discarded in our furtheranalyses; the remnant one, NP_275737 from archaebacteriumMethanothermobacter thermautotrophicus, does not have thepeculiarC-terminal region, but has an extraKaiC domain at theN-termini which has been mentioned in previous studies (Kooninand Aravind, 2000; Pandit and Srinivasan, 2003), and it hadalready been recognized as a small GTPase (Pandit andSrinivasan, 2003); hence it was included in our further analyses.The other 57 of the 74 sequences are relatively short (152–289aa),resembling eukaryotic Ras superfamily proteins in size. Amongthem, 37 sequences are predicted to be MglA proteins, as theyhavemore similarities toM. xanthusMglA, with identities of 23–55%. The other 20 of the 57 sequences show more similarities toeukaryotic Ras superfamily proteins, and thus we recognizedthem as Ras-like proteins.

According to the above analyses, we termed the 74 differenthomologs, except the NP_275737 mentioned above, as XxRoco,XxMglA or XxRas (Ras-like), respectively (X represents the firstthree letters ofGenus name, and x represents the first two letters ofSpecies name). Any given species can have 1–3 Roco or MglAproteins, but it can have at most 1 Ras-like protein. Apart from 3species (N. punctiforme, Nostoc sp., Anabaena variabilis) thathave both Roco and Ras-like proteins, and 1 Choloroflexi,Chloroflexus aurantiacus, that has both MglA and Ras-likeproteins, each species has only one kind of homologs: Roco,MglA or Ras-like.

The alignment of all the 57 short sequences and NP_275737with the eukaryotic Ras superfamily proteins shows that all theprokaryotic homologs have neither eukaryotic posttranslationalmodification directive regions existing in eukaryotic Rab, Ras,Rho and Arf, nor the functional regions existing in eukaryoticRan. The alignment also shows us a salient divergence of theprokaryotic homologs (see Fig. 2, full alignment see supple-mentary material Fig. S1).

Firstly, the 37MglA proteins of the 57 short sequences show acommon unique feature distinct from the eukaryotic Rassuperfamily proteins, that is, all MglA proteins possess aninsertion betweenG1 andG2, and an amino acid substitution fromaspartic acid to threonine in the G3 motif (DXXG→TXXG) (seeFig. 2 and supplementary material Fig. S1). Moreover, except for

Fig. 1. The taxonomic distribution of the identified Ras superfamily protein homologs in prokaryotes. The tree for the representative groups and species in this figurewas drawn according to the highly resolved tree of life from Ciccarelli et al. (2006).

119J.-H. Dong et al. / Gene 396 (2007) 116–124

Fig. 2. The partial alignment of the identified prokaryotic homologs with some eukaryotic Ras superfamily proteins. The five conserved regions that contribute to the GDP/GTP-binding domain of Ras superfamily proteins are indicated by horizontal lines and text. Different prokaryotic homologs are indicated by vertical lines and text.

120 J.-H. Dong et al. / Gene 396 (2007) 116–124

a few, most MglA proteins share a common sequence ((D/E)RT(L/I)FFD(F/L)LP) near the G2 motif (see Fig. 2). These MglAproteins are mainly present in delta-proteobacteria (nearly alldelta-proteobacterial genomic databases have MglA genes inthem) besides a few species of other kinds of bacteria, such asbeta-proteobacteria, gamma-proteobacteria, Chloroflexi, Aquifi-cae, and Deinococcus–Thermus.

Secondly, the other 20 Ras-like proteins of the 57 shortsequences fall into two types. One type includes 13 sequenceswhich are most similar to the eukaryotic proteins with theidentities of 23–29% to yeast Ras1, having the GDP/GTP-binding domain with all the five characteristic motifs ofeukaryotic Ras superfamily proteins (see G1–G5 in Fig. 2).They are from 13 different prokaryotes respectively and are themost unambiguous prokaryotic Ras superfamily protein homo-logs. Except for one archaebacterial sequence (ThepeRas) fromThermofilum pendens, the other 12 sequences are all fromeubacteria: 5 from Bacteroidetes/Chlorobi bacteria (FlasbaRas,FlabaRas, CroatRas, CelspRas, TenspRas), 3 from Nostocalesof Cyanobacteria (NospuRas, NosspRas, AnavaRas), 2 fromgamma-proteobacteria (PseatRas, AltmaRas), 1 from an alpha-proteobacterium (MetloRas) and 1 from a Chloroflexi bacteri-um (ChlauRas). Our PCR and RT–PCR with the same primerpair on NospuRas from N. punctiforme showed that bothproduced the same expected size (∼350 bp) amplicon (thecalculated length is 344 bp) (see supplementary materialFig. S2, Lane 1 and 2), and the negative result from the RT–PCR control verified that the products of RT–PCR were derived

from RNA and not fromDNA contaminations (see supplementarymaterial Fig. S2, Lane 3). This proves that the gene of ouridentifiedNospuRas is an actively transcribed gene. The other typeincludes the remnant seven Ras-like proteins, which showrelatively high divergence from the eukaryotic analogs, notpossessing all the five eukaryotic characteristic motifs of theGDP/GTP-binding domain. The seven sequences are all fromarchaebacteria, 3 (TheacRas, ThevoRas, and FeracRas) fromThermoplasmales and 4 (MetmaRas, MetthRas, MetjaRas andMetkaRas) from methanogenes. The 3 Thermoplasmales homo-logs all have only three motifs: G1, G4, and G5 of the GDP/GTP-binding domain (see Fig. 2). Whereas all the 4 methanogenicarchaebacterial sequences have not the G5 motif. The MetthRasand MetkaRas each only contain G1 and G4 motifs, whileMetmaRas andMetjaRas possess the three motifs, G1, G3 and G4(see Fig. 2). In addition, their G3 motif is not the eukaryoticconventional DXXG sequence, but GXXG. The long sequencementioned above, NP_275737 from M. thermautotrophicus (thisspecies also has a Ras-like proteinMetthRas), is very similar to the4 methanogenic homologs in the sequence of GDP/GTP-bindingdomain, possessing the three motifs, G1, G3 and G4 (see Fig. 2).

Surprisingly, when using the methanogenic archaebacterialhomologs as queries to search against the prokaryotic genomicdatabases, some other sequences with GXXG conserved regionsin the position corresponding to G3 motif of eukaryotic analogswere also found. They were present in most Actinobacteria, 1Aquificae (Aquifex aeolicus), 2 Chloroflexi (Roseiflexus sp. andC. aurantiacus), and some cyanobacteria and proteobacteria

Fig. 3. Phylogenetic tree constructed using the maximum likelihoodmethod. The PHYMLprogramwas used. The invoked options were 100 bootstrap replications, theWAGsubstitution matrix, and the gamma distribution model (1 invariable site+8 gamma rate categories) for estimation of rate heterogeneity. Bootstrap proportions are labeled aspercentages near the major nodes. Group 1, the most unambiguous homologs in 12 eubacteria and 1 archaebacterium; Group 2, homologs in 3 Thermoplasmalesarchaebacteria; Group 3, homologs in 4 methanogenic archaebacteria; Group 4, MglA proteins in eubacteria; Group 5, proteins of eukaryotic Ran, Rab, Rho and Ras families;Group 6, proteins of eukaryotic SRbeta, Sar1, and Arf families; Group 7, RarD sequences. For the GenBank accession numbers of prokaryotic homologs and eukaryotic Rassuperfamily proteins see supplementary materials (Table S1 and S2). For the GenBank accession numbers of the RarD sequences see supplementary material Table S3.

121J.-H. Dong et al. / Gene 396 (2007) 116–124

122 J.-H. Dong et al. / Gene 396 (2007) 116–124

(see supplementary material Table S3). Although the ATP_bind_1domain is identified in these sequences by using the Pfamprogram, the origin of the definition and annotation of this domainis unclear from the corresponding Pfam entry. In fact, besides theGXXG region, these sequences all have the G1 motif, and the G4motif which recognizes the guanine of GDP/GTP is also present insome of them (see Fig. 2). Koonin and Aravind had alreadymentioned these sequences as members of a GTPase family(COG2229) appearing to be sister proteins to the MglA family(Koonin and Aravind, 2000). Among them, the sequences ofactinobacteria Streptomyces coelicolor and Streptomyces griseushad already been identified as putative GTP-binding proteins andtermed as CvnD and RarD, respectively (Bentley et al., 2002;Komatsu et al., 2003); and the AAC07666 from A. aeolicus hadalready been identified as a Ras superfamily protein homolog(Pandit and Srinivasan, 2003). Combining the presence of somecharacteristic motifs of the GDP/GTP-binding domain in thesesequences with the previous reports mentioned above, theseproteins perhaps do not act as ATPase proteins as implied by Pfamanalysis, but as GTPases. Their real relationship with the Rassuperfamily proteins remains unknown. Here, the name RarD wasused for these sequences in further phylogenetic analysis.

3.2. Phylogenetic analysis of Ras superfamily proteins

The alignment of the short prokaryotic homologs sequences(only one species' MglA proteins in a common genus wereincluded), NP_275737 fromM. thermautotrophicus, and 9 repre-sentative prokaryotic RarD sequences was complemented withrepresentative members of the seven families of eukaryotic Rassuperfamily proteins fromAmoebozoa (Dictyosteliumdiscoideum),Opisthokonta (Schizosaccharomyces pombe and Drosophilamelanogaster), Archaeplastida (Arabidopsis thaliana), Chro-malveolata (Plasmodium falciparum and Paramecium tetraurelia),and Excavata (Giardia lamblia and Trypanosoma cruzi). Theresult was manually refined, and 126 unambiguously alignedpositions (see supplementary material Fig. S3) were finally usedfor maximum likelihood analysis.

In the unrooted tree (Fig. 3), the identified prokaryotichomologs form four groups (Group 1–4): Group 1 comprises allthe 13 most unambiguous prokaryotic homologs with highsimilarity to eukaryotic analogs; Group 2 includes the archae-bacteria Thermoplasmales sequences (TheacRas, ThevoRas, andFeracRas); the methanogenic archaebacterial homologs (Met-maRas, MetjaRas, MetkaRas, MetthRas, and NP_275737) makeup Group 3; andGroup 4 comprises all theMglA proteins. Four ofthe seven eukaryotic protein families: Rab, Ran, Ras, and Rho,cluster together (Group 5) and then form a large clade with theGroup 1, whereas, interestingly, the other three eukaryoticfamilies, SRbeta, Arf, and Sar1, form another separate group(Group 6), and then cluster with the Group 3 and Group 4. TheRarD sequences cluster together (Group 7), and then form a sistergroupwith themethanogenic archaebacterial homologs (Group 3).

In Group 1, the 13most unambiguous prokaryotic homologs ofeukaryotic Ras superfamily proteins show a very distinct order,reflecting the evolutionary relationship of these lineages. Thesequences from the same lineages cluster together, and those from

the lineages with close evolutionary relationships form sisterclades. For example, the sequences from proteobacteria clustertogether with those of cyanobacteria; five Bacteroidetes/Chlorobisequences cluster together separately; the only archaebacterialsequence branches outside all the eubacterial ones.

4. Discussion

Our survey of 621 prokaryotic genomic databases and domainanalyses identified 74 homologs having the characteristic GDP/GTP-binding domain of eukaryotic Ras superfamily proteins from51 prokaryotes. Excluding the relatively long sequences, Rocoproteins, and NP_275737 from M. thermautotrophicus with anextraKaiC domain at theN-termini, we finally obtained 57 putativeprokaryotic homologs of eukaryotic Ras superfamily proteins. Theresults of domain analyses, sequence alignment, and phylogeneticanalysis indicated that these proteins had salient divergences.Among them, 13 homologs have the highest similarity to theeukaryotic Ras superfamily proteins with all five characteristicmotifs of GDP/GTP-binding domain. They should be the mostunambiguous homologs of the eukaryotic analogs. Therefore,canonical Ras superfamily protein homologs have been identifiedin prokaryotes, disproving the previous opinion that Rassuperfamily proteins are specific to eukaryotes (Jekely, 2003).OurRT–PCR results showed that a representative gene (NospuRas)of the 13 homologs was transcribed. This means that NospuRasprobably acts as a functional protein in N. punctiforme.

Considering that the RarD sequences contained partialcharacteristic motifs of the Ras superfamily protein GDP/GTP-binding domain and had been identified as GTP-bindingproteins previously (Koonin and Aravind, 2000; Pandit andSrinivasan, 2003), nine RarD sequences were also included inour phylogenetic analysis. In the unrooted tree, these RarDsequences and the identified methanogenic archaebacterialhomologs form sister groups, suggesting that they have acommon ancestor and that RarD probably acts as GTPase.

NP_275737 fromM. thermautotrophicus shows high similarityto the other four methanogenic homologs in GDP/GTP-bindingdomain, and clusters together with the latter in the phylogenetictree, suggesting that the NP_275737 gene probably derived from arecent gene fusion of KaiC and a Ras-like protein gene. In fact, ithas been proposed that the GDP/GTP-binding domain ofNP_275737 may act as a switch between the active and inactivestates of this protein, allowing fine-tuning of the function of KaiCin the regulation of cell division, nitrogen fixation and photo-synthesis (Pandit and Srinivasan, 2003).

More importantly, our phylogenetic analysis revealed thefollowing interesting phylogenetic ramifications. Firstly, theexistence of four independent groups of the prokaryotic homologssuggests that the Ras superfamily proteins had already not onlyemerged, but also diverged, among different prokaryotes;secondly, the clustering together of the 13 most unambiguousprokaryotic homologs with four of the seven eukaryotic families(Ran, Rab, Ras, Rho), and the separate branching of the otherthree eukaryotic families (SRbeta, Sar1, Arf) implies that theeukaryotic Ras superfamily proteins might have two independentorigins.

123J.-H. Dong et al. / Gene 396 (2007) 116–124

However, in view of the very limited distribution of thesehomologs in both eubacteria and archaebacteria, we consider thepossibility that the prokaryotic homologsmay have been acquiredthrough lateral gene transfer (LGT) from eukaryotes. And then,the existence of two independent clades of the eukaryotic Rassuperfamily proteins could be due to the inclusion of these pro-karyotic homologs transferred from eukaryotes in the phyloge-netic tree, meaning that all eukaryotic Ras superfamily proteinsshould share a common ancestor. However, when investigatedcarefully, this explanation seems unreasonable. Firstly, since theMglA proteins and the members of two archaebacterial homologgroups (Group 3 and Group 4) have their own unique sequencecharacters that are very different from the eukaryotic Rassuperfamily proteins and branch separately from the latter, theymust not have been acquired through LGT from eukaryotes.Secondly, although the 13 most unambiguous prokaryotichomologs have the highest possibility of LGT origin due totheir very limited distribution and high similarity to the eukaryoticRas superfamily proteins, there are two complications: 1) In thephylogenetic tree, they separately branch as a whole outside thewhole eukaryotic protein families, not grouping with any one ofthe latter. Therefore no single one of the eukaryotic proteinfamilies can be recognized as their LGT donor; 2) Within Group4, the 13 prokaryotic homologs orderly branch, reflecting theevolutionary relationships among these species. These areinconsistent with the usually messy branches in the phylogenyof the proteins that have undergone recent LGT. Therefore hadthe LGT occurred before the divergence of the superfamilyin eukaryotes, that is, did the common ancestor gene of theeukaryotic Ras superfamily proteins transfer into some ancientprokaryotes at a very early time? If so, according to our phylo-genetic tree, all 13 of the prokaryotic homologs must haveoriginated from the common ancestor of the eukaryotic Rho, Ras,Rab, and Ran family. However it is difficult to imagine that suchan ancient LGT happened many times on different ancientprokaryotic lineages (including archaebacteria and differenteubacteria). Therefore, it is very unlikely that these prokaryotichomologswere acquired through lateral gene transfer (LGT) fromeukaryotes.

In addition, the lack of some motifs in the 3 Thermoplasmalessequences and 4 methanogenic archaebacterial sequences, and theabsence of eukaryotic posttranslational modification directiveregions or functional regions in all prokaryotic homologs,might berecognized as ancestral characters, which also implies an earlierpresence of the prokaryotic homologs than that of the eukaryoticproteins.

Therefore, according to the above analyses, our identifiedprokaryotic homologs might not be acquired by LGT fromeukaryotes, but be the intrinsic proteins in prokaryotes. Theirlimited distribution may be due to their origin only in some of theprokaryotic lineages, and their absence in many species of theselineages may be attributed to gene loss or their undetectability dueto high divergence. Hence, the seven eukaryotic families of Rassuperfamily proteins probably have two different origins. Three(SRbeta, Sar1, and Arf) of them may have derived from thecommon ancestor of MglA proteins and the homologs inmethanogenic archaebacteria, and the other four (Ran, Rab, Ras,

and Rho) from the common ancestor of the 13 most unambiguousprokaryotic homologs. The question that remains is when and howdid the Ras superfamily proteins arise. According to the ‘fusion’scenario for the origin of eukaryotic cells, the simplest explanationis that the two independent ancestors were brought into the pro-eukaryote by the two prokaryotic fusion partners respectively. Asfor the details of the fusion partners, there are twomain alternativesaccording to our phylogenetic tree. One alternative is amethanogenic archaebacterium with a proteobacterium, themethanogenic archaebacterium donating the ancestor of theSRbeta, Sar1, and Arf families, and the proteobacterium providingthe ancestor of Ran, Rab, Ras, and Rho. This scenario is consistentwith those mitochondrion-driven hypotheses for the first eukary-ote, which proposed an endosymbiosis of an alpha-proteobacter-ium in an archaebacterium (Martin and Muller, 1998; Vellai et al.,1998; Searcy, 2003). The other alternative is that a delta-proteobacterium fused with an archaebacterium of Thermopro-teales (e.g. Thermofilum pendens), and herein the MglA of thedelta-proteobacterium might be the ancestor of the SRbeta, Sar1,and Arf families, and the Ras-like protein of Thermoprotealesmight be the ancestor of Ran, Rab, Ras, and Rho families. Thisscenario is compatible with the ‘eocyte hypothesis’, whichproposed a symbiosis between an eocyte (e.g. Sulfolobales andThermoproteales) and a proteobacterium (Lake, 1988; Lake andRivera, 1994; Rivera and Lake, 2004). Our phylogenetic analysiscannot tell which of the two alternatives ismore possible. Howeverconsidering that the methanogenic archaebacterial homologs aremore similar to the eukaryotic Arf and Sar1 proteins (withidentities of 19–28% and 18–24% respectively to yeast Arf1 andSar1) than MglA proteins are (with identities of 16–23% to yeastArf1 and Sar1), and that especially MglA proteins have their ownunique sequence characters (e.g. insertion and conserved regions,except for the five Ras signature motifs), we favor the formerexplanation and think that the MglA has little possibility of beingthe ancestor of eukaryotic Ras superfamily proteins.

In conclusion, our survey provides a comprehensive view of thedistribution of Ras superfamily protein homologs in prokaryotes.The identification of prokaryotic homologs, especially of the 13most unambiguous ones proves that Ras superfamily proteins arenot specific to eukaryotes but have emerged in at least someprokaryotic lineages, and that MglA are obviously not the onlyprokaryotic analogs and have little possibility of being the ancestorof eukaryotic Ras superfamily proteins. The seven eukaryoticprotein families may have two different prokaryotic origins, whichprobably reflects the ‘fusion’ evolutionary history of the eukaryoticcell. With the accumulation of new prokaryotic genomic databases,more prokaryotic homologs of Ras superfamily proteins will befound, which will help to reveal their phylogenetic correlation withthe eukaryotic analogs more clearly and definitely.

Although the Ras superfamily protein gene homologs havebeen identified, and their transcription activities proved (at least onNospuRas), the functions of these homologs in prokaryotic cellsare not yet clear. Since the eukaryotic Ras superfamily proteinshave diverse roles related to eukaryotic specific cellular processes(e.g. vesicle transport, regulation of nuclear function andcytoskeleton reorganization) and eukaryotic specific cellularcompartmentalization, studies of the functions of these homologs

124 J.-H. Dong et al. / Gene 396 (2007) 116–124

in non-compartmentalized prokaryotic cells will be interesting andof significance to understanding the evolution of structures andfunctions of the eukaryotic cell.

Without finding of any unambiguous homologs of eukaryoticRas superfamily proteins in prokaryotes, Jekely's phylogeneticanalysis indicated that eukaryotic SRbeta/Sar1/Arf split off firstfrom all other Ras superfamily proteins, and according to this,further inferred that eukaryote endomembranes were of secretoryorigin (Jekely, 2003). Our results cannot distinguish definitelywhich families diverged early within the Ras superfamily, butemphasize their two independent origins.

Acknowledgments

Weare grateful to Prof. ZhangChuan-Mao (BeijingUniversity)for providing us some references. Thanks also to Miss ClaireMaries (Australia) for her advice in the English writing. This workwas supported by grants (90408016; 30021004; 30623007) fromthe National Natural Science Foundation of China.

Appendix A. Supplementary data

Supplementary data associated with this article can be found,in the online version, at doi:10.1016/j.gene.2007.03.001.

References

Abascal, F., Zardoya, R., Posada, D., 2005. ProtTest: selection of best-fit modelsof protein evolution. Bioinformatics 2001, 2104–2105.

Bentley, S.D., et al., 2002. Complete genome sequence of the modelactinomycete Streptomyces coelicolor A3(2). Nature 417, 141–147.

Bosgraaf, L., Van Haastert, P.J., 2003. Roc, a Ras/GTPase domain in complexproteins. Biochim. Biophys. Acta 1643, 5–10.

Bourne, H.R., Sanders, D.A., McCormick, F., 1990. The GTPase superfamily: aconserved switch for diverse cell functions. Nature 348, 125–132.

Ciccarelli, F.D., Doerks, T., von, M.C., Creevey, C.J., Snel, B., Bork, P., 2006.Toward automatic reconstruction of a highly resolved tree of life. Science311, 1283–1287.

Drivas, G.T., Palmieri, S., D'Eustachio, P., Rush, M.G., 1991. Evolutionarygrouping of the RAS-protein family. Biochem. Biophys. Res. Commun.176, 1130–1135.

Edgar, R.C., 2004. MUSCLE: multiple sequence alignment with high accuracyand high throughput. Nucleic Acids Res. 32, 1792–1797.

Guindon, S., Gascuel, O., 2003. A simple, fast, and accurate algorithm to estimatelarge phylogenies by maximum likelihood. Syst. Biol. 52, 696–704.

Hartzell, P.L., 1997. Complementation of sporulation and motility defects in aprokaryote by a eukaryotic GTPase. Proc. Natl. Acad. Sci. U. S. A. 94,9881–9886.

Jekely, G., 2003. Small GTPases and the evolution of the eukaryotic cell.Bioessays 25, 1129–1138.

Komatsu, M., Kuwahara, Y., Hiroishi, A., Hosono, K., Beppu, T., Ueda, K., 2003.Cloning of the conserved regulatory operon by its aerial mycelium-inducingactivity in an amfR mutant of Streptomyces griseus. Gene 306, 79–89.

Koonin, E.V., Aravind, L., 2000. Dynein light chains of the Roadblock/LC7group belong to an ancient protein superfamily implicated in NTPaseregulation. Curr. Biol. 10, R774–R776.

Lake, J.A., 1988. Origin of the eukaryotic nucleus determined by rate-invariantanalysis of rRNA sequences. Nature 331, 184–186.

Lake, J.A., Rivera, M.C., 1994. Was the nucleus the first endosymbiont? Proc.Natl. Acad. Sci. U. S. A. 91, 2880–2881.

Leipe, D.D., Wolf, Y.I., Koonin, E.V., Aravind, L., 2002. Classification andevolution of P-loop GTPases and related ATPases. J. Mol. Biol. 317, 41–72.

Lu, Y., Wen, J.F., Lü, T.W., 2001. Isolation, pure cultivation and total DNAextraction of Microcystis aeruginosa kutz in Dianchi Lake. J. Lake Sci. 13,285–288. In Chinese.

Martin, W., Muller, M., 1998. The hydrogen hypothesis for the first eukaryote.Nature 392, 37–41.

Mittenhuber, G., 2001. Comparative genomics of prokaryotic GTP-bindingproteins (the Era, Obg, EngA, ThdF (TrmE), YchF and YihA families) andtheir relationship to eukaryotic GTP-binding proteins (the DRG, ARF, RAB,RAN, RAS and RHO families). J. Mol. Microbiol. Biotechnol. 3, 21–35.

Moreira, D., Lopez-Garcia, P., 1998. Symbiosis between methanogenic archaea anddelta-proteobacteria as the origin of eukaryotes: the syntrophic hypothesis.J. Mol. Evol. 47, 517–530.

Moss, J., Vaughan, M., 1995. Structure and function of ARF proteins: activatorsof cholera toxin and critical components of intracellular vesicular transportprocesses. J. Biol. Chem. 270, 12327–12330.

Paduch, M., Jelen, F., Otlewski, J., 2001. Structure of small G proteins and theirregulators. Acta Biochim. Pol. 48, 829–850.

Pandit, S.B., Srinivasan,N., 2003. Survey forGproteins in the prokaryotic genomes:prediction of functional roles based on classification. Proteins 52, 585–597.

Ponting, C.P., Aravind, L., Schultz, J., Bork, P., Koonin, E.V., 1999. Eukaryoticsignalling domain homologs in archaea and bacteria. Ancient ancestry andhorizontal gene transfer. J. Mol. Biol. 289, 729–745.

Ren, M., et al., 1995. Separate domains of the Ran GTPase interact withdifferent factors to regulate nuclear protein import and RNA processing.Mol. Cell Biol. 15, 2117–2124.

Rivera, M.C., Lake, J.A., 2004. The ring of life provides evidence for a genomefusion origin of eukaryotes. Nature 431, 152–155.

Schwartz, T., Blobel, G., 2003. Structural basis for the function of the beta subunit ofthe eukaryotic signal recognition particle receptor. Cell 112, 793–803.

Searcy, D.G., 2003. Metabolic integration during the evolutionary origin ofmitochondria. Cell Res. 13, 229–238.

Stephens, K., Hartzell, P., Kaiser, D., 1989. Glidingmotility inMyxococcus xanthus:mgl locus, RNA, and predicted protein products. J. Bacteriol. 171, 819–830.

Takai, Y., Sasaki, T., Matozaki, T., 2001. Small GTP-binding proteins. PhysiolRev. 81, 153–208.

Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., Higgins, D.G., 1997.The CLUSTAL_X windows interface: flexible strategies for multiple sequencealignment aided by quality analysis tools. Nucleic Acids Res. 25, 4876–4882.

Valencia, A., Chardin, P.,Wittinghofer, A., Sander, C., 1991. The ras protein family:evolutionary tree and role of conserved amino acids. Biochemistry 30,4637–4648.

Vellai, T., Takacs, K., Vida, G., 1998. A new aspect to the origin and evolutionof eukaryotes. J. Mol. Evol. 46, 499–507.