first open reading frame protein (orf1p) of the blattella germanica r1 retroposon and...

10
ISSN 10227954, Russian Journal of Genetics, 2011, Vol. 47, No. 2, pp. 129–138. © Pleiades Publishing, Inc., 2011. Original Russian Text © T.V. Kapelinskaya, A.S. Kagramanova, A.L. Korolev, D.V. Mukha, 2011, published in Genetika, 2011, Vol. 47, No. 2, pp. 149–158. 129 INTRODUCTION Retrotransposons without long terminal repeats (retroposons) are classified as autonomous and non autonomous. Autonomous retroposons encode all proteins required for their transpositions in the genome. Reverse transcriptase (RT) is the most con served domain of all autonomous retroposons and is used as the main test for identification, classification, and phylogeny of these mobile elements [1, 2]. Retro posons studied so far time are grouped into more than 15 clades on the basis of phylogeny of the RT domain containing eleven identified highly conserved regions [3]. There are also nonclassified clades including recently discovered L2, NeSL1, and Rex1 retro posons [3, 4]. Six families of retroposons (R1, RT, TRAS, SART, Waldo, and Mino) belong to the R1 clade [3]. All of them have specific sites of genome integration into the genome: R1 and RT are inserted into specific sites of 28S rDNA in most insects and arthropods; TRAS and SART are found in different sites of (TTAGG) n telomeric repeats located in oppo site orientations in the terminal regions of insect chro mosomes [5]. Waldo is inserted into ACAY repeats throughout the length of chromosomes, and its deriv atives R6 and R7 are inserted into specific sites of 28S rDNA and 18S rDNA, respectively. Mino is inte grated into AC repeats [3]. R1 elements of insects have two open reading frames (ORF1 and ORF2). The second reading frame (ORF2) contains the reverse transcriptase domain in the middle of the sequence, APElike endonuclease at the Nend, and a Cterminal conserved domain with an unknown function [4]. The first open reading frame (ORF1) of these transposons is less conserved, and its domain organization is a subject of study in many spe cies. The protein coded for by ORF1 (ORF1p) is known to belong to the superfamily of retroviral nucle ocapsid proteins (Gag proteins) containing zinc finger domains with a characteristic CX2CX4HX4C (CCHC) amino acid motif, but the main function of the protein encoded by ORF1 of autonomous retro posons in numerous species of eukaryotic organisms remains to be obscure. It is also known that ORF1p binds singlestranded RNA and DNA with a high specificity and is necessary for retrotransposition of its transcript [6]. A noncanonical RRM domain, most frequently found in RNAbinding eukaryotic pro teins, was recently identified in ORF1p of mammalian L1 retrotransposons [7]. This domain was discovered in proteins encoded by ORF1 of two types of retro posons: mammalian L1 retroposons and some retro posons of invertebrates [7]. This domain was identified only by means of analysis of the secondary structure of ORF1p, since no homology with the canonical RRM First Open Reading Frame Protein (ORF1p) of the Blattella germanica R1 Retroposon and Phylogenetically Close GAGLike Proteins of Insects and Fungi Contain RRM Domains T. V. Kapelinskaya, A. S. Kagramanova, A. L. Korolev, and D. V. Mukha Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991 Russia; email: [email protected], [email protected] Received March 17, 2010 Abstract—The rDNA locus of insects and other arthropods contains nonLTR retrotransposons (retro posons) that are specifically inserted into 28S rRNA genes. The most frequent retroposons are R1 and R2, but the mechanism of insertion and the functions of these mobile elements have not been studied in detail. A clone containing a fulllength R1 retroposon copy was isolated from the cosmid library of Blattella germanica genes and sequenced. The amino acid sequences encoded by ORF1 of the R1 retroposon were subjected to bioinformatic analysis. It was found that ORF1 of this mobile element encodes a protein (ORF1p) belonging to the superfamily of zinc finger (CCHC) retroviral nucleocapsid proteins and contains two conserved RRM domains (RNArecognizing motifs) identified on the basis of analysis of the secondary structure of this pro tein. The discovery of RRM domains in ORF1p of R1 retroposons can contribute to the understanding of the mechanisms of their retrotransposition. We revealed a coiledcoil motif in the Nterminal region of R1 ORF1p, which is similar to the coiledcoil domain involved in homo or heteromultimerization of proteins and in protein–protein interactions. The domain organization of homologous Gaglike proteins of retro posons in some insects and fungi was found to be similar to the structure established for R1 ORF1p of B. ger manica. DOI: 10.1134/S1022795410121038 MOLECULAR GENETICS

Upload: independent

Post on 04-Mar-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

ISSN 1022�7954, Russian Journal of Genetics, 2011, Vol. 47, No. 2, pp. 129–138. © Pleiades Publishing, Inc., 2011.Original Russian Text © T.V. Kapelinskaya, A.S. Kagramanova, A.L. Korolev, D.V. Mukha, 2011, published in Genetika, 2011, Vol. 47, No. 2, pp. 149–158.

129

INTRODUCTION

Retrotransposons without long terminal repeats(retroposons) are classified as autonomous and non�autonomous. Autonomous retroposons encode allproteins required for their transpositions in thegenome. Reverse transcriptase (RT) is the most con�served domain of all autonomous retroposons and isused as the main test for identification, classification,and phylogeny of these mobile elements [1, 2]. Retro�posons studied so far time are grouped into more than15 clades on the basis of phylogeny of the RT domaincontaining eleven identified highly conserved regions[3]. There are also nonclassified clades includingrecently discovered L2, NeSL�1, and Rex1 retro�posons [3, 4]. Six families of retroposons (R1, RT,TRAS, SART, Waldo, and Mino) belong to the R1clade [3]. All of them have specific sites of genomeintegration into the genome: R1 and RT are insertedinto specific sites of 28S rDNA in most insects andarthropods; TRAS and SART are found in differentsites of (TTAGG)n telomeric repeats located in oppo�site orientations in the terminal regions of insect chro�mosomes [5]. Waldo is inserted into ACAY repeatsthroughout the length of chromosomes, and its deriv�atives R6 and R7 are inserted into specific sites of28S rDNA and 18S rDNA, respectively. Mino is inte�grated into AC repeats [3].

R1 elements of insects have two open readingframes (ORF1 and ORF2). The second reading frame(ORF2) contains the reverse transcriptase domain inthe middle of the sequence, APE�like endonuclease atthe N�end, and a C�terminal conserved domain withan unknown function [4]. The first open reading frame(ORF1) of these transposons is less conserved, and itsdomain organization is a subject of study in many spe�cies. The protein coded for by ORF1 (ORF1p) isknown to belong to the superfamily of retroviral nucle�ocapsid proteins (Gag proteins) containing zinc fingerdomains with a characteristic CX2CX4HX4C(CCHC) amino acid motif, but the main function ofthe protein encoded by ORF1 of autonomous retro�posons in numerous species of eukaryotic organismsremains to be obscure. It is also known that ORF1pbinds single�stranded RNA and DNA with a highspecificity and is necessary for retrotransposition of itstranscript [6]. A noncanonical RRM domain, mostfrequently found in RNA�binding eukaryotic pro�teins, was recently identified in ORF1p of mammalianL1 retrotransposons [7]. This domain was discoveredin proteins encoded by ORF1 of two types of retro�posons: mammalian L1 retroposons and some retro�posons of invertebrates [7]. This domain was identifiedonly by means of analysis of the secondary structure ofORF1p, since no homology with the canonical RRM

First Open Reading Frame Protein (ORF1p) of the Blattella germanica R1 Retroposon and Phylogenetically Close GAG�Like

Proteins of Insects and Fungi Contain RRM DomainsT. V. Kapelinskaya, A. S. Kagramanova, A. L. Korolev, and D. V. Mukha

Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991 Russia;e�mail: [email protected], [email protected]

Received March 17, 2010

Abstract—The rDNA locus of insects and other arthropods contains non�LTR retrotransposons (retro�posons) that are specifically inserted into 28S rRNA genes. The most frequent retroposons are R1 and R2,but the mechanism of insertion and the functions of these mobile elements have not been studied in detail. Aclone containing a full�length R1 retroposon copy was isolated from the cosmid library of Blattella germanicagenes and sequenced. The amino acid sequences encoded by ORF1 of the R1 retroposon were subjected tobioinformatic analysis. It was found that ORF1 of this mobile element encodes a protein (ORF1p) belongingto the superfamily of zinc finger (CCHC) retroviral nucleocapsid proteins and contains two conserved RRMdomains (RNA�recognizing motifs) identified on the basis of analysis of the secondary structure of this pro�tein. The discovery of RRM domains in ORF1p of R1 retroposons can contribute to the understanding of themechanisms of their retrotransposition. We revealed a coiled�coil motif in the N�terminal region of R1ORF1p, which is similar to the coiled�coil domain involved in homo� or heteromultimerization of proteinsand in protein–protein interactions. The domain organization of homologous Gag�like proteins of retro�posons in some insects and fungi was found to be similar to the structure established for R1 ORF1p of B. ger�manica.

DOI: 10.1134/S1022795410121038

MOLECULAR GENETICS

130

RUSSIAN JOURNAL OF GENETICS Vol. 47 No. 2 2011

KAPELINSKAYA et al.

domain was found on the level of amino acidsequences. The crystal structure was determined forORF1p of L1 retroposons, and an unusual tertiarystructure of the RRM domain was demonstrated,which explains why it was not discovered in retro�posons earlier [7]. Next, a canonical RRM domainwas found in ORF1 proteins of L1 retroposons indiverse plant genomes in which domains containingthe zinc finger motif were absent [8]. It is known thatRRM domains are present in approximately 2% ofhuman proteins and often in multiple copies (fromone to eight) within one protein in combination withother domains [9]. Most widely represented amongthe latter ones are zinc finger domains of the CCHCand CCCH types (21%) and the C�terminal domain ofpoly(A)�binding proteins (PABP, 10%), which isinvolved in protein–protein interactions [9, 10]. Thus,being combined with different protein domains, RRMdomains can modulate their own RNA�binding activ�ity and specificity and fulfill diverse biological func�tions. Indeed, eukaryotic RRM proteins participate inall posttranscriptional events: pre�mRNA processing,splicing, alternative splicing, maintenance of mRNAstability, RNA editing, RNA export, formation of apre�rRNA complex, regulation of translation, anddegradation. Moreover, RRM domains were identifiedin bacteria, viruses, and mitochondria, but the func�tion of these domains in lower organisms remains to bestudied [11]. The tertiary structure of the RRMdomain shows a characteristic folding of four alternat�ing β�sheets and two α�helices that are arranged in theβ�α�β�β�α�β order and form a RNA�binding motif[10].

The conserved organization of RRM domains inplant and animal proteins suggests their ancient origin.The frequent presence of these domains in combina�tion with zinc finger domains among cell proteins ofeukaryotes indicates that proteins of insect retro�posons also may contain these domains. Making use ofthe characteristic secondary structure of the RRMdomain, we carried out a bioinformatic analysis of theamino acid sequence of ORF1 of the cloned B. ger�manica R1 retroposon and other closely related insectretroposons by means of sensitive HHpred [12] detec�tion of protein homology based on the comparison ofthe secondary structures. As a result, we identifiedconserved secondary structures characteristic of RRMdomains in ORF1p of the German cockroach R1 ret�roposon and several other retroelements of insects andfungi. In addition, we found a coiled�coil domain inthe N�terminal region of R1 ORF1p, which may sug�gest its interaction with other cell proteins duringtransfer of R1 transcripts into the nucleus and duringretrotransposition. We also demonstrated that insectand fungus Gag�like proteins homologous to R1ORF1p contain RRM domains and are phylogeneti�cally related.

MATERIALS AND METHODS

The library of German cockroach (Blattella german�ica) genes was constructed using of the cosmid vectorSuperCos I (Stratagene) and high molecular DNAfrom B. germanica strain P6 oothecas. It was screenedto find full�length copies of several subfamilies of R1retroposons. Screening of the library was carried outusing probes with truncated copies of R1 retroposonsobtained by us earlier [13]. Clones with the mostextended stretches of the 5'�regions obtained afterscreening of the library were sequenced as describedpreviously [13]. The amino acid sequences of R1ORF1 were determined after analysis of the primarynucleotide sequence using the Chromas�Pro program.

Homologous sequences for R1 ORF1p were identifiedusing the PSI�BLAST program after three runs of search�ing. Multiple alignment of the amino acid sequences ofhomologous proteins for various purposes was carried outusing the Clustal W, MUSCLE, and Tcoffe algorithms(http://www.ebi.ac.uk/Tools/sequence.html andhttp://toolkit.tuebingen.mpg.de/sections/alignment).Screening of the amino acid sequences for detection ofconserved motifs was performed with the use of the pro�grams MOTIF SCAN (http://myhits.isb�isb.ch/cgi�bin/motif_scan), COIL (http://www.isrec.isb�sib.ch/webmarcoil/webmarcoilC1.html), PSORT (http://psort.ims.u�tokyo.ac.jp/), and the HomoloGene database(http://www.ncbi.nlm.nih.gov/homologene). The phy�logenetic analysis and construction of trees were car�ried out with the programs http://www.ch.emb�net.org/software/ClustalW. html and MEGA version 4[14] using the neighbor�joining method. Significanceof different phylogenetic branches was estimated bythe bootstrap analysis of 1000 replicates.

Computational identification of RRM domains andconstruction of 3D models. The amino acid sequencesof proteins encoded by ORF1 of homologous insectand fungus retrotransposons were analyzed individu�ally or by multiple alignments (MUSCLE) for similar�ity to known domains on the basis of comparison ofthe secondary structures in HHpred (http://www.eb.tuebingen.mpg.de/departments/1�protein�evolution/department�1�protein�evolution) [12]. The identifiedconserved regions were used to construct 3D models(sites http://www.cbs.dtu.dk/services/index.php andhttp://zhanglab.ccmb.med.umich.edu/I�TASSER/) [15].

Visualization of the 3D models and their illustra�tion were made in the Chimera program(http://www.chimera.org). The obtained models inthe form of PDB files were compared with known3D structures with the use of the VAST (Vector Align�ment Search Tool) search (http://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.html).

Numbers of GenBank sequences used in the work:R1OFR1Bm (non�LTR retrotransposon R1mksORF1 protein, BAD82945.1, Bombyx mori),ORF1Bm (ORF1, BAA07466.1, Bombyx mori),Tras3Bm (TRAS3, BAB21511.1 Bombyx mori),

RUSSIAN JOURNAL OF GENETICS Vol. 47 No. 2 2011

FIRST OPEN READING FRAME PROTEIN 131

R1Dm (Y1R1 DROME, P16424.1, Drosophila melan�ogaster), R1/R2Nv (putative chimeric R1/R2 ret�rotransposon, AAC34928.1, Nasonia vitripennis),hpTcas (hypothetical protein TcasGA2_TC006972,EFA13356.1, Tribolium castaneum), hpTc507 (hypo�thetical protein TcasGA2_TC011554, EFA13507.1,Tribolium castaneum), hpTc36 (hypothetical proteinTcasGA2_TC001894, EFA11936.1, Tribolium casta�neum), tras3Api (similar to TRAS3, XP_001942622.1,Acyrthosiphon pisum), glApi21 (similar to gag�like pro�tein, XP_001947721.1, Acyrthosiphon pisum), gagDv(gag protein, AAQ75091.1, Drosophila virilis),gkDw34 (GK24896, EDW77234.1, Drosphila willis�toni), pgNv (similar to gag�like protein,XP_001600233.1, Nasonia vitripennis), pgAgm13(gag�like protein, BAC57913.1, Anopheles gambiae),R1gAgam (gag�like protein, BAC57909.1, Anophelesgambiae), glpAg99 (gag�like protein, BAC57899.1,Anopheles gambiae), gAga60 (gag�like protein,BAC57903.1, Anopheles gambiae), gagAae (gag�likeprotein, AAF20018.1, Aedes aegypti), glNc (gag�likeprotein, AAA21780.1, Neurospora crassa), gagAal(gag�like protein, BAI44771.1, Alternaria alternata),3Cg (hypothetical protein CHGG_03578,XP_001230094.1, Chaetomium globosum CBS148.51), 9Cg (predicted protein, XP_001219431.1,Chaetomium globosum CBS 148.51), hpCg76 (hypo�thetical protein CHGG_10776, EAQ82958.1, Cha�etomium globosum CBS 148.51), hpCg86 (hypotheti�cal protein CHGG_08786, EAQ84772.1, Chaetomiumglobosum CBS 148.51), ppLbi1 (predicted protein,EDR06718.1, Laccaria bicolor S238N�H82), ppLbi2(predicted protein EDR12593.1, Laccaria bicolorS238N�H82), ppCim (predicted proteinXP_001244535.1, Coccidioides immitis RS), glAAlt(gag�like protein, BAI44753.1, Alternaria alternata),ppAcap1 (predicted protein, EDN06463.1, Ajellomy�ces capsulatus NAm1), ppAcap87 (predicted protein,EDN10987.1, Ajellomyces capsulatus NAm1), 4trfTst(transposon I factor, putative, XP_001230094.1,Talaromyces stipitatus ATCC 10500).

RESULTS AND DISCUSSION

Analysis of the Amino Acid Sequences of the First Open Reading Frame (ORF1) of the R1 Retrotransposon

and Its Homologs

The primary nucleotide sequence of the R1 rer�trotransposon was obtained after sequencing of theclone selected as a result of screening the cosmidlibrary and containing a full�length copy of this mobileelement. In accordance with the analysis of the nucle�otide sequence, the R1 retrotransposon is 6810 bp insize and terminates at the 3'�end with the oligo(dT)�sequence containing about 20 thymidine residues, asshown earlier when cloning truncated R1 copies [13].Retroposon R1 codes for two open reading frames,ORF1 and ORF2. The first one consists of 480 amino

acid residues and is a subject of this study. For compar�ative analysis of proteins encoded by ORF1 of retro�posons, BLASTp searches were carried out to identifysequences most highly homologous to R1 ORF1p. Asa result, several conserved proteins of insects wereselected. To identify remote homologs, the PSI�BLAST search was additionally performed, whichrevealed several other proteins with an unknown func�tion belonging to retroposons and Gag�like proteinsnot only in insects, but also in fungi. The search forhomology revealed Gag proteins encoded not only byR1, but also by telomeric TRAS retroposons ofinsects, as well as numerous hypothetical Gag�likeprotein sequences of insects. The analysis of the pri�mary amino acid sequences of six open reading frames1 of insect retroposons permitted us to identify onlyconserved regions containing from one to three zincfinger domains (Zinc knuckles) at the N�ends of theproteins under study belonging to the CCHC(CX2CX4HX4C) type of retroviral Gag proteins(Fig.1) and to detect nuclear localization signals forsome of them. R1 ORF1 proved to encode twodomains and three nuclear localization signals locatedat the beginning, in the middle, and at the end of thesequence (Fig. 1).

However, individual BLASTp searches for homol�ogy with each of the proteins showed that the Zincknuckle domain of some of them included a part of theconserved AIR1 domain of a protein interacting witharginine methyltransferase and containing the RINGZn finger domain. Proteins containing the AIR1domain are implicated in posttranslational modifica�tion, they are chaperones and mediate intracellulartraffic and secretion [16]. In all these proteins, theAIR1 domain is present in combination with RNA�binding domains, zinc finger domains, and otherdomains involved in protein–protein interactions(according to the HomoloGene database). Parts of theAIR1 domain were found in R1 ORF1p of Blattellagermanica, R1 ORF1p of Drosophila melanogaster(R1Dm), in the hypothetical protein of Tribolium cas�taneum (hpTcas) (Fig. 1), as well as in insect and fun�gus Gag�like proteins homologous to R1 ORF1p(hpTcas507, hpTcas36, hpTcas10, pgNv, pGAm13,9Cg, 4trfTst, glNc, and others). It is probable that theidentification of the AIR1 domain in ORF1p of severalGag proteins of insects and fungi ows only to its partialhomology to the Zinc knuckle domain. However, itcannot be ruled out that the homology between theZinc knuckle domains involved in nucleic acid bindingand the RING Zn finger domains can also beexplained by their involvment in protein–proteininteractions similar to proteins having AIR1 domains.

Our analysis did not reveal homology of R1 ORF1pto canonical RRM domains of RNA�binding proteinson the amino acid level. However, the analysis of thedomain organization of proteins containing zinc fin�ger domains and AIR1 domains showed that many ofthem had RRM domains as well, and we have

132

RUSSIAN JOURNAL OF GENETICS Vol. 47 No. 2 2011

KAPELINSKAYA et al.

****

113

0

131

260

261

390

391

512

OR

F1R

1R

1OR

F1B

mR

1Dm

hp

Tca

sO

RF

1Bm

tras

3Bm

R1/

R2N

vC

on

sen

sus

Bg

OR

F1R

1R

1OR

F1B

mR

1Dm

hp

Tca

sO

RF

1Bm

tras

3Bm

R1/

R2N

vC

on

sen

sus

Bg

OR

F1R

1R

1OR

F1B

mR

1Dm

hp

Tca

sO

RF

1Bm

tras

3Bm

R1/

R2N

vC

on

sen

sus

Bg

OR

F1R

1R

1OR

F1B

mR

1Dm

hp

Tca

sO

RF

1Bm

tras

3Bm

R1/

R2N

v

Co

nse

nsu

s

Bg

****

Co

ile

d–

co

il

α1

β2

β3

α2

β4

1�3

Zn

kn

uck

les

β2

α1

α2

β3

β1

β1

β4

RUSSIAN JOURNAL OF GENETICS Vol. 47 No. 2 2011

FIRST OPEN READING FRAME PROTEIN 133

attempted to find homology with the RRM domain inR1 ORF1p by studying the secondary structure of theprotein.

ORF1 Proteins of Retroposon R1 and Other Insect and Fungus Retroposons Contain RRM Domains Immediately Upstream of Zinc Finger Domains

To search for structural similarity between the R1ORF1 amino acid sequences and RRM domains, sev�eral software programs (SMART, PSORT, HHpred,and others) were used, but only HHpred [12] enabledthe detection of homology in the secondary structureof R1 ORF1p with RNA�binding motifs of differentsplicing factors and heterogenous nuclear proteins.The RRM domain is folded into the αβ sandwichstructure with the the β1α1β2β3α2β4 topology. In thisstructure, two α�helices are packed against four anti�parallel β�sheets. The average length of the RRMdomain is about 90 amino acid residues. Most of theseamino acid residues are hydrophobic, while RNAbinding involves mainly aromatic amino acids locatedin sheets β3 and β1 and forming two conserved motifs,RNP1 and RNP2, respectively [10]. When studyingthe secondary structure of the R1 ORF1 amino acidsequence with the help of the Quick2D subprogramand the HHpred server, we found a RRM�specificalternation β� and α�chains located upstream of thezinc finger domains. Moreover, as shown in Fig. 1, theRRM motif in R1 ORF1p is repeated twice. A similarstructure was also identified for other homologs of R1ORF1 in insects, and after multiple alignment of thesesequences the RRM motifs appeared to be in the con�served C�terminal domain of these proteins (Fig. 1).At the next step of the study, 3D models of the con�served region of R1 ORF1 extending between aminoacid residues 231 and 400 were constructed with theuse of the programs CBS Prediction and I�TASSER.The six constructed models were obtained on the tem�plates of proteins having two RRM domains. Most ofthe models displayed significant homology with thefollowing proteins: the splicing factor U2AF(65) guid�ing an early choice of the splicing site by recognizingthe polypyrimidine tract nearby the 3'�terminal splic�ing site (PDB code 2G4B_A) [17]; a polypyrimidinetract�binding protein [18] involved in pre�mRNAsplicing (PDB code 1QM9_A); nucleolin [19]involved in 28S rRNA processing (PDB code1FJE_B); S. cerevisiae splicing factor Prp24 (PDB

code 2GO9_A) [20]; and the human ELAV�likeRNA�binding protein HuD [21], a known mRNA�stabilizing factor (PDB code 1FXL_A). Using thePDB files of these models for finding homology withknown 3D structures in the VAST database [22], twoproteins most highly homologous in structure wereselected for the RRM domains of R1 ORF1p(2GO9_A and 1FXL_A). Comparison of the aminoacid sequences of the RRM domains of these proteinswith the homologous region of R1 ORF1p is presentedin Fig. 2a. The final 3D model version for the R1ORF1 RRM domains was constructed in the MOD�ELLER (HHpred) program using the two above tem�plates (Fig. 2b). As shown by analysis of the 3D struc�ture in the Chimera program, the RRM domains of R1ORF1 have the hydrophobic surface like other canon�ical RRM domains of RNA�binding proteins, but theβ�sheets have conserved aromatic amino acids charac�teristic of the RNP1 and RNP2 motifs only in the sec�ond RRM domain of R1 ORF1. It is known, however,that only about 70% of RNA�binding proteins haveconserved aromatic amino acids in sheets β1 and β3, insome cases they can be replaced by aliphatic ones. ForRRM domains binding long transcripts, interdomainlinkers and loops connecting the α� and β�chains arealso important [9]. As seen from Fig. 2b, two RRMdomains are connected by a long flexible linker, whichis a distinctive feature of RNA�binding proteins inter�acting not only with the specific RNA structure, butalso with the polyuridine tracts and AU�rich unstableelements (AREs) located in the 3'�untranslatedregions (3'�UTR) of mRNA [21]. Such structure ofthe RRM domains of R1 ORF1p is possibly requiredto bind and maintain stability of its own RNA tran�script having a polyuridine tract at the 3'�end, and italso confirms the function of the protein as a RNAchaperone [23].

3D models of the conserved C�terminal domainwere also constructed for other Gag proteins presentedin Fig. 1. All of them have at least one RRM domain,and Tras1 and Tras3 of B. mori have two domains likeR1 ORF1p. Besides, using the PSI�BLAST programand the conserved region of R1 ORF1p containingRRM domains, we discovered additional Gag proteinsof several other retroposons of insects and fungihomologous to R1 ORF1p, including retroposons RT,R6, and Waldo of mosquitoes, Het�A of Drosophila,and Tad of Neurospora crassa (the numbers are givenin the section Materials and Methods).

Fig. 1. Multiple alignment of the amino acid sequences of Gag proteins of insect retroposons: ORF1RBg (ORF1 protein, non�LTR retrotransposon R1, Blattella germanica), R1ORF1Bm (non�LTR retrotransposon R1mks ORF1 protein, BAD82945.1,Bombyx mori), R1Dm (Y1R1 DROME, P16424.1, Drosophila melanogaster), hpTcas (hypothetical protein TcasGA2_TC006972,EFA13356.1, Tribolium castaneum), ORF1Bm (ORF1, BAA07466.1, (Bombyx mori), Tras3Bm (TRAS3, BAB21511.1, Bombyxmori), R1/R2Nv (putative chimeric R1/R2 retrotransposon, AAC34928.1, Nasonia vitripennis). Black�colored are sequencesforming β�sheets that are indicated with black arrows; α� helices of two RRM domains of all proteins are shaded in grey and indi�cated with grey cylinders. The oval “rectangle” encloses Zinc knuckle (CCHC) domains. Asterisks indicate the first two nuclearlocalization signals, and italized in bold type is the third (within Zinc knuckle domains) nuclear localization signal of ORF1p R1 Bg.Underlined are coiled�coil motifs of all proteins.

134

RUSSIAN JOURNAL OF GENETICS Vol. 47 No. 2 2011

KAPELINSKAYA et al.

All of them displayed homology with the secondarystructure of RRM domains, especially Gag proteinsencoded by fungus retroposons, and in some of themtwo RRM domains were also identified. One of them,ppAcap87 (predicted protein, EDN10987.1, Ajello�myces capsulatus NAm1), has nearly 40% homologywith R1 ORF1p even in the N�terminal domain show�ing the highest variability in all homologous proteins.

N�Terminal Sequences of ORF1p Are Involvedin Protein–Protein Interactions

The N�terminal sequences of ORF1 proteins ofretroposons are most variable, and many of them con�tain regions of low complexity. For instance, at thevery beginning of the R1 ORF1p sequence there is aregion rich in proline and glutamine (1–130 aminoacid residues). The role of these regions is not clear so

(a)Model CBS

RRM R12G09 A

10 20 30 40 50 60

70 80 90 100 110 120

130 140 150 160

67

6267

121119

RRM R12G09 A

RRM R12G09 A

6166

120118

157155

(b)

Model 5 I–TASSER

RRM R11FXL A

10 20 30 40 50 60

70 80 90 100 110 120

130 140 150 160

22

5758

108116

RRM R11FXL A

RRM R11FXL A

5657

107115

156154

170

RRM1 RRM2

Fig. 2. The structure of two RRM domains of ORF1p R1 Bg. (a) comparison of the amino acid sequences of two 3D models ofORF1p R1 Bg with the RRM domains of the S. cerevisiae splicing factor Prp24 (PDB code 2GO9_A) and the human ELAV�likeRNA�binding protein HuD (PDB code 1FXL_A). Capital letters designate homologous amino acid residues, and bold lettersdesignate identical amino acid residues. (b) 3D model of two RRM domains of ORF1p R1 Bg constructed using the templates ofthe two proteins shown in (a). Black arrows indicate β�sheets, grey�colored are α�helices, and white�colored is the backbone ofthe RRM domains.

RUSSIAN JOURNAL OF GENETICS Vol. 47 No. 2 2011

FIRST OPEN READING FRAME PROTEIN 135

far. It is known only that there is a group of proteinshaving regions of low complexity. These are specialsplicing factors forming a three�component REScomplex required for pre�mRNA retention in thenucleus during splicing [24]. The initial part of R1ORF1p has homology with one of such splicing factorsfrom the human louse Pediculus humanis corporis(GenBank: EEB11155.1), which additionally con�tains a coiled�coil domain (cc) and a conserveddomain of unknown function (DUF566). The analysisof the secondary structure of the N�terminalsequences corresponding to 130–210 amino acid resi�dues of R1 ORF1p and its homologs clearly revealssupercoiled regions, consisting of α�helices, that arecharacteristic of cc domains and take part in homo�and heteromultimerization of proteins and in interac�tions with DNA and other proteins [25]. It is knownthat ORF1 proteins of mammalian LINE�1 ret�rotransposons contain cc domains required for trimer�ization of these proteins, which provides their correctfunctioning [7, 23]. Extended cc domains resemblingleucine zipper motifs were also found in Gag proteinsof the telomeric HeT�A and TART retroposons ofDrosophila [26]. However, extraordinary variability ofcc domains consisting of repetitive sequences of 7–35 amino acids makes it difficult to conclude that theyperform a common function in different types of pro�teins. Similarly, in ORF1p of insects and fungi we alsoidentified cc domains characteristic of fibrinogen,myosin, tropomyosin, actinin, vincullin, and spectrin,as well as leucine zipper repeats and other α�helixmotifs [27–29]. The protein�binding domain ofR1 ORF1p is structurally homologous to the ccdomain of the geminin protein [30]. Geminin binds toCdt1 (replication factor mediating a single act of rep�lication), a component of the replication complex,and prevents reinitiation of replication in a regionwhere it has already stopped. The inhibitory effect ofgeminin consists in preventing interactions betweenCdt1 and MCM helicase [31]. Thus, it depends on thecc domain structure what cell components the proteinwill interact with and in what compartments, whicheventually determines its function. In Fig. 1, theamino acid sequences corresponding to the ccdomains of R1 ORF1p and its homologs are under�lined.

It should be noted that not in all Gag proteins ofinsect retroposons the localization of domains is anal�ogous to that in R1 ORF1p. Some of them containRRM domains in the N�terminal regions, zinc fingerdomains in the middle, and cc domains in the C�ter�minal regions. Such localization is characteristic ofTras3Api (similar to TRAS3, XP_001942622.1,Acyrthosiphon pisum) and pgNv (similar to gag�likeprotein, XP_001600233.1, Nasonia vitripennis) andsupports the hypothesis of modular origin of retro�posons [8, 32].

Evolutionary Interactions of Gag Proteins of Insect and Fungus Retroposons

The methods of classification of retroposons on thebasis of amino acid differences in the conserveddomains of reverse transcriptase and endonucleaselocated either in the second reading frame or in themiddle or at the end of one common frame for alldomains are well developed [2, 3]. In contrast to thosementioned above, domains of Gag proteins are poorlystudied, and Gag proteins themselves are very diverse andhave no sufficient homology even among closely relatedspecies of organisms [6, 23, 33]. Therefore, no phyloge�netic studies of ORF1 proteins of retroposons have prac�tically been carried out. As mentioned above, the searchfor homology with R1 ORF1p in the PSI�BLAST data�base revealed, in addition to a great number of homol�ogous Gag proteins of insects (especially Triboliumcastaneum), a lot of hypothetical protein sequences offungi also having significant homology with ORF1p.In this work, we attempted to analyze the phylogeneticrelations and construct a tree of Gag proteins ofinsects and fungi on the basis of the RRM domain ofR1 ORF1p described. For all individual homologoussequences, we analyzed the secondary structure ofregions homologous to the RRM domain of R1ORF1p and selected those having the β1α1β2β3α2β4topology.

Sequences aligned in the ClustalW program andcorresponding to the RRM domains of the foundhomologs were used for phylogenetic analysis. Treeswere constructed by the neighbor�joining method sep�arately for insects and fungi. When using the conservedregion containing RRM domains of insects, weselected the part of the common tree that formed aclearcut cluster with R1 ORF1p. The cluster includedonly several Gag proteins of insects. This is probablydue to the fact that we searched for homology onlywith the protein of the German cockroach, but insectsclose to this species are absent in the databases. A sta�tistically reliable cladogram was constructed forselected sequences of fungi belonging to the classes ofascomycetes (Talaromyces stipitatus, Coccidioidesimmitis, Chaetomium globosum, Alternaria alternata,Neurospora crassa, and Ajellomyces capsulatus) andbasidiomycetes (Laccaria bicolor). The cladogramconsists of three clusters that all include the Tad retro�poson of Neurospora crassa (glNc*). Using the clusterof the most similar sequences of Gag proteins ofinsects and the cladogram of fungi as templates, acommon phylogenetic tree was constructed afteralignment (Fig. 3). All branches of the tree have a sta�tistically significant bootstrap support, and the insectsequences form a cluster with three most closelyrelated Gag proteins of fungi from which they seem tooriginate.

It should be noted that in addition to the R1 andTRAS retroelements, the cluster of Gag sequences ofinsects includes two Gag proteins that do not belong to

136

RUSSIAN JOURNAL OF GENETICS Vol. 47 No. 2 2011

KAPELINSKAYA et al.

the R1 clade of retroposons. The gagDv protein (gagprotein, AAQ75091.1, Drosophila virilis) belongs totelomeric HeT�A retroposons of Drosophila [26, 34],and the existence of the gkDw34 protein (GK24896,EDW77234.1, Drosophila willistoni) is predicted fromthe open reading frame [35, 36]. Drosophila HeT�Aretroposons are known to have one reading frameencoding only a Gag protein that lacks the reversetranscriptase domain and plays the role of chaperonefor transcripts encoded by another telomeric retropo�son, TAHRE [37]. The gkDw34 protein is also locatedin a region having no open reading frames encodingreverse transcriptase in the immeadiate neighborhood(according to the FlyBase database), but it forms acluster with R1 ORF1p of Drosophila. The dataobtained suggest that ORF1 proteins of the R1 andTRAS retroposons originated independently of theirsecond reading frame encoding the RT domain, butthe frames were brought together in another evolu�tionary event.

The methods of computer analysis used in the workcan be applied to search for novel numerous structur�ally related Gag�like proteins of insects and fungi forsubsequent elucidation of their functions.

ACKNOWLEDGMENTS

The work was supported by the Russian Founda�tion for Basic Research (grant no. 08�04�01402�a) andthe Program of the Russian Academy of Sciences“Biological Diversity” (subprogram “Gene Pools andGenetic Diversity”).

REFERENCES

1. Eickbush, T.H. and Jamburuthugoda, V.K., The Diver�sity of Retrotransposons and the Properties of TheirReverse Transcriptases, Virus Res., 2008, vol. 134,nos. 1–2, pp. 221–234.

2. Kapitonov, V.V. and Tempel, S.J.J., Simple and FastClassification of Non�LTR Retrotransposons Based onPhylogeny of Their RT Domain Protein Sequences,Gene, 2009, vol. 448, no. 2, pp. 207–13.

3. Kojima, K.K. and Fujiwara, H., Cross�GenomeScreening of Novel Sequence�Specific Non�LTR Ret�rotransposons: Various Multicopy RNA Genes andSatellites Are Selected as Targets, Mol. Biol. Evol.,2004, vol. 21, no. 2, pp. 207–217.

R1Dm

gkDw34

ORF1R1Bg

tras3Bm

ORF1Bm

R1ORF1Bm

gagDv

4transpfT*

3Cg*

9Cg*

hpCg76*

hpCg86*

ppLbi1*

ppLbi2*

ppAcap87*

ppAcap1*

ppCim*

gagAal*

glAalt*

glNc*

100

89

80

82

52

83

7797

100

100

100

100

100

100

100

100

100

100

100

Fig. 3. Phylogenetic relations of Gag proteins of insect and fungus retroposons. The tree was constructed by the neighbor�joiningmethod using conserved regions of two RRM domains of the proteins under study. Asterisks indicate Gag proteins of fungi. Thenumerals at the nodes designate the bootstrap indices for 1000 replicates. Full names and GenBank numbers of the sequences aregiven in the section Materials and Methods.

RUSSIAN JOURNAL OF GENETICS Vol. 47 No. 2 2011

FIRST OPEN READING FRAME PROTEIN 137

4. Kojima, K.K. and Fujiwara, H., Evolution of TargetSpecificity in R1 Clade Non�LTR Retrotransposons,Mol. Biol. Evol., 2003, vol. 20, no. 3, pp. 351–361.

5. Takahashi, H., Okazaki, S., and Fujiwara, H., A NewFamily of Site�Specific Retrotransposons, SART1, IsInserted into Telomeric Repeats of the Silkworm, Bom�byx mori, Nucl. Acids Res., 1997, vol. 25, no. 8,pp. 1578–1584.

6. Matsumoto, T., Hamada, M., Osanai, M., and Fuji�wara, H., Essential Domains for RibonucleoproteinComplex Formation Required for Retrotranspositionof Telomere�Specific Non�Long Terminal Repeat Ret�rotransposon SART1, Mol. Cell. Biol., 2006, vol. 26,no. 13, pp. 5168–5179.

7. Khazina, E. and Weichenrieder, O., Non�LTR Ret�rotransposons Encode Noncanonical RRM Domainsin Their First Open Reading Frame, Proc. Natl. Acad.Sci. USA, 2009, vol. 106, no. 3, pp. 731–736.

8. Heitkam, T. and Schmidt, T., BNR—a LINE Familyfrom Beta vulgaris—Contains a RRM Domain in OpenReading Frame 1 and Defines a L1 Sub�Clade Presentin Diverse Plant Genomes, Plant J., 2009, vol. 59,no. 6, pp. 872�882.

9. Maris, C., Dominguez, C., and Allain, F.H.T., TheRNA Recognition Motif, a Plastic RNA�Binding Plat�form to Regulate Post�Transcriptional Gene Expres�sion, FEBS J., 2005, vol. 272, no. 9, pp. 2118–2131.

10. Lunde, B.M., Moore, C., and Varani, G., RNA�Bind�ing Proteins: Modular Design for Efficient Function,Nat. Rev. Mol. Cell Biol., 2007, vol. 8, no. 6, pp. 479–490.

11. Chen, Y. and Varani, G., Protein Families and RNARecognition, FEBS J., 2005, vol. 272, no. 9, pp. 2088–2097.

12. Sding, J., Biegert, A., and Lupas, A.N., The HHpredInteractive Server for Protein Homology Detection andStructure Prediction, Nucl. Acids Res., 2005, vol. 33,pp. W244–W248.

13. Kagramanova, A.S., Kapelinskaya, T.V., Korolev, A.L.,and Mukha, D.V., R1 and R2 Retrotransposons of Ger�man Cockroach Blatella germanica: A ComparativeStudy of 5'�Truncated Copies Integrated into theGenome, Mol. Biol. (Moscow), 2007, vol. 41, no. 4,pp. 546–553.

14. Tamura, K., Dudley, J., Nei, M., and Kumar, S.,MEGA4: Molecular Evolutionary Genetics Analysis(MEGA) Software Version 4.0, Mol. Biol. Evol., 2007,vol. 24, no. 8, pp. 1596–1599.

15. Zhang, Y., I�TASSER Server for Protein 3D StructurePrediction, BMC Bioinform., 2008, vol. 9, p. 40.

16. Inoue, K., Mizuno, T., Wada, K., and Hagiwara, M.,Novel RING Finger Proteins, Air1p and Air2p, Inter�act with Hmt1p and Inhibit the Arginine Methylationof Npl3p, J. Biol. Chem., 2000, vol. 275, no. 42,pp. 32793–32799.

17. Sickmier, E.A., Frato, K.E., Shen, H., et al., StructuralBasis for Polypyrimidine Tract Recognition by the

Essential Pre�mRNA Splicing Factor U2AF65, Mol.Cell, 2006, vol. 23, no. 1, pp. 49–59.

18. Sawicka, K., Bushell, M., Spriggs, K.A., and Willis, A.E.,Polypyrimidine�Tract�Binding Protein: A Multifunc�tional RNA�Binding Protein, Bioch. Soc. Trans., 2008,vol. 36, no. 4, pp. 641–647.

19. Allain, F.H., Bouvet, P., Dieckmann, T., and Feigon, J.,Molecular Basis of Sequence�Specific Recognition ofPre�Ribosomal RNA by Nucleolin, EMBO J., 2000,vol. 19, no. 24, pp. 6870–6881.

20. Bae, E., Reiter, N.J., Bingman, C.A., et al., Structureand Interactions of the First Three RNA RecognitionMotifs of Splicing Factor prp24, J. Mol. Biol., 2007,vol. 367, no. 5, pp. 1447–1458.

21. Bolognani, F., Contente�Cuomo, T., and Perrone�Biz�zozero, N.I., Novel Recognition Motifs and BiologicalFunctions of the RNA�Binding Protein HuD Revealedby Genome�Wide Identification of Its Targets, NucleicAcids Res., 2010, vol. 38, no. 1, pp. 117–130.

22. Gibrat, J.F., Madej, T., and Bryant, S.H., SurprisingSimilarities in Structure Comparison, Curr. Opin.Struct. Biol., 1996, vol. 6, no. 3, pp. 377–385.

23. Martin, S.L., The ORF1 Protein Encoded by LINE�1:Structure and Function during L1 Retrotransposition,J. Biomed. Biotechnol., 2006, vol. 2006, pp. 45621–45626.

24. Dziembowski, A., Ventura, A�P., Rutz, B., et al., Pro�teomic Analysis Identifies a New Complex Requiredfor Nuclear Pre�mRNA Retention and Splicing,EMBO J., 2004, vol. 23, no. 24, pp. 4847–4856.

25. Grigoryan, G. and Keating, A.E., Structural Specificityin Coiled�Coil Interactions, Curr. Opin. Struct. Biol.,2008, vol. 18, no. 4, pp. 477–483.

26. Casacuberta, E. and Pardue, M�L., HeT�A Elementsin Drosophila virilis: Retrotransposon Telomeres AreConserved across the Drosophila Genus, Proc. Natl.Acad. Sci. USA, 2003, vol. 100, no. 24, pp. 14091–14096.

27. Preker, P.J. and Keller, W., The HAT Helix, a RepetitiveMotif Implicated in RNA Processing, Trends Biochem.Sci., 1998, vol. 23, no. 1, pp. 15–16.

28. Blatch, G.L. and Lassle, M., The TetratricopeptideRepeat: A Structural Motif Mediating Protein–ProteinInteractions, BioEssays, 1999, vol. 21, no. 11, pp. 932–939.

29. Kobe, B. and Deisenhofer, J., The Leucine�RichRepeat: A Versatile Binding Motif, Trends Biochem.Sci., 1994, vol. 19, no. 10, pp. 415–421.

30. Lee, C., Hong, B.S., Choi, J.M., et al., Structural Basisfor Inhibition of the Replication Licensing Factor Cdt1by Geminin, Nature, 2004, vol. 430, no. 7002, pp. 913–917.

31. Maiorano, D., Rul, W., and Mechali, M., Cell CycleRegulation of the Licensing Activity of Cdt1 in Xenopuslaevis, Exp. Cell Res., 2004, vol. 295, no. 1, pp. 138–149.

138

RUSSIAN JOURNAL OF GENETICS Vol. 47 No. 2 2011

KAPELINSKAYA et al.

32. Nefedova, L.N. and Kim, A.I., Molecular Evolution ofMobile Elements of the gypsy Group: A Homolog of thegag Gene in Drosophila, Russ. J. Genet., 2009, vol. 45,no. 1, pp. 23–29.

33. Kroutter, E.N., Belancio, V.P., Wagstaff, B.J., and Roy�Engel, A.M., The RNA Polymerase Dictates ORF1Requirement and Timing of LINE and SINE Ret�rotransposition, PLoS Genet., 2009, vol. 5, no. 4,p. e1000458.

34. Rashkova, S., Athanasiadis, A., and Pardue, M.�L.,Intracellular Targeting of Gag Proteins of the Droso�

phila Telomeric Retrotransposons, J. Virol., 2003,vol. 77, no. 11, pp. 6376–6384.

35. Zimin, A.V., Smith, D.R., Sutton, G., and Yorke, J.A.,Assembly Reconciliation, Bioinformatics, 2008, vol. 24,no. 1, pp. 42–45.

36. Clark A.G., Eisen M.B., et al. Evolution of Genes andGenomes on the Drosophila Phylogeny, Nature, 2007,vol. 450, no. 7167, pp. 203–218.

37. Villasante, A., Abad, J.P., Planell, R., et al., DrosophilaTelomeric Retrotransposons Derived from an AncestralElement That Was Recruited to Replace Telomerase,Genome Res., 2007, vol. 17, no. 12, pp. 1909–1918.