analysis of base and codon usage by rubella virus.pdf

11
ORIGINAL ARTICLE Analysis of base and codon usage by rubella virus Yumei Zhou Xianfeng Chen Hiroshi Ushijima Teryl K. Frey Received: 17 November 2011 / Accepted: 24 December 2011 / Published online: 10 February 2012 Ó Springer-Verlag 2012 Abstract Rubella virus (RUBV), a small, plus-strand RNA virus that is an important human pathogen, has the unique feature that the GC content of its genome (70%) is the highest (by 20%) among RNA viruses. To determine the effect of this GC content on genomic evolution, base and codon usage were analyzed across viruses from eight diverse genotypes of RUBV. Despite differences in fre- quency of codon use, the favored codons in the RUBV genome matched those in the human genome for 18 of the 20 amino acids, indicating adaptation to the host. Although usage patterns were conserved in corresponding genes in the diverse genotypes, within-genome comparison revealed that both base and codon usages varied regionally, particularly in the hypervariable region (HVR) of the P150 replicase gene. While directional mutation pressure was predominant in determining base and codon usage within most of the genome (with the strongest tendency being towards C’s at third codon positions), natural selection was predominant in the HVR region. The GC content of this region was the highest in the genome ( [ 80%), and it was not clear if selection at the nucleotide level accompanied selection at the amino acid level. Dinucleotide frequency analysis of the RUBV genome revealed that TpA usage was lower than expected, similar to mammalian genes; however, CpG usage was not suppressed, and TpG usage was not enhanced, as is the case in mammalian genes. Introduction Due to the redundancy of the genetic code, the nucleotide composition of a genome can significantly influence its evolution, resulting in, among other things, unique utili- zation patterns of synonymous codons [18]. In contrast to bacteria, C. elegans, yeast, and Drosophila, extensive studies of base composition and codon usage in viral genomes have not been conducted, although the pattern of base and codon usage is of significance in viral patho- genesis [11], vaccine design [20] and phylogenetic recon- struction [36]. Base composition and codon usage among both DNA and RNA viruses have been found to be non- random and species-specific. While the factors shaping these biases vary among viruses, nucleotide composition constraint, or directional mutation pressure, has been identified as the major factor in governing codon prefer- ences [25, 45]. For example, retroviral RNA genomes have a biased nucleotide composition depending on the genus. The RNA of human immunodeficiency virus (HIV) is Electronic supplementary material The online version of this article (doi:10.1007/s00705-012-1243-9) contains supplementary material, which is available to authorized users. Y. Zhou X. Chen T. K. Frey (&) Department of Biology, Georgia State University, Atlanta, GA 30303, USA e-mail: [email protected] Present Address: X. Chen Gastroenteritis and Respiratory Virus Laboratory Branch, Division of Viral Diseases, Centers for Disease Control and Prevention (CDC), Atlanta, GA 30333, USA H. Ushijima Department of Developmental Medical Sciences, Institute of International Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan Present Address: H. Ushijima Division of Microbiology, Department of Pathology and Microbiology, School of Medicine, Nihon University, 30-1 Oyaguchi Kamicho, Itabashi-ku, Tokyo 173-8610, Japan 123 Arch Virol (2012) 157:889–899 DOI 10.1007/s00705-012-1243-9

Upload: utpal-bakshi

Post on 18-Apr-2015

31 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Analysis of base and codon usage by rubella virus.pdf

ORIGINAL ARTICLE

Analysis of base and codon usage by rubella virus

Yumei Zhou • Xianfeng Chen •

Hiroshi Ushijima • Teryl K. Frey

Received: 17 November 2011 / Accepted: 24 December 2011 / Published online: 10 February 2012

� Springer-Verlag 2012

Abstract Rubella virus (RUBV), a small, plus-strand

RNA virus that is an important human pathogen, has the

unique feature that the GC content of its genome (70%) is

the highest (by 20%) among RNA viruses. To determine

the effect of this GC content on genomic evolution, base

and codon usage were analyzed across viruses from eight

diverse genotypes of RUBV. Despite differences in fre-

quency of codon use, the favored codons in the RUBV

genome matched those in the human genome for 18 of the

20 amino acids, indicating adaptation to the host. Although

usage patterns were conserved in corresponding genes in

the diverse genotypes, within-genome comparison revealed

that both base and codon usages varied regionally,

particularly in the hypervariable region (HVR) of the P150

replicase gene. While directional mutation pressure was

predominant in determining base and codon usage within

most of the genome (with the strongest tendency being

towards C’s at third codon positions), natural selection was

predominant in the HVR region. The GC content of this

region was the highest in the genome ([80%), and it was

not clear if selection at the nucleotide level accompanied

selection at the amino acid level. Dinucleotide frequency

analysis of the RUBV genome revealed that TpA usage

was lower than expected, similar to mammalian genes;

however, CpG usage was not suppressed, and TpG usage

was not enhanced, as is the case in mammalian genes.

Introduction

Due to the redundancy of the genetic code, the nucleotide

composition of a genome can significantly influence its

evolution, resulting in, among other things, unique utili-

zation patterns of synonymous codons [18]. In contrast to

bacteria, C. elegans, yeast, and Drosophila, extensive

studies of base composition and codon usage in viral

genomes have not been conducted, although the pattern of

base and codon usage is of significance in viral patho-

genesis [11], vaccine design [20] and phylogenetic recon-

struction [36]. Base composition and codon usage among

both DNA and RNA viruses have been found to be non-

random and species-specific. While the factors shaping

these biases vary among viruses, nucleotide composition

constraint, or directional mutation pressure, has been

identified as the major factor in governing codon prefer-

ences [25, 45]. For example, retroviral RNA genomes have

a biased nucleotide composition depending on the genus.

The RNA of human immunodeficiency virus (HIV) is

Electronic supplementary material The online version of thisarticle (doi:10.1007/s00705-012-1243-9) contains supplementarymaterial, which is available to authorized users.

Y. Zhou � X. Chen � T. K. Frey (&)

Department of Biology, Georgia State University,

Atlanta, GA 30303, USA

e-mail: [email protected]

Present Address:X. Chen

Gastroenteritis and Respiratory Virus Laboratory Branch,

Division of Viral Diseases, Centers for Disease Control

and Prevention (CDC), Atlanta, GA 30333, USA

H. Ushijima

Department of Developmental Medical Sciences,

Institute of International Health, Graduate School of Medicine,

The University of Tokyo, Tokyo, Japan

Present Address:H. Ushijima

Division of Microbiology, Department of Pathology

and Microbiology, School of Medicine, Nihon University,

30-1 Oyaguchi Kamicho, Itabashi-ku, Tokyo 173-8610, Japan

123

Arch Virol (2012) 157:889–899

DOI 10.1007/s00705-012-1243-9

Page 2: Analysis of base and codon usage by rubella virus.pdf

enriched with adenine, while the genome of human T cell

leukemia virus (HTLV) is cytosine-rich [6], and these

biased nucleotide compositions constrain codon usage in

the genomes of these viruses [7]. It has been suggested that

nucleotide composition bias is caused by directional

mutation pressure associated with the specific viral reverse

transcriptases [37, 39]. Codon usage bias determined by

directional mutation pressure has also been indicated in

studies of human DNA viruses [45], RNA viruses [19, 25]

and plant viruses [1].

On the other hand, translational efficiency may also play

an important role in codon choice among virus genomes [3,

50]. For example, codon usage in the envelope protein gene

of HIV does not match with codon usage among highly

expressed human genes, resulting in poor expression of the

envelope protein. Optimizing the envelope gene codon

usage to the codon preference of human genes resulted in

an increase in envelope gene expression [20]. Similar

tendencies have been documented in human papillomavi-

ruses (HPV) [5] and poxviruses [4]. Conversely, deoptim-

izing the native viral codon preferences in poliovirus

reduced gene expression and attenuated virulence (and has

thus been suggested as a strategy for developing attenuated

vaccines) [13, 41]. While matching codons with the

availability of cellular tRNA allows for a virus to achieve

the highest levels of gene expression, using non-preferred

codons could result in less gene expression but help the

virus establish persistent infection.

Rubella virus (RUBV), an enveloped, single-stranded,

positive-sense RNA virus, is the sole member of the genus

Rubivirus in the family Togaviridae. The *10-kb genome

contains two long open reading frames (ORFs). The 50

ORF or nonstructural protein ORF (NS-ORF) encodes two

proteins, P150 and P90, that function in RNA replication,

and the 30 ORF or structural protein ORF (SP-ORF)

encodes the three proteins that comprise the virus particle:

the capsid protein, CP, and two envelope glycoproteins, E1

and E2 [17]. The genome of RUBV maintains the highest

genomic GC content (70%) among RNA viruses, the next

highest GC content of the genome of an RNA virus being

*50% [17]. In one region of the RUBV genome, the

hypervariable region (HVR) in the middle of P150 gene,

the GC content is greater than 80% [61]. Based on partial

sequences of the E1 gene, thirteen genotypes of RUBV

have been distinguished [54]. Despite the importance of

base composition and codon usage as indicators of evolu-

tionary forces playing on a virus, little is known about

these parameters across genotypes of a single virus. RUBV

only has a single serotype, and humans are the sole host,

making it an ideal model for studying the effect of intrinsic

evolutionary factors on base and codon usage in the

absence of antigenic diversity or adaptation to infection of

multiple hosts. In this study, we have analyzed the base and

codon usage across 19 genomes of RUBV representing

eight genotypes, examining genes and domains individu-

ally and in comparison with human genes.

Materials and methods

Genome sequences

The 19 RUBV genome sequences that are available were

used in this study. These sequences represented eight out of

13 genotypes [61]. Sequences from 11 genomic coding

regions (P150; P150: MT; P150: HVR; P150: XD; P150:

NP; P90; P90: HEL; P90: RdRp; C; E2; E1; see Fig. 1)

were extracted using the Pileup program in the GCG

package (Genetics Computer Group, version 11.0, Accel-

rys Inc.). Selected coding sequences were edited and

aligned using ClustalW software version 1.8 [21].

Base and codon usage analysis

Two parameters, relative synonymous codon usage

(RSCU) and effective number of codons (Nc), were used to

measure the codon usage biases of all genomic coding

regions. RSCU is defined as the ratio of the observed fre-

quency of a codon to its expected frequency assuming

equal use of synonymous codons [46, 56, 57]. The RSCU

value of the jth codon for ith amino acid is estimated using

the equation

RSCUij ¼ obsij

� �� 1

ni

Xni

j¼1obsij

where obsij is the observed number of the jth codon for the

ith amino acid with ni-fold synonymous codons.

Nc is a parameter that directly reflects the absolute

synonymous codon usage bias of a given gene [14]. The Nc

value ranges from 20, when only one codon is used for

each amino acid (highly biased), to 61, when all codons are

used equally (unbiased). The Nc value is calculated using

the equation

NS-ORF:P150 C P90

SP-ORF:E1E2

RDRPHELNPXDHVRMT

Fig. 1 Genomic coding regions analyzed. A schematic diagram of

the RUBV genome is shown with the two ORFs as boxes and

untranslated regions as lines. Within each ORF, the boundaries of the

individual genes are shown, and within the P150 and P90 genes, the

relative positions of domains are indicated. Specific coordinates (in

nt): P150, 41-3943; MT (methyl/guanylyl-transferase), 230-439; HVR

(hypervariable region), 2120-2440; XD (X domain; ADP-ribose

binding domain), 2492-2998; NP (nonstructural protease), 3041-3940;

P90, 3944-6391; HEL (helicase), 4043-4798; RDRP (RNA-dependent

RNA polymerase), 4826-6388; C, 6512-7411; E2, 7412-8257; E1,

8258-9703

890 Y. Zhou et al.

123

Page 3: Analysis of base and codon usage by rubella virus.pdf

Nc ¼ 2þ 9=F2ð Þ þ 1=F3ð Þ þ 5=F4ð Þ þ 3=F6ð Þ

where Fi is the average homozygosity of the i-fold amino

acids (i = 2, 3, 4 and 6).

The expected value of Nc under the null model, in which

the GC bias at silent sites is entirely due to mutation bias, is

given by the equation

Nc ¼ 2þ sþ 29�

s2 þ 1� sð Þ2h in o

where s is the frequency of GC at the synonymous third

position of the codon (GC3s) [55].

GC3s, the GC content at the third codon position of

synonymous codons, was used to indicate the base com-

position bias of a gene [25]. To quantify the usage of each

nucleotide at synonymous third codon positions, four

indexes, A3s, T3s, C3s, G3s, were calculated. Each index

was defined as a proportion of the maximum usage the

individual nucleotide could have without altering the

amino acid composition (the ratio of the observed codon

frequency of the nucleotide at a synonymous third codon

position to the maximum potential codon frequency of the

nucleotide at a synonymous third codon position). It should

be pointed here that these indices are not directly compa-

rable with GC3s, although they are correlated with GC3s.

Two software programs, DAMBE [56, 57] and CodonW

(win3.2, contributed by John Penden, available at

http://www.molbiol.ox.ac.uk/cu) were used to estimate

these parameters. Codon frequencies and RSCU were cal-

culated using DAMBE software. A3s, T3s, C3s, G3s, GC3s

and Nc were tabulated using CodonW software. Base

composition at individual codon positions and overall

codon positions were carried out by the program TREE-

PUZZLE version 5.2 [44]. To examine the extent to which

base mutation bias accounts for the observed codon usage

bias, the Nc-plot against GC3s, G3s and C3s and the cor-

relations between GC3s and GC content at the first (GC1),

second (GC2) and overall (GC) codon positions of all

genomic coding regions were performed by linear regres-

sion analysis using Microsoft Excel in Windows XP.

Dinucleotide frequency is another important factor that

can affect codon usage. In this study, the relative abun-

dance values of 16 dinucleotides in each genomic coding

region were estimated using the method described by

Karlin and Burge [27]. The average relative abundance

value of each dinucleotide was determined from the odds

ratio over three codon site combinations (p12, p23, p31)

using the equation

qxy ¼ 3�fxy 1;2ð Þ þ fxy 2;3ð Þ þ fxy 3;1ð Þ

fx 1ð Þ þ fx 2ð Þ þ fx 3ð Þ� �

fy 1ð Þ þ fy 2ð Þ þ fy 3ð Þ� �

where f denotes the frequency of the dinucleotide XY in the

sequence under study, and fx and fy denote the frequency of

nucleotide X and Y, respectively. From data simulations

and statistical theory, qxy B 0.78 and C1.23 represent

extreme underrepresentation and extreme overrepresenta-

tion, respectively, for sufficiently long ([20 kb) random

sequences with a probability of 0.001 at most for virtually

any base composition. The observed nucleotide and dinu-

cleotide frequencies of a given gene were determined using

DAMBE software.

Results

Codon usage pattern and bias

The RUBV genome was subdivided into eleven genomic

coding regions ([61], see Fig. 1) for analysis. A similar

codon usage pattern was found in each genomic coding

region among the 19 virus genomes analyzed, indicating no

significant codon usage differences among diverse geno-

types of rubella viruses (data not shown). Therefore, this

study focused on examining the variations of codon usage

among these regions within the RUBV genome.

The codon frequencies and relative synonymous codon

usage (RSCU) values of the 11 regions calculated from the

19 genomic sequences are summarized in Supplementary

Table S1. Codon bias was evident, as roughly 40% of the

codons in each region had an RSCU value [1 and pre-

dictably reflected the high GC content of the RUBV gen-

ome. For example, among Ala codons (GCN), which were

the most frequently used across the genome, GCC was the

most preferred (RSCU = 1.43-2.31). Among the six syn-

onymous codons for Arg, which were frequent in the NSP-

ORF and C gene, 70% were CGC, and the RSCU value of

this codon was the highest among all codons, ranging from

2.98 to 5.02 (average = 4.04) among the genomic coding

regions. In contrast, AAN codons (Lys and Asn) were the

most infrequent (frequencies \2.9% of the total codons).

The codon preference within each genomic coding

region is listed in Table 1. With four exceptions (MT and

HVR, both in P150, C and E2), the preferred codon for

each amino acid was same in all of the genomic regions.

All of the preferred codons ended in C or G, with 83.3%

(15/18) of them ending in C. However, the base usage in

the first and second positions of the preferred codons was

not GC-prone, especially in the second codon position,

where the usage of A (39%, 7/18) was more frequent than

that of C (17%, 3/18) or G (22%, 4/18). In order to com-

pare the codon preferences of RUBV with those of its host,

the codon preferences for a compilation of human genes

are also presented in Table 1. Surprisingly, despite the high

GC content of the rubella virus genome, 15 out of 18

preferred codons were the same between the virus and its

host, although the frequency of usage of each codon in toto

varied widely between the viral and human genes.

Base and codon usage of rubella virus 891

123

Page 4: Analysis of base and codon usage by rubella virus.pdf

Base composition at codon positions

The base compositions at the three codon positions across

the 11 genomic coding regions are summarized in Table 2.

Except for the HVR, the GC content ranged from 66.2%

(E1) to 74.2% (XD), with the frequency of C (36.4-41.7%)

being greater than that of G (28.1-33.4%). The frequencies

of A (11.5-18.2%) and T (12.7-18%) were equal. With an

exceptionally high frequency of C (49%) and low fre-

quencies of A (11%) and T (8%), the HVR had a GC

content of 81.1%. Except for the HVR, at the third codon

position, the base usage was highly biased toward C (48.7-

60.9%), resulting in a GC content at this position with a

range of 74.9-85.1%. At the second codon position, unlike

the extreme bias toward C at third codon position, the

usage of four nucleotides was more equally distributed,

with higher relative frequencies of both A (16.8-26.3%)

and T (15.6-25.7%) and an overall GC content of 48.4-

62.6%. At the first codon position, instead of C, G was the

preferred nucleotide, with a frequency range of 33.1-46.8%

and an overall GC content of 63.9-77%. Overall, outside of

the HVR, the relative nucleotide percentages in the regions

of the rubella virus genome were GC3 [ GC1 [ GC2

(C3 [ C1 = C2 and G1 [ G3 = G2). Interestingly,

although the nucleotide frequencies differed markedly

between RUBV and the compilation of human genes, these

tendencies were similar to those of the human genes,

namely GC3 [ GC1 [ GC2 (C3 [ C1 = C2 and G1 [G3 [ G2). In contrast to the rest of the RUBV genome,

within the HVR, the nucleotide frequency was similar

among the three codon positions, with GC1 (84.6%) [GC3 (81.4%) [ GC2 (77.4%), for an overall GC content of

81.1%. C and G were roughly equal at the first codon

position (43.2% and 41.4%, respectively), but C was

greater at both the second (56.9% vs 20.5%) and third

codon positions (47% vs 34.4%). Thus, within the HVR,

the usage tendencies of GC were GC1 [ GC3 [ GC2,

C2 [ C3 [ C1 and G1 [ G3 [ G2, at variance with other

regions of the genome.

Correlation between codon usage bias and GC content

To address codon usage bias quantitatively within the

coding regions of the genome, a number of parameters

were calculated for each region of the 19 genomes ana-

lyzed in the study. As summarized in Table 3, the N3s

index (the ratio of the actual usage of a nucleotide at the

third codon position to its maximal potential usage without

changing the amino acid sequence) indicated a bias gra-

dient of C3s (0.525-0.722) [ G3s (0.236-0.373) [ T3s

(0.068-0.203) [ A3s (0.049-0.172), which was reflective

of the nucleotide frequencies for the third codon position

shown in Table 2. In the HVR, C3s (0.471-0.524) and G3s

(0.362-0.417) were more similar, as were T3s (0.06-0.118)

and A3s (0.084-0.128), than in other regions of the gen-

ome. Calculations of Nc, the effective number of codons,

revealed a somewhat greater bias in regions of the NSP-

ORF (29-40) than regions of the SP-ORF (35-45). There

was a large range in Nc values in the HVR, 31 to 51 with a

standard deviation of 4.95, although this was not due to

Table 1 Codon preferences

within RUBV genomic coding

regions and human genes

*The preferred codon for each

amino acid was calculated for

each of the coding regions of the

RUBV genome using the 19

sequences. These were the same

for P150, XD, NP, P90, HEL,

RdRp, and E1 and have been

included together under the

heading ‘‘Genomic regions’’

**Preferred codons for human

genes as compiled by Frey [17].

The same usage pattern was

reported in references [34] and

[58]

***Same preferred codons as in

genomic regions�Used equally for the two

codons

AA Genomic regions* P150: MT* P150: HVR* SP: C* SP: E2* Human**

Ala GCC *** GCG *** *** ***

Cys UGC *** *** *** *** ***

Asp GAC ***/GAU� *** *** *** ***

Glu GAG *** *** *** *** ***

Phe UUC *** *** *** *** ***

Gly GGC *** *** *** *** ***

His CAC *** *** *** *** ***

Ile AUC *** AUU *** ***/AUA� ***

Lys AAG AAA *** *** *** ***

Leu CUC *** CUG *** CUG CUG

Asn AAC *** *** *** *** ***

Pro CCC CCA ***/CCG� CCG *** ***

Gln CAG *** *** *** *** ***

Arg CGC *** *** *** *** ***/AGG�

Ser AGC *** UCG UCC *** ***

Thr ACC ***/ACG� *** *** *** ***

Val GUC *** *** *** *** GUG

Tyr UAC UAU *** *** *** ***

892 Y. Zhou et al.

123

Page 5: Analysis of base and codon usage by rubella virus.pdf

GC3s, the GC content at the third codon position of syn-

onymous codons, which was 0.78-0.85 compared with

0.69-0.88 in other regions of the genome (the range of total

GC content was 0.79-0.83 in the HVR vs 0.65-0.75 among

other genomic regions). As shown in Figure 2, linear

regression analysis revealed a positive correlation between

GC and GC3s (R2 = 0.302) for all regions of the genome

except for the HVR, for which no correlation was found

(R2 = 0.0503). Additionally, GC3s was also shown to

correlate with GC1 (R2 = 0.2856) but not with GC2

(R2 = 0.0029) (data not shown). Figure 3 displays the

relationship between GC3s and Nc. With the exception of

the HVR, the Nc of all of the genomic regions exhibited a

significant negative correlation with GC3s (R2 = 0.5928),

while the GC3s of the HVR did not correlate with its Nc

(R2 = 0.1307). In order to examine whether the individual

frequencies of C and G were related to the codon usage

bias of the RUBV genome, the correlations of Nc with C3s

and G3s were determined. As shown in Figure 4, Nc

strongly negatively correlated with C3s (R2 = 0.3536), but

not with G3s (R2 = 0.0775). In the HVR, there was no

significant negative correlation between Nc and either C3s

(R2 = 0.1648) or G3s (R2 = 0.0446).

When comparing the actual plot of Nc vs GC3s with the

plot expected under the null model in which codon usage

bias is entirely due to mutation pressure, it was evident

that, with exception of the HVR, the plot of Nc vs GC3s for

genomic regions lay just below and paralleled the expected

curve, indicating that the codon usage bias in these regions

was mainly influenced by the genomic GC composition,

i.e., mutation pressure. However, the plot of Nc vs GC3s of

the HVR intersected the expected curve. Significantly, the

finding of no correlation between Nc and GC3s within the

HVR indicated that this region of the genome was under

different selective pressure.

Dinucleotide frequencies

Dinucleotide usage is another important factor that shapes

codon usage in genes. The relative dinucleotide abundance

value over three codon positions (p12, p23 and p31) of 16

dinucleotides was computed. Well-known tendencies

among mammalian genes are underrepresentation of TpA

and CpG usage and overuse of TpG. Therefore, the values

of CpG, TpG and TpA in all genomic regions were

examined first. As displayed in Table 4, CpG and TpG

frequencies over all genomic regions were not consistent

with those of the host, as the values of CpG over all regions

ranged between 0.7955 and 1.2428, while TpG was over-

represented in P150, E1, and E2 (values = 1.2620-1.4649)

but not enhanced in the P90 and C genes (values = 1.1210-

1.1638). In contrast, similar to the tendencies in human

genes, TpA was underrepresented in all genomic regions,Ta

ble

2B

ase

com

po

siti

on

(%)

atco

do

np

osi

tio

ns

of

RU

BV

gen

om

icco

din

gre

gio

ns

Co

do

np

osi

tio

nG

ene

GC

AC

GT

Co

do

n

po

siti

on

Gen

eG

CA

CG

T

1st

Gen

om

ic

reg

ion

s*

63

.9–

77

12

.3–

19

.82

7.1

–3

9.4

33

.1–

46

.88

.9–

18

.63

rdG

eno

mic

reg

ion

s*7

4.9

–8

5.1

4.8

–1

3.7

48

.7–

60

.92

1.4

–3

0.3

9.4

–1

4.3

HV

R*

84

.68

0.7

43

.24

1.4

6.7

HV

R*

81

.49

.14

73

4.4

9.6

Hu

man

**

54

28

23

31

18

Hu

man

**

63

17

34

29

20

2n

dG

eno

mic

reg

ion

s

48

.4–

62

.61

6.8

–2

6.3

26

.9–

35

.62

0.7

–3

1.3

15

.6–

25

.7A

llG

eno

mic

reg

ion

s*6

6.2

–7

4.2

11

.5–

18

.23

6.4

–4

1.7

28

.1–

33

.41

2.7

–1

8

HV

R*

77

.41

5.1

56

.92

0.5

7.6

HV

R*

81

.11

14

93

2.1

8

Hu

man

**

40

32

22

18

28

Hu

man

**

52

25

26

26

22

*T

he

bas

eco

mp

osi

tio

nat

each

cod

on

po

siti

on

inea

cho

fth

eg

eno

mic

cod

ing

reg

ion

sw

asca

lcu

late

dfr

om

the

19

gen

om

icse

qu

ence

s.T

he

par

amet

ers

for

10

of

thes

ere

gio

ns

(P1

50

,M

T,X

C,N

P,

P9

0,

HE

L,

Rd

Rp

,C

,E

2,

and

E1

)ar

eg

rou

ped

tog

eth

eran

dp

rese

nte

das

ran

ges

,w

hil

eth

ep

aram

eter

sfo

rth

eH

VR

are

pre

sen

ted

sep

arat

ely

**

Bas

eu

sag

esin

hu

man

gen

esas

com

pil

edb

yF

rey

[17

].T

he

sam

eu

sag

ep

atte

rnw

asre

po

rted

inre

fere

nce

s[3

4]

and

[58

]

Base and codon usage of rubella virus 893

123

Page 6: Analysis of base and codon usage by rubella virus.pdf

with a low value range of 0.3765-0.7970 (a slightly higher

value [0.8458] was observed in HEL). Additionally, it was

noted that 11 out of 16 dinucleotides were underrepre-

sented in the MT domain of P150. Analysis of the relative

abundance of the remainder of the dinucleotides did not

reveal any tendencies of interest.

Discussion

In this study, base composition and codon usage among

and within RUBV genomes were analyzed. RUBV is an

interesting model for such studies both because it com-

prises a single serotype and infects a single host (humans),

freeing it from evolutionary pressures faced by other

viruses with multiple serotypes and hosts, and because it

possesses a GC content that is uniquely high among RNA

viruses. While such a GC content would seemingly be at

odds with that of its host, we found that despite marked

differences in proportional codon usage, the preferred

RUBV codon matched the preferred human codon for 18

out of the 20 amino acids, indicating adaptation to its host.

We found that codon usage was similar in homologous

genes among the genotypes analyzed, corresponding to

findings in studies on other viruses, such as phylogeneti-

cally close nucleopolyhedroviruses [35], human and non-

human AT-rich papillomaviruses [58], and influenza A

viruses [60]. While the RUBV genotypes are only recog-

nizable by sequencing, they are in a dynamic state of flux

worldwide for unknown reasons [53]. While part of the ebb

Table 3 Codon usage indices calculated for genomic coding regions in the RUBV genome

Genomic regions* T3s C3s A3s G3s Nc GC3s GC L aa**

P150 0.108-0.127 0.583-0.605 0.085-0.109 0.330-0.351 38.29-40.33 0.795-0.827 0.711-0.720 1301

P150:MT 0.068-0.203 0.525-0.644 0.138-0.172 0.255-0.309 29.06-39.05 0.687-0.806 0.662-0.695 70

P150:HVR 0.060-0.118 0.471-0.524 0.084-0.128 0.362-0.417 31.67-50.94 0.776-0.850 0.794-0.829 107

P150:XD 0.077-0.144 0.595-0.641 0.050-0.087 0.316-0.358 34.06-39.05 0.818-0.880 0.728-0.753 169

P150:NP 0.083-0.130 0.605-0.652 0.065-0.121 0.286-0.333 36.21-40.22 0.783-0.860 0.711-0.736 300

P90 0.116-0.146 0.634-0.670 0.062-0.086 0.314-0.345 34.84-38.31 0.804-0.849 0.669-0.680 815

P90:HEL 0.101-0.141 0.686-0.722 0.074-0.126 0.236-0.287 33.46-36.74 0.792-0.852 0.660-0.680 252

P90:RdRp 0.122-0.162 0.606-0.651 0.049-0.099 0.326-0.373 34.89-39.89 0.786-0.850 0.661-0.678 521

SP:C 0.112-0.155 0.581-0.634 0.068-0.117 0.297-0.339 35.07-42.01 0.764-0.823 0.717-0.736 300

SP:E2 0.111-0.165 0.563-0.632 0.054-0.083 0.302-0.342 36.82-44.78 0.784-0.839 0.697-0.720 282

SP:E1 0.139-0.195 0.558-0.603 0.082-0.113 0.314-0.353 39.97-44.09 0.752-0.810 0.653-0.670 481

*Parameters were calculated from the indicated genomic regions using the 19 genomic sequences

**Length of the genomic region in amino acids

0.63

0.68

0.73

0.78

0.83

0.880.830.780.730.68

GC3s

GC

Fig. 2 Correlation between overall GC content (GC) and that atsynonymous third codon positions (GC3s) within a genomicregion. GC vs GC3s is plotted for each genomic region of the 19

genomic sequences employed in this study. The HVR points are

shown as pink squares, while the points from the other 10 genomic

regions are shown as blue diamonds (color figure online)

20

25

30

35

40

45

50

55

60

10.80.60.40.20

GC3s

Nc

Fig. 3 The effective number of codons (Nc) as a function of thesynonymous third codon GC content (GC3s) within a genomicregion. Nc vs GC3s is plotted for each genomic region of the 19

genomic sequences employed in this study. The continuous curve

indicates the expected Nc vs GC3s plot under the null model in which

the codon usage is completely constrained by GC content. The HVR

points are shown as pink squares, while the points from the other 10

genomic regions are shown as blue diamonds (color figure online)

894 Y. Zhou et al.

123

Page 7: Analysis of base and codon usage by rubella virus.pdf

and flow of genotypes is due to vaccination programs, the

spread of other genotypes is not explained by vaccine

pressure.

Interesting differences were revealed in the intra-geno-

mic comparisons. Within coding regions, the GC content

varied, with a range of 66.2-81.1%. With one exception, the

HVR (the perpetual outlier), GC content over the three

codon positions was GC3 [ GC1 [ GC2, and GC3s cor-

related with overall GC and GC1 positively and with Nc

negatively. When comparing the actual plot of Nc against

GC3s with the expected curve generated by the null model,

the plot lay just below the expected curve, suggesting that

directional mutation pressure was the major factor for

driving the codon choices for most of the genome. That

mutation pressure had a greater effect on the codon usage

than natural selection was further supported by the finding

that NSP genes had greater codon usage bias than SP genes

in the RUBV genome (Nc NSP vs SP = 29-40 vs 35-45).

Since the expression level of NSP genes is much lower than

that of SP genes during virus replication, the codon usage of

highly expressed protein genes is usually much more biased

than less-expressed genes if natural selection is at play.

Directional mutation pressure as the main factor for codon

usage choice has also been observed in many human RNA

viruses [25] and DNA viruses [45] as well as in mammalian

species [52]. The high mutation rates and large population

sizes exhibited by RNA viruses possibly make the natural

selection constraints on codon choice inefficient, with the

result that RNA viruses have a low codon usage bias, and

their codon usage reflects more of the underlying mutation

pressures (an inverse relationship between nucleotide sub-

stitution rate and codon usage bias) [25, 43, 45]. Accord-

ingly, with a high codon usage bias, RUBV is expected to

have a low nucleotide substitution rate. In fact, a compre-

hensive study on substitution rates in 50 RNA viruses

20

25

30

35

40

45

50

55

60

0.22 0.32 0.42 0.52 0.62 0.72

G3s/C3s

Nc

Fig. 4 Correlation between the effective number of codons (Nc)and the G and C content at synonymous third codon positions(G3s/C3s) within a genomic region. Nc vs G3s or C3s is plotted for

each genomic region of the nineteen genomic sequences employed in

this study. The HVR points are shown as pink squares (G3s) or pinktriangles (C3s), while the points from the other 10 genomic regions are

shown as blue squares (G3s) or blue triangles (C3s) (color figure online)

Table 4 Relative abundance of the 16 dinucleotides (di-nt) in RUBV genomic coding regions

Genomic region di-nt* P150 P150:MT P150:HVR P150:XD P150:NP P90 P90:HEL P90:RdRp SP:C SP:E1 SP:E2

AA 0.9663 0.7039 1.2301 0.7890 0.9566 0.8298 0.8603 0.7868 1.1124 0.9251 0.8171

AC 1.0730 0.7069 1.1546 1.0596 1.0530 1.1063 1.1744 0.9897 1.0078 1.2429 1.2445

AG 0.9004 0.7648 0.6672 0.8156 0.9540 0.8656 0.7410 0.8951 0.9399 0.8522 0.5494

AU 1.0563 0.8003 1.0730 1.0556 0.9999 1.1731 1.1953 1.1146 0.9928 0.8084 1.0166

CA 1.0518 0.7758 0.8981 1.1126 1.0791 1.0566 1.0468 1.0312 1.0271 1.1793 1.1693

CC 0.9068 0.6563 0.8793 0.7385 0.9535 0.8383 0.7462 0.8421 0.9158 0.9559 0.8803

CG 1.0981 0.7955 1.2428 1.0951 1.0079 1.1286 1.2132 1.0268 1.0498 0.8968 0.9005

CU 0.9824 0.7712 0.8246 1.0652 1.0175 1.0770 1.1931 0.9404 1.1090 1.0902 1.0102

GA 1.0897 0.7413 1.2317 0.9453 1.0734 1.1776 1.1088 1.1349 1.0757 0.9368 0.8297

GC 1.1610 0.8203 1.1302 1.1485 1.1126 1.1575 1.2370 1.0749 1.1265 1.0515 1.0584

GG 0.7680 0.6283 0.6768 0.6769 0.7987 0.8243 0.7875 0.7763 0.9120 0.9713 0.9002

GU 0.9953 0.6615 1.1823 0.9688 1.0856 0.7698 0.6669 0.7836 0.7138 0.9748 0.8422

UA 0.6783 0.5335 0.3765 0.6095 0.6371 0.7141 0.8458 0.5781 0.5375 0.7970 0.6937

UC 0.8243 0.8168 1.0055 0.9516 0.8118 0.9695 1.0032 0.9044 0.9577 0.7783 0.6690

UG 1.3461 0.7678 1.1438 1.2620 1.4815 1.1638 1.1206 1.1268 1.1210 1.3619 1.4649

UU 1.0047 1.0744 1.2446 0.4750 0.7438 1.0680 0.8977 1.0934 1.3508 1.0169 0.9215

*The dinucleotide frequency at all three codon positions (p12, p23, p31) was calculated from the 19 genomic sequences and expressed as the

proportion of the expected frequency given the GC content of the genomic region. The values that are deeply depressed (B0.78) or overrep-

resented (C1.23) are in boldface type

Base and codon usage of rubella virus 895

123

Page 8: Analysis of base and codon usage by rubella virus.pdf

documented that RUBV did exhibit a lower substitution rate

than other RNA viruses [24]. Although a significant pattern

with low substitution rates along with decreased codon

usage bias has been observed in members of the family

Togaviridae (most of which are vector-borne viruses) [24],

as a non-arthropod-borne virus with a single host, it was not

surprising that RUBV uses a codon strategy with a low

substitution rate along with a high codon usage bias. Also,

increased bias in segmented and aerosol-transmitted RNA

virus (RUBV is an aerosol-transmitted virus) has been

reported [25], providing evidence that structural features of

the viral genome and ecology of viruses contribute to the

codon choice of viruses [24].

Within the HVR, however, the overall relationships

among GC, GC1, GC3 and Nc suggested that this region is

subjected primarily to natural selection rather than to

mutation pressure. In this regard, previous studies on

RUBV indicated that the HVR was subjected to positive

evolutionary pressure [22, 61]. It is well established that

natural selection can change genomic base content and play

a role in altering the most intricate pattern of codon usage

[2, 8, 15, 23, 38, 40, 48, 49]. Within the RUBV NS-ORF,

the HVR resides within the recently defined Q domain. The

function of the Q domain is unknown, and it was mapped

through the curious observation that while deletions of this

region render RUBV constructs nonviable, these deletions

can be rescued by the capsid protein [51]. The region

encoded by the HVR is high in proline and includes PxxP

and PxxPxR motifs known to be important in protein-

protein interaction [28]. Therefore, the selective pressure

brought to bear on this region may relate to intracellular

contacts the region mediates during the RUBV replication

cycle. While it is difficult to imagine that these contacts

would involve different cell proteins throughout the human

population, it is possible that the plasticity of this region is

necessary to accommodate the same set of proteins across

different genetic backgrounds. Since the HVR exhibits the

highest substitution rate in the RUBV genome [61], it

would be expected to have the lowest codon usage bias,

since there is usually a negative correlation between codon

usage bias and mutation rate (i.e., because codon bias

reduces the rate of synonymous substitutions) [10, 47].

However, the HVR exhibited a wide range of Nc values

among diverse genotypes (Nc = 32-51). This finding is not

opposed to this expectation, however, because the HVR

features substitutions at nonsynonymous sites that were

apparently driven by natural selection that countered cod-

ing bias driven by mutation pressure. It should be pointed

out that the HVR has the highest GC content in the whole

genome ([80%), and it is possible that some attribute of

such a high GC content (e.g., stability of secondary struc-

ture) may also be favored by natural selection, particularly

in a region that tolerates plasticity in its amino acid content.

Regarding the basis of the directional mutation pressure,

although the RUBV genome is rich in both C and G, the

distribution of these two nucleotides over the three codon

positions obviously differed. Specifically, G was higher in

the first codon position (i.e., G1 [ G3 [ G2) in all of the

genomic regions, including the HVR, while C predomi-

nated at the third codon position (i.e., C3 [ C1 = C2) in

all of the genomic regions except the HVR

(C2 [ C3 [ C1). These data indicated first that irrespec-

tive of whether the genomic region was subjected to

directional mutation pressure or natural selection, C was

the primary base that influenced codon choice in the RUBV

genome. Secondly, G base usage in the RUBV genome was

governed to a greater extent by natural selection than was C

base usage. The stronger negative correlation between Nc

and C3s than between Nc and G3s confirmed these indi-

cations. The observation of natural selection influencing G

base usage at the first two codon positions in a genome in

which the codon usage was primarily governed by muta-

tion pressure was also documented in bacteria with highly

skewed (H. influenzae and M. tuberculosis) and unskewed

(E. coli) base compositions [42]. In these studies, G was

found to be preferred in the first codon position of highly

expressed genes, irrespective of their overall GC content

[42, 55]. This was not the case in the RUBV genome, as the

more highly expressed ORF, the SP-ORF, exhibited the

same tendencies as the less expressed ORF, the NS-ORF.

Interestingly, an analysis performed on codon usage of

16,654 human genes and 129 human virus genomes

showed that all C-ending and G-ending codons were pre-

ferred in highly expressed genes [31]. While the basis of

the directional mutation pressure on the RUBV genome is

not known, we infer that it is due to enzyme-induced

hypermutations (i.e., characteristics of the RNA-depen-

dent-RNA polymerase encoded by the virus), although the

formation of stable local RNA structures may also play a

role, as has been suggested for retroviruses [29, 37].

Intriguingly, it was recently suggested that the RUBV

genome RNA contains a microRNA precursor in NS-ORF

that suppresses the human APOBEC1 editing enzyme

(which catalyzes G-to-A and C-to-U transitions) resulting

in the GC content bias in RUBV genomes [30]. However,

this remains to be confirmed experimentally.

Analysis of relative dinucleotide abundance in the

RUBV genome revealed that CpG was used at the expected

frequency (the HVR had an especially high relative CpG

abundance, with a value of 1.2428), and TpG was enhanced

in the P150, E1 and E2 genes but used as expected in the

P90 and C genes. This was not as in the case of host genes.

These data are consistent with a previous analysis of

eukaryotic viruses (RNA and DNA), which demonstrated

that CpG was deeply suppressed and TpG was overrepre-

sented in viruses with small genomes, with the exception of

896 Y. Zhou et al.

123

Page 9: Analysis of base and codon usage by rubella virus.pdf

four togaviruses, including RUBV [25, 26, 45]. The other

three togaviruses were alphaviruses with a genomic GC

content of *53%, and thus the CpG usage by members of

this family is independent of the genomic GC content. CpG

is important in chromatin structure and immune signaling

through TLR9 in mammalian species [32, 33, 59]; how-

ever, why either would be a factor in CpG usage by small

RNA viruses is not clear. On the other hand, depressed

usage of TpA was found in the RUBV genome, as has been

reported for mammalian genes and both DNA and RNA

viruses. Properties selecting against TpA usage include

RNase susceptibility [9], low thermal stability [12, 16] and

high TA content in stop codons [9].

Finally, the tendencies in base composition and codon

usage in the RUBV genome can be summarized in an

analysis of codon usage for Arg. Arginine is abundant in

both the NS-ORF (12.8% in the MT domain) and CP gene

(13.4%) in SP-ORF. The abundance of arginine is

accompanied by an exclusion of lysine (AAG and AAA)

(\2.9% of total amino acid usage). Furthermore, within the

six synonymous codons for Arg, 70% were CGC. In fact,

the RSCU value of this codon was the highest among all

codons (2.98-5.02 across the genomic regions, aver-

age = 4.04). Concomitantly, the usage of the two AGR

codons was deeply suppressed; e.g., eight out of eleven

genomic regions scored an RSCU of zero for AGA. In

contrast, in the genomes of large viruses with unsuppressed

relative abundances of CpG, no bias in codon choice for

Arg was present, while in the genomes of most small

viruses (DNA and RNA) with a CpG deficiency, Arg usage

was normal but biased towards AGR [26, 45]. Thus, the

directional mutation pressure towards C, and to a lesser

extent G in RUBV, appears to be driving amino acid as

well as codon choice.

Acknowledgments This research was supported by a grant from

NIH (AI21389).

References

1. Adams MJ, Antoniw JF (2004) Codon usage bias amongst plant

viruses. Arch Virol 149:113–135

2. Akashi H, Kliman RM, Eyre-Walker A (1998) Mutation pressure,

natural selection, and the evolution of base composition in Dro-sophila. Genetica 102–103:49–60

3. Andre S, Seed B, Eberle J, Schraut W, Bultmann A, Haas J

(1998) Increased immune response elicited by DNA vaccination

with a synthetic gp120 sequence with optimized codon usage.

J Virol 72:1497–1503

4. Barrett JW, Sun Y, Nazarian SH, Belsito TA, Brunetti CR,

McFadden G (2006) Optimization of codon usage of poxvirus

genes allows for improved transient expression in mammalian

cells. Virus Genes 33:15–26

5. Baud D, Ponci F, Bobst M, De Grandi P, Nardelli-Haefliger D

(2004) Improved efficiency of a Salmonella-based vaccine

against human papillomavirus type 16 virus-like particles

achieved by using a codon-optimized version of L1. J Virol

78:12901–12909

6. Berkhout B, van Hemert FJ (1994) The unusual nucleotide con-

tent of the HIV RNA genome results in a biased amino acid

composition of HIV proteins. Nucleic Acids Res 22:1705–1711

7. Berkhout B, Grigoriev A, Bakker M, Lukashov VV (2002) Codon

and amino acid usage in retroviral genomes is consistent with

virus-specific nucleotide pressure. AIDS Res Hum Retroviruses

18:133–141

8. Bernardi G (1995) The human genome: organization and evolu-

tionary history. Annu Rev Genet 29:445–476

9. Beutler E, Gelbart T, Han JH, Koziol JA, Beutler B (1989)

Evolution of the genome and the genetic code: selection at the

dinucleotide level by methylation and polyribonucleotide cleav-

age. Proc Natl Acad Sci USA 86:192–196

10. Bierne N, Eyre-Walker A (2003) The problem of counting sites in

the estimation of the synonymous and nonsynonymous substitu-

tion rates: implications for the correlation between the synony-

mous substitution rate and codon usage bias. Genetics 165:1587–

1597

11. Bosch ML, Andeweg AC, Schipper R, Kenter M (1994) Insertion

of N-linked glycosylation sites in the variable regions of the

human immunodeficiency virus type 1 surface glycoprotein

through AAT triplet reiteration. J Virol 68:7566–7569

12. Breslauer KJ, Frank R, Blocker H, Marky LA (1986) Predicting

DNA duplex stability from the base sequence. Proc Natl Acad Sci

USA 83:3746–3750

13. Burns CC, Shaw J, Campagnoli R, Jorba J, Vincent A, Quay J,

Kew O (2006) Modulation of poliovirus replicative fitness in

HeLa cells by deoptimization of synonymous codon usage in the

capsid region. J Virol 80:3259–3272

14. Comeron JM, Aguade M (1998) An evaluation of measures of

synonymous codon usage bias. J Mol Evol 47:268–274

15. Comeron JM, Kreitman M, Aguade M (1999) Natural selection

on synonymous sites is correlated with gene length and recom-

bination in Drosophila. Genetics 151:239–249

16. Delcourt SG, Blake RD (1991) Stacking energies in DNA. J Biol

Chem 266:15160–15169

17. Frey TK (1994) Molecular biology of rubella virus. Adv Virus

Res 44:69–160

18. Grantham R, Gautier C, Gouy M, Mercier R, Pave A (1980)

Codon catalog usage and the genome hypothesis. Nucleic Acids

Res 8:r49–r62

19. Gu W, Zhou T, Ma J, Sun X, Lu Z (2004) Analysis of synony-

mous codon usage in SARS Coronavirus and other viruses in the

Nidovirales. Virus Res 101:155–161

20. Haas J, Park EC, Seed B (1996) Codon usage limitation in the

expression of HIV-1 envelope glycoprotein. Curr Biol 6:315–324

21. Henikoff S, Henikoff JG (1994) Position-based sequence weights.

J Mol Biol 243:574–578

22. Hofmann J, Renz M, Meyer S, von Haeseler A, Liebert UG

(2003) Phylogenetic analysis of rubella virus including new

genotype I isolates. Virus Res 96:123–128

23. Hughes S, Zelus D, Mouchiroud D (1999) Warm-blooded iso-

chore structure in Nile crocodile and turtle. Mol Biol Evol

16:1521–1527

24. Jenkins GM, Rambaut A, Pybus OG, Holmes EC (2002) Rates of

molecular evolution in RNA viruses: a quantitative phylogenetic

analysis. J Mol Evol 54:156–165

25. Jenkins GM, Holmes EC (2003) The extent of codon usage bias

in human RNA viruses and its evolutionary origin. Virus Res

92:1–7

26. Karlin S, Doerfler W, Cardon LR (1994) Why is CpG suppressed

in the genomes of virtually all small eukaryotic viruses but not in

those of large eukaryotic viruses? J Virol 68:2889–2897

Base and codon usage of rubella virus 897

123

Page 10: Analysis of base and codon usage by rubella virus.pdf

27. Karlin S, Burge C (1995) Dinucleotide relative abundance

extremes: a genomic signature. Trends Genet 11:283–290

28. Kay BK, Williamson MP, Sudol M (2000) The importance of

being proline: the interaction of proline-rich motifs in signaling

proteins with their cognate domains. Faseb J 14:231–241

29. Keating CP, Hill MK, Hawkes DJ, Smyth RP, Isel C, Le SY,

Palmenberg AC, Marshall JA, Marquet R, Nabel GJ, Mak J

(2009) The A-rich RNA sequences of HIV-1 pol are important for

the synthesis of viral cDNA. Nucleic Acids Res 37:945–956

30. Khrustalev VV, Barkovsky EV (2011) Unusual nucleotide con-

tent of Rubella virus genome as a consequence of biased RNA-

editing: comparison with Alphaviruses. Int J Bioinform Res Appl

7:82–100

31. Kliman RM, Bernal CA (2005) Unusual usage of AGG and TTG

codons in humans and their viruses. Gene 352:92–99

32. Krug A, Luker GD, Barchet W, Leib DA, Akira S, Colonna M

(2004) Herpes simplex virus type 1 activates murine natural

interferon-producing cells through toll-like receptor 9. Blood

103:1433–1437

33. Kundu TK, Rao MR (1999) CpG islands in chromatin organi-

zation and gene expression. J Biochem 125:217–222

34. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC,

Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R,

Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J,

LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP,

Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos

R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N,

Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R,

Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A,

Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R,

French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt

A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S,

Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen

R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson

JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH,

Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL,

Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL,

Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P,

Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen

A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny

DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives

CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS,

Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M,

Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y,

Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F,

Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR,

Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM,

Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S,

Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen

L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP,

Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J,

Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Mino-

shima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F,

Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la

Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G,

Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S,

Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC,

Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler

EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y,

Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS,

Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P,

Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A,

Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP,

Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski

J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler

R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F,

Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos

A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H,

Choi S, Chen YJ (2001) Initial sequencing and analysis of the

human genome. Nature 409:860–921

35. Levin DB, Whittome B (2000) Codon usage in nucleopolyhed-

roviruses. J General Virol 81:2313–2325

36. Lockhart PJ, Steel MA, Hendy MD, Penny D (1994) Recovering

evolutionary trees under a more realistic model of sequence

evolution. Mol Biol Evol 11:605–612

37. Mansky LM, Temin HM (1995) Lower in vivo mutation rate of

human immunodeficiency virus type 1 than that predicted from

the fidelity of purified reverse transcriptase. J Virol 69:5087–5094

38. McEwan CE, Gatherer D, McEwan NR (1998) Nitrogen-fixing

aerobic bacteria have higher genomic GC content than non-fixing

species within the same genus. Hereditas 128:173–178

39. Menendez-Arias L (2002) Molecular basis of fidelity of DNA

synthesis and nucleotide specificity of retroviral reverse tran-

scriptases. Prog Nucleic Acid Res Mol Biol 71:91–147

40. Mooers AO, Holmes EC (2000) The evolution of base compo-

sition and phylogenetic inference. Trends Ecol Evol 15:365–369

41. Mueller S, Papamichail D, Coleman JR, Skiena S, Wimmer E

(2006) Reduction of the rate of poliovirus protein synthesis

through large-scale codon deoptimization causes attenuation

of viral virulence by lowering specific infectivity. J Virol 80:

9687–9696

42. Pan A, Dutta C, Das J (1998) Codon usage in highly expressed genes

of Haemophillus influenzae and Mycobacterium tuberculosis:

translational selection versus mutational bias. Gene 215:405–413

43. Powell JR, Moriyama EN (1997) Evolution of codon usage bias

in Drosophila. Proc Natl Acad Sci USA 94:7784–7790

44. Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002)

TREE-PUZZLE: maximum likelihood phylogenetic analysis

using quartets and parallel computing. Bioinformatics (Oxford,

England) 18:502–504

45. Shackelton LA, Parrish CR, Holmes EC (2006) Evolutionary

basis of codon usage and nucleotide composition bias in verte-

brate DNA viruses. J Mol Evol 62:551–563

46. Sharp PM, Li WH (1986) An evolutionary perspective on syn-

onymous codon usage in unicellular organisms. J Mol Evol

24:28–38

47. Sharp PM, Li WH (1987) The rate of synonymous substitution in

enterobacterial genes is inversely related to codon usage bias.

Mol Biol Evol 4:222–230

48. Sharp PM, Matassi G (1994) Codon usage and genome evolution.

Curr Opin Genet Dev 4:851–860

49. Singer CE, Ames BN (1970) Sunlight ultraviolet and bacterial

DNA base ratios. Science 170:822–825

50. Smith DW (1996) Problems of translating heterologous genes

in expression systems: the role of tRNA. Biotechnol Prog 12:

417–422

51. Tzeng WP, Frey TK (2009) Functional replacement of a domain

in the rubella virus p150 replicase protein by the virus capsid

protein. J Virol 83:3549–3555

52. Urrutia AO, Hurst LD (2001) Codon usage bias covaries with

expression breadth and the rate of synonymous evolution in

humans, but this is not evidence for selection. Genetics

159:1191–1199

53. WHO (2007) Update of standard nomenclature for wild-type

rubella viruses, 2007. Wkly Epidemiol Rec 82:216–222

54. WHO (2007) Update of standard nomenclature for wild-type

rubella viruses, 2007. Releve epidemiologique hebdomadaire/

Section d’hygiene du Secretariat de la Societe des Nations =

Weekly epidemiological record/Health Section of the Secretariat

of the League of Nations 82:216–222

55. Wright F (1990) The ‘effective number of codons’ used in a gene.

Gene 87:23–29

898 Y. Zhou et al.

123

Page 11: Analysis of base and codon usage by rubella virus.pdf

56. Xia X (2000) Factors Affecting Codon Frequencies. Data Anal-

ysis in Molecular Biology and Evolution. Kluwer Academic

Publishers, pp 59–105

57. Xia X, Xie Z (2001) DAMBE: software package for data analysis

in molecular biology and evolution. J Hered 92:371–373

58. Zhao KN, Liu WJ, Frazer IH (2003) Codon usage bias and A ? T

content variation in human papillomavirus genomes. Virus Res

98:95–104

59. Zheng M, Klinman DM, Gierynska M, Rouse BT (2002) DNA

containing CpG motifs induces angiogenesis. Proc Natl Acad Sci

USA 99:8944–8949

60. Zhou T, Gu W, Ma J, Sun X, Lu Z (2005) Analysis of synony-

mous codon usage in H5N1 virus and other influenza A viruses.

Bio Syst 81:77–86

61. Zhou Y, Ushijima H, Frey TK (2007) Genomic analysis of

diverse rubella virus genotypes. J General Virol 88:932–941

Base and codon usage of rubella virus 899

123