analysis of base and codon usage by rubella virus.pdf
TRANSCRIPT
ORIGINAL ARTICLE
Analysis of base and codon usage by rubella virus
Yumei Zhou • Xianfeng Chen •
Hiroshi Ushijima • Teryl K. Frey
Received: 17 November 2011 / Accepted: 24 December 2011 / Published online: 10 February 2012
� Springer-Verlag 2012
Abstract Rubella virus (RUBV), a small, plus-strand
RNA virus that is an important human pathogen, has the
unique feature that the GC content of its genome (70%) is
the highest (by 20%) among RNA viruses. To determine
the effect of this GC content on genomic evolution, base
and codon usage were analyzed across viruses from eight
diverse genotypes of RUBV. Despite differences in fre-
quency of codon use, the favored codons in the RUBV
genome matched those in the human genome for 18 of the
20 amino acids, indicating adaptation to the host. Although
usage patterns were conserved in corresponding genes in
the diverse genotypes, within-genome comparison revealed
that both base and codon usages varied regionally,
particularly in the hypervariable region (HVR) of the P150
replicase gene. While directional mutation pressure was
predominant in determining base and codon usage within
most of the genome (with the strongest tendency being
towards C’s at third codon positions), natural selection was
predominant in the HVR region. The GC content of this
region was the highest in the genome ([80%), and it was
not clear if selection at the nucleotide level accompanied
selection at the amino acid level. Dinucleotide frequency
analysis of the RUBV genome revealed that TpA usage
was lower than expected, similar to mammalian genes;
however, CpG usage was not suppressed, and TpG usage
was not enhanced, as is the case in mammalian genes.
Introduction
Due to the redundancy of the genetic code, the nucleotide
composition of a genome can significantly influence its
evolution, resulting in, among other things, unique utili-
zation patterns of synonymous codons [18]. In contrast to
bacteria, C. elegans, yeast, and Drosophila, extensive
studies of base composition and codon usage in viral
genomes have not been conducted, although the pattern of
base and codon usage is of significance in viral patho-
genesis [11], vaccine design [20] and phylogenetic recon-
struction [36]. Base composition and codon usage among
both DNA and RNA viruses have been found to be non-
random and species-specific. While the factors shaping
these biases vary among viruses, nucleotide composition
constraint, or directional mutation pressure, has been
identified as the major factor in governing codon prefer-
ences [25, 45]. For example, retroviral RNA genomes have
a biased nucleotide composition depending on the genus.
The RNA of human immunodeficiency virus (HIV) is
Electronic supplementary material The online version of thisarticle (doi:10.1007/s00705-012-1243-9) contains supplementarymaterial, which is available to authorized users.
Y. Zhou � X. Chen � T. K. Frey (&)
Department of Biology, Georgia State University,
Atlanta, GA 30303, USA
e-mail: [email protected]
Present Address:X. Chen
Gastroenteritis and Respiratory Virus Laboratory Branch,
Division of Viral Diseases, Centers for Disease Control
and Prevention (CDC), Atlanta, GA 30333, USA
H. Ushijima
Department of Developmental Medical Sciences,
Institute of International Health, Graduate School of Medicine,
The University of Tokyo, Tokyo, Japan
Present Address:H. Ushijima
Division of Microbiology, Department of Pathology
and Microbiology, School of Medicine, Nihon University,
30-1 Oyaguchi Kamicho, Itabashi-ku, Tokyo 173-8610, Japan
123
Arch Virol (2012) 157:889–899
DOI 10.1007/s00705-012-1243-9
enriched with adenine, while the genome of human T cell
leukemia virus (HTLV) is cytosine-rich [6], and these
biased nucleotide compositions constrain codon usage in
the genomes of these viruses [7]. It has been suggested that
nucleotide composition bias is caused by directional
mutation pressure associated with the specific viral reverse
transcriptases [37, 39]. Codon usage bias determined by
directional mutation pressure has also been indicated in
studies of human DNA viruses [45], RNA viruses [19, 25]
and plant viruses [1].
On the other hand, translational efficiency may also play
an important role in codon choice among virus genomes [3,
50]. For example, codon usage in the envelope protein gene
of HIV does not match with codon usage among highly
expressed human genes, resulting in poor expression of the
envelope protein. Optimizing the envelope gene codon
usage to the codon preference of human genes resulted in
an increase in envelope gene expression [20]. Similar
tendencies have been documented in human papillomavi-
ruses (HPV) [5] and poxviruses [4]. Conversely, deoptim-
izing the native viral codon preferences in poliovirus
reduced gene expression and attenuated virulence (and has
thus been suggested as a strategy for developing attenuated
vaccines) [13, 41]. While matching codons with the
availability of cellular tRNA allows for a virus to achieve
the highest levels of gene expression, using non-preferred
codons could result in less gene expression but help the
virus establish persistent infection.
Rubella virus (RUBV), an enveloped, single-stranded,
positive-sense RNA virus, is the sole member of the genus
Rubivirus in the family Togaviridae. The *10-kb genome
contains two long open reading frames (ORFs). The 50
ORF or nonstructural protein ORF (NS-ORF) encodes two
proteins, P150 and P90, that function in RNA replication,
and the 30 ORF or structural protein ORF (SP-ORF)
encodes the three proteins that comprise the virus particle:
the capsid protein, CP, and two envelope glycoproteins, E1
and E2 [17]. The genome of RUBV maintains the highest
genomic GC content (70%) among RNA viruses, the next
highest GC content of the genome of an RNA virus being
*50% [17]. In one region of the RUBV genome, the
hypervariable region (HVR) in the middle of P150 gene,
the GC content is greater than 80% [61]. Based on partial
sequences of the E1 gene, thirteen genotypes of RUBV
have been distinguished [54]. Despite the importance of
base composition and codon usage as indicators of evolu-
tionary forces playing on a virus, little is known about
these parameters across genotypes of a single virus. RUBV
only has a single serotype, and humans are the sole host,
making it an ideal model for studying the effect of intrinsic
evolutionary factors on base and codon usage in the
absence of antigenic diversity or adaptation to infection of
multiple hosts. In this study, we have analyzed the base and
codon usage across 19 genomes of RUBV representing
eight genotypes, examining genes and domains individu-
ally and in comparison with human genes.
Materials and methods
Genome sequences
The 19 RUBV genome sequences that are available were
used in this study. These sequences represented eight out of
13 genotypes [61]. Sequences from 11 genomic coding
regions (P150; P150: MT; P150: HVR; P150: XD; P150:
NP; P90; P90: HEL; P90: RdRp; C; E2; E1; see Fig. 1)
were extracted using the Pileup program in the GCG
package (Genetics Computer Group, version 11.0, Accel-
rys Inc.). Selected coding sequences were edited and
aligned using ClustalW software version 1.8 [21].
Base and codon usage analysis
Two parameters, relative synonymous codon usage
(RSCU) and effective number of codons (Nc), were used to
measure the codon usage biases of all genomic coding
regions. RSCU is defined as the ratio of the observed fre-
quency of a codon to its expected frequency assuming
equal use of synonymous codons [46, 56, 57]. The RSCU
value of the jth codon for ith amino acid is estimated using
the equation
RSCUij ¼ obsij
� �� 1
ni
Xni
j¼1obsij
where obsij is the observed number of the jth codon for the
ith amino acid with ni-fold synonymous codons.
Nc is a parameter that directly reflects the absolute
synonymous codon usage bias of a given gene [14]. The Nc
value ranges from 20, when only one codon is used for
each amino acid (highly biased), to 61, when all codons are
used equally (unbiased). The Nc value is calculated using
the equation
NS-ORF:P150 C P90
SP-ORF:E1E2
RDRPHELNPXDHVRMT
Fig. 1 Genomic coding regions analyzed. A schematic diagram of
the RUBV genome is shown with the two ORFs as boxes and
untranslated regions as lines. Within each ORF, the boundaries of the
individual genes are shown, and within the P150 and P90 genes, the
relative positions of domains are indicated. Specific coordinates (in
nt): P150, 41-3943; MT (methyl/guanylyl-transferase), 230-439; HVR
(hypervariable region), 2120-2440; XD (X domain; ADP-ribose
binding domain), 2492-2998; NP (nonstructural protease), 3041-3940;
P90, 3944-6391; HEL (helicase), 4043-4798; RDRP (RNA-dependent
RNA polymerase), 4826-6388; C, 6512-7411; E2, 7412-8257; E1,
8258-9703
890 Y. Zhou et al.
123
Nc ¼ 2þ 9=F2ð Þ þ 1=F3ð Þ þ 5=F4ð Þ þ 3=F6ð Þ
where Fi is the average homozygosity of the i-fold amino
acids (i = 2, 3, 4 and 6).
The expected value of Nc under the null model, in which
the GC bias at silent sites is entirely due to mutation bias, is
given by the equation
Nc ¼ 2þ sþ 29�
s2 þ 1� sð Þ2h in o
where s is the frequency of GC at the synonymous third
position of the codon (GC3s) [55].
GC3s, the GC content at the third codon position of
synonymous codons, was used to indicate the base com-
position bias of a gene [25]. To quantify the usage of each
nucleotide at synonymous third codon positions, four
indexes, A3s, T3s, C3s, G3s, were calculated. Each index
was defined as a proportion of the maximum usage the
individual nucleotide could have without altering the
amino acid composition (the ratio of the observed codon
frequency of the nucleotide at a synonymous third codon
position to the maximum potential codon frequency of the
nucleotide at a synonymous third codon position). It should
be pointed here that these indices are not directly compa-
rable with GC3s, although they are correlated with GC3s.
Two software programs, DAMBE [56, 57] and CodonW
(win3.2, contributed by John Penden, available at
http://www.molbiol.ox.ac.uk/cu) were used to estimate
these parameters. Codon frequencies and RSCU were cal-
culated using DAMBE software. A3s, T3s, C3s, G3s, GC3s
and Nc were tabulated using CodonW software. Base
composition at individual codon positions and overall
codon positions were carried out by the program TREE-
PUZZLE version 5.2 [44]. To examine the extent to which
base mutation bias accounts for the observed codon usage
bias, the Nc-plot against GC3s, G3s and C3s and the cor-
relations between GC3s and GC content at the first (GC1),
second (GC2) and overall (GC) codon positions of all
genomic coding regions were performed by linear regres-
sion analysis using Microsoft Excel in Windows XP.
Dinucleotide frequency is another important factor that
can affect codon usage. In this study, the relative abun-
dance values of 16 dinucleotides in each genomic coding
region were estimated using the method described by
Karlin and Burge [27]. The average relative abundance
value of each dinucleotide was determined from the odds
ratio over three codon site combinations (p12, p23, p31)
using the equation
qxy ¼ 3�fxy 1;2ð Þ þ fxy 2;3ð Þ þ fxy 3;1ð Þ
fx 1ð Þ þ fx 2ð Þ þ fx 3ð Þ� �
fy 1ð Þ þ fy 2ð Þ þ fy 3ð Þ� �
where f denotes the frequency of the dinucleotide XY in the
sequence under study, and fx and fy denote the frequency of
nucleotide X and Y, respectively. From data simulations
and statistical theory, qxy B 0.78 and C1.23 represent
extreme underrepresentation and extreme overrepresenta-
tion, respectively, for sufficiently long ([20 kb) random
sequences with a probability of 0.001 at most for virtually
any base composition. The observed nucleotide and dinu-
cleotide frequencies of a given gene were determined using
DAMBE software.
Results
Codon usage pattern and bias
The RUBV genome was subdivided into eleven genomic
coding regions ([61], see Fig. 1) for analysis. A similar
codon usage pattern was found in each genomic coding
region among the 19 virus genomes analyzed, indicating no
significant codon usage differences among diverse geno-
types of rubella viruses (data not shown). Therefore, this
study focused on examining the variations of codon usage
among these regions within the RUBV genome.
The codon frequencies and relative synonymous codon
usage (RSCU) values of the 11 regions calculated from the
19 genomic sequences are summarized in Supplementary
Table S1. Codon bias was evident, as roughly 40% of the
codons in each region had an RSCU value [1 and pre-
dictably reflected the high GC content of the RUBV gen-
ome. For example, among Ala codons (GCN), which were
the most frequently used across the genome, GCC was the
most preferred (RSCU = 1.43-2.31). Among the six syn-
onymous codons for Arg, which were frequent in the NSP-
ORF and C gene, 70% were CGC, and the RSCU value of
this codon was the highest among all codons, ranging from
2.98 to 5.02 (average = 4.04) among the genomic coding
regions. In contrast, AAN codons (Lys and Asn) were the
most infrequent (frequencies \2.9% of the total codons).
The codon preference within each genomic coding
region is listed in Table 1. With four exceptions (MT and
HVR, both in P150, C and E2), the preferred codon for
each amino acid was same in all of the genomic regions.
All of the preferred codons ended in C or G, with 83.3%
(15/18) of them ending in C. However, the base usage in
the first and second positions of the preferred codons was
not GC-prone, especially in the second codon position,
where the usage of A (39%, 7/18) was more frequent than
that of C (17%, 3/18) or G (22%, 4/18). In order to com-
pare the codon preferences of RUBV with those of its host,
the codon preferences for a compilation of human genes
are also presented in Table 1. Surprisingly, despite the high
GC content of the rubella virus genome, 15 out of 18
preferred codons were the same between the virus and its
host, although the frequency of usage of each codon in toto
varied widely between the viral and human genes.
Base and codon usage of rubella virus 891
123
Base composition at codon positions
The base compositions at the three codon positions across
the 11 genomic coding regions are summarized in Table 2.
Except for the HVR, the GC content ranged from 66.2%
(E1) to 74.2% (XD), with the frequency of C (36.4-41.7%)
being greater than that of G (28.1-33.4%). The frequencies
of A (11.5-18.2%) and T (12.7-18%) were equal. With an
exceptionally high frequency of C (49%) and low fre-
quencies of A (11%) and T (8%), the HVR had a GC
content of 81.1%. Except for the HVR, at the third codon
position, the base usage was highly biased toward C (48.7-
60.9%), resulting in a GC content at this position with a
range of 74.9-85.1%. At the second codon position, unlike
the extreme bias toward C at third codon position, the
usage of four nucleotides was more equally distributed,
with higher relative frequencies of both A (16.8-26.3%)
and T (15.6-25.7%) and an overall GC content of 48.4-
62.6%. At the first codon position, instead of C, G was the
preferred nucleotide, with a frequency range of 33.1-46.8%
and an overall GC content of 63.9-77%. Overall, outside of
the HVR, the relative nucleotide percentages in the regions
of the rubella virus genome were GC3 [ GC1 [ GC2
(C3 [ C1 = C2 and G1 [ G3 = G2). Interestingly,
although the nucleotide frequencies differed markedly
between RUBV and the compilation of human genes, these
tendencies were similar to those of the human genes,
namely GC3 [ GC1 [ GC2 (C3 [ C1 = C2 and G1 [G3 [ G2). In contrast to the rest of the RUBV genome,
within the HVR, the nucleotide frequency was similar
among the three codon positions, with GC1 (84.6%) [GC3 (81.4%) [ GC2 (77.4%), for an overall GC content of
81.1%. C and G were roughly equal at the first codon
position (43.2% and 41.4%, respectively), but C was
greater at both the second (56.9% vs 20.5%) and third
codon positions (47% vs 34.4%). Thus, within the HVR,
the usage tendencies of GC were GC1 [ GC3 [ GC2,
C2 [ C3 [ C1 and G1 [ G3 [ G2, at variance with other
regions of the genome.
Correlation between codon usage bias and GC content
To address codon usage bias quantitatively within the
coding regions of the genome, a number of parameters
were calculated for each region of the 19 genomes ana-
lyzed in the study. As summarized in Table 3, the N3s
index (the ratio of the actual usage of a nucleotide at the
third codon position to its maximal potential usage without
changing the amino acid sequence) indicated a bias gra-
dient of C3s (0.525-0.722) [ G3s (0.236-0.373) [ T3s
(0.068-0.203) [ A3s (0.049-0.172), which was reflective
of the nucleotide frequencies for the third codon position
shown in Table 2. In the HVR, C3s (0.471-0.524) and G3s
(0.362-0.417) were more similar, as were T3s (0.06-0.118)
and A3s (0.084-0.128), than in other regions of the gen-
ome. Calculations of Nc, the effective number of codons,
revealed a somewhat greater bias in regions of the NSP-
ORF (29-40) than regions of the SP-ORF (35-45). There
was a large range in Nc values in the HVR, 31 to 51 with a
standard deviation of 4.95, although this was not due to
Table 1 Codon preferences
within RUBV genomic coding
regions and human genes
*The preferred codon for each
amino acid was calculated for
each of the coding regions of the
RUBV genome using the 19
sequences. These were the same
for P150, XD, NP, P90, HEL,
RdRp, and E1 and have been
included together under the
heading ‘‘Genomic regions’’
**Preferred codons for human
genes as compiled by Frey [17].
The same usage pattern was
reported in references [34] and
[58]
***Same preferred codons as in
genomic regions�Used equally for the two
codons
AA Genomic regions* P150: MT* P150: HVR* SP: C* SP: E2* Human**
Ala GCC *** GCG *** *** ***
Cys UGC *** *** *** *** ***
Asp GAC ***/GAU� *** *** *** ***
Glu GAG *** *** *** *** ***
Phe UUC *** *** *** *** ***
Gly GGC *** *** *** *** ***
His CAC *** *** *** *** ***
Ile AUC *** AUU *** ***/AUA� ***
Lys AAG AAA *** *** *** ***
Leu CUC *** CUG *** CUG CUG
Asn AAC *** *** *** *** ***
Pro CCC CCA ***/CCG� CCG *** ***
Gln CAG *** *** *** *** ***
Arg CGC *** *** *** *** ***/AGG�
Ser AGC *** UCG UCC *** ***
Thr ACC ***/ACG� *** *** *** ***
Val GUC *** *** *** *** GUG
Tyr UAC UAU *** *** *** ***
892 Y. Zhou et al.
123
GC3s, the GC content at the third codon position of syn-
onymous codons, which was 0.78-0.85 compared with
0.69-0.88 in other regions of the genome (the range of total
GC content was 0.79-0.83 in the HVR vs 0.65-0.75 among
other genomic regions). As shown in Figure 2, linear
regression analysis revealed a positive correlation between
GC and GC3s (R2 = 0.302) for all regions of the genome
except for the HVR, for which no correlation was found
(R2 = 0.0503). Additionally, GC3s was also shown to
correlate with GC1 (R2 = 0.2856) but not with GC2
(R2 = 0.0029) (data not shown). Figure 3 displays the
relationship between GC3s and Nc. With the exception of
the HVR, the Nc of all of the genomic regions exhibited a
significant negative correlation with GC3s (R2 = 0.5928),
while the GC3s of the HVR did not correlate with its Nc
(R2 = 0.1307). In order to examine whether the individual
frequencies of C and G were related to the codon usage
bias of the RUBV genome, the correlations of Nc with C3s
and G3s were determined. As shown in Figure 4, Nc
strongly negatively correlated with C3s (R2 = 0.3536), but
not with G3s (R2 = 0.0775). In the HVR, there was no
significant negative correlation between Nc and either C3s
(R2 = 0.1648) or G3s (R2 = 0.0446).
When comparing the actual plot of Nc vs GC3s with the
plot expected under the null model in which codon usage
bias is entirely due to mutation pressure, it was evident
that, with exception of the HVR, the plot of Nc vs GC3s for
genomic regions lay just below and paralleled the expected
curve, indicating that the codon usage bias in these regions
was mainly influenced by the genomic GC composition,
i.e., mutation pressure. However, the plot of Nc vs GC3s of
the HVR intersected the expected curve. Significantly, the
finding of no correlation between Nc and GC3s within the
HVR indicated that this region of the genome was under
different selective pressure.
Dinucleotide frequencies
Dinucleotide usage is another important factor that shapes
codon usage in genes. The relative dinucleotide abundance
value over three codon positions (p12, p23 and p31) of 16
dinucleotides was computed. Well-known tendencies
among mammalian genes are underrepresentation of TpA
and CpG usage and overuse of TpG. Therefore, the values
of CpG, TpG and TpA in all genomic regions were
examined first. As displayed in Table 4, CpG and TpG
frequencies over all genomic regions were not consistent
with those of the host, as the values of CpG over all regions
ranged between 0.7955 and 1.2428, while TpG was over-
represented in P150, E1, and E2 (values = 1.2620-1.4649)
but not enhanced in the P90 and C genes (values = 1.1210-
1.1638). In contrast, similar to the tendencies in human
genes, TpA was underrepresented in all genomic regions,Ta
ble
2B
ase
com
po
siti
on
(%)
atco
do
np
osi
tio
ns
of
RU
BV
gen
om
icco
din
gre
gio
ns
Co
do
np
osi
tio
nG
ene
GC
AC
GT
Co
do
n
po
siti
on
Gen
eG
CA
CG
T
1st
Gen
om
ic
reg
ion
s*
63
.9–
77
12
.3–
19
.82
7.1
–3
9.4
33
.1–
46
.88
.9–
18
.63
rdG
eno
mic
reg
ion
s*7
4.9
–8
5.1
4.8
–1
3.7
48
.7–
60
.92
1.4
–3
0.3
9.4
–1
4.3
HV
R*
84
.68
0.7
43
.24
1.4
6.7
HV
R*
81
.49
.14
73
4.4
9.6
Hu
man
**
54
28
23
31
18
Hu
man
**
63
17
34
29
20
2n
dG
eno
mic
reg
ion
s
48
.4–
62
.61
6.8
–2
6.3
26
.9–
35
.62
0.7
–3
1.3
15
.6–
25
.7A
llG
eno
mic
reg
ion
s*6
6.2
–7
4.2
11
.5–
18
.23
6.4
–4
1.7
28
.1–
33
.41
2.7
–1
8
HV
R*
77
.41
5.1
56
.92
0.5
7.6
HV
R*
81
.11
14
93
2.1
8
Hu
man
**
40
32
22
18
28
Hu
man
**
52
25
26
26
22
*T
he
bas
eco
mp
osi
tio
nat
each
cod
on
po
siti
on
inea
cho
fth
eg
eno
mic
cod
ing
reg
ion
sw
asca
lcu
late
dfr
om
the
19
gen
om
icse
qu
ence
s.T
he
par
amet
ers
for
10
of
thes
ere
gio
ns
(P1
50
,M
T,X
C,N
P,
P9
0,
HE
L,
Rd
Rp
,C
,E
2,
and
E1
)ar
eg
rou
ped
tog
eth
eran
dp
rese
nte
das
ran
ges
,w
hil
eth
ep
aram
eter
sfo
rth
eH
VR
are
pre
sen
ted
sep
arat
ely
**
Bas
eu
sag
esin
hu
man
gen
esas
com
pil
edb
yF
rey
[17
].T
he
sam
eu
sag
ep
atte
rnw
asre
po
rted
inre
fere
nce
s[3
4]
and
[58
]
Base and codon usage of rubella virus 893
123
with a low value range of 0.3765-0.7970 (a slightly higher
value [0.8458] was observed in HEL). Additionally, it was
noted that 11 out of 16 dinucleotides were underrepre-
sented in the MT domain of P150. Analysis of the relative
abundance of the remainder of the dinucleotides did not
reveal any tendencies of interest.
Discussion
In this study, base composition and codon usage among
and within RUBV genomes were analyzed. RUBV is an
interesting model for such studies both because it com-
prises a single serotype and infects a single host (humans),
freeing it from evolutionary pressures faced by other
viruses with multiple serotypes and hosts, and because it
possesses a GC content that is uniquely high among RNA
viruses. While such a GC content would seemingly be at
odds with that of its host, we found that despite marked
differences in proportional codon usage, the preferred
RUBV codon matched the preferred human codon for 18
out of the 20 amino acids, indicating adaptation to its host.
We found that codon usage was similar in homologous
genes among the genotypes analyzed, corresponding to
findings in studies on other viruses, such as phylogeneti-
cally close nucleopolyhedroviruses [35], human and non-
human AT-rich papillomaviruses [58], and influenza A
viruses [60]. While the RUBV genotypes are only recog-
nizable by sequencing, they are in a dynamic state of flux
worldwide for unknown reasons [53]. While part of the ebb
Table 3 Codon usage indices calculated for genomic coding regions in the RUBV genome
Genomic regions* T3s C3s A3s G3s Nc GC3s GC L aa**
P150 0.108-0.127 0.583-0.605 0.085-0.109 0.330-0.351 38.29-40.33 0.795-0.827 0.711-0.720 1301
P150:MT 0.068-0.203 0.525-0.644 0.138-0.172 0.255-0.309 29.06-39.05 0.687-0.806 0.662-0.695 70
P150:HVR 0.060-0.118 0.471-0.524 0.084-0.128 0.362-0.417 31.67-50.94 0.776-0.850 0.794-0.829 107
P150:XD 0.077-0.144 0.595-0.641 0.050-0.087 0.316-0.358 34.06-39.05 0.818-0.880 0.728-0.753 169
P150:NP 0.083-0.130 0.605-0.652 0.065-0.121 0.286-0.333 36.21-40.22 0.783-0.860 0.711-0.736 300
P90 0.116-0.146 0.634-0.670 0.062-0.086 0.314-0.345 34.84-38.31 0.804-0.849 0.669-0.680 815
P90:HEL 0.101-0.141 0.686-0.722 0.074-0.126 0.236-0.287 33.46-36.74 0.792-0.852 0.660-0.680 252
P90:RdRp 0.122-0.162 0.606-0.651 0.049-0.099 0.326-0.373 34.89-39.89 0.786-0.850 0.661-0.678 521
SP:C 0.112-0.155 0.581-0.634 0.068-0.117 0.297-0.339 35.07-42.01 0.764-0.823 0.717-0.736 300
SP:E2 0.111-0.165 0.563-0.632 0.054-0.083 0.302-0.342 36.82-44.78 0.784-0.839 0.697-0.720 282
SP:E1 0.139-0.195 0.558-0.603 0.082-0.113 0.314-0.353 39.97-44.09 0.752-0.810 0.653-0.670 481
*Parameters were calculated from the indicated genomic regions using the 19 genomic sequences
**Length of the genomic region in amino acids
0.63
0.68
0.73
0.78
0.83
0.880.830.780.730.68
GC3s
GC
Fig. 2 Correlation between overall GC content (GC) and that atsynonymous third codon positions (GC3s) within a genomicregion. GC vs GC3s is plotted for each genomic region of the 19
genomic sequences employed in this study. The HVR points are
shown as pink squares, while the points from the other 10 genomic
regions are shown as blue diamonds (color figure online)
20
25
30
35
40
45
50
55
60
10.80.60.40.20
GC3s
Nc
Fig. 3 The effective number of codons (Nc) as a function of thesynonymous third codon GC content (GC3s) within a genomicregion. Nc vs GC3s is plotted for each genomic region of the 19
genomic sequences employed in this study. The continuous curve
indicates the expected Nc vs GC3s plot under the null model in which
the codon usage is completely constrained by GC content. The HVR
points are shown as pink squares, while the points from the other 10
genomic regions are shown as blue diamonds (color figure online)
894 Y. Zhou et al.
123
and flow of genotypes is due to vaccination programs, the
spread of other genotypes is not explained by vaccine
pressure.
Interesting differences were revealed in the intra-geno-
mic comparisons. Within coding regions, the GC content
varied, with a range of 66.2-81.1%. With one exception, the
HVR (the perpetual outlier), GC content over the three
codon positions was GC3 [ GC1 [ GC2, and GC3s cor-
related with overall GC and GC1 positively and with Nc
negatively. When comparing the actual plot of Nc against
GC3s with the expected curve generated by the null model,
the plot lay just below the expected curve, suggesting that
directional mutation pressure was the major factor for
driving the codon choices for most of the genome. That
mutation pressure had a greater effect on the codon usage
than natural selection was further supported by the finding
that NSP genes had greater codon usage bias than SP genes
in the RUBV genome (Nc NSP vs SP = 29-40 vs 35-45).
Since the expression level of NSP genes is much lower than
that of SP genes during virus replication, the codon usage of
highly expressed protein genes is usually much more biased
than less-expressed genes if natural selection is at play.
Directional mutation pressure as the main factor for codon
usage choice has also been observed in many human RNA
viruses [25] and DNA viruses [45] as well as in mammalian
species [52]. The high mutation rates and large population
sizes exhibited by RNA viruses possibly make the natural
selection constraints on codon choice inefficient, with the
result that RNA viruses have a low codon usage bias, and
their codon usage reflects more of the underlying mutation
pressures (an inverse relationship between nucleotide sub-
stitution rate and codon usage bias) [25, 43, 45]. Accord-
ingly, with a high codon usage bias, RUBV is expected to
have a low nucleotide substitution rate. In fact, a compre-
hensive study on substitution rates in 50 RNA viruses
20
25
30
35
40
45
50
55
60
0.22 0.32 0.42 0.52 0.62 0.72
G3s/C3s
Nc
Fig. 4 Correlation between the effective number of codons (Nc)and the G and C content at synonymous third codon positions(G3s/C3s) within a genomic region. Nc vs G3s or C3s is plotted for
each genomic region of the nineteen genomic sequences employed in
this study. The HVR points are shown as pink squares (G3s) or pinktriangles (C3s), while the points from the other 10 genomic regions are
shown as blue squares (G3s) or blue triangles (C3s) (color figure online)
Table 4 Relative abundance of the 16 dinucleotides (di-nt) in RUBV genomic coding regions
Genomic region di-nt* P150 P150:MT P150:HVR P150:XD P150:NP P90 P90:HEL P90:RdRp SP:C SP:E1 SP:E2
AA 0.9663 0.7039 1.2301 0.7890 0.9566 0.8298 0.8603 0.7868 1.1124 0.9251 0.8171
AC 1.0730 0.7069 1.1546 1.0596 1.0530 1.1063 1.1744 0.9897 1.0078 1.2429 1.2445
AG 0.9004 0.7648 0.6672 0.8156 0.9540 0.8656 0.7410 0.8951 0.9399 0.8522 0.5494
AU 1.0563 0.8003 1.0730 1.0556 0.9999 1.1731 1.1953 1.1146 0.9928 0.8084 1.0166
CA 1.0518 0.7758 0.8981 1.1126 1.0791 1.0566 1.0468 1.0312 1.0271 1.1793 1.1693
CC 0.9068 0.6563 0.8793 0.7385 0.9535 0.8383 0.7462 0.8421 0.9158 0.9559 0.8803
CG 1.0981 0.7955 1.2428 1.0951 1.0079 1.1286 1.2132 1.0268 1.0498 0.8968 0.9005
CU 0.9824 0.7712 0.8246 1.0652 1.0175 1.0770 1.1931 0.9404 1.1090 1.0902 1.0102
GA 1.0897 0.7413 1.2317 0.9453 1.0734 1.1776 1.1088 1.1349 1.0757 0.9368 0.8297
GC 1.1610 0.8203 1.1302 1.1485 1.1126 1.1575 1.2370 1.0749 1.1265 1.0515 1.0584
GG 0.7680 0.6283 0.6768 0.6769 0.7987 0.8243 0.7875 0.7763 0.9120 0.9713 0.9002
GU 0.9953 0.6615 1.1823 0.9688 1.0856 0.7698 0.6669 0.7836 0.7138 0.9748 0.8422
UA 0.6783 0.5335 0.3765 0.6095 0.6371 0.7141 0.8458 0.5781 0.5375 0.7970 0.6937
UC 0.8243 0.8168 1.0055 0.9516 0.8118 0.9695 1.0032 0.9044 0.9577 0.7783 0.6690
UG 1.3461 0.7678 1.1438 1.2620 1.4815 1.1638 1.1206 1.1268 1.1210 1.3619 1.4649
UU 1.0047 1.0744 1.2446 0.4750 0.7438 1.0680 0.8977 1.0934 1.3508 1.0169 0.9215
*The dinucleotide frequency at all three codon positions (p12, p23, p31) was calculated from the 19 genomic sequences and expressed as the
proportion of the expected frequency given the GC content of the genomic region. The values that are deeply depressed (B0.78) or overrep-
resented (C1.23) are in boldface type
Base and codon usage of rubella virus 895
123
documented that RUBV did exhibit a lower substitution rate
than other RNA viruses [24]. Although a significant pattern
with low substitution rates along with decreased codon
usage bias has been observed in members of the family
Togaviridae (most of which are vector-borne viruses) [24],
as a non-arthropod-borne virus with a single host, it was not
surprising that RUBV uses a codon strategy with a low
substitution rate along with a high codon usage bias. Also,
increased bias in segmented and aerosol-transmitted RNA
virus (RUBV is an aerosol-transmitted virus) has been
reported [25], providing evidence that structural features of
the viral genome and ecology of viruses contribute to the
codon choice of viruses [24].
Within the HVR, however, the overall relationships
among GC, GC1, GC3 and Nc suggested that this region is
subjected primarily to natural selection rather than to
mutation pressure. In this regard, previous studies on
RUBV indicated that the HVR was subjected to positive
evolutionary pressure [22, 61]. It is well established that
natural selection can change genomic base content and play
a role in altering the most intricate pattern of codon usage
[2, 8, 15, 23, 38, 40, 48, 49]. Within the RUBV NS-ORF,
the HVR resides within the recently defined Q domain. The
function of the Q domain is unknown, and it was mapped
through the curious observation that while deletions of this
region render RUBV constructs nonviable, these deletions
can be rescued by the capsid protein [51]. The region
encoded by the HVR is high in proline and includes PxxP
and PxxPxR motifs known to be important in protein-
protein interaction [28]. Therefore, the selective pressure
brought to bear on this region may relate to intracellular
contacts the region mediates during the RUBV replication
cycle. While it is difficult to imagine that these contacts
would involve different cell proteins throughout the human
population, it is possible that the plasticity of this region is
necessary to accommodate the same set of proteins across
different genetic backgrounds. Since the HVR exhibits the
highest substitution rate in the RUBV genome [61], it
would be expected to have the lowest codon usage bias,
since there is usually a negative correlation between codon
usage bias and mutation rate (i.e., because codon bias
reduces the rate of synonymous substitutions) [10, 47].
However, the HVR exhibited a wide range of Nc values
among diverse genotypes (Nc = 32-51). This finding is not
opposed to this expectation, however, because the HVR
features substitutions at nonsynonymous sites that were
apparently driven by natural selection that countered cod-
ing bias driven by mutation pressure. It should be pointed
out that the HVR has the highest GC content in the whole
genome ([80%), and it is possible that some attribute of
such a high GC content (e.g., stability of secondary struc-
ture) may also be favored by natural selection, particularly
in a region that tolerates plasticity in its amino acid content.
Regarding the basis of the directional mutation pressure,
although the RUBV genome is rich in both C and G, the
distribution of these two nucleotides over the three codon
positions obviously differed. Specifically, G was higher in
the first codon position (i.e., G1 [ G3 [ G2) in all of the
genomic regions, including the HVR, while C predomi-
nated at the third codon position (i.e., C3 [ C1 = C2) in
all of the genomic regions except the HVR
(C2 [ C3 [ C1). These data indicated first that irrespec-
tive of whether the genomic region was subjected to
directional mutation pressure or natural selection, C was
the primary base that influenced codon choice in the RUBV
genome. Secondly, G base usage in the RUBV genome was
governed to a greater extent by natural selection than was C
base usage. The stronger negative correlation between Nc
and C3s than between Nc and G3s confirmed these indi-
cations. The observation of natural selection influencing G
base usage at the first two codon positions in a genome in
which the codon usage was primarily governed by muta-
tion pressure was also documented in bacteria with highly
skewed (H. influenzae and M. tuberculosis) and unskewed
(E. coli) base compositions [42]. In these studies, G was
found to be preferred in the first codon position of highly
expressed genes, irrespective of their overall GC content
[42, 55]. This was not the case in the RUBV genome, as the
more highly expressed ORF, the SP-ORF, exhibited the
same tendencies as the less expressed ORF, the NS-ORF.
Interestingly, an analysis performed on codon usage of
16,654 human genes and 129 human virus genomes
showed that all C-ending and G-ending codons were pre-
ferred in highly expressed genes [31]. While the basis of
the directional mutation pressure on the RUBV genome is
not known, we infer that it is due to enzyme-induced
hypermutations (i.e., characteristics of the RNA-depen-
dent-RNA polymerase encoded by the virus), although the
formation of stable local RNA structures may also play a
role, as has been suggested for retroviruses [29, 37].
Intriguingly, it was recently suggested that the RUBV
genome RNA contains a microRNA precursor in NS-ORF
that suppresses the human APOBEC1 editing enzyme
(which catalyzes G-to-A and C-to-U transitions) resulting
in the GC content bias in RUBV genomes [30]. However,
this remains to be confirmed experimentally.
Analysis of relative dinucleotide abundance in the
RUBV genome revealed that CpG was used at the expected
frequency (the HVR had an especially high relative CpG
abundance, with a value of 1.2428), and TpG was enhanced
in the P150, E1 and E2 genes but used as expected in the
P90 and C genes. This was not as in the case of host genes.
These data are consistent with a previous analysis of
eukaryotic viruses (RNA and DNA), which demonstrated
that CpG was deeply suppressed and TpG was overrepre-
sented in viruses with small genomes, with the exception of
896 Y. Zhou et al.
123
four togaviruses, including RUBV [25, 26, 45]. The other
three togaviruses were alphaviruses with a genomic GC
content of *53%, and thus the CpG usage by members of
this family is independent of the genomic GC content. CpG
is important in chromatin structure and immune signaling
through TLR9 in mammalian species [32, 33, 59]; how-
ever, why either would be a factor in CpG usage by small
RNA viruses is not clear. On the other hand, depressed
usage of TpA was found in the RUBV genome, as has been
reported for mammalian genes and both DNA and RNA
viruses. Properties selecting against TpA usage include
RNase susceptibility [9], low thermal stability [12, 16] and
high TA content in stop codons [9].
Finally, the tendencies in base composition and codon
usage in the RUBV genome can be summarized in an
analysis of codon usage for Arg. Arginine is abundant in
both the NS-ORF (12.8% in the MT domain) and CP gene
(13.4%) in SP-ORF. The abundance of arginine is
accompanied by an exclusion of lysine (AAG and AAA)
(\2.9% of total amino acid usage). Furthermore, within the
six synonymous codons for Arg, 70% were CGC. In fact,
the RSCU value of this codon was the highest among all
codons (2.98-5.02 across the genomic regions, aver-
age = 4.04). Concomitantly, the usage of the two AGR
codons was deeply suppressed; e.g., eight out of eleven
genomic regions scored an RSCU of zero for AGA. In
contrast, in the genomes of large viruses with unsuppressed
relative abundances of CpG, no bias in codon choice for
Arg was present, while in the genomes of most small
viruses (DNA and RNA) with a CpG deficiency, Arg usage
was normal but biased towards AGR [26, 45]. Thus, the
directional mutation pressure towards C, and to a lesser
extent G in RUBV, appears to be driving amino acid as
well as codon choice.
Acknowledgments This research was supported by a grant from
NIH (AI21389).
References
1. Adams MJ, Antoniw JF (2004) Codon usage bias amongst plant
viruses. Arch Virol 149:113–135
2. Akashi H, Kliman RM, Eyre-Walker A (1998) Mutation pressure,
natural selection, and the evolution of base composition in Dro-sophila. Genetica 102–103:49–60
3. Andre S, Seed B, Eberle J, Schraut W, Bultmann A, Haas J
(1998) Increased immune response elicited by DNA vaccination
with a synthetic gp120 sequence with optimized codon usage.
J Virol 72:1497–1503
4. Barrett JW, Sun Y, Nazarian SH, Belsito TA, Brunetti CR,
McFadden G (2006) Optimization of codon usage of poxvirus
genes allows for improved transient expression in mammalian
cells. Virus Genes 33:15–26
5. Baud D, Ponci F, Bobst M, De Grandi P, Nardelli-Haefliger D
(2004) Improved efficiency of a Salmonella-based vaccine
against human papillomavirus type 16 virus-like particles
achieved by using a codon-optimized version of L1. J Virol
78:12901–12909
6. Berkhout B, van Hemert FJ (1994) The unusual nucleotide con-
tent of the HIV RNA genome results in a biased amino acid
composition of HIV proteins. Nucleic Acids Res 22:1705–1711
7. Berkhout B, Grigoriev A, Bakker M, Lukashov VV (2002) Codon
and amino acid usage in retroviral genomes is consistent with
virus-specific nucleotide pressure. AIDS Res Hum Retroviruses
18:133–141
8. Bernardi G (1995) The human genome: organization and evolu-
tionary history. Annu Rev Genet 29:445–476
9. Beutler E, Gelbart T, Han JH, Koziol JA, Beutler B (1989)
Evolution of the genome and the genetic code: selection at the
dinucleotide level by methylation and polyribonucleotide cleav-
age. Proc Natl Acad Sci USA 86:192–196
10. Bierne N, Eyre-Walker A (2003) The problem of counting sites in
the estimation of the synonymous and nonsynonymous substitu-
tion rates: implications for the correlation between the synony-
mous substitution rate and codon usage bias. Genetics 165:1587–
1597
11. Bosch ML, Andeweg AC, Schipper R, Kenter M (1994) Insertion
of N-linked glycosylation sites in the variable regions of the
human immunodeficiency virus type 1 surface glycoprotein
through AAT triplet reiteration. J Virol 68:7566–7569
12. Breslauer KJ, Frank R, Blocker H, Marky LA (1986) Predicting
DNA duplex stability from the base sequence. Proc Natl Acad Sci
USA 83:3746–3750
13. Burns CC, Shaw J, Campagnoli R, Jorba J, Vincent A, Quay J,
Kew O (2006) Modulation of poliovirus replicative fitness in
HeLa cells by deoptimization of synonymous codon usage in the
capsid region. J Virol 80:3259–3272
14. Comeron JM, Aguade M (1998) An evaluation of measures of
synonymous codon usage bias. J Mol Evol 47:268–274
15. Comeron JM, Kreitman M, Aguade M (1999) Natural selection
on synonymous sites is correlated with gene length and recom-
bination in Drosophila. Genetics 151:239–249
16. Delcourt SG, Blake RD (1991) Stacking energies in DNA. J Biol
Chem 266:15160–15169
17. Frey TK (1994) Molecular biology of rubella virus. Adv Virus
Res 44:69–160
18. Grantham R, Gautier C, Gouy M, Mercier R, Pave A (1980)
Codon catalog usage and the genome hypothesis. Nucleic Acids
Res 8:r49–r62
19. Gu W, Zhou T, Ma J, Sun X, Lu Z (2004) Analysis of synony-
mous codon usage in SARS Coronavirus and other viruses in the
Nidovirales. Virus Res 101:155–161
20. Haas J, Park EC, Seed B (1996) Codon usage limitation in the
expression of HIV-1 envelope glycoprotein. Curr Biol 6:315–324
21. Henikoff S, Henikoff JG (1994) Position-based sequence weights.
J Mol Biol 243:574–578
22. Hofmann J, Renz M, Meyer S, von Haeseler A, Liebert UG
(2003) Phylogenetic analysis of rubella virus including new
genotype I isolates. Virus Res 96:123–128
23. Hughes S, Zelus D, Mouchiroud D (1999) Warm-blooded iso-
chore structure in Nile crocodile and turtle. Mol Biol Evol
16:1521–1527
24. Jenkins GM, Rambaut A, Pybus OG, Holmes EC (2002) Rates of
molecular evolution in RNA viruses: a quantitative phylogenetic
analysis. J Mol Evol 54:156–165
25. Jenkins GM, Holmes EC (2003) The extent of codon usage bias
in human RNA viruses and its evolutionary origin. Virus Res
92:1–7
26. Karlin S, Doerfler W, Cardon LR (1994) Why is CpG suppressed
in the genomes of virtually all small eukaryotic viruses but not in
those of large eukaryotic viruses? J Virol 68:2889–2897
Base and codon usage of rubella virus 897
123
27. Karlin S, Burge C (1995) Dinucleotide relative abundance
extremes: a genomic signature. Trends Genet 11:283–290
28. Kay BK, Williamson MP, Sudol M (2000) The importance of
being proline: the interaction of proline-rich motifs in signaling
proteins with their cognate domains. Faseb J 14:231–241
29. Keating CP, Hill MK, Hawkes DJ, Smyth RP, Isel C, Le SY,
Palmenberg AC, Marshall JA, Marquet R, Nabel GJ, Mak J
(2009) The A-rich RNA sequences of HIV-1 pol are important for
the synthesis of viral cDNA. Nucleic Acids Res 37:945–956
30. Khrustalev VV, Barkovsky EV (2011) Unusual nucleotide con-
tent of Rubella virus genome as a consequence of biased RNA-
editing: comparison with Alphaviruses. Int J Bioinform Res Appl
7:82–100
31. Kliman RM, Bernal CA (2005) Unusual usage of AGG and TTG
codons in humans and their viruses. Gene 352:92–99
32. Krug A, Luker GD, Barchet W, Leib DA, Akira S, Colonna M
(2004) Herpes simplex virus type 1 activates murine natural
interferon-producing cells through toll-like receptor 9. Blood
103:1433–1437
33. Kundu TK, Rao MR (1999) CpG islands in chromatin organi-
zation and gene expression. J Biochem 125:217–222
34. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC,
Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R,
Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J,
LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP,
Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos
R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N,
Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R,
Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A,
Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R,
French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt
A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S,
Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen
R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson
JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH,
Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL,
Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL,
Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P,
Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen
A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny
DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives
CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS,
Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M,
Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y,
Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F,
Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR,
Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM,
Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S,
Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen
L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP,
Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J,
Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Mino-
shima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F,
Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la
Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G,
Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S,
Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC,
Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler
EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y,
Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS,
Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P,
Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A,
Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP,
Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski
J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler
R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F,
Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos
A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H,
Choi S, Chen YJ (2001) Initial sequencing and analysis of the
human genome. Nature 409:860–921
35. Levin DB, Whittome B (2000) Codon usage in nucleopolyhed-
roviruses. J General Virol 81:2313–2325
36. Lockhart PJ, Steel MA, Hendy MD, Penny D (1994) Recovering
evolutionary trees under a more realistic model of sequence
evolution. Mol Biol Evol 11:605–612
37. Mansky LM, Temin HM (1995) Lower in vivo mutation rate of
human immunodeficiency virus type 1 than that predicted from
the fidelity of purified reverse transcriptase. J Virol 69:5087–5094
38. McEwan CE, Gatherer D, McEwan NR (1998) Nitrogen-fixing
aerobic bacteria have higher genomic GC content than non-fixing
species within the same genus. Hereditas 128:173–178
39. Menendez-Arias L (2002) Molecular basis of fidelity of DNA
synthesis and nucleotide specificity of retroviral reverse tran-
scriptases. Prog Nucleic Acid Res Mol Biol 71:91–147
40. Mooers AO, Holmes EC (2000) The evolution of base compo-
sition and phylogenetic inference. Trends Ecol Evol 15:365–369
41. Mueller S, Papamichail D, Coleman JR, Skiena S, Wimmer E
(2006) Reduction of the rate of poliovirus protein synthesis
through large-scale codon deoptimization causes attenuation
of viral virulence by lowering specific infectivity. J Virol 80:
9687–9696
42. Pan A, Dutta C, Das J (1998) Codon usage in highly expressed genes
of Haemophillus influenzae and Mycobacterium tuberculosis:
translational selection versus mutational bias. Gene 215:405–413
43. Powell JR, Moriyama EN (1997) Evolution of codon usage bias
in Drosophila. Proc Natl Acad Sci USA 94:7784–7790
44. Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002)
TREE-PUZZLE: maximum likelihood phylogenetic analysis
using quartets and parallel computing. Bioinformatics (Oxford,
England) 18:502–504
45. Shackelton LA, Parrish CR, Holmes EC (2006) Evolutionary
basis of codon usage and nucleotide composition bias in verte-
brate DNA viruses. J Mol Evol 62:551–563
46. Sharp PM, Li WH (1986) An evolutionary perspective on syn-
onymous codon usage in unicellular organisms. J Mol Evol
24:28–38
47. Sharp PM, Li WH (1987) The rate of synonymous substitution in
enterobacterial genes is inversely related to codon usage bias.
Mol Biol Evol 4:222–230
48. Sharp PM, Matassi G (1994) Codon usage and genome evolution.
Curr Opin Genet Dev 4:851–860
49. Singer CE, Ames BN (1970) Sunlight ultraviolet and bacterial
DNA base ratios. Science 170:822–825
50. Smith DW (1996) Problems of translating heterologous genes
in expression systems: the role of tRNA. Biotechnol Prog 12:
417–422
51. Tzeng WP, Frey TK (2009) Functional replacement of a domain
in the rubella virus p150 replicase protein by the virus capsid
protein. J Virol 83:3549–3555
52. Urrutia AO, Hurst LD (2001) Codon usage bias covaries with
expression breadth and the rate of synonymous evolution in
humans, but this is not evidence for selection. Genetics
159:1191–1199
53. WHO (2007) Update of standard nomenclature for wild-type
rubella viruses, 2007. Wkly Epidemiol Rec 82:216–222
54. WHO (2007) Update of standard nomenclature for wild-type
rubella viruses, 2007. Releve epidemiologique hebdomadaire/
Section d’hygiene du Secretariat de la Societe des Nations =
Weekly epidemiological record/Health Section of the Secretariat
of the League of Nations 82:216–222
55. Wright F (1990) The ‘effective number of codons’ used in a gene.
Gene 87:23–29
898 Y. Zhou et al.
123
56. Xia X (2000) Factors Affecting Codon Frequencies. Data Anal-
ysis in Molecular Biology and Evolution. Kluwer Academic
Publishers, pp 59–105
57. Xia X, Xie Z (2001) DAMBE: software package for data analysis
in molecular biology and evolution. J Hered 92:371–373
58. Zhao KN, Liu WJ, Frazer IH (2003) Codon usage bias and A ? T
content variation in human papillomavirus genomes. Virus Res
98:95–104
59. Zheng M, Klinman DM, Gierynska M, Rouse BT (2002) DNA
containing CpG motifs induces angiogenesis. Proc Natl Acad Sci
USA 99:8944–8949
60. Zhou T, Gu W, Ma J, Sun X, Lu Z (2005) Analysis of synony-
mous codon usage in H5N1 virus and other influenza A viruses.
Bio Syst 81:77–86
61. Zhou Y, Ushijima H, Frey TK (2007) Genomic analysis of
diverse rubella virus genotypes. J General Virol 88:932–941
Base and codon usage of rubella virus 899
123