evidence of alu and b1 expression in dbest
Post on 20-Feb-2023
0 Views
Preview:
TRANSCRIPT
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
Research Article
Evidence of Alu and B1 Expressionin dbEST
Boris Umylny
Asia Pacific Bioinformatics
Research Institute, Honolulu, HI
Gernot Presting
Department of Molecular
Biosciences and Bioengineering,
University of Hawaii at Manoa,
Honolulu, HI
W. Steven Ward
Institute for Biogenesis Research,
University of Hawaii at Manoa,
Honolulu, HI
Alus and B1s are short interspersed repeat elements (SINEs) derived from
the 7SL RNA gene. Alus and B1s exist in the cytoplasm as non-coding
RNA indicating that they are actively transcribed, but their function, if any,
is unknown. Transcription of individual SINEs is a prerequisite for retroposi-
tion, but it is also possible that individual Alu and B1 elements have some
cellular functions. Previous studies suggest that transcription of Alu elements
depends on the presence of an RNA polymerase-III bipartite promoter and
the poly-A tail. Sequencing of small RNAs has demonstrated that the
members of the Y and S subfamily are expressed. We analyzed almost
one million Alu sequences longer than 200 nucleotides for the presence
of RNA polymerase-III bipartite promoter sequences. More than half con-
tained a promoter indicating some potential for expression. We searched
7.7 million human EST sequences in dbEST for the presence of Alu non-
coding RNAs and found evidence for the expression of 452. Analysis of
mouse spermatogenic dbEST libraries revealed an apparent relationship
between the level of differentiation and the level of B1-related sequences
in the EST library.
KEYWORDS Alu, B1, EST, polymerase III, repeat, SINE
INTRODUCTION
Human Alus [Deininger et al. 1981] and mouse B1s [Krayev et al. 1980] are
short interspersed elements (SINEs) derived from 7SL RNA. 7SL RNA is an
RNA component of the signal recognition particle (SRP), a ribonucleopro-
tein involved in translation of eukaryotic secreted proteins [Ullu and Tschudi
1984]. These repeats occupy a significant portion of the human (10.7%) and
mouse (2.7%) genomes [Lander et al. 2001; Waterston et al. 2002]. A single
Alu repeat consists of two similar elements linked by a short poly-A chain
yeilding a total length of approximately 300 nucleotides (nt) [Jurka and
Milosavljevic 1991; Quentin 1992]. The mouse B1 is a monomer of approxi-
mately 140 nt in length, with an internal 29-nucleotide duplication [Labuda
et al. 1991]. The Alu 50 monomer (FLAN) shares significant similarity with
certain proto-B1 (pB1) sequences [Quentin 1994]. It is believed that Alu
and B1 sequences are propagated by a reverse transcriptase encoded by
Received 12 February 2007;accepted 23 March 2007.
Address correspondence to W. StevenWard, PhD, University of Hawaii atManoa, Institute for BiogenesisResearch, 1960 East–West Road,Honolulu, HI 96822, USA.E-mail: wward@hawaii.edu
Archives of Andrology: Journal of Reproductive Systems, 53:207–218, 2007Copyright # Informa Healthcare USA, Inc.ISSN: 0148-5016 print=1521-0375 onlineDOI: 10.1080/01485010701426422
207
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
the L1 family of long interspersed repeat elements
(LINEs) [Schmid 1998]. The propagation of specific
Alu subfamilies may be attributable to particular L1
subfamilies [Ohshima et al. 2003].
While there is no confirmed function for individ-
ual Alu or B1 elements, various stress conditions
increase both expression [Chu et al. 1998] and
L1-mediated retroposition [Hagan et al. 2003],
suggesting the possibility of function [Schmid 1998].
One possible function for some of the transcribed
Alu elements appears to be inhibiting PKR (a dou-
ble-stranded RNA binding protein) to increase
protein translation [Chu et al. 1998]. Alu sequences
also appear to play a role in exonization and A-I edit-
ing [Dagan et al. 2004; Hasler and Strub 2006]. Alu
and B1 sequences are preferred methylation targets
[Hellmann-Blumberg et al. 1993; Jeong and Lee
2005; Kochanek et al. 1993], accounting for 33% of
the total genomic methylation sites [Schmid 1998].
In the genome, Alu and B1 sequences appear to be
preferentially distributed in GC-rich regions [Lander
et al. 2001; Waterston et al. 2002]. The distribution
of the B1 element in the mouse genome exhibits a
greater correlation with the Alu content of the ortho-
logous areas of the human genome than with the
immediate GC-density [Waterston et al. 2002]. This
suggests that genomic features, which are correlated
with, but distinct from, GC-content, may determine
Alu=B1 distribution [Waterston et al. 2002].
Both full length Alu RNA (flAlu) as well as left
monomer-only Alu RNA can be detected. It has been
proposed that left monomer-only Alu RNA (scAlu) is
a more stable, processed product of the dimeric flAlu
[Maraia et al. 1993]. The flAlu RNA appears to bind
and inactivate PKR [Chu et al. 1998], an inhibitor of
protein translation [Clemens and Elia 1997]. It is
believed that a stress-induced increase in Alu
expression inhibits PKR, thereby allowing increased
protein synthesis [Chu et al. 1998].
Alu subfamilies are classified by age. The oldest,
Jo and Jb, were derived from a single ancestral
gene 81 million years ago [Kapitonov and Jurka
1996]. The intermediate S subfamilies (Sx, Sp, Sq,
and Sc) have an estimated age of approximately 35
to 48 million years [Jurka and Milosavljevic 1991].
The youngest group, previously known as Sb,
includes the Y subfamilies [Rowold and Herrera
2000]. Some Y Alus are estimated to be as young as
3 to 4 million years [Kapitonov and Jurka 1996] and
may still be undergoing active transposition [Jurka
et al. 2002].
The Alus are transcribed by RNA polymerase III
using a bipartite promoter similar to the tRNA pro-
moters, consisting of an A-Box and a B-Box [Perez-
Stable et al. 1984; Shaikh et al. 1997; Shankar et al.
2004]. It is believed that the B-Box plus the poly-A
tail are sufficient to initiate minimal transcription, if
G is at position 1 and T is at position 3 [Shaikh
et al. 1997; Shankar et al. 2004]. Increased stress
may stimulate transcription of Alus that have only the
B-Box [Shankar et al. 2004]. It is believed that the
A-Box contributes to the increased transcription rates
of Alus and to the stability of the transcripts [Shankar
et al. 2004]. The evidence that Alus exist in the cyto-
plasm as non-coding RNA is widely accepted [Chu
et al. 1998; Clemens and Elia 1997; Hagan et al. 2003;
Maraia et al. 1993; Schmid 1998; Vila et al. 2003].
Transcription of Alus is one of the requirements for
retroposition of Alus, so it might be expected that
the youngest Alus, the AluY’s, would be the pre-
dominant elements transcribed. However, it is also
possible that non-transposing Alus are transcribed
for reasons related to cellular function. In support
of this, sequencing of small RNAs has demonstrated
that the members of the S subfamily also are
expressed [Shaikh et al. 1997].
Both the human and mouse sequencing projects
reported a large variation between individual Alus
and B1s, ranging from 1% to 40% from the consen-
sus sequence [Lander et al. 2001; Waterston et al.
2002]. The currently accepted method for classifying
Alus is to compare each sequence to one of the 34
consensus sequences for each identified subfamily
present in Repbase [Jurka et al. 2005]. These consen-
sus sequences were established by profiling nucleo-
tide frequencies at each position and selecting
specific diagnostic positions to classify a given Alu
sequence into a subfamily [Jurka and Smith 1988].
A recent classification of approximately half of all
human Alus, using frequencies of nucleotide-pairs
at diagnostic positions, resulted in the identification
of at least 213 subfamilies [Price et al. 2004].
The aim of this study was to begin to understand
Alu and B1 expression. We first analyzed the entire
population of Alu repeats 200 bp or greater in length
for the presence of the RNA polymerase III bipartite
promoter sequence. Then, using previously pub-
lished results indicating that Alus can be uniquely
B. Umylny et al. 208
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
identified using their sequence [Umylny et al. 2007],
we also searched for evidence of expressed Alus in
public domain EST databases. Having established
the technique, we used it to examine the level of
B1 expression in mouse spermatogenic EST libraries.
Murine B1 repeats were uniquely identified by
sequence [Umylny et al. 2007], although they were
far less prevalent than human Alu repeats in the gen-
ome [Lander et al. 2001; Waterston et al. 2002].
RESULTS
Distribution of Poly(A) Tails
in Human Alus
To detect the poly(A) tails, we chose the criteria
that the sequence must be initiated with two 30
A-nucleotides and terminated by two consecutive
non-A nucleotides (see Methods). Using this defi-
nition we found that over 75% of human Alus con-
tain a poly(A) tail greater than 8 nucleotides. This
proportion increases slightly, to 77%, if we consider
only those Alus that also contain a B-box and to 80%
in Alus that contain both an A and B-box (Fig. 1).
We used the following templates for promoter
boxes, A-box: [GjT]GGCNNRGTN[GjC] and B-box:
G[AjT]T[CjT]RANNC.
Distribution of Promoter Locationsin Human Alus
The locations of the A and B promoter boxes
within all Alu sequences were detected and mapped.
While B-boxes may potentially be present at any
location within the Alu sequence, they show a clear
preference for the area between nucleotides 62 and
82 (Fig. 2). This is consistent with the expectation that
transcription begins at nucleotide �70 from the start
of the B-box [Perez-Stable et al. 1984]. Likewise, 66%
of all A-boxes occur between nucleotides �1 and 5.
Another 27% of the A-boxes are located between
nucleotides 135 and 145 (Fig. 2). However, both A
and B-boxes were observed at all positions within
an Alu sequence. Specifically, there were minor con-
centrations of B-box between nucleotides 194 and
228. This may correspond to the significant con-
centration of A-boxes between nucleotides 135 and
145 and suggest the possibility of right monomer tran-
scription. There is also a minor concentration of
B-boxes between nucleotides 30 and 58, that implies
Alu transcription may begin upstream of the expected
50 start (Fig. 3). The basic distribution of A and
B-boxes does not change when we consider the sub-
set of Alus that contain both the A and the B-box. The
same spikes in concentration are evident in the same
locations (Figs. 4 and 5).
Alus Capable of Transcription
It has been suggested that to be capable of tran-
scription by RNA polymerase III, an Alu must contain
a viable poly(A) tail and a B-box [Shankar et al.
2004]. This allows transcription to take place at a
FIGURE 1 Length of the poly-A tails. Of all Alus, 75% have
poly(A) tails greater than 10 nucleotides, 77% of Alus containing
B-boxes have poly(A) tails greater than 10 nucleotides and 80%
of the Alus with both B and A-boxes have poly(A) tails greater
than 10 nucleotides.
FIGURE 2 Distribution of locations of A and B-boxes. Vertical
axis represents the total number of Alus with an A or B-box at
the location indicated on the horizontal axis. Approximately
39% of all Alus contain A-boxes and 73% contain B-boxes. Of
those, 66% of A-boxes and 73% of the B-boxes are contained in
the large spikes between –1 to 5 and 62 to 82, respectively and
27% of the A-boxes are contained in the small spike between
135 and 145.
209 Evidence of Alu and B1 Expression in dbEST
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
low, baseline level. The presence of the A-box is
believed to increase transcription rates and stabilizes
transcription [Shankar et al. 2004]. It is possible that
transcription rates of Alus with only the B-Boxes
might be up regulated during stress. The younger
Y Alus are more likely than the S, and the J Alus,
to have both the A and B-boxes. However, Alus con-
taining only a B-box but no A-box, and therefore
capable of low levels of expression, are equally
likely in all subfamilies (Table 1). There does not
appear to be an obvious preference for the distri-
bution of transcription-capable Alus with respect to
genes, exons or introns (Table 1).
Alus Capable of Right Monomer
Expression
There appears to be a significant concentration of
A-boxes at the beginning of the right monomer
(Fig. 2). While the A-box by itself is not sufficient
to initiate transcription, all Alu sequences were
checked for the presence of A-boxes at the start of
the right monomer, B-Boxes at 60 to 80 nucleotides
past the A-box and a viable poly A tail. A small num-
ber of primarily old Alus appear to satisfy the
requirement for the significant transcription rate of
the right monomer (Table 2).
Alus in dbEST
We next examined all of the over seven million
human ESTs in dbEST for evidence of non-coding
Alu expression. Using RepeatMasker, we selected
those ESTs that were (i) over 200 nucleotides in
length and (ii) composed of a single Alu sequence
over 90% of their total length. These criteria were
met by 655 ESTs. This set of sequences was aligned
against all known human Alus, resulting in 452 ESTs
that could be matched to at least one Alu at 80%
identity and 431 ESTs that could be matched to at least
one Alu at 90% identity. Of these 431, a total of 240
ESTs matched exactly one single copy Alu (Table 3).
A ‘‘single copy’’ Alu is a unique Alu variant that can
be identified based on its unique sequence [Umylny
et al. 2007]. A small number of ESTs are marked as
‘‘putative full length reads’’ in the comment field
FIGURE 4 Distribution of A-boxes. Vertical axis indicates the
total number of Alus with an A-box at the location indicated on
the horizontal axis. Approximately 27% of all Alus with A-boxes
also contain a B-box and 42% of Alus with B-boxes also contain
an A-box. Of all A-boxes in the Alus that also contain B-boxes,
69% fall into the first spike between –1 and 5 and 25% are in the
second spike between 135 and 145.
FIGURE 5 Distribution of B-boxes. Vertical axis represents the
total number of Alus with a B-box at the location represented on
the horizontal axis. Approximately 27% of all Alus with B-boxes
also contain an A-box. The distribution of B-box locations
remains roughly the same, possibly even more pronounced.
The major spike between 62 and 82 containing 94% of all B-boxes
within Alus that also contain an A-box.
FIGURE 3 Distribution of other A and B-boxes. Vertical axis
represents the total number of Alus with an A or B-box at the
location indicated on the horizontal axis. Over 90% of A boxes
and 73% of B-boxes are located in the preferred locations (�1 to
5 and 135 to 145 for A-boxes and 62 to 82 for B-boxes). This graph
shows the distribution the A and B-boxes that fall outside those
major regions. It is possible that the small B-box spike in the
194 to 228 region corresponds to the major A-box spike in the
135 to 145 region.
B. Umylny et al. 210
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
of the dbEST record and three of these were present
in our data set. Of these three, two were found to
have come from deep within an intron and one from
the region outside of any known genes (Table 4). All
three have a poly-A tail greater than 8 nucleotides and
two have a well defined B-box promoter region that
matches consensus (see Methods). One has a B-box
promoter region that deviates in two places from the
consensus, but not in the two key locations (Table 5).
ESTs Mapped to Alus Located Outside
of Known Genes
The 240 that each matched one single copy Alu can
be reliably mapped to a location within the genome,
and we were able to examine them with respect to
known genes (Table 4). We found that 22 originated
within exons, 154 within introns and 64 from areas
outside any known gene (Table 6). Of the 64 that
were found outside of genes, all have a poly(A) tail,
31 have a well-defined B-box and 6 also have an
A-box. Of the 27 that do not have a well-defined
B-box, 25 do have a recognizable B-box with G at
position 1 and T at position 3 and only one mutation
from the consensus sequence, the remaining 2
Alus also have G at position 1 and T at position 3
and deviate from the consensus by 2 mutations
(Table 4). The relative lack of the younger Alus in this
data set is due to our requirement of only one match
at 90% identity and that most Y elements have high
similarity to each other. Two Alus (15:6454 and
6:30226) were identified twice by the 64 ESTs
mapped outside of known genes. Both were found
within the same library – invasive ovarian tumor. All
4 accession numbers from this tissue (Table 7)
mapped to these Alus. Alu 6:30226 also has a rela-
tively short poly(A) of only 8 nucleotides. In general,
these non-genic Alus were identified in a wide variety
of tissues (Tables 7 and 8).
Percentage of ESTs Containing Near-
Full-Length B1 Repeats Dropsas Cells Approach Terminal
Differentiation
Having established a method for analyzing 7SL
RNA SINEs using human data we adapted the
concept to analyzing mouse EST libraries for the
evidence of B1 expression. Because B1s are
significantly shorter, they are somewhat more diffi-
cult to analyze than Alus. However, they do offer
an advantage of making available libraries from cer-
tain tissues that are not easily obtained from human
tissue banks. In particular, we worked with a set of
EST libraries derived from mouse spermatogenic
cells constructed and donated to dbEST [Boguski
et al. 1993] by McCarrey, Ph.D. (Southwest Foun-
dation for Biomedical Research, Dept. of Genetics)
and sequenced by Eddy, Ph.D. (National Institutes
of Health, National Institute of Environmental Health
Sciences). The distribution of B1-containing ESTs
TABLE 1 Distribution of Alus Capable of No, Low and High Expression
No B-box between 60 and
90 or poly-A tail <8
B-box between 60 and 90 and
poly-A tail�8
B-box between 60 and 90, poly-A tail
�8 and A-box between �1 and 5
Exons 2,000 (0.4) 1,445 (0.4) 722 (0.5)
Introns 198,846 (43.8) 145,329 (44.4) 63,650 (44.6)
Intergenic 253,203 (55.8) 180,523 (55.2) 78,339 (54.9)
J 138,378 (31) 56,418 (17) 9,913 (7)
S 264,818 (58) 226,451 (69) 90,710 (64)
Y 50,233 (11) 44,360 (14) 42,082 (29)
Total 453,429 (49) 327,297 (35) 142,705 (16)
The percentages are calculated within expression groups for genomic locations (introns, exons and intergenic) and subfamilies (J, S and Y). The percen-tages for ‘‘total’’ are calculated for the entire group (e.g., 49% of all Alus are not capable of expression; 58% of Alus not capable of expression are mem-bers of the S subfamily).
TABLE 2 Age-Based Distribution of Alus with A and B Boxes in
the Right Monomer
Alu family
Have B-box and A-box in the right
monomer and poly-A tail�8 (%)
J 195 (91)
S 13 (6)
Y 6 (3)
Total 214 (100)
211 Evidence of Alu and B1 Expression in dbEST
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
TABLE 4 Probable Alu Expression Located within ESTs Marked as ‘‘Putative Full Length Reads’’
Alu ID (Accession#) Sub-family Location Tissue Length (%) Identity (%) From 50 From 30 Poly(A)-tail
7:57047 (AA063511) Sq Intron Pineal gland 100 98 2,878 6,017 Yes
11:22601 (H71552) Sx Intron Nose 100 98 202,513 669,759 Yes
8:43896 (H64649) Y Out Nose 100 97 NA NA Yes
TABLE 3 Alus in Human dbEST
Alu family
Have B-box and A-box in the right monomer and poly-A
tail�8 (%)
J 195 (91)
S 13 (6)
Y 6 (3)
Total 214 (100)
Description Number of ESTs
Total human ESTs 7.7� 106
ESTs containing Alu� 90% of EST length 655
ESTs with at least one match to an Alu greater than 200
nucleotides at 80% identity
452
ESTs with at least one match to an Alu greater than 200
nucleotides at 90% identity
431
ESTs with at least 90% identity match to exactly one Alu
longer than 200 nucleotides
240
A total of 7.7 million ESTs were processed, 655 were found to contain a recognizable Alu sequence taking up 90% or more of EST length. A blast of the655 ESTs against all Alus > 200 nucleotides revealed that 452 match at least one known Alu at 80% identity or better. 431 ESTs matched at 90% or betterand a total of 240 ESTs matched exactly one single-copy Alu at 90% identity or better.
TABLE 5 Polymerase III Consensus Sequences in Alus Detected within Probable Expressed Alus
Alu ID A-Box Sequence B-Box Sequence Notes
7:57047 4 GGGCCCAGTGGC 76 GTTCAAGAC
11:22601 None 59 GATCATATG Mutated in 2 positions A(7) should be T and G(9)
should be C, but G(1) and T(3) are in place
8:43896 None 75 GATCGAGAC
TABLE 6 Possible Expressed Alus in dbEST by Age
Genomic
location
Number of
Alus No B-box
Have B-box and
poly-A tail�8
Have B-box, A-box and
poly-A tail�8 J (%) S (%) Y (%)
Exon 22 10 (46) 6 (27) 6 (27) 9 (41) 13 (59) 0 (0)
Intron 154 63 (41) 66 (43) 25 (16) 65 (42) 85 (55) 4 (3)
Outside 64 27 (42) 31 (48) 6 (10) 19 (30) 41 (64) 4 (6)
Total 240 100 (42) 103 (43) 37 (15) 93 (39) 139 (58) 8 (3)
A total of 240 ESTs matched a single Alu at 90% identity. Of this total, 22 matched within an exon, 154 within an intron and 64 outside any knowngene.
B. Umylny et al. 212
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
was approximately equal between introns, exons
and areas outside of genes (Table 9). These libraries
were examined for the proportion of ESTs containing
near full-length B1 repeats. Figure 6 demonstrates
that if these libraries are arranged in the order of
cell-division=replication events, the percentage of
B1 associated elements in the expressed sequences
tends to decline as the cells approach terminal differ-
entiation (R2 of 0.87 and slope p-value of 0.004) to
the Sertoli cell reference.
DISCUSSION
A significant number of Alus have well-defined
promoters and a poly(A) tail that make them good
candidates for transcription (Table 1). This includes
a large number of J and S Alus that appear to be
capable of transcription. This is probably an under-
estimate of the total population of Alus capable of
transcription as non-coding RNAs. Our analysis of
dbEST provides evidence that Alus with a B-box that
deviate from consensus in as many as two places
could still be expressed (Tables 5 and 6). It is com-
monly accepted that only the small subset of the
AluY subfamily, namely the Ya5 and Yb8 subfamilies,
are currently actively reverse transcribed [Mamedov
et al. 2005]. We provide evidence indicating that
older Alu sequences are expressed as non-coding
RNAs (Tables 4 and 6).
This is supported by previously published experi-
mental observations that members of the AluS
subfamily are transcribed and can be detected in
the cytoplasm [Shaikh et al. 1997]. Since the J and S
Alus are not believed to be reverse transcribed, their
expression can be described as either without
function or an indication that Alu transcription as
non-coding RNA has a function beyond that of retro-
position. While older Alu elements are thought to be
retropositionally inactive [Johanning et al. 2003;
Ohshima et al. 2003], if AluJ and AluS elements are
transcribed and present in the cytoplasm, why would
they not be reverse-transcribed into DNA? It is poss-
ible that certain families of LINE reverse transcrip-
tases have affinity for only certain Alus [Ohshima
et al. 2003], and only LINEs with affinity for AluYs
are currently active. Additionally, older J=S Alus are
actively retroposed at this time [Johanning et al.
2003]. We have previously found that a small number
of J and S Alus are exact copies of each other and are
not part of duplicated DNA segments [Umylny et al.
2007]. Evidence that older Alus are also expressed
as non-coding RNA might provide a further indi-
cation that these are either actively being retroposed
or have an important function that protects them
from incidental mutations.
Our analysis of promoter boxes indicates that
AluYs are more likely to have both an A and B-box
and are therefore likely to be transcribed in greater
numbers. However, if Alus have a function, it should
be indicated by differential expression in various
TABLE 7 Distribution of the 64 Alus Matched at 90% Identity
and Located Outside Known Genes
Tissue Number
Undefined 4
Lymph node 2
Ovary 4
Breast 1
Head and neck 1
Stomach 1
Placenta normal 1
Bone marrow 1
Uterus 3
Kidney 1
Thyroid gland 1
Liver 1
Mixed 4
Pineal gland 1
Skin 1
Nervous tumor 1
Heart 1
Brain 1
Pineal body 1
Lung 5
Colon 3
Esophagus 1
Nose 3
Eye 1
Placenta 2
Genitourinary tract 1
NT2 precursor cells induced with Retinoic acid 1
Pooled retinal tissue 1
Fetal heart (8–10 weeks) 2
B cells 2
Liposarcoma 1
Invasive prostate tumor 1
Invasive ovarian tumor 4
Wilms’ tumor 1
Lungcarcinoid 1
Myeloma cells 1
Fetal brain 1
Fetus 1
213 Evidence of Alu and B1 Expression in dbEST
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
tissues and under different conditions. Therefore,
Alus that have only a B-box and a poly(A) tail and
are expressed at low levels, but could be up regu-
lated by transcription factors in the event of stress
[Shankar et al. 2004], represent a potentially more
important population. These appear to be repre-
sented by the older J and S subfamilies (Table 1).
There is also evidence that the B-box and a poly(A)
tail are sufficient to induce expression (Tables 5
and 6), which supports previously published results
TABLE 8 Distribution of the 240 Alus Matched at 90% Identity
Tissue Number Tissue Number
unknown 37 uterus 10
adrenal gland 2 Wilms’ tumor 3
aorta 2 HES lines 3
blood 2 fetal material 4
bone marrow 1 normal islets 2
brain 11 fetal thymus 1
breast 6 fetal heart 5
cervix 1 multiple myeloma bone marrow 1
colon 9 NT2þRetinoic acidþmitotic inhibitors 1
colon est 1 hNT neurons 2
esophagus 1 retinal tissue 2
eye 6 malignant prostate cancer 7
gall bladder 1 tonsillar cells 8
genitourinary tract 1 prostatic intraepithelial neoplasia 3
head neck 2 liposarcoma 2
heart 1 invasive prostate tumor 1
kidney 3 invasive ovarian tumor 9
kidney tumor 1 various tumors 1
leg muscle 1 HeLa cells 1
liver 1 invasive thyroid tumor 1
liver and spleen 8 Ewing’s sarcoma 1
lung 9 germ cell tumor 1
lymph node 2 pineal gland 2
mixed 3 placenta 3
muscle (skeletal) 1 placenta normal 2
nervous normal 1 pooled 1
nervous tumor 1 prostate 2
nose 5 prostate normal 1
ovary 7 retina 1
pancreas 7 skin 6
parathyroid gland 2 pineal body 4
TABLE 9 Distribution of B1 Elements in ESTs Relative to
Known Genes
Genomic area B1 (%)
Outside known genes 31
Mapped to introns 24
Mapped to exons 45
FIGURE 6 B1 content in mouse testis libraries. The scale is per-
centage of B1-affiliated nucleotides in the dbEST libraries. The
libraries are arranged in order of cell-division/replication events.
Fitting this data (minus Sertoli cells) to a linear model gives an
adjusted R2 of 0.87 and slope p-value of 0.004. pSGA ¼ primitive
type A spermatogonia; SGA ¼ type A spermatogonia; SGB ¼ Type
Type B spermatogonia; Meitotic ¼ a combination of leptotene,
zygotene and preleptotene spermatogonia; 2�Sp ¼ secondary
spermatocytes; RS ¼ round spermatids; and Sertoli ¼ Sertoli
cells.
B. Umylny et al. 214
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
[Shankar et al. 2004]. In addition to finding A and
B-boxes in the beginning of Alu sequences, A and
B-boxes have been located in the beginning of the
right monomer (Table 2 and Figs. 2 and 3), raising
the possibility that the right monomer could be
separately transcribed.
While it would be difficult to guarantee that a
particular EST is a non-coding Alu, we can set strin-
gent criteria that allow us to state that a particular
EST is likely to be a non-coding Alu. An interesting
case is presented by ESTs that were marked as ‘‘puta-
tive complete reads’’ in the dbEST comment field. In
this way we found three ESTs that are marked as full
length reads, and each matched a different known
Alu over 100% of their length at 97% to 98% ident-
ity. A 3% error rate is expected in generating EST
libraries [Boguski et al. 1993]. All three have a signifi-
cant poly(A) tail and two out of three have a defined
B-box. In the one case where the B-box appears to
be mutated and assuming that these ‘‘mutations’’
are not sequencing errors, the B-box would still be
sufficient to trigger transcription [Shankar et al.
2004]. One of these ESTs came from an intergenic
portion of the DNA and two are located within a
known intron. We believe that it is probable that
these ESTs are non-coding Alus. Two of these Alus
are members of the AluS subfamily and one is a
member of AluY, but not Ya5 or Yb8. This would
suggest that their expression is either accidental or
has some function that is not associated with reverse
transcription. Alternatively this might suggest that
older Alu subfamilies are actively retroposed at this
time. This is not the only indication of possible
expression of older Alus. In 1997 Shaikh et al. gener-
ated and sequenced a library of small cytoplasmic
RNAs. Two of the ESTs they submitted to GenBank
appear to be members of the AluS subfamily [Shaikh
et al. 1997].
Since the lack of ‘‘putative full length read’’ com-
ment is not an indication that the EST is not a full
length read, we examined all human ESTs in the
database. Even with our strict criteria, which include
a demand for non-ambiguous matches, we were able
to identify 240 ESTs that may represent non-coding
Alus. 154 of these Alus came from within introns.
While it is possible to explain their presence in the
EST libraries as intron contamination, we observed
that certain ESTs explicitly marked as ‘‘full length
reads’’ do represent Alus that originated from deep
within an intron (Table 4). The 64 ESTs that were
from areas outside of known genes, and therefore
cannot be explained as intron contamination, are
very likely to be evidence of possible Alu transcrip-
tion. As in the case of previous experiments, the
older Alu subfamilies are well represented in this
group (Table 6).
These data suggest that more than half of human
Alus contain the proper promoter sequences to be
transcribed by RNA polymerase III, at least by our
definition. We have also provided evidence that data
for the expression of Alus as non-coding RNA can be
found in existing EST databases. We have recently
demonstrated that most human Alu repeats are
unique at some part of their sequence [Umylny et al.
2007] suggesting that the expression of individual Alu
elements can be studied with the proper bioinfor-
matics tools. We are currently investigating whether
individual Alu repeats are expressed in a tissue spe-
cific manner as suggested by the expression of mouse
B1-elements during spermatogenesis.
Applying these methods to the more difficult
problem of analyzing B1 sequences in EST data also
produced unexpected results. Since B1s are signifi-
cantly shorter that Alus we anticipated considerably
more difficulties in working with these sequences,
but we found the analysis of B1s to be just as clear.
By expanding our scope to mouse data we were also
able to obtain access to data on differentiating cells –
the type not easily available for humans. In particular
we examined seven Unigene=dbEST libraries gener-
ated from spermatogenic cells. Focusing on these
particular libraries allowed us to compare ‘‘B1
expression’’ between different cell types. While some
B1s (those localized to exons) clearly were
expressed as part of larger sequences and others
(localized to introns) likely were expressed as a por-
tion of larger sequences, about one third were
mapped to areas of the genome outside known
genes (Table 9). In total there is a clear decline in
the expression of B1 sequences as a spermatogenic
cell approaches terminal differentiation (Figure 5).
In fact the expression of B1 elements in round sper-
matids is about the same as the expression of B1
elements in Sertoli cells (adult somatic tissues).
Further analysis of EST as well as microarray data
confirmed increased expression of B1 sequences in
cancerous and pre-implantation tissues compared
to healthy adult somatic tissues (unpublished data).
215 Evidence of Alu and B1 Expression in dbEST
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
MATERIALS AND METHODS
Software
Bioperl was used with Ensembl’s release 31 Perl
API libraries, [Hubbard et al. 2005] GNU bash version
3.00.00 and Perl version 5.8.5 for scripting, GNU gcc
version 3.3.4 for procedures that required enhanced
performance and MYSQL version 12.22 for storing
Ensembl data as well as the primary data repository.
Analyses were performed on an Intel platform run-
ning SUSE version of Linux (distributed by Novell).
Identifying Promoter Boxesand Poly(A) Tails
All Alu sequences were extracted into FASTA-
formatted file and a ‘C’-coded application was
developed to detect the promoter A and B-box and
the poly-A tail. The promoter boxes were detected
using regular expression search routines based on
definitions [Perez-Stable et al. 1984; Shaikh et al. 1997;
Shankar et al. 2004]. We used the following templates
for promoter boxes:
. A-box: [GjT]GGCNNRGTN[GjC]
. B-box: G[AjT]T[CjT]RANNC
The poly-A tail was detected using the following
algorithm: starting at the 30 end of the sequence,
the tail must begin with 2 consecutive ‘A’ nucleo-
tides, we then move towards the 50 end evaluating
every nucleotide encountered. The algorithm stops
when we encounter 2 consecutive non-A nucleo-
tides. At all times, the poly-A tail must maintain a
minimum of 50% concentration of ‘A’s. Initial scan-
ning of the EST libraries identified multiple Alu EST
candidates originating from an identifiable genomic
locus. The smallest poly(A) tail these Alus have is
of length 8, which we subsequently used to define
a functional poly(A) tail.
The information on promoter boxes and poly(A)
tail length for all Alus was stored in the MySQL data-
base for further evaluation.
dbEST Processing
The entire dbEST library was downloaded and
processed using ‘C’-coded executables. All ESTs for
species ‘‘Homo sapiens’’ were identified and
extracted. The sequences and the accession numbers
were placed into FASTA-formatted files and all meta-
data, including accession numbers, insert length,
quality, comment, description and all other fields
were stored in a MySQL database.
Identification of Alu and B1 Repeats
RepeatMasker [Smit et al. 1996–2004] version 3.1.0
was used to identify Alus and B1s in the FASTA-
formatted dbEST files. RepeatMasker requires two
external packages – WU-BLAST [Gish 1996–2004] to
compare sequences and Repbase libraries [Jurka
et al. 2005] that contain SINE repeat consensus
sequences. The repeat libraries release from January
12, 2005 and WU-BLAST version 1.05 were used in
our analyses. The addresses and lengths of Alu and
B1 sequences were taken directly from the Repeat-
Masker’s ‘‘out’’ files. These lengths include the poly-
A tail at the ends of the sequences and the addresses
correspond to the offsets within the specific EST.
The repeats were identified and loaded into a
MYSQL database using custom Perl and bash scripts
and ‘C’ coded executables. The reports were gener-
ated from the database using custom Perl scripts.
Selecting Putative Expressed Alusfrom dbEST
The entire dbEST was scanned and all human ESTs
were extracted into FASTA-formatted files. Repeat-
Masker was run on human ESTs and all EST com-
posed of at least 90% Alu repeats and being longer
than 200 nucleotides in length were identified. A
total of 655 ESTs were selected. BLAST was used to
compare these ESTs to the database of all human
Alus, resulting in 452 ESTs that could be matched
to at least one Alu at 80% or greater identity. ESTs
that were marked ‘‘putative full length reads’’ were
selected if they compared to a single known Alu at
a minimum identity of 90% over 90% of their length.
These 3 ESTs were marked as probable Alu expres-
sions. The rest were analyzed based on the same cri-
teria without the restriction of requiring ‘‘putative full
length read’’ comment.
Evaluating B-box Promoter Site
We evaluated intergenic Alus that were identified
by ESTs at 90% length and 90% identity and that
did not have a B-box expected B-box promoter site.
B. Umylny et al. 216
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
We generated all 128 possible variations of the
9-nucleotide promoter sequence and aligned the 27
Alus that were matched by ESTs but classified as
not capable of expression using the following blast
parameters:�E 2, �q 1 and �W 4. We then evalu-
ated the alignment of the Alus to the promoter sites
using custom Perl scripts.
Processing Individual dbEST Libraries
Individual spermatogenic mouse libraries
(Lib.11283, Lib.11284, Lib.11285, Lib.11128,
Lib.6786, Lib.6787, Lib.6788 and Lib.6789), con-
structed and donated by J. McCarrey, Ph.D. (South-
west Foundation for Biomedical Research, Dept. of
Genetics) with sequencing done by E.M. Eddy,
Ph.D. (National Institutes of Health, National Institute
of Environmental Health Sciences), were identified
on the Unigene=dbEST web site [Boguski et al.
1993; Pontius et al. 2003] and downloaded in FASTA
format. RepeatMasker was used to identify all B1 ele-
ments and custom Perl and shell scripts were used to
select those over 100 nucleotides and to store the
data in custom MySQL database. Reports were gener-
ated using custom SQL, shell and Perl scripts.
Identifying the Location of ESTswithin Human and Mouse Genomes
All B1-containing ESTs were aligned against the
mouse genome. The fact that most B1s are unique
sequences [Umylny et al. 2007] assisted in identifying
un-ambiguous matches. An un-ambiguous match
required a minimum of 90% identity of 90% of the
EST length, including the repeat, and no other match
better than 70% identity over 70% length.
ACKNOWLEDGMENTS
This work was supported by the NIH, grant no.
HD28501 to W. S. W.
We gratefully acknowledge the support of the
University of Hawaii Dell Cluster system under the
management of the Department of Information and
Computer Sciences with funding from NIH-NCCR
P20RR016467 and NSF-EPS02–37065.
REFERENCES
Boguski, M. S., Lowe, T. M. and Tolstoshev, C. M. (1993) dbEST –database for ‘‘expressed sequence tags’’. Nat Genet 4:332–333.
Chu, W. M., Ballard, R., Carpick, B. W., Williams, B. R. and Schmid, C. W.(1998) Potential Alu function: Regulation of the activity of double-stranded RNA-activated kinase PKR. Mol Cell Biol 18:58–68.
Clemens, M. J. and Elia, A. (1997) The double-stranded RNA-dependentprotein kinase PKR: Structure and function. J Interferon Cytokine Res17:503–524.
Dagan, T., Sorek, R., Sharon, E., Ast, G. and Graur, D. (2004) AluGene: adatabase of Alu elements incorporated within protein-coding genes.Nucleic Acids Res 32:D489–D492.
Deininger, P. L., Jolly, D. J., Rubin, C. M., Friedmann, T. and Schmid, C. W.(1981) Base sequence studies of 300 nucleotide renatured repeatedhuman DNA clones. J Mol Biol 151:17–33.
Gish, W., WU-BLAST. 1996–2004. Website: http:==blast.wustl.eduHagan, C. R., Sheffield, R. F. and Rudin, C. M. (2003) Human Alu element
retrotransposition induced by genotoxic stress. Nat Genet 35:219–220.Hasler, J. and Strub, K. (2006) Alu elements as regulators of gene
expression. Nucleic Acids Res 34:5491–5497.Hellmann-Blumberg, U., Hintz, M. F., Gatewood, J. M. and Schmid, C. W.
(1993) Developmental differences in methylation of human Alurepeats. Mol Cell Biol 13:4523–4530.
Hubbard, T., Andrews, D., Caccamo, M., Cameron, G., Chen, Y.,Clamp, M., Clarke, L., Coates, G., Cox, T., Cunningham, F., et al.(2005) Ensembl 2005. Nucleic Acids Res 33:D447–D453.
Jeong, K. S. and Lee, S. (2005) Estimating the total mouse DNA methyla-tion according to the B1 repetitive elements. Biochem Biophys ResCommun 335: 1211–1216.
Johanning, K., Stevenson, C. A., Oyeniran, O. O., Gozal, Y. M.,Roy-Engel, A. M., Jurka, J. and Deininger, P. L. (2003) Potential forretroposition by old Alu subfamilies. J Mol Evol 56:658–664.
Jurka, J. and Smith, T. (1988) A fundamental division in the Alu family ofrepeated sequences. Proc Natl Acad Sci U S A 85:4775–4778.
Jurka, J. and Milosavljevic, A. (1991) Reconstruction and analysis ofhuman Alu genes. J Mol Evol 32:105–121.
Jurka, J., Krnjajic, M., Kapitonov, V. V., Stenger, J. E. and Kokhanyy, O.(2002) Active Alu elements are passed primarily through paternalgermlines. Theor Popul Biol 61:519–530.
Jurka, J., Kapitonov, V. V., Pavlicek, A., Klonowski, P., Kohany, O. andWalichiewicz, J. (2005) Repbase Update, a database of eukaryoticrepetitive elements. Cytogenet Genome Res 110:462–467.
Kapitonov, V. and Jurka, J. (1996) The age of Alu subfamilies. J Mol Evol42:59–65.
Kochanek, S., Renz, D. and Doerfler, W. (1993) DNA methylation in theAlu sequences of diploid and haploid primary human cells. Embo J12:1141–1151.
Krayev, A. S., Kramerov, D. A., Skryabin, K. G., Ryskov, A. P., Bayev, A. A.and Georgiev, G. P. (1980) The nucleotide sequence of the ubiquitousrepetitive DNA sequence B1 complementary to the most abundantclass of mouse fold-back RNA. Nucleic Acids Res 8:1201–1215.
Labuda, D., Sinnett, D., Richer, C., Deragon, J. M. and Striker, G. (1991)Evolution of mouse B1 repeats: 7SL RNA folding pattern conserved. JMol Evol 32:405–414.
Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J.,Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Initialsequencing and analysis of the human genome. Nature 409:860–921.
Mamedov, I. Z., Arzumanyan, E. S., Amosova, A. L., Lebedev, Y. B. andSverdlov, E. D. (2005) Whole-genome experimental identification ofinsertion=deletion polymorphisms of interspersed repeats by a newgeneral approach. Nucleic Acids Res 33:e16.
Maraia, R. J., Driscoll, C. T., Bilyeu, T., Hsu, K. and Darlington, G. J. (1993)Multiple dispersed loci produce small cytoplasmic Alu RNA. Mol CellBiol 13:4233–4241.
Ohshima, K., Hattori, M., Yada, T., Gojobori, T., Sakaki, Y. and Okada, N.(2003) Whole-genome screening indicates a possible burst of forma-tion of processed pseudogenes and Alu repeats by particular L1 sub-families in ancestral primates. Genome Biol 4:R74.
Perez-Stable, C., Ayres, T. M. and Shen, C. K. (1984) Distinctive sequenceorganization and functional programming of an Alu repeat promoter.Proc Natl Acad Sci U S A 81:5291–5295.
217 Evidence of Alu and B1 Expression in dbEST
Dow
nloa
ded
By:
[Uni
vers
ity o
f Haw
aii]
At:
01:2
8 12
Oct
ober
200
7
Pontius, J. U., Wagner, L. and Schuler, G. (2003) UniGene: A Unified Viewof the Transcriptome. In The NCBI Handbook ed. Information, N. C. f.B., NCBI, Bethesda (MD).
Price, A. L., Eskin, E. and Pevzner, P. A. (2004) Whole-genome analysis ofAlu repeat elements reveals complex evolutionary history. GenomeRes 14:2245–2252.
Quentin, Y. (1994) A master sequence related to a free left Alu monomer(FLAM) at the origin of the B1 family in rodent genomes. Nucleic AcidsRes 22:2222–2227.
RepeatMasker Open-3.0. Smit, A., Hubley, R. and Green, P. 1996–2004.Website: http:==www.repeatmasker.org
Rowold, D. J. and Herrera, R. J. (2000) Alu elements and the human gen-ome. Genetica 108:57–72.
Schmid, C. W. (1998) Does SINE evolution preclude Alu function? NucleicAcids Res 26:4541–4550.
Shaikh, T. H., Roy, A. M., Kim, J., Batzer, M. A. and Deininger, P. L. (1997)cDNAs derived from primary and small cytoplasmic Alu (scAlu)transcripts. J Mol Biol 271:222–234.
Shankar, R., Grover, D., Brahmachari, S. K. and Mukerji, M. (2004)Evolution and distribution of RNA polymerase II regulatory sites fromRNA polymerase III dependant mobile Alu elements. BMC Evol Biol4:37.
Ullu, E. and Tschudi, C. (1984) Alu sequences are processed 7SL RNAgenes. Nature 312:171–172.
Umylny, B., Presting, G., Effird, J., Klimovitsky, B. and Ward, W. (2007)Most human Alu and murine B1 repeats are unique. Journal of Cellu-lar Biochemistry In press.
Vila, M. R., Gelpi, C., Nicolas, A., Morote, J., Schwartz, S., Jr., Schwartz, S.and Meseguer, A. (2003) Higher processing rates of Alu-containingsequences in kidney tumors and cell lines with overexpressedAlu-mRNAs. Oncol Rep 10:1903–1909.
Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F.,Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An,P., et al. (2002) Initial sequencing and comparative analysis of themouse genome. Nature 420:520–562.
B. Umylny et al. 218
top related