evidence of alu and b1 expression in dbest

12
Downloaded By: [University of Hawaii] At: 01:28 12 October 2007 Research Article Evidence of Alu and B1 Expression in dbEST Boris Umylny Asia Pacific Bioinformatics Research Institute, Honolulu, HI Gernot Presting Department of Molecular Biosciences and Bioengineering, University of Hawaii at Manoa, Honolulu, HI W. Steven Ward Institute for Biogenesis Research, University of Hawaii at Manoa, Honolulu, HI Alus and B1s are short interspersed repeat elements (SINEs) derived from the 7SL RNA gene. Alus and B1s exist in the cytoplasm as non-coding RNA indicating that they are actively transcribed, but their function, if any, is unknown. Transcription of individual SINEs is a prerequisite for retroposi- tion, but it is also possible that individual Alu and B1 elements have some cellular functions. Previous studies suggest that transcription of Alu elements depends on the presence of an RNA polymerase-III bipartite promoter and the poly-A tail. Sequencing of small RNAs has demonstrated that the members of the Y and S subfamily are expressed. We analyzed almost one million Alu sequences longer than 200 nucleotides for the presence of RNA polymerase-III bipartite promoter sequences. More than half con- tained a promoter indicating some potential for expression. We searched 7.7 million human EST sequences in dbEST for the presence of Alu non- coding RNAs and found evidence for the expression of 452. Analysis of mouse spermatogenic dbEST libraries revealed an apparent relationship between the level of differentiation and the level of B1-related sequences in the EST library. KEYWORDS Alu, B1, EST, polymerase III, repeat, SINE INTRODUCTION Human Alus [Deininger et al. 1981] and mouse B1s [Krayev et al. 1980] are short interspersed elements (SINEs) derived from 7SL RNA. 7SL RNA is an RNA component of the signal recognition particle (SRP), a ribonucleopro- tein involved in translation of eukaryotic secreted proteins [Ullu and Tschudi 1984]. These repeats occupy a significant portion of the human (10.7%) and mouse (2.7%) genomes [Lander et al. 2001; Waterston et al. 2002]. A single Alu repeat consists of two similar elements linked by a short poly-A chain yeilding a total length of approximately 300 nucleotides (nt) [Jurka and Milosavljevic 1991; Quentin 1992]. The mouse B1 is a monomer of approxi- mately 140 nt in length, with an internal 29-nucleotide duplication [Labuda et al. 1991]. The Alu 5 0 monomer (FLAN) shares significant similarity with certain proto-B1 (pB1) sequences [Quentin 1994]. It is believed that Alu and B1 sequences are propagated by a reverse transcriptase encoded by Received 12 February 2007; accepted 23 March 2007. Address correspondence to W. Steven Ward, PhD, University of Hawaii at Manoa, Institute for Biogenesis Research, 1960 East–West Road, Honolulu, HI 96822, USA. E-mail: [email protected] Archives of Andrology: Journal of Reproductive Systems, 53:207–218, 2007 Copyright # Informa Healthcare USA, Inc. ISSN: 0148-5016 print=1521-0375 online DOI: 10.1080/01485010701426422 207

Upload: ncgr

Post on 20-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

Research Article

Evidence of Alu and B1 Expressionin dbEST

Boris Umylny

Asia Pacific Bioinformatics

Research Institute, Honolulu, HI

Gernot Presting

Department of Molecular

Biosciences and Bioengineering,

University of Hawaii at Manoa,

Honolulu, HI

W. Steven Ward

Institute for Biogenesis Research,

University of Hawaii at Manoa,

Honolulu, HI

Alus and B1s are short interspersed repeat elements (SINEs) derived from

the 7SL RNA gene. Alus and B1s exist in the cytoplasm as non-coding

RNA indicating that they are actively transcribed, but their function, if any,

is unknown. Transcription of individual SINEs is a prerequisite for retroposi-

tion, but it is also possible that individual Alu and B1 elements have some

cellular functions. Previous studies suggest that transcription of Alu elements

depends on the presence of an RNA polymerase-III bipartite promoter and

the poly-A tail. Sequencing of small RNAs has demonstrated that the

members of the Y and S subfamily are expressed. We analyzed almost

one million Alu sequences longer than 200 nucleotides for the presence

of RNA polymerase-III bipartite promoter sequences. More than half con-

tained a promoter indicating some potential for expression. We searched

7.7 million human EST sequences in dbEST for the presence of Alu non-

coding RNAs and found evidence for the expression of 452. Analysis of

mouse spermatogenic dbEST libraries revealed an apparent relationship

between the level of differentiation and the level of B1-related sequences

in the EST library.

KEYWORDS Alu, B1, EST, polymerase III, repeat, SINE

INTRODUCTION

Human Alus [Deininger et al. 1981] and mouse B1s [Krayev et al. 1980] are

short interspersed elements (SINEs) derived from 7SL RNA. 7SL RNA is an

RNA component of the signal recognition particle (SRP), a ribonucleopro-

tein involved in translation of eukaryotic secreted proteins [Ullu and Tschudi

1984]. These repeats occupy a significant portion of the human (10.7%) and

mouse (2.7%) genomes [Lander et al. 2001; Waterston et al. 2002]. A single

Alu repeat consists of two similar elements linked by a short poly-A chain

yeilding a total length of approximately 300 nucleotides (nt) [Jurka and

Milosavljevic 1991; Quentin 1992]. The mouse B1 is a monomer of approxi-

mately 140 nt in length, with an internal 29-nucleotide duplication [Labuda

et al. 1991]. The Alu 50 monomer (FLAN) shares significant similarity with

certain proto-B1 (pB1) sequences [Quentin 1994]. It is believed that Alu

and B1 sequences are propagated by a reverse transcriptase encoded by

Received 12 February 2007;accepted 23 March 2007.

Address correspondence to W. StevenWard, PhD, University of Hawaii atManoa, Institute for BiogenesisResearch, 1960 East–West Road,Honolulu, HI 96822, USA.E-mail: [email protected]

Archives of Andrology: Journal of Reproductive Systems, 53:207–218, 2007Copyright # Informa Healthcare USA, Inc.ISSN: 0148-5016 print=1521-0375 onlineDOI: 10.1080/01485010701426422

207

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

the L1 family of long interspersed repeat elements

(LINEs) [Schmid 1998]. The propagation of specific

Alu subfamilies may be attributable to particular L1

subfamilies [Ohshima et al. 2003].

While there is no confirmed function for individ-

ual Alu or B1 elements, various stress conditions

increase both expression [Chu et al. 1998] and

L1-mediated retroposition [Hagan et al. 2003],

suggesting the possibility of function [Schmid 1998].

One possible function for some of the transcribed

Alu elements appears to be inhibiting PKR (a dou-

ble-stranded RNA binding protein) to increase

protein translation [Chu et al. 1998]. Alu sequences

also appear to play a role in exonization and A-I edit-

ing [Dagan et al. 2004; Hasler and Strub 2006]. Alu

and B1 sequences are preferred methylation targets

[Hellmann-Blumberg et al. 1993; Jeong and Lee

2005; Kochanek et al. 1993], accounting for 33% of

the total genomic methylation sites [Schmid 1998].

In the genome, Alu and B1 sequences appear to be

preferentially distributed in GC-rich regions [Lander

et al. 2001; Waterston et al. 2002]. The distribution

of the B1 element in the mouse genome exhibits a

greater correlation with the Alu content of the ortho-

logous areas of the human genome than with the

immediate GC-density [Waterston et al. 2002]. This

suggests that genomic features, which are correlated

with, but distinct from, GC-content, may determine

Alu=B1 distribution [Waterston et al. 2002].

Both full length Alu RNA (flAlu) as well as left

monomer-only Alu RNA can be detected. It has been

proposed that left monomer-only Alu RNA (scAlu) is

a more stable, processed product of the dimeric flAlu

[Maraia et al. 1993]. The flAlu RNA appears to bind

and inactivate PKR [Chu et al. 1998], an inhibitor of

protein translation [Clemens and Elia 1997]. It is

believed that a stress-induced increase in Alu

expression inhibits PKR, thereby allowing increased

protein synthesis [Chu et al. 1998].

Alu subfamilies are classified by age. The oldest,

Jo and Jb, were derived from a single ancestral

gene 81 million years ago [Kapitonov and Jurka

1996]. The intermediate S subfamilies (Sx, Sp, Sq,

and Sc) have an estimated age of approximately 35

to 48 million years [Jurka and Milosavljevic 1991].

The youngest group, previously known as Sb,

includes the Y subfamilies [Rowold and Herrera

2000]. Some Y Alus are estimated to be as young as

3 to 4 million years [Kapitonov and Jurka 1996] and

may still be undergoing active transposition [Jurka

et al. 2002].

The Alus are transcribed by RNA polymerase III

using a bipartite promoter similar to the tRNA pro-

moters, consisting of an A-Box and a B-Box [Perez-

Stable et al. 1984; Shaikh et al. 1997; Shankar et al.

2004]. It is believed that the B-Box plus the poly-A

tail are sufficient to initiate minimal transcription, if

G is at position 1 and T is at position 3 [Shaikh

et al. 1997; Shankar et al. 2004]. Increased stress

may stimulate transcription of Alus that have only the

B-Box [Shankar et al. 2004]. It is believed that the

A-Box contributes to the increased transcription rates

of Alus and to the stability of the transcripts [Shankar

et al. 2004]. The evidence that Alus exist in the cyto-

plasm as non-coding RNA is widely accepted [Chu

et al. 1998; Clemens and Elia 1997; Hagan et al. 2003;

Maraia et al. 1993; Schmid 1998; Vila et al. 2003].

Transcription of Alus is one of the requirements for

retroposition of Alus, so it might be expected that

the youngest Alus, the AluY’s, would be the pre-

dominant elements transcribed. However, it is also

possible that non-transposing Alus are transcribed

for reasons related to cellular function. In support

of this, sequencing of small RNAs has demonstrated

that the members of the S subfamily also are

expressed [Shaikh et al. 1997].

Both the human and mouse sequencing projects

reported a large variation between individual Alus

and B1s, ranging from 1% to 40% from the consen-

sus sequence [Lander et al. 2001; Waterston et al.

2002]. The currently accepted method for classifying

Alus is to compare each sequence to one of the 34

consensus sequences for each identified subfamily

present in Repbase [Jurka et al. 2005]. These consen-

sus sequences were established by profiling nucleo-

tide frequencies at each position and selecting

specific diagnostic positions to classify a given Alu

sequence into a subfamily [Jurka and Smith 1988].

A recent classification of approximately half of all

human Alus, using frequencies of nucleotide-pairs

at diagnostic positions, resulted in the identification

of at least 213 subfamilies [Price et al. 2004].

The aim of this study was to begin to understand

Alu and B1 expression. We first analyzed the entire

population of Alu repeats 200 bp or greater in length

for the presence of the RNA polymerase III bipartite

promoter sequence. Then, using previously pub-

lished results indicating that Alus can be uniquely

B. Umylny et al. 208

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

identified using their sequence [Umylny et al. 2007],

we also searched for evidence of expressed Alus in

public domain EST databases. Having established

the technique, we used it to examine the level of

B1 expression in mouse spermatogenic EST libraries.

Murine B1 repeats were uniquely identified by

sequence [Umylny et al. 2007], although they were

far less prevalent than human Alu repeats in the gen-

ome [Lander et al. 2001; Waterston et al. 2002].

RESULTS

Distribution of Poly(A) Tails

in Human Alus

To detect the poly(A) tails, we chose the criteria

that the sequence must be initiated with two 30

A-nucleotides and terminated by two consecutive

non-A nucleotides (see Methods). Using this defi-

nition we found that over 75% of human Alus con-

tain a poly(A) tail greater than 8 nucleotides. This

proportion increases slightly, to 77%, if we consider

only those Alus that also contain a B-box and to 80%

in Alus that contain both an A and B-box (Fig. 1).

We used the following templates for promoter

boxes, A-box: [GjT]GGCNNRGTN[GjC] and B-box:

G[AjT]T[CjT]RANNC.

Distribution of Promoter Locationsin Human Alus

The locations of the A and B promoter boxes

within all Alu sequences were detected and mapped.

While B-boxes may potentially be present at any

location within the Alu sequence, they show a clear

preference for the area between nucleotides 62 and

82 (Fig. 2). This is consistent with the expectation that

transcription begins at nucleotide �70 from the start

of the B-box [Perez-Stable et al. 1984]. Likewise, 66%

of all A-boxes occur between nucleotides �1 and 5.

Another 27% of the A-boxes are located between

nucleotides 135 and 145 (Fig. 2). However, both A

and B-boxes were observed at all positions within

an Alu sequence. Specifically, there were minor con-

centrations of B-box between nucleotides 194 and

228. This may correspond to the significant con-

centration of A-boxes between nucleotides 135 and

145 and suggest the possibility of right monomer tran-

scription. There is also a minor concentration of

B-boxes between nucleotides 30 and 58, that implies

Alu transcription may begin upstream of the expected

50 start (Fig. 3). The basic distribution of A and

B-boxes does not change when we consider the sub-

set of Alus that contain both the A and the B-box. The

same spikes in concentration are evident in the same

locations (Figs. 4 and 5).

Alus Capable of Transcription

It has been suggested that to be capable of tran-

scription by RNA polymerase III, an Alu must contain

a viable poly(A) tail and a B-box [Shankar et al.

2004]. This allows transcription to take place at a

FIGURE 1 Length of the poly-A tails. Of all Alus, 75% have

poly(A) tails greater than 10 nucleotides, 77% of Alus containing

B-boxes have poly(A) tails greater than 10 nucleotides and 80%

of the Alus with both B and A-boxes have poly(A) tails greater

than 10 nucleotides.

FIGURE 2 Distribution of locations of A and B-boxes. Vertical

axis represents the total number of Alus with an A or B-box at

the location indicated on the horizontal axis. Approximately

39% of all Alus contain A-boxes and 73% contain B-boxes. Of

those, 66% of A-boxes and 73% of the B-boxes are contained in

the large spikes between –1 to 5 and 62 to 82, respectively and

27% of the A-boxes are contained in the small spike between

135 and 145.

209 Evidence of Alu and B1 Expression in dbEST

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

low, baseline level. The presence of the A-box is

believed to increase transcription rates and stabilizes

transcription [Shankar et al. 2004]. It is possible that

transcription rates of Alus with only the B-Boxes

might be up regulated during stress. The younger

Y Alus are more likely than the S, and the J Alus,

to have both the A and B-boxes. However, Alus con-

taining only a B-box but no A-box, and therefore

capable of low levels of expression, are equally

likely in all subfamilies (Table 1). There does not

appear to be an obvious preference for the distri-

bution of transcription-capable Alus with respect to

genes, exons or introns (Table 1).

Alus Capable of Right Monomer

Expression

There appears to be a significant concentration of

A-boxes at the beginning of the right monomer

(Fig. 2). While the A-box by itself is not sufficient

to initiate transcription, all Alu sequences were

checked for the presence of A-boxes at the start of

the right monomer, B-Boxes at 60 to 80 nucleotides

past the A-box and a viable poly A tail. A small num-

ber of primarily old Alus appear to satisfy the

requirement for the significant transcription rate of

the right monomer (Table 2).

Alus in dbEST

We next examined all of the over seven million

human ESTs in dbEST for evidence of non-coding

Alu expression. Using RepeatMasker, we selected

those ESTs that were (i) over 200 nucleotides in

length and (ii) composed of a single Alu sequence

over 90% of their total length. These criteria were

met by 655 ESTs. This set of sequences was aligned

against all known human Alus, resulting in 452 ESTs

that could be matched to at least one Alu at 80%

identity and 431 ESTs that could be matched to at least

one Alu at 90% identity. Of these 431, a total of 240

ESTs matched exactly one single copy Alu (Table 3).

A ‘‘single copy’’ Alu is a unique Alu variant that can

be identified based on its unique sequence [Umylny

et al. 2007]. A small number of ESTs are marked as

‘‘putative full length reads’’ in the comment field

FIGURE 4 Distribution of A-boxes. Vertical axis indicates the

total number of Alus with an A-box at the location indicated on

the horizontal axis. Approximately 27% of all Alus with A-boxes

also contain a B-box and 42% of Alus with B-boxes also contain

an A-box. Of all A-boxes in the Alus that also contain B-boxes,

69% fall into the first spike between –1 and 5 and 25% are in the

second spike between 135 and 145.

FIGURE 5 Distribution of B-boxes. Vertical axis represents the

total number of Alus with a B-box at the location represented on

the horizontal axis. Approximately 27% of all Alus with B-boxes

also contain an A-box. The distribution of B-box locations

remains roughly the same, possibly even more pronounced.

The major spike between 62 and 82 containing 94% of all B-boxes

within Alus that also contain an A-box.

FIGURE 3 Distribution of other A and B-boxes. Vertical axis

represents the total number of Alus with an A or B-box at the

location indicated on the horizontal axis. Over 90% of A boxes

and 73% of B-boxes are located in the preferred locations (�1 to

5 and 135 to 145 for A-boxes and 62 to 82 for B-boxes). This graph

shows the distribution the A and B-boxes that fall outside those

major regions. It is possible that the small B-box spike in the

194 to 228 region corresponds to the major A-box spike in the

135 to 145 region.

B. Umylny et al. 210

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

of the dbEST record and three of these were present

in our data set. Of these three, two were found to

have come from deep within an intron and one from

the region outside of any known genes (Table 4). All

three have a poly-A tail greater than 8 nucleotides and

two have a well defined B-box promoter region that

matches consensus (see Methods). One has a B-box

promoter region that deviates in two places from the

consensus, but not in the two key locations (Table 5).

ESTs Mapped to Alus Located Outside

of Known Genes

The 240 that each matched one single copy Alu can

be reliably mapped to a location within the genome,

and we were able to examine them with respect to

known genes (Table 4). We found that 22 originated

within exons, 154 within introns and 64 from areas

outside any known gene (Table 6). Of the 64 that

were found outside of genes, all have a poly(A) tail,

31 have a well-defined B-box and 6 also have an

A-box. Of the 27 that do not have a well-defined

B-box, 25 do have a recognizable B-box with G at

position 1 and T at position 3 and only one mutation

from the consensus sequence, the remaining 2

Alus also have G at position 1 and T at position 3

and deviate from the consensus by 2 mutations

(Table 4). The relative lack of the younger Alus in this

data set is due to our requirement of only one match

at 90% identity and that most Y elements have high

similarity to each other. Two Alus (15:6454 and

6:30226) were identified twice by the 64 ESTs

mapped outside of known genes. Both were found

within the same library – invasive ovarian tumor. All

4 accession numbers from this tissue (Table 7)

mapped to these Alus. Alu 6:30226 also has a rela-

tively short poly(A) of only 8 nucleotides. In general,

these non-genic Alus were identified in a wide variety

of tissues (Tables 7 and 8).

Percentage of ESTs Containing Near-

Full-Length B1 Repeats Dropsas Cells Approach Terminal

Differentiation

Having established a method for analyzing 7SL

RNA SINEs using human data we adapted the

concept to analyzing mouse EST libraries for the

evidence of B1 expression. Because B1s are

significantly shorter, they are somewhat more diffi-

cult to analyze than Alus. However, they do offer

an advantage of making available libraries from cer-

tain tissues that are not easily obtained from human

tissue banks. In particular, we worked with a set of

EST libraries derived from mouse spermatogenic

cells constructed and donated to dbEST [Boguski

et al. 1993] by McCarrey, Ph.D. (Southwest Foun-

dation for Biomedical Research, Dept. of Genetics)

and sequenced by Eddy, Ph.D. (National Institutes

of Health, National Institute of Environmental Health

Sciences). The distribution of B1-containing ESTs

TABLE 1 Distribution of Alus Capable of No, Low and High Expression

No B-box between 60 and

90 or poly-A tail <8

B-box between 60 and 90 and

poly-A tail�8

B-box between 60 and 90, poly-A tail

�8 and A-box between �1 and 5

Exons 2,000 (0.4) 1,445 (0.4) 722 (0.5)

Introns 198,846 (43.8) 145,329 (44.4) 63,650 (44.6)

Intergenic 253,203 (55.8) 180,523 (55.2) 78,339 (54.9)

J 138,378 (31) 56,418 (17) 9,913 (7)

S 264,818 (58) 226,451 (69) 90,710 (64)

Y 50,233 (11) 44,360 (14) 42,082 (29)

Total 453,429 (49) 327,297 (35) 142,705 (16)

The percentages are calculated within expression groups for genomic locations (introns, exons and intergenic) and subfamilies (J, S and Y). The percen-tages for ‘‘total’’ are calculated for the entire group (e.g., 49% of all Alus are not capable of expression; 58% of Alus not capable of expression are mem-bers of the S subfamily).

TABLE 2 Age-Based Distribution of Alus with A and B Boxes in

the Right Monomer

Alu family

Have B-box and A-box in the right

monomer and poly-A tail�8 (%)

J 195 (91)

S 13 (6)

Y 6 (3)

Total 214 (100)

211 Evidence of Alu and B1 Expression in dbEST

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

TABLE 4 Probable Alu Expression Located within ESTs Marked as ‘‘Putative Full Length Reads’’

Alu ID (Accession#) Sub-family Location Tissue Length (%) Identity (%) From 50 From 30 Poly(A)-tail

7:57047 (AA063511) Sq Intron Pineal gland 100 98 2,878 6,017 Yes

11:22601 (H71552) Sx Intron Nose 100 98 202,513 669,759 Yes

8:43896 (H64649) Y Out Nose 100 97 NA NA Yes

TABLE 3 Alus in Human dbEST

Alu family

Have B-box and A-box in the right monomer and poly-A

tail�8 (%)

J 195 (91)

S 13 (6)

Y 6 (3)

Total 214 (100)

Description Number of ESTs

Total human ESTs 7.7� 106

ESTs containing Alu� 90% of EST length 655

ESTs with at least one match to an Alu greater than 200

nucleotides at 80% identity

452

ESTs with at least one match to an Alu greater than 200

nucleotides at 90% identity

431

ESTs with at least 90% identity match to exactly one Alu

longer than 200 nucleotides

240

A total of 7.7 million ESTs were processed, 655 were found to contain a recognizable Alu sequence taking up 90% or more of EST length. A blast of the655 ESTs against all Alus > 200 nucleotides revealed that 452 match at least one known Alu at 80% identity or better. 431 ESTs matched at 90% or betterand a total of 240 ESTs matched exactly one single-copy Alu at 90% identity or better.

TABLE 5 Polymerase III Consensus Sequences in Alus Detected within Probable Expressed Alus

Alu ID A-Box Sequence B-Box Sequence Notes

7:57047 4 GGGCCCAGTGGC 76 GTTCAAGAC

11:22601 None 59 GATCATATG Mutated in 2 positions A(7) should be T and G(9)

should be C, but G(1) and T(3) are in place

8:43896 None 75 GATCGAGAC

TABLE 6 Possible Expressed Alus in dbEST by Age

Genomic

location

Number of

Alus No B-box

Have B-box and

poly-A tail�8

Have B-box, A-box and

poly-A tail�8 J (%) S (%) Y (%)

Exon 22 10 (46) 6 (27) 6 (27) 9 (41) 13 (59) 0 (0)

Intron 154 63 (41) 66 (43) 25 (16) 65 (42) 85 (55) 4 (3)

Outside 64 27 (42) 31 (48) 6 (10) 19 (30) 41 (64) 4 (6)

Total 240 100 (42) 103 (43) 37 (15) 93 (39) 139 (58) 8 (3)

A total of 240 ESTs matched a single Alu at 90% identity. Of this total, 22 matched within an exon, 154 within an intron and 64 outside any knowngene.

B. Umylny et al. 212

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

was approximately equal between introns, exons

and areas outside of genes (Table 9). These libraries

were examined for the proportion of ESTs containing

near full-length B1 repeats. Figure 6 demonstrates

that if these libraries are arranged in the order of

cell-division=replication events, the percentage of

B1 associated elements in the expressed sequences

tends to decline as the cells approach terminal differ-

entiation (R2 of 0.87 and slope p-value of 0.004) to

the Sertoli cell reference.

DISCUSSION

A significant number of Alus have well-defined

promoters and a poly(A) tail that make them good

candidates for transcription (Table 1). This includes

a large number of J and S Alus that appear to be

capable of transcription. This is probably an under-

estimate of the total population of Alus capable of

transcription as non-coding RNAs. Our analysis of

dbEST provides evidence that Alus with a B-box that

deviate from consensus in as many as two places

could still be expressed (Tables 5 and 6). It is com-

monly accepted that only the small subset of the

AluY subfamily, namely the Ya5 and Yb8 subfamilies,

are currently actively reverse transcribed [Mamedov

et al. 2005]. We provide evidence indicating that

older Alu sequences are expressed as non-coding

RNAs (Tables 4 and 6).

This is supported by previously published experi-

mental observations that members of the AluS

subfamily are transcribed and can be detected in

the cytoplasm [Shaikh et al. 1997]. Since the J and S

Alus are not believed to be reverse transcribed, their

expression can be described as either without

function or an indication that Alu transcription as

non-coding RNA has a function beyond that of retro-

position. While older Alu elements are thought to be

retropositionally inactive [Johanning et al. 2003;

Ohshima et al. 2003], if AluJ and AluS elements are

transcribed and present in the cytoplasm, why would

they not be reverse-transcribed into DNA? It is poss-

ible that certain families of LINE reverse transcrip-

tases have affinity for only certain Alus [Ohshima

et al. 2003], and only LINEs with affinity for AluYs

are currently active. Additionally, older J=S Alus are

actively retroposed at this time [Johanning et al.

2003]. We have previously found that a small number

of J and S Alus are exact copies of each other and are

not part of duplicated DNA segments [Umylny et al.

2007]. Evidence that older Alus are also expressed

as non-coding RNA might provide a further indi-

cation that these are either actively being retroposed

or have an important function that protects them

from incidental mutations.

Our analysis of promoter boxes indicates that

AluYs are more likely to have both an A and B-box

and are therefore likely to be transcribed in greater

numbers. However, if Alus have a function, it should

be indicated by differential expression in various

TABLE 7 Distribution of the 64 Alus Matched at 90% Identity

and Located Outside Known Genes

Tissue Number

Undefined 4

Lymph node 2

Ovary 4

Breast 1

Head and neck 1

Stomach 1

Placenta normal 1

Bone marrow 1

Uterus 3

Kidney 1

Thyroid gland 1

Liver 1

Mixed 4

Pineal gland 1

Skin 1

Nervous tumor 1

Heart 1

Brain 1

Pineal body 1

Lung 5

Colon 3

Esophagus 1

Nose 3

Eye 1

Placenta 2

Genitourinary tract 1

NT2 precursor cells induced with Retinoic acid 1

Pooled retinal tissue 1

Fetal heart (8–10 weeks) 2

B cells 2

Liposarcoma 1

Invasive prostate tumor 1

Invasive ovarian tumor 4

Wilms’ tumor 1

Lungcarcinoid 1

Myeloma cells 1

Fetal brain 1

Fetus 1

213 Evidence of Alu and B1 Expression in dbEST

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

tissues and under different conditions. Therefore,

Alus that have only a B-box and a poly(A) tail and

are expressed at low levels, but could be up regu-

lated by transcription factors in the event of stress

[Shankar et al. 2004], represent a potentially more

important population. These appear to be repre-

sented by the older J and S subfamilies (Table 1).

There is also evidence that the B-box and a poly(A)

tail are sufficient to induce expression (Tables 5

and 6), which supports previously published results

TABLE 8 Distribution of the 240 Alus Matched at 90% Identity

Tissue Number Tissue Number

unknown 37 uterus 10

adrenal gland 2 Wilms’ tumor 3

aorta 2 HES lines 3

blood 2 fetal material 4

bone marrow 1 normal islets 2

brain 11 fetal thymus 1

breast 6 fetal heart 5

cervix 1 multiple myeloma bone marrow 1

colon 9 NT2þRetinoic acidþmitotic inhibitors 1

colon est 1 hNT neurons 2

esophagus 1 retinal tissue 2

eye 6 malignant prostate cancer 7

gall bladder 1 tonsillar cells 8

genitourinary tract 1 prostatic intraepithelial neoplasia 3

head neck 2 liposarcoma 2

heart 1 invasive prostate tumor 1

kidney 3 invasive ovarian tumor 9

kidney tumor 1 various tumors 1

leg muscle 1 HeLa cells 1

liver 1 invasive thyroid tumor 1

liver and spleen 8 Ewing’s sarcoma 1

lung 9 germ cell tumor 1

lymph node 2 pineal gland 2

mixed 3 placenta 3

muscle (skeletal) 1 placenta normal 2

nervous normal 1 pooled 1

nervous tumor 1 prostate 2

nose 5 prostate normal 1

ovary 7 retina 1

pancreas 7 skin 6

parathyroid gland 2 pineal body 4

TABLE 9 Distribution of B1 Elements in ESTs Relative to

Known Genes

Genomic area B1 (%)

Outside known genes 31

Mapped to introns 24

Mapped to exons 45

FIGURE 6 B1 content in mouse testis libraries. The scale is per-

centage of B1-affiliated nucleotides in the dbEST libraries. The

libraries are arranged in order of cell-division/replication events.

Fitting this data (minus Sertoli cells) to a linear model gives an

adjusted R2 of 0.87 and slope p-value of 0.004. pSGA ¼ primitive

type A spermatogonia; SGA ¼ type A spermatogonia; SGB ¼ Type

Type B spermatogonia; Meitotic ¼ a combination of leptotene,

zygotene and preleptotene spermatogonia; 2�Sp ¼ secondary

spermatocytes; RS ¼ round spermatids; and Sertoli ¼ Sertoli

cells.

B. Umylny et al. 214

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

[Shankar et al. 2004]. In addition to finding A and

B-boxes in the beginning of Alu sequences, A and

B-boxes have been located in the beginning of the

right monomer (Table 2 and Figs. 2 and 3), raising

the possibility that the right monomer could be

separately transcribed.

While it would be difficult to guarantee that a

particular EST is a non-coding Alu, we can set strin-

gent criteria that allow us to state that a particular

EST is likely to be a non-coding Alu. An interesting

case is presented by ESTs that were marked as ‘‘puta-

tive complete reads’’ in the dbEST comment field. In

this way we found three ESTs that are marked as full

length reads, and each matched a different known

Alu over 100% of their length at 97% to 98% ident-

ity. A 3% error rate is expected in generating EST

libraries [Boguski et al. 1993]. All three have a signifi-

cant poly(A) tail and two out of three have a defined

B-box. In the one case where the B-box appears to

be mutated and assuming that these ‘‘mutations’’

are not sequencing errors, the B-box would still be

sufficient to trigger transcription [Shankar et al.

2004]. One of these ESTs came from an intergenic

portion of the DNA and two are located within a

known intron. We believe that it is probable that

these ESTs are non-coding Alus. Two of these Alus

are members of the AluS subfamily and one is a

member of AluY, but not Ya5 or Yb8. This would

suggest that their expression is either accidental or

has some function that is not associated with reverse

transcription. Alternatively this might suggest that

older Alu subfamilies are actively retroposed at this

time. This is not the only indication of possible

expression of older Alus. In 1997 Shaikh et al. gener-

ated and sequenced a library of small cytoplasmic

RNAs. Two of the ESTs they submitted to GenBank

appear to be members of the AluS subfamily [Shaikh

et al. 1997].

Since the lack of ‘‘putative full length read’’ com-

ment is not an indication that the EST is not a full

length read, we examined all human ESTs in the

database. Even with our strict criteria, which include

a demand for non-ambiguous matches, we were able

to identify 240 ESTs that may represent non-coding

Alus. 154 of these Alus came from within introns.

While it is possible to explain their presence in the

EST libraries as intron contamination, we observed

that certain ESTs explicitly marked as ‘‘full length

reads’’ do represent Alus that originated from deep

within an intron (Table 4). The 64 ESTs that were

from areas outside of known genes, and therefore

cannot be explained as intron contamination, are

very likely to be evidence of possible Alu transcrip-

tion. As in the case of previous experiments, the

older Alu subfamilies are well represented in this

group (Table 6).

These data suggest that more than half of human

Alus contain the proper promoter sequences to be

transcribed by RNA polymerase III, at least by our

definition. We have also provided evidence that data

for the expression of Alus as non-coding RNA can be

found in existing EST databases. We have recently

demonstrated that most human Alu repeats are

unique at some part of their sequence [Umylny et al.

2007] suggesting that the expression of individual Alu

elements can be studied with the proper bioinfor-

matics tools. We are currently investigating whether

individual Alu repeats are expressed in a tissue spe-

cific manner as suggested by the expression of mouse

B1-elements during spermatogenesis.

Applying these methods to the more difficult

problem of analyzing B1 sequences in EST data also

produced unexpected results. Since B1s are signifi-

cantly shorter that Alus we anticipated considerably

more difficulties in working with these sequences,

but we found the analysis of B1s to be just as clear.

By expanding our scope to mouse data we were also

able to obtain access to data on differentiating cells –

the type not easily available for humans. In particular

we examined seven Unigene=dbEST libraries gener-

ated from spermatogenic cells. Focusing on these

particular libraries allowed us to compare ‘‘B1

expression’’ between different cell types. While some

B1s (those localized to exons) clearly were

expressed as part of larger sequences and others

(localized to introns) likely were expressed as a por-

tion of larger sequences, about one third were

mapped to areas of the genome outside known

genes (Table 9). In total there is a clear decline in

the expression of B1 sequences as a spermatogenic

cell approaches terminal differentiation (Figure 5).

In fact the expression of B1 elements in round sper-

matids is about the same as the expression of B1

elements in Sertoli cells (adult somatic tissues).

Further analysis of EST as well as microarray data

confirmed increased expression of B1 sequences in

cancerous and pre-implantation tissues compared

to healthy adult somatic tissues (unpublished data).

215 Evidence of Alu and B1 Expression in dbEST

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

MATERIALS AND METHODS

Software

Bioperl was used with Ensembl’s release 31 Perl

API libraries, [Hubbard et al. 2005] GNU bash version

3.00.00 and Perl version 5.8.5 for scripting, GNU gcc

version 3.3.4 for procedures that required enhanced

performance and MYSQL version 12.22 for storing

Ensembl data as well as the primary data repository.

Analyses were performed on an Intel platform run-

ning SUSE version of Linux (distributed by Novell).

Identifying Promoter Boxesand Poly(A) Tails

All Alu sequences were extracted into FASTA-

formatted file and a ‘C’-coded application was

developed to detect the promoter A and B-box and

the poly-A tail. The promoter boxes were detected

using regular expression search routines based on

definitions [Perez-Stable et al. 1984; Shaikh et al. 1997;

Shankar et al. 2004]. We used the following templates

for promoter boxes:

. A-box: [GjT]GGCNNRGTN[GjC]

. B-box: G[AjT]T[CjT]RANNC

The poly-A tail was detected using the following

algorithm: starting at the 30 end of the sequence,

the tail must begin with 2 consecutive ‘A’ nucleo-

tides, we then move towards the 50 end evaluating

every nucleotide encountered. The algorithm stops

when we encounter 2 consecutive non-A nucleo-

tides. At all times, the poly-A tail must maintain a

minimum of 50% concentration of ‘A’s. Initial scan-

ning of the EST libraries identified multiple Alu EST

candidates originating from an identifiable genomic

locus. The smallest poly(A) tail these Alus have is

of length 8, which we subsequently used to define

a functional poly(A) tail.

The information on promoter boxes and poly(A)

tail length for all Alus was stored in the MySQL data-

base for further evaluation.

dbEST Processing

The entire dbEST library was downloaded and

processed using ‘C’-coded executables. All ESTs for

species ‘‘Homo sapiens’’ were identified and

extracted. The sequences and the accession numbers

were placed into FASTA-formatted files and all meta-

data, including accession numbers, insert length,

quality, comment, description and all other fields

were stored in a MySQL database.

Identification of Alu and B1 Repeats

RepeatMasker [Smit et al. 1996–2004] version 3.1.0

was used to identify Alus and B1s in the FASTA-

formatted dbEST files. RepeatMasker requires two

external packages – WU-BLAST [Gish 1996–2004] to

compare sequences and Repbase libraries [Jurka

et al. 2005] that contain SINE repeat consensus

sequences. The repeat libraries release from January

12, 2005 and WU-BLAST version 1.05 were used in

our analyses. The addresses and lengths of Alu and

B1 sequences were taken directly from the Repeat-

Masker’s ‘‘out’’ files. These lengths include the poly-

A tail at the ends of the sequences and the addresses

correspond to the offsets within the specific EST.

The repeats were identified and loaded into a

MYSQL database using custom Perl and bash scripts

and ‘C’ coded executables. The reports were gener-

ated from the database using custom Perl scripts.

Selecting Putative Expressed Alusfrom dbEST

The entire dbEST was scanned and all human ESTs

were extracted into FASTA-formatted files. Repeat-

Masker was run on human ESTs and all EST com-

posed of at least 90% Alu repeats and being longer

than 200 nucleotides in length were identified. A

total of 655 ESTs were selected. BLAST was used to

compare these ESTs to the database of all human

Alus, resulting in 452 ESTs that could be matched

to at least one Alu at 80% or greater identity. ESTs

that were marked ‘‘putative full length reads’’ were

selected if they compared to a single known Alu at

a minimum identity of 90% over 90% of their length.

These 3 ESTs were marked as probable Alu expres-

sions. The rest were analyzed based on the same cri-

teria without the restriction of requiring ‘‘putative full

length read’’ comment.

Evaluating B-box Promoter Site

We evaluated intergenic Alus that were identified

by ESTs at 90% length and 90% identity and that

did not have a B-box expected B-box promoter site.

B. Umylny et al. 216

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

We generated all 128 possible variations of the

9-nucleotide promoter sequence and aligned the 27

Alus that were matched by ESTs but classified as

not capable of expression using the following blast

parameters:�E 2, �q 1 and �W 4. We then evalu-

ated the alignment of the Alus to the promoter sites

using custom Perl scripts.

Processing Individual dbEST Libraries

Individual spermatogenic mouse libraries

(Lib.11283, Lib.11284, Lib.11285, Lib.11128,

Lib.6786, Lib.6787, Lib.6788 and Lib.6789), con-

structed and donated by J. McCarrey, Ph.D. (South-

west Foundation for Biomedical Research, Dept. of

Genetics) with sequencing done by E.M. Eddy,

Ph.D. (National Institutes of Health, National Institute

of Environmental Health Sciences), were identified

on the Unigene=dbEST web site [Boguski et al.

1993; Pontius et al. 2003] and downloaded in FASTA

format. RepeatMasker was used to identify all B1 ele-

ments and custom Perl and shell scripts were used to

select those over 100 nucleotides and to store the

data in custom MySQL database. Reports were gener-

ated using custom SQL, shell and Perl scripts.

Identifying the Location of ESTswithin Human and Mouse Genomes

All B1-containing ESTs were aligned against the

mouse genome. The fact that most B1s are unique

sequences [Umylny et al. 2007] assisted in identifying

un-ambiguous matches. An un-ambiguous match

required a minimum of 90% identity of 90% of the

EST length, including the repeat, and no other match

better than 70% identity over 70% length.

ACKNOWLEDGMENTS

This work was supported by the NIH, grant no.

HD28501 to W. S. W.

We gratefully acknowledge the support of the

University of Hawaii Dell Cluster system under the

management of the Department of Information and

Computer Sciences with funding from NIH-NCCR

P20RR016467 and NSF-EPS02–37065.

REFERENCES

Boguski, M. S., Lowe, T. M. and Tolstoshev, C. M. (1993) dbEST –database for ‘‘expressed sequence tags’’. Nat Genet 4:332–333.

Chu, W. M., Ballard, R., Carpick, B. W., Williams, B. R. and Schmid, C. W.(1998) Potential Alu function: Regulation of the activity of double-stranded RNA-activated kinase PKR. Mol Cell Biol 18:58–68.

Clemens, M. J. and Elia, A. (1997) The double-stranded RNA-dependentprotein kinase PKR: Structure and function. J Interferon Cytokine Res17:503–524.

Dagan, T., Sorek, R., Sharon, E., Ast, G. and Graur, D. (2004) AluGene: adatabase of Alu elements incorporated within protein-coding genes.Nucleic Acids Res 32:D489–D492.

Deininger, P. L., Jolly, D. J., Rubin, C. M., Friedmann, T. and Schmid, C. W.(1981) Base sequence studies of 300 nucleotide renatured repeatedhuman DNA clones. J Mol Biol 151:17–33.

Gish, W., WU-BLAST. 1996–2004. Website: http:==blast.wustl.eduHagan, C. R., Sheffield, R. F. and Rudin, C. M. (2003) Human Alu element

retrotransposition induced by genotoxic stress. Nat Genet 35:219–220.Hasler, J. and Strub, K. (2006) Alu elements as regulators of gene

expression. Nucleic Acids Res 34:5491–5497.Hellmann-Blumberg, U., Hintz, M. F., Gatewood, J. M. and Schmid, C. W.

(1993) Developmental differences in methylation of human Alurepeats. Mol Cell Biol 13:4523–4530.

Hubbard, T., Andrews, D., Caccamo, M., Cameron, G., Chen, Y.,Clamp, M., Clarke, L., Coates, G., Cox, T., Cunningham, F., et al.(2005) Ensembl 2005. Nucleic Acids Res 33:D447–D453.

Jeong, K. S. and Lee, S. (2005) Estimating the total mouse DNA methyla-tion according to the B1 repetitive elements. Biochem Biophys ResCommun 335: 1211–1216.

Johanning, K., Stevenson, C. A., Oyeniran, O. O., Gozal, Y. M.,Roy-Engel, A. M., Jurka, J. and Deininger, P. L. (2003) Potential forretroposition by old Alu subfamilies. J Mol Evol 56:658–664.

Jurka, J. and Smith, T. (1988) A fundamental division in the Alu family ofrepeated sequences. Proc Natl Acad Sci U S A 85:4775–4778.

Jurka, J. and Milosavljevic, A. (1991) Reconstruction and analysis ofhuman Alu genes. J Mol Evol 32:105–121.

Jurka, J., Krnjajic, M., Kapitonov, V. V., Stenger, J. E. and Kokhanyy, O.(2002) Active Alu elements are passed primarily through paternalgermlines. Theor Popul Biol 61:519–530.

Jurka, J., Kapitonov, V. V., Pavlicek, A., Klonowski, P., Kohany, O. andWalichiewicz, J. (2005) Repbase Update, a database of eukaryoticrepetitive elements. Cytogenet Genome Res 110:462–467.

Kapitonov, V. and Jurka, J. (1996) The age of Alu subfamilies. J Mol Evol42:59–65.

Kochanek, S., Renz, D. and Doerfler, W. (1993) DNA methylation in theAlu sequences of diploid and haploid primary human cells. Embo J12:1141–1151.

Krayev, A. S., Kramerov, D. A., Skryabin, K. G., Ryskov, A. P., Bayev, A. A.and Georgiev, G. P. (1980) The nucleotide sequence of the ubiquitousrepetitive DNA sequence B1 complementary to the most abundantclass of mouse fold-back RNA. Nucleic Acids Res 8:1201–1215.

Labuda, D., Sinnett, D., Richer, C., Deragon, J. M. and Striker, G. (1991)Evolution of mouse B1 repeats: 7SL RNA folding pattern conserved. JMol Evol 32:405–414.

Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J.,Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Initialsequencing and analysis of the human genome. Nature 409:860–921.

Mamedov, I. Z., Arzumanyan, E. S., Amosova, A. L., Lebedev, Y. B. andSverdlov, E. D. (2005) Whole-genome experimental identification ofinsertion=deletion polymorphisms of interspersed repeats by a newgeneral approach. Nucleic Acids Res 33:e16.

Maraia, R. J., Driscoll, C. T., Bilyeu, T., Hsu, K. and Darlington, G. J. (1993)Multiple dispersed loci produce small cytoplasmic Alu RNA. Mol CellBiol 13:4233–4241.

Ohshima, K., Hattori, M., Yada, T., Gojobori, T., Sakaki, Y. and Okada, N.(2003) Whole-genome screening indicates a possible burst of forma-tion of processed pseudogenes and Alu repeats by particular L1 sub-families in ancestral primates. Genome Biol 4:R74.

Perez-Stable, C., Ayres, T. M. and Shen, C. K. (1984) Distinctive sequenceorganization and functional programming of an Alu repeat promoter.Proc Natl Acad Sci U S A 81:5291–5295.

217 Evidence of Alu and B1 Expression in dbEST

Dow

nloa

ded

By:

[Uni

vers

ity o

f Haw

aii]

At:

01:2

8 12

Oct

ober

200

7

Pontius, J. U., Wagner, L. and Schuler, G. (2003) UniGene: A Unified Viewof the Transcriptome. In The NCBI Handbook ed. Information, N. C. f.B., NCBI, Bethesda (MD).

Price, A. L., Eskin, E. and Pevzner, P. A. (2004) Whole-genome analysis ofAlu repeat elements reveals complex evolutionary history. GenomeRes 14:2245–2252.

Quentin, Y. (1994) A master sequence related to a free left Alu monomer(FLAM) at the origin of the B1 family in rodent genomes. Nucleic AcidsRes 22:2222–2227.

RepeatMasker Open-3.0. Smit, A., Hubley, R. and Green, P. 1996–2004.Website: http:==www.repeatmasker.org

Rowold, D. J. and Herrera, R. J. (2000) Alu elements and the human gen-ome. Genetica 108:57–72.

Schmid, C. W. (1998) Does SINE evolution preclude Alu function? NucleicAcids Res 26:4541–4550.

Shaikh, T. H., Roy, A. M., Kim, J., Batzer, M. A. and Deininger, P. L. (1997)cDNAs derived from primary and small cytoplasmic Alu (scAlu)transcripts. J Mol Biol 271:222–234.

Shankar, R., Grover, D., Brahmachari, S. K. and Mukerji, M. (2004)Evolution and distribution of RNA polymerase II regulatory sites fromRNA polymerase III dependant mobile Alu elements. BMC Evol Biol4:37.

Ullu, E. and Tschudi, C. (1984) Alu sequences are processed 7SL RNAgenes. Nature 312:171–172.

Umylny, B., Presting, G., Effird, J., Klimovitsky, B. and Ward, W. (2007)Most human Alu and murine B1 repeats are unique. Journal of Cellu-lar Biochemistry In press.

Vila, M. R., Gelpi, C., Nicolas, A., Morote, J., Schwartz, S., Jr., Schwartz, S.and Meseguer, A. (2003) Higher processing rates of Alu-containingsequences in kidney tumors and cell lines with overexpressedAlu-mRNAs. Oncol Rep 10:1903–1909.

Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F.,Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An,P., et al. (2002) Initial sequencing and comparative analysis of themouse genome. Nature 420:520–562.

B. Umylny et al. 218