palindromes in viral genomes - utep mathematics · 2007. 2. 15. · length 22 palindrome •...

Palindromes in Viral Genomes

Ming-Ying Leung Department of Mathematical Sciences

The University of Texas at El Paso (UTEP)

Outline: Cytomegalovirus(CMV) Particle • Palindromes

• Viral Genomes • Roles of palindromes in

DNA and RNA viruses

SARS CoronavirusParticle

Palindromes in Letter Sequences

Odd Palindrome:“A nut for a jar of tuna”

ANUTFORA AROFTUNAJ

remove spaces and capitalize

Even Palindrome:“Step on no pets”

STEPON NOPETS

DNA and RNA• DNA is deoxyribonucleic acid, made

up of 4 nucleotide bases Adenine, Cytosine, Guanine, and Thymine.

• RNA is ribonucleic acid, made up of 4 nucleotide bases Adenine, Cytosine, Guanine, and Uracil.

• The bases A and T form a complementary pair, so are C and G.

G

AC

TGU

AC

G

C

T

A

DNA/RNA Palindromes

Virus and Eye DiseasesCMV Particle

CMV Retinitis• inflammation of the retina • triggered by CMV particles• may lead to blindnessGenome size

~ 230 kbp

Palindromes and Replication Origins in Herpesviruses

• High concentration of palindromes exists around replication origins of herpesviruses

• Locating clusters of palindromes (above a minimal length) on herpesvirus genome sequences might reveal likely locations of its replication origins.

Human Herpesvirus (HHV-6) Genome

• Replication origin (oriLyt) in non-coding region• Two binding sites for origin-binding protein (OBP)• Two 200-bp DNA unwinding elements (DUE)

Computational Prediction of Replication Origins

• Poisson approximation for palindrome distribution in an i.i.d. random sequence model

• Criterion for identifying statistically significant palindrome clusters using scan statistics

• Improve prediction accuracy

Poisson Process Approximation of Palindrome Distribution

Use of the Scan Statistic to Identify Clusters of Palindromes

Improve Prediction Accuracy

• Using a Markov random sequence model• Taking the lengths of palindromes and base

composition of the genomes into consideration when counting palindromes. [Chew et al. 2005, Nucleic Acids Research 33(15):e134]

• Using machine learning approaches to utilize knowledge about replication origin locations in closely related genomes.

SARS Viral Particles

SARS Virus Particle

Spike Glycoprotein

HostCell

• Single-stranded RNA genome with ~30 kilobases• Only about 7% the genome size of Cytomegalovirus

(CMV) with double-stranded DNA

SARS Virus Genome Map

• Replicase (1a and 1b), spike glycoprotein (S), X1 and X2 occupy 87% of the genome

• Two pairs of overlapping ORFs,1a & 1b and X1 & X2 (as designated by Rota et al. 2003), are predicted in this region

• 1a and 1b are standard in all coronaviruses, X1 and X2 are unique to SARS. Whether X1 and X2 do code for proteins is still unconfirmed

256 13398 21485 25268

ORF1aORF1b

S X1X2

A Long Palindrome in X1 and X2

TCTTTAACAAGCTTGTTAAAGA

Positions: 25962-25983 (22 bases)

• Found in SARS but not in other 6 coronavirus genomes (Chew et al. 2004)

• The next longest palindrome in SARS is 14 bases long• In the overlapping region of X1 and X2

Palindrome: A string of nucleotide bases that reads the same as its reverse complement. A palindrome must be even in length, e.g. palindrome of length 10:

5’ ….. GCAATATTGC …..3’

j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L

We say that a palindrome of length 2L occurs at position j when the (j-i+1)st and the (j+i)th bases are complementary to each other for i=1,…, L. In an i.i.d. sequence model this occurs with probability

( )2 LA T C Gp p p p⎡ ⎤+⎣ ⎦ .

Probability of Observing a Length 22 Palindrome

• Approximate the palindrome distribution by a Poisson process with rate

• Here n = genome length = 29727, and

• The probability of the occurrence of at least one length 22 palindrome in the genome is

[ ]11ˆ ˆ ˆ ˆ2( )A T C Gp p p p p= +

0.010081 0.01e−− ≈

0.01008npλ= =

Expression of Overlapping Genes Requires Frameshifting in Reading

Frameshifting must have the following elements:• Slippery Sequence - a mechanism that allows the

“reader” (called ribosome) to slip • Stimulatory Element - a pseudoknot or stem-loop

structure that blocks the ribosome

Slippery Sequences

Short binding sequencePseudoknot:

Palindrome-like sequencesStem-loop or

hairpin loop:

Ribosome

start reading

-1 FrameshiftingPseudoknot

CGGGTTT

• Reading with default frame: GGG then TTT

Ribosome

start reading

Pseudoknot

CGGGTTT

-1 Frameshifting

• Reading with default frame: GGG then TTT• Reading after -1 frameshifting: CGG then GTT

Heptanucleotide Slippery Sequences

• ORF1a and ORF1b (Overlap: 13398, 1 base only)

• X1 and X2 (Overlap: 25689-26089, 401 bases)

A string in the form of XXXYYYNwhere X = A,T, or G; Y = A or T; and N = A, T, or C

Locations of Slippery Sequences256 13398 21485 25268

ORF1aORF1b

S X1X2

TTTAAAC TTGAAAASlippery Sequences:?

• Right preceding the overlapping base between 1a and 1b, there is a slippery sequence followed by a pseudoknot (Theil et al 2003)

• Possible slippery sequences are detected in the overlapping region of X1 and X2; any pseudoknot or stem-loop structure in close proximity downstream?

Pseudoknot Predicted by PknotsRG

∆G = -55.84 kcal/mol.

RNA Secondary Structure Prediction

• Prediction RNA secondary structures including pseudoknots is a computational intensive problem.

• Use of heterogeneous grid computing to provide computing power.

• Use of palindrome content evaluation to improve prediction consistency.

1 if palindrome of length 2 occurs at base 0 otherwisej

L jI

≥=

n L

L kj L

X I−

=

= ∑

Palindrome counts in random nucleotide sequences Define the indicator random variable

Then

is the total count of palindromes of length at least 2L in a sequence of length n.

12

1

( ) ( 2 1) ( )

var( ) var( ) 2 cov( , )

n L

L L k Lj L

n L n L n L

L L j j kj L j L k j

E X E I n L E I

X I I I

µ

σ

−

=

− − − −

= = = +

= = = − +

= = +

∑

∑ ∑ ∑

j−

2

( ) (0)

cov( , )

var( ) (0)(1 (0))( ) (0)

j

j j d

j

E I

I I

Id

γ

γ γ

γ γ+

=

=

= −

−

Mean and variance of palindrome counts

If we let (0) ( 1) for

( ) ( 1, 1) for 1j

j j d

P I L j n L

d P I I d n L

γ

γ +

= = ≤ ≤ −

= = = ≤ ≤ − then

( )

( )

( )

2

1

1

22

1

( ) ( 2 1) 0

var( )

var( ) 2 cov( , )

( 2 1) (0) 1 (0)

2 2 1 ( ) (0)

L L

L Ln L n L n L

j j kj L j L k j

n L

d

E X n L

X

I I I

n L

n L d d

µ γ

σ

γ γ

γ γ

− − − −

= = = +

−

=

= = − +

=

= +

= − + −

+ − + − −

∑ ∑ ∑

∑

Mean and variance of palindrome counts (cont’d)

How to find the γ’s?

• Under a Markov sequence model, Chew et al.(INFORMS Journal of Computing, 16:331-340, 2004) have obtained computable formulas for the γ’s, expressed in terms of the transition and stationary probabilities of the Markov chain. These can be estimated by the observed base frequencies and dinucleotide frequencies.

• Let’s look at a special case, namely the i.i.d. random sequence model where the nucleotide bases are generated independently with probabilities pA, pC, pG, pT.

G

Finding γ(0) for the i.i.d. sequence model

(0) ( 1) [2( )]Lj A T CP I p p p pγ = = = +

j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L

Finding γ(d) for the i.i.d. sequence model: Case 1: d ≥ 2L

Case 2: L ≤ d < 2L

Case 3: 1 ≤ d < L

Palindromes, and More Palindromes…

i - L i - 2 i - 1 i i + 1 i + 2 i + 3 i +L+1 b1 b2 … bL - 1 bL b’L b’L - 1 … b’2 b’1

i - L i - 3 i – 2 I - 1 i i + 1 i + 2 i + 3 i +L+1 b1 b2 … bL - 1 bL b’L b’L - 1 … b’2 b’1

Acknowledgements

CollaboratorsLouis H. Y. Chen (National University of Singapore) David Chew (National University of Singapore) Kwok Pui Choi (National University of Singapore)Raul Cruz-Cano (The University of Texas at El Paso)Hans Heidner (The University of Texas at San Antonio)Michela Taufer (The University of Texas at El Paso)Aihua Xia (University of Melbourne, Australia)

Funding SupportNIH Grants S06GM08012-35, 2G12RR008124-11 and 3T34GM008048-20S1.THECB Advanced Research Program 003661-0008-2006

Palindromes in Letter SequencesPalindromes and Replication Origins in HerpesvirusesComputational Prediction of Replication OriginsRNA Secondary Structure PredictionPagesfrom-HKU-Leung040625.pdfHairpin Loop Predicted by Pknots

palindromes in viral genomes - utep mathematics · 2007. 2. 15. · length 22 palindrome •...

Documents