palindromes in viral genomes - utep mathematics · 2007. 2. 15. · length 22 palindrome •...

33
Palindromes in Viral Genomes Ming-Ying Leung Department of Mathematical Sciences The University of Texas at El Paso (UTEP)

Upload: others

Post on 08-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Palindromes in Viral Genomes

    Ming-Ying Leung Department of Mathematical Sciences

    The University of Texas at El Paso (UTEP)

  • Outline: Cytomegalovirus(CMV) Particle • Palindromes

    • Viral Genomes • Roles of palindromes in

    DNA and RNA viruses

    SARS CoronavirusParticle

  • Palindromes in Letter Sequences

    Odd Palindrome:“A nut for a jar of tuna”

    ANUTFORA AROFTUNAJ

    remove spaces and capitalize

    Even Palindrome:“Step on no pets”

    STEPON NOPETS

  • DNA and RNA• DNA is deoxyribonucleic acid, made

    up of 4 nucleotide bases Adenine, Cytosine, Guanine, and Thymine.

    • RNA is ribonucleic acid, made up of 4 nucleotide bases Adenine, Cytosine, Guanine, and Uracil.

    • The bases A and T form a complementary pair, so are C and G.

    G

    AC

    TGU

    AC

    G

    C

    T

    A

  • DNA/RNA Palindromes

  • Virus and Eye DiseasesCMV Particle

    CMV Retinitis• inflammation of the retina • triggered by CMV particles• may lead to blindnessGenome size

    ~ 230 kbp

  • Palindromes and Replication Origins in Herpesviruses

    • High concentration of palindromes exists around replication origins of herpesviruses

    • Locating clusters of palindromes (above a minimal length) on herpesvirus genome sequences might reveal likely locations of its replication origins.

  • Human Herpesvirus (HHV-6) Genome

    • Replication origin (oriLyt) in non-coding region• Two binding sites for origin-binding protein (OBP)• Two 200-bp DNA unwinding elements (DUE)

  • Computational Prediction of Replication Origins

    • Poisson approximation for palindrome distribution in an i.i.d. random sequence model

    • Criterion for identifying statistically significant palindrome clusters using scan statistics

    • Improve prediction accuracy

  • Poisson Process Approximation of Palindrome Distribution

  • Use of the Scan Statistic to Identify Clusters of Palindromes

  • Improve Prediction Accuracy

    • Using a Markov random sequence model• Taking the lengths of palindromes and base

    composition of the genomes into consideration when counting palindromes. [Chew et al. 2005, Nucleic Acids Research 33(15):e134]

    • Using machine learning approaches to utilize knowledge about replication origin locations in closely related genomes.

  • SARS Viral Particles

  • SARS Virus Particle

    Spike Glycoprotein

    HostCell

    • Single-stranded RNA genome with ~30 kilobases• Only about 7% the genome size of Cytomegalovirus

    (CMV) with double-stranded DNA

  • SARS Virus Genome Map

    • Replicase (1a and 1b), spike glycoprotein (S), X1 and X2 occupy 87% of the genome

    • Two pairs of overlapping ORFs,1a & 1b and X1 & X2 (as designated by Rota et al. 2003), are predicted in this region

    • 1a and 1b are standard in all coronaviruses, X1 and X2 are unique to SARS. Whether X1 and X2 do code for proteins is still unconfirmed

    256 13398 21485 25268

    ORF1aORF1b

    S X1X2

  • A Long Palindrome in X1 and X2

    TCTTTAACAAGCTTGTTAAAGA

    Positions: 25962-25983 (22 bases)

    • Found in SARS but not in other 6 coronavirus genomes (Chew et al. 2004)

    • The next longest palindrome in SARS is 14 bases long• In the overlapping region of X1 and X2

  • Palindrome: A string of nucleotide bases that reads the same as its reverse complement. A palindrome must be even in length, e.g. palindrome of length 10:

    5’ ….. GCAATATTGC …..3’

    j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L

    We say that a palindrome of length 2L occurs at position j when the (j-i+1)st and the (j+i)th bases are complementary to each other for i=1,…, L. In an i.i.d. sequence model this occurs with probability

    ( )2 LA T C Gp p p p⎡ ⎤+⎣ ⎦ .

  • Probability of Observing a Length 22 Palindrome

    • Approximate the palindrome distribution by a Poisson process with rate

    • Here n = genome length = 29727, and

    • The probability of the occurrence of at least one length 22 palindrome in the genome is

    [ ]11ˆ ˆ ˆ ˆ2( )A T C Gp p p p p= +

    0.010081 0.01e−− ≈

    0.01008npλ= =

  • Expression of Overlapping Genes Requires Frameshifting in Reading

    Frameshifting must have the following elements:• Slippery Sequence - a mechanism that allows the

    “reader” (called ribosome) to slip • Stimulatory Element - a pseudoknot or stem-loop

    structure that blocks the ribosome

    Slippery Sequences

    Short binding sequencePseudoknot:

    Palindrome-like sequencesStem-loop or

    hairpin loop:

  • Ribosome

    start reading

    -1 FrameshiftingPseudoknot

    CGGGTTT

    • Reading with default frame: GGG then TTT

  • Ribosome

    start reading

    Pseudoknot

    CGGGTTT

    -1 Frameshifting

    • Reading with default frame: GGG then TTT• Reading after -1 frameshifting: CGG then GTT

  • Heptanucleotide Slippery Sequences

    • ORF1a and ORF1b (Overlap: 13398, 1 base only)

    • X1 and X2 (Overlap: 25689-26089, 401 bases)

    A string in the form of XXXYYYNwhere X = A,T, or G; Y = A or T; and N = A, T, or C

  • Locations of Slippery Sequences256 13398 21485 25268

    ORF1aORF1b

    S X1X2

    TTTAAAC TTGAAAASlippery Sequences:?

    • Right preceding the overlapping base between 1a and 1b, there is a slippery sequence followed by a pseudoknot (Theil et al 2003)

    • Possible slippery sequences are detected in the overlapping region of X1 and X2; any pseudoknot or stem-loop structure in close proximity downstream?

  • Pseudoknot Predicted by PknotsRG

    ∆G = -55.84 kcal/mol.

  • RNA Secondary Structure Prediction

    • Prediction RNA secondary structures including pseudoknots is a computational intensive problem.

    • Use of heterogeneous grid computing to provide computing power.

    • Use of palindrome content evaluation to improve prediction consistency.

  • 1 if palindrome of length 2 occurs at base 0 otherwisej

    L jI

    ≥=

    n L

    L kj L

    X I−

    =

    = ∑

    Palindrome counts in random nucleotide sequences Define the indicator random variable

    Then

    is the total count of palindromes of length at least 2L in a sequence of length n.

  • 12

    1

    ( ) ( 2 1) ( )

    var( ) var( ) 2 cov( , )

    n L

    L L k Lj L

    n L n L n L

    L L j j kj L j L k j

    E X E I n L E I

    X I I I

    µ

    σ

    =

    − − − −

    = = = +

    = = = − +

    = = +

    ∑ ∑ ∑

    j−

    2

    ( ) (0)

    cov( , )

    var( ) (0)(1 (0))( ) (0)

    j

    j j d

    j

    E I

    I I

    Id

    γ

    γ γ

    γ γ+

    =

    =

    = −

    Mean and variance of palindrome counts

    If we let (0) ( 1) for

    ( ) ( 1, 1) for 1j

    j j d

    P I L j n L

    d P I I d n L

    γ

    γ +

    = = ≤ ≤ −

    = = = ≤ ≤ − then

  • ( )

    ( )

    ( )

    2

    1

    1

    22

    1

    ( ) ( 2 1) 0

    var( )

    var( ) 2 cov( , )

    ( 2 1) (0) 1 (0)

    2 2 1 ( ) (0)

    L L

    L Ln L n L n L

    j j kj L j L k j

    n L

    d

    E X n L

    X

    I I I

    n L

    n L d d

    µ γ

    σ

    γ γ

    γ γ

    − − − −

    = = = +

    =

    = = − +

    =

    = +

    = − + −

    + − + − −

    ∑ ∑ ∑

    Mean and variance of palindrome counts (cont’d)

  • How to find the γ’s?

    • Under a Markov sequence model, Chew et al.(INFORMS Journal of Computing, 16:331-340, 2004) have obtained computable formulas for the γ’s, expressed in terms of the transition and stationary probabilities of the Markov chain. These can be estimated by the observed base frequencies and dinucleotide frequencies.

    • Let’s look at a special case, namely the i.i.d. random sequence model where the nucleotide bases are generated independently with probabilities pA, pC, pG, pT.

  • G

    Finding γ(0) for the i.i.d. sequence model

    (0) ( 1) [2( )]Lj A T CP I p p p pγ = = = +

    j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L

  • Finding γ(d) for the i.i.d. sequence model: Case 1: d ≥ 2L

    Case 2: L ≤ d < 2L

    Case 3: 1 ≤ d < L

  • Palindromes, and More Palindromes…

    i - L i - 2 i - 1 i i + 1 i + 2 i + 3 i +L+1 b1 b2 … bL - 1 bL b’L b’L - 1 … b’2 b’1

    i - L i - 3 i – 2 I - 1 i i + 1 i + 2 i + 3 i +L+1 b1 b2 … bL - 1 bL b’L b’L - 1 … b’2 b’1

  • Acknowledgements

    CollaboratorsLouis H. Y. Chen (National University of Singapore) David Chew (National University of Singapore) Kwok Pui Choi (National University of Singapore)Raul Cruz-Cano (The University of Texas at El Paso)Hans Heidner (The University of Texas at San Antonio)Michela Taufer (The University of Texas at El Paso)Aihua Xia (University of Melbourne, Australia)

    Funding SupportNIH Grants S06GM08012-35, 2G12RR008124-11 and 3T34GM008048-20S1.THECB Advanced Research Program 003661-0008-2006

    Palindromes in Letter SequencesPalindromes and Replication Origins in HerpesvirusesComputational Prediction of Replication OriginsRNA Secondary Structure PredictionPagesfrom-HKU-Leung040625.pdfHairpin Loop Predicted by Pknots