computational virology

26
Marcella A. McClure, Ph.D. Department of Microbiology and the Center for Computational Biology Montana State University, Bozeman MT [email protected] Computational Virology Lectures in Bioinformatic Studies on the Evolution Structure and Function of RNA-based Life Forms

Upload: walker

Post on 25-Jan-2016

51 views

Category:

Documents


1 download

DESCRIPTION

Computational Virology. Lectures in. Bioinformatic Studies on the Evolution Structure and Function of RNA-based Life Forms. Marcella A. McClure, Ph.D. Department of Microbiology and the Center for Computational Biology Montana State University, Bozeman MT [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Virology

Marcella A. McClure, Ph.D.Department of Microbiology and the Center for Computational Biology

Montana State University, Bozeman MT

[email protected]

Computational VirologyLectures in

Bioinformatic Studies on the Evolution Structure and Function of RNA-based Life Forms

Page 2: Computational Virology

1) Introduction to Retroid Agents

2) The Genome Parsing Suite

3) Retroid Agents in the Human Genome

4) Discovery-based Hypothesis Generation

Summary Lecture II

Page 3: Computational Virology

Replication by

DNA-dependent

DNA polymerase

PROTEIN SYNTHESIS

snRNAs, ribozymes tRNA, rRNAtranslation

transcription

reverse transcriptase mediated replication or transposition

DNA

RNA viruses e.g.,

Ebola, rabies,

influenza, polio

Replication by RNA-dependentRNA Polymerase

All cellular systems

& most DNA Viruses

McClure, 2000

Retroviruses, retrotransposons, pararetroviruses, retroposons, retroplasmids, retrointrons, and retrons

RNA

Retroid Agents

Page 4: Computational Virology

Retroid agents

Eukaryotes

Eubacteria

Human

Vertebrates Invertebrates Plants Fungi Slime Mold

Protists

Alga Protozoa

Oomycetes

Plastids Baculovirus Genome

Conjugative

transposons

Retroviruses

+

+

+a

Pararetroviruses Caulimoviruses Badnaviruses Hepadnaviruses

+

+

+

+

Transposons: Retrotransposons Gypsy- DIRS1- Copia- Retroprosons Retrointrons

+

+ + +

+b

+

+b

+

+

+

+

+ + +

+

+ + +

+ +

+

+

+ +

+ + + +

+

+

+

+

Retroplasmids

+ +

Retrons

+

Retrophages

+

Archaea

+

+

Distribution of Retroid Agents among Eukaryotes and Eubacteria

Page 5: Computational Virology

Variable features of Retroid genomes Retroid agent LTRs PBS DNA synthesis primer host self protein Integration specificity self other site regional structural Retroviruses + + tRNA - - - - - - + Pararetroviruses Plant Animal - +a + - tRNA - - - - RT NAb NAc Transposons: Retrotransposons Gypsy- Gypsy Tf1 DIRS- Copia- Retroposons Retrointrons

+ + ITRs + -d -

+ - - + - -

tRNA - DNA tRNA DNA DNA

- RNA ? - - -

- - ? - - -

+ + + ? +e

+ ? + + ?

+ ? + + ?

+ ? + + ?

+ ? + + ? Retroplasmids Mitochondrial Fungal - - - - - ? tRNA ? - ? NA Retrons - - - RNA - ? ? ? ? ? Retrophages - - - RNA - ? ? ? ? ?

Page 6: Computational Virology

HBV

HIV-1DIRS-1

17.6CaMV

Copia

I-FAC

INGI

R2Bm

CIN4

LIN-H

INT-SC1

MAUP

MX65TERT

retroviruses

hepadnaviruses

orphan class

gypsy-like retrotransposons

caulimoviruses

copia-like retrotransposons

retr

op

oson

s

introns

plasmids

retrons

NC

1000 2000 3000 4000

Nucleotides

C

Phylogenetic Tree based on 65 RT sequences

Gene Maps

RT = reverse transcriptaseRH= ribonuclease HH-C/IN =integrase

PR = aspartic acid protease

C

NC

NCC

NC

C NC

MA C NC

Gro

up

II

McClure, 2000

C

Page 7: Computational Virology

Ribonuclease HReverse Transcriptase

fingers palm fingers palm thumb connection

DK P DD KG D E D

NX3D

Integrase

DTG G ILG DTG G ILG

Aspartic Acid Protease1 2 3 1 2 3

1 2 3 4 5 6 1 2 3 4

Hx4H CX2C D D E

1 2 3 4

zinc-binding core DNA-binding

Hx4H CX2C D D E

1 2 3 4

zinc-binding core DNA-binding

RNA-dependent DNA Polymerase

Page 8: Computational Virology

1) Disease:a) retroviruses:

1) exogenous infectious: HIV HTLV2) endogenous associations: breast cancer, testicular tumors, insulin dependent diabetes, multiple sclerosis, rheumatoid

arthritis, schizophrenia and systemic lupus erythematosus b) LINEs insertional mutagenesis:

1) Hemophilia A 2) muscular dystrophies; Duchenne and Fukuyama- congenital type

3) X-linked disorders; Alport Syndrome-Diffuse Leiomyomatosis and Chronic Granulomatous Disease

2) Regulation of cellular genes and reproduction

3) Telomere maintenance

4) Repair of broken dsDNA

5) Exchange of genetic information among and between organisms

Roles of Retroid Agents:

Page 9: Computational Virology

Endometrium

Trophoblast Syncytiotrophoblast

HERV-W

Syncytin

Possible function of HERV-W

Page 10: Computational Virology

Real Chromosome

Real Contig

Predicted Retroid genome

Predicted functional RT

Disease Reproduction Development

What is the “host” genomic environment of active Retroid Agents ?

Page 11: Computational Virology

By Chromosome

Determine total versus potentially Active Retroid Agents in Human Genome

22 RT sequences The Human Genome By Subgroup

Probable RT function determined by: E-value, OSM score and gene architecture

Mapping Genomic Retroid Agents

Probable active Retroid agents determined by:1) genomic boundaries2) genome architecture3) identification of OSM in PR/RH and IN sequences4) presence of non-enzymatic Retroid genes

Map host gene environment of Retroid genome

Database

Significant BLAST hits from 22 queries on 24 chromosomes

Query Sequences Data categories

Hypothesis Testing regarding theFunctiona and Evolution of Retroid Agents

What is the distribution of active RetroidAgents in the Human Genome

Page 12: Computational Virology

BLAST using 22 RT consensus sequences

Remove duplicates and overlaps

Evaluate OSM of RT

Select RTs to annotate

Extract Genome based on RT type

Annotate using consensus library

Analyze the entire Retroid Agent

The seven major steps of GPS

Page 13: Computational Virology

Ribonuclease HReverse Transcriptase

fingers palm fingers palm thumb connection

DK P DD KG D E D

NX3D

Integrase

DTG G ILG DTG G ILG

Aspartic Acid Protease1 2 3 1 2 3

1 2 3 4 5 6 1 2 3 4

Hx4H CX2C D D E

1 2 3 4

zinc-binding core DNA-binding

Hx4H CX2C D D E

1 2 3 4

zinc-binding core DNA-binding

RNA-dependent DNA Polymerase

Page 14: Computational Virology

M score =M + M1 + M2

M length

M, M1 and M2 are based on the number of amino acids in a motif found in common between a known RT query sequence and the potential RT

M is a count of amino acid identities

M1 is a count on conservative substitution of (ILMV, AG, ST, DE, NQ, FY, RK)

M2 accounts for older substitutions (LIMV, AGST, DENQ, FYW, RKH)

M score_iT motifs

OSM score =∑

i = 1

T motifs

The overall OSM score is calculated by

T motifs is the number of motifs comprising the OSM

The score of a given motif is calculated by

Page 15: Computational Virology

• 3,200,000 Kbp of the euchromatic portion of the human chromosomes are being sequenced

• Heterochromatic portion is not being done

• As of January 5, 2003:

– Non-redundant sequence only

– 98.8% of euchromatic portion has been done

– 3.0% is completed to the working draft level

– 95.8% has been completed to 99% accuracy

Status of the Human Genome Project

Page 16: Computational Virology

0 2000 4000 6000 8000

123456789

10111213141516171819202122XY

Chromosomes

Unique RTs

B.

0 50 100 150 200 250

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

X

Y

Chromosomes

Megabasepairs

A.

Fluctuation in nucleotides per chromosome (A) and unique BLAST RT hits per chromosome (B) over the last four freezes. The bar codes are as follows: black November, 2002; right-hatched, June 2002; gray April 2002; and left-hatched December 2001.

Page 17: Computational Virology

Chr Size Raw hits Unique RTs w/6 motifs Intact OSM Full LINE Perfect LINE

chr1 221.3 16450 6202 595 207 124 17

chr2 237.5 17573 6810 712 259 162 15

chr3 194.3 16858 6243 650 243 136 12

chr4 187.7 17479 6463 709 264 151 10

chr5 177.7 15793 5937 676 232 135 17

chr6 166.8 5611 2119 266 99 65 4

chr7 153.8 4465 1860 154 57 39 4

chr8 142.4 10488 4018 401 149 95 7

chr9 117 9063 3601 323 126 70 5

chr10 131 8396 3216 295 106 63 5

chr11 132.2 10465 3943 415 152 90 9

chr12 128.4 9598 3563 396 130 86 6

chr13 95.2 4446 1815 135 50 30 3

chr14 88.1 2842 1131 125 43 23 1

chr15 82.9 5742 2297 172 64 42 5

chr16 80.6 3892 1575 132 50 31 8

chr17 79.7 3554 1453 85 35 20 2

chr18 74.6 5852 2257 170 79 46 5

chr19 56.3 3368 1215 61 30 15 0

chr20 59.4 2061 851 59 21 11 2

chr21 33.9 1456 581 33 11 9 0

chr22 33.8 1397 602 27 12 7 2

chrx 147.3 22249 8148 921 336 186 13

chry 22.7 4311 1230 132 40 20 1

Totals 2845 203409 77130 7644 2795 1656 153

Distribution of significant BLAST hits retrieved by 22 RT protein query sequences per chromosome. Chromosomal size from the Nov. 2002 HGD freeze is given in megabase pairs. Other column designations are described in the text. The significant raw and unique hits are from all 22 queries. The RTs with six motifs are significant hits retrieved by LINEs, HERVs, MMLV and TERT queries. Intact OSMs are found only in LINEs, HERVs and the TERT. The last two columns report the full length LINEs with all components and perfect LINEs, respectively.

Page 18: Computational Virology

N o . i n H G S t o p c o d o n s F r a me-s hi f t s D e t ail s

1 5 3 0 0 P erfe ct

8 6 1 0 4 3 i n LZ / 1 5 i n T

8 0 0 1 1 1 E n / 2 in tr a OR F

1 3 3 7 M u lti p l e M u lti p l e M a n y ca s es

1 6 5 6

A total of 153 LINEs appear to be perfect, while 86 contain a single stop codon and 80 a single frame-shift.

Classicification of 1656 whole LINEs

Page 19: Computational Virology

58/31/0/0

2369/17/0/0

3109/39/0/0

3232/51/0/0

1505/15/0/0

903/8/0/0

3506/52/0/0

4559/2108/4/0

8208/2910/208/12

2982/496/86/22

170260/69692/7345/2760

Hits

HBV

SPUMA

Snakehead

HTLV

FIV

HIV

MPMV

MMLV

HERV-L

HERV-K

H-LIN

Query

11/11/0/0Archaea

26/21/0/0R_TERT

1857/1581/1/1H_TERT

9/9/0/0RECO

19/18/0/0PMAUP

27/13/0/0IPAO

97/12/0/0DIRO

334/14/0/0Gypsy

104/9/0/0Copia

174/11/0/0CMV

60/12/0/0RTBV

HitsQuery

Values indicate raw hits/unique hits/RTs with 6 motifs/Perfect OSMs. The 22 representative sequences used to query the HGD. Sequences, excluding the HERVs and human TERT, are the representative mean sequences for over 600 RTs from eight different classes of Retroid Agents.

Distribution of significant BLAST hits per query sequence.

Page 20: Computational Virology

Chromosome RTBV CMV Gypsy HBV HTLV IPAO FIV HIV MPMV DIRO PMAUP Snakehead Spuma TERT Total1 1/0 1/0 1/0 1/1 1/0 3/2 4/2 29/0 41/52 2/2 3/2 1/1 2/0 1/1 1/1 1/0 14/0 25/73 1/0 1/0 5/4 1/1 1/1 3/3 13/0 25/94 1/1 1/0 1/1 5/4 1/0 3/1 2/1 9/0 23/85 1/0 3/1 2/1 2/0 1/1 1/0 3/3 1/0 19/6 33/126 1/0 1/0 2/0 1/0 1/0 1/0 1/0 9/1 17/17 1/1 1/0 1/1 10/0 13/28 1/1 1/1 1/1 1/0 6/0 10/39 1/0 1/0 1/1 1/1 15/0 19/210 2/1 3/0 1/1 1/1 2/0 13/0 22/311 5/2 1/0 1/1 2/1 10/0 19/412 4/2 2/0 1/0 1/1 1/1 2/1 8/0 19/513 2/1 2/1 1/0 1/1 1/1 4/0 11/414 1/1 1/0 8/0 10/115 2/2 1/0 1/0 1/0 1/0 1/1 13/0 20/316 2/2 1/0 3/2 24/0 30/417 1/1 3/0 1/1 21/0 26/218 1/0 1/0 1/0 2/1 1/0 1/1 1/0 1/1 8/0 17/319 1/0 1/1 2/1 4/3 1/1 2/0 24/0 35/620 1/1 1/1 8/0 10/221 1/1 2/1 1/0 1/0 5/0 10/222 2/0 1/0 1/1 1/1 11/0 16/2X 1/0 1/1 1/1 4/4 2/1 3/2 2/2 1/1 5/0 20/12Y 4/2 2/2 2/2 2/0 1/0 11/6

Total 9/4 9/6 13/7 18/6 40/22 7/1 13/3 4/3 30/23 5/0 7/0 26/18 15/8 286/7 482/108

Distribution of the 482 Low Frequency Reverse Transcriptase hits with remnants of at least one motif. Number of Low Frequency hits/Number of hits with a minimum of one recognizable motif. Of the 482 hits, 108 have at least one recognizable RT motif. The remaining 374 hits have remnants of at least one motif and were conserved enough to be scored by GPS.

Page 21: Computational Virology

Chromosome HIV MPMV

Motifs K D QG DD G-K LG K D QG DD G-K LG

1 1R (1)C C (1)C C 1R

2 1C

3 1C 1C

4 3C(1)C C 1R

5 1C

6

7

8

9 (1)R C

10 1C

11

12 1C

13 (1)C C1R

14

15 1R

16 1R(1)C R 1C

17

18 1C

19 1C 1R (2)C C C

20 1C

21

22 1C

X 1R (1)C R 1C

Y 1C(1)R C C C

Spuma TERT

K D QG DD G-K LG K D QG DD G-K LG

29R

1R 14R

13R 1R 1C 9R

1C 2C 1C12R 1C1R 1C 1C

8R 1C

10R

6R

15R

13R

1C 1R 10R

1R 1C 8R

1C 4R

8R

1C 12R 1R

24R

1C 21R

8R

2R 22R 2R

8R

5R(1)C C 10R 1R

1C 5R

Chromosomes

Page 22: Computational Virology

Chromosome 21 contig NT_029490TPTE Gene

Truncated LINEinserted into

Intron 6

Truncated L1MB1 inserted into

Intron 6

Truncated L1PA5inserted into

Intron 8

Truncated LINEinserted into

Intron 18

Figure 3: Looking at the environment of each Retroid Genome. In this example, four truncated LINEs are found within three different exons of a putative Tyrosine Phosphatase gene (TPTE). Insertions of Retroid genomes into introns may have little effect on a gene, or may allow for gene shuffling. In this case none of the coding region for the gene was disrupted, which demonstrates that Retroid sequence information may be utilized to make introns, or selection favors insertions that do not disrupt coding capacity or introns may provide the preferential target site for transposition. The black lines represent the exons of the TPTE gene.

Looking at the environment of each Retroid Agent

Page 23: Computational Virology
Page 24: Computational Virology

(November, 2002 Freeze)

Query:

22 distinct reverse transcriptase sequences representing 18 subgroups were used to query the NCBI’s Human Genome Database

Results:

1) Retroid Agents are not randomly distributed on Human Chromosomes. 2) Chromosomes X and Y have the highest percent Retroid Agent sequence 3) Of those remaining, Chromosome 4, has the most, while Chromosome 20 comprises the least percent Retroid Agents.

Only two chromosomes, 19 and 21 are without at least one intact and potentially active LINE. Using exact sequence lengths for each hit of each category indicated in the table of data, the November freeze of the human genome contains at least 1.01% unique RT sequences, 0.35% full-length LINEs and 0.032% active LINEs.

Distribution of Retroid Agents on Human Chromosomes

Page 25: Computational Virology

1) Low frequency RT-like sequences (not from LINEs or ERVs) are discernible in the Human Genome.

2) Human low frequency RT-like sequences are remnants of ancient invasions.

3) Human low frequency RT-like sequences are remnants of failed invasions.

4)The pattern of low frequency RT-like sequences is unique in each organismal genome.

5) Both unique and trans-organismal patterns of low frequency RT-like sequences are found in Eukaryotes.

1) Gene conversion, an event without a mechanism.2) Transcriptional inactivation due to methylation of CpG regions. 3) Translational recoding.4) Complementation.

New hypotheses from discovery-based research

What mechanisms could be maintaining these signals ?

Page 26: Computational Virology

Dr. Marcella McClure, P.I. (Marcie)

Eric Donaldson, B.S., Bioinformatician II

Dustin Lee, M.S., Bioinformatics Programmer

Aaron Juntunen, Undergraduate programmer

Crystal Hepp, Undergraduate

Kendal Harwood, Undergraduate