computational virology
DESCRIPTION
Computational Virology. Lectures in. Bioinformatic Studies on the Evolution Structure and Function of RNA-based Life Forms. Marcella A. McClure, Ph.D. Department of Microbiology and the Center for Computational Biology Montana State University, Bozeman MT [email protected]. - PowerPoint PPT PresentationTRANSCRIPT
Marcella A. McClure, Ph.D.Department of Microbiology and the Center for Computational Biology
Montana State University, Bozeman MT
Computational VirologyLectures in
Bioinformatic Studies on the Evolution Structure and Function of RNA-based Life Forms
1) Introduction to Retroid Agents
2) The Genome Parsing Suite
3) Retroid Agents in the Human Genome
4) Discovery-based Hypothesis Generation
Summary Lecture II
Replication by
DNA-dependent
DNA polymerase
PROTEIN SYNTHESIS
snRNAs, ribozymes tRNA, rRNAtranslation
transcription
reverse transcriptase mediated replication or transposition
DNA
RNA viruses e.g.,
Ebola, rabies,
influenza, polio
Replication by RNA-dependentRNA Polymerase
All cellular systems
& most DNA Viruses
McClure, 2000
Retroviruses, retrotransposons, pararetroviruses, retroposons, retroplasmids, retrointrons, and retrons
RNA
Retroid Agents
Retroid agents
Eukaryotes
Eubacteria
Human
Vertebrates Invertebrates Plants Fungi Slime Mold
Protists
Alga Protozoa
Oomycetes
Plastids Baculovirus Genome
Conjugative
transposons
Retroviruses
+
+
+a
Pararetroviruses Caulimoviruses Badnaviruses Hepadnaviruses
+
+
+
+
Transposons: Retrotransposons Gypsy- DIRS1- Copia- Retroprosons Retrointrons
+
+ + +
+b
+
+b
+
+
+
+
+ + +
+
+ + +
+ +
+
+
+ +
+ + + +
+
+
+
+
Retroplasmids
+ +
Retrons
+
Retrophages
+
Archaea
+
+
Distribution of Retroid Agents among Eukaryotes and Eubacteria
Variable features of Retroid genomes Retroid agent LTRs PBS DNA synthesis primer host self protein Integration specificity self other site regional structural Retroviruses + + tRNA - - - - - - + Pararetroviruses Plant Animal - +a + - tRNA - - - - RT NAb NAc Transposons: Retrotransposons Gypsy- Gypsy Tf1 DIRS- Copia- Retroposons Retrointrons
+ + ITRs + -d -
+ - - + - -
tRNA - DNA tRNA DNA DNA
- RNA ? - - -
- - ? - - -
+ + + ? +e
+ ? + + ?
+ ? + + ?
+ ? + + ?
+ ? + + ? Retroplasmids Mitochondrial Fungal - - - - - ? tRNA ? - ? NA Retrons - - - RNA - ? ? ? ? ? Retrophages - - - RNA - ? ? ? ? ?
HBV
HIV-1DIRS-1
17.6CaMV
Copia
I-FAC
INGI
R2Bm
CIN4
LIN-H
INT-SC1
MAUP
MX65TERT
retroviruses
hepadnaviruses
orphan class
gypsy-like retrotransposons
caulimoviruses
copia-like retrotransposons
retr
op
oson
s
introns
plasmids
retrons
NC
1000 2000 3000 4000
Nucleotides
C
Phylogenetic Tree based on 65 RT sequences
Gene Maps
RT = reverse transcriptaseRH= ribonuclease HH-C/IN =integrase
PR = aspartic acid protease
C
NC
NCC
NC
C NC
MA C NC
Gro
up
II
McClure, 2000
C
Ribonuclease HReverse Transcriptase
fingers palm fingers palm thumb connection
DK P DD KG D E D
NX3D
Integrase
DTG G ILG DTG G ILG
Aspartic Acid Protease1 2 3 1 2 3
1 2 3 4 5 6 1 2 3 4
Hx4H CX2C D D E
1 2 3 4
zinc-binding core DNA-binding
Hx4H CX2C D D E
1 2 3 4
zinc-binding core DNA-binding
RNA-dependent DNA Polymerase
1) Disease:a) retroviruses:
1) exogenous infectious: HIV HTLV2) endogenous associations: breast cancer, testicular tumors, insulin dependent diabetes, multiple sclerosis, rheumatoid
arthritis, schizophrenia and systemic lupus erythematosus b) LINEs insertional mutagenesis:
1) Hemophilia A 2) muscular dystrophies; Duchenne and Fukuyama- congenital type
3) X-linked disorders; Alport Syndrome-Diffuse Leiomyomatosis and Chronic Granulomatous Disease
2) Regulation of cellular genes and reproduction
3) Telomere maintenance
4) Repair of broken dsDNA
5) Exchange of genetic information among and between organisms
Roles of Retroid Agents:
Endometrium
Trophoblast Syncytiotrophoblast
HERV-W
Syncytin
Possible function of HERV-W
Real Chromosome
Real Contig
Predicted Retroid genome
Predicted functional RT
Disease Reproduction Development
What is the “host” genomic environment of active Retroid Agents ?
By Chromosome
Determine total versus potentially Active Retroid Agents in Human Genome
22 RT sequences The Human Genome By Subgroup
Probable RT function determined by: E-value, OSM score and gene architecture
Mapping Genomic Retroid Agents
Probable active Retroid agents determined by:1) genomic boundaries2) genome architecture3) identification of OSM in PR/RH and IN sequences4) presence of non-enzymatic Retroid genes
Map host gene environment of Retroid genome
Database
Significant BLAST hits from 22 queries on 24 chromosomes
Query Sequences Data categories
Hypothesis Testing regarding theFunctiona and Evolution of Retroid Agents
What is the distribution of active RetroidAgents in the Human Genome
BLAST using 22 RT consensus sequences
Remove duplicates and overlaps
Evaluate OSM of RT
Select RTs to annotate
Extract Genome based on RT type
Annotate using consensus library
Analyze the entire Retroid Agent
The seven major steps of GPS
Ribonuclease HReverse Transcriptase
fingers palm fingers palm thumb connection
DK P DD KG D E D
NX3D
Integrase
DTG G ILG DTG G ILG
Aspartic Acid Protease1 2 3 1 2 3
1 2 3 4 5 6 1 2 3 4
Hx4H CX2C D D E
1 2 3 4
zinc-binding core DNA-binding
Hx4H CX2C D D E
1 2 3 4
zinc-binding core DNA-binding
RNA-dependent DNA Polymerase
M score =M + M1 + M2
M length
M, M1 and M2 are based on the number of amino acids in a motif found in common between a known RT query sequence and the potential RT
M is a count of amino acid identities
M1 is a count on conservative substitution of (ILMV, AG, ST, DE, NQ, FY, RK)
M2 accounts for older substitutions (LIMV, AGST, DENQ, FYW, RKH)
M score_iT motifs
OSM score =∑
i = 1
T motifs
The overall OSM score is calculated by
T motifs is the number of motifs comprising the OSM
The score of a given motif is calculated by
• 3,200,000 Kbp of the euchromatic portion of the human chromosomes are being sequenced
• Heterochromatic portion is not being done
• As of January 5, 2003:
– Non-redundant sequence only
– 98.8% of euchromatic portion has been done
– 3.0% is completed to the working draft level
– 95.8% has been completed to 99% accuracy
Status of the Human Genome Project
0 2000 4000 6000 8000
123456789
10111213141516171819202122XY
Chromosomes
Unique RTs
B.
0 50 100 150 200 250
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Y
Chromosomes
Megabasepairs
A.
Fluctuation in nucleotides per chromosome (A) and unique BLAST RT hits per chromosome (B) over the last four freezes. The bar codes are as follows: black November, 2002; right-hatched, June 2002; gray April 2002; and left-hatched December 2001.
Chr Size Raw hits Unique RTs w/6 motifs Intact OSM Full LINE Perfect LINE
chr1 221.3 16450 6202 595 207 124 17
chr2 237.5 17573 6810 712 259 162 15
chr3 194.3 16858 6243 650 243 136 12
chr4 187.7 17479 6463 709 264 151 10
chr5 177.7 15793 5937 676 232 135 17
chr6 166.8 5611 2119 266 99 65 4
chr7 153.8 4465 1860 154 57 39 4
chr8 142.4 10488 4018 401 149 95 7
chr9 117 9063 3601 323 126 70 5
chr10 131 8396 3216 295 106 63 5
chr11 132.2 10465 3943 415 152 90 9
chr12 128.4 9598 3563 396 130 86 6
chr13 95.2 4446 1815 135 50 30 3
chr14 88.1 2842 1131 125 43 23 1
chr15 82.9 5742 2297 172 64 42 5
chr16 80.6 3892 1575 132 50 31 8
chr17 79.7 3554 1453 85 35 20 2
chr18 74.6 5852 2257 170 79 46 5
chr19 56.3 3368 1215 61 30 15 0
chr20 59.4 2061 851 59 21 11 2
chr21 33.9 1456 581 33 11 9 0
chr22 33.8 1397 602 27 12 7 2
chrx 147.3 22249 8148 921 336 186 13
chry 22.7 4311 1230 132 40 20 1
Totals 2845 203409 77130 7644 2795 1656 153
Distribution of significant BLAST hits retrieved by 22 RT protein query sequences per chromosome. Chromosomal size from the Nov. 2002 HGD freeze is given in megabase pairs. Other column designations are described in the text. The significant raw and unique hits are from all 22 queries. The RTs with six motifs are significant hits retrieved by LINEs, HERVs, MMLV and TERT queries. Intact OSMs are found only in LINEs, HERVs and the TERT. The last two columns report the full length LINEs with all components and perfect LINEs, respectively.
N o . i n H G S t o p c o d o n s F r a me-s hi f t s D e t ail s
1 5 3 0 0 P erfe ct
8 6 1 0 4 3 i n LZ / 1 5 i n T
8 0 0 1 1 1 E n / 2 in tr a OR F
1 3 3 7 M u lti p l e M u lti p l e M a n y ca s es
1 6 5 6
A total of 153 LINEs appear to be perfect, while 86 contain a single stop codon and 80 a single frame-shift.
Classicification of 1656 whole LINEs
58/31/0/0
2369/17/0/0
3109/39/0/0
3232/51/0/0
1505/15/0/0
903/8/0/0
3506/52/0/0
4559/2108/4/0
8208/2910/208/12
2982/496/86/22
170260/69692/7345/2760
Hits
HBV
SPUMA
Snakehead
HTLV
FIV
HIV
MPMV
MMLV
HERV-L
HERV-K
H-LIN
Query
11/11/0/0Archaea
26/21/0/0R_TERT
1857/1581/1/1H_TERT
9/9/0/0RECO
19/18/0/0PMAUP
27/13/0/0IPAO
97/12/0/0DIRO
334/14/0/0Gypsy
104/9/0/0Copia
174/11/0/0CMV
60/12/0/0RTBV
HitsQuery
Values indicate raw hits/unique hits/RTs with 6 motifs/Perfect OSMs. The 22 representative sequences used to query the HGD. Sequences, excluding the HERVs and human TERT, are the representative mean sequences for over 600 RTs from eight different classes of Retroid Agents.
Distribution of significant BLAST hits per query sequence.
Chromosome RTBV CMV Gypsy HBV HTLV IPAO FIV HIV MPMV DIRO PMAUP Snakehead Spuma TERT Total1 1/0 1/0 1/0 1/1 1/0 3/2 4/2 29/0 41/52 2/2 3/2 1/1 2/0 1/1 1/1 1/0 14/0 25/73 1/0 1/0 5/4 1/1 1/1 3/3 13/0 25/94 1/1 1/0 1/1 5/4 1/0 3/1 2/1 9/0 23/85 1/0 3/1 2/1 2/0 1/1 1/0 3/3 1/0 19/6 33/126 1/0 1/0 2/0 1/0 1/0 1/0 1/0 9/1 17/17 1/1 1/0 1/1 10/0 13/28 1/1 1/1 1/1 1/0 6/0 10/39 1/0 1/0 1/1 1/1 15/0 19/210 2/1 3/0 1/1 1/1 2/0 13/0 22/311 5/2 1/0 1/1 2/1 10/0 19/412 4/2 2/0 1/0 1/1 1/1 2/1 8/0 19/513 2/1 2/1 1/0 1/1 1/1 4/0 11/414 1/1 1/0 8/0 10/115 2/2 1/0 1/0 1/0 1/0 1/1 13/0 20/316 2/2 1/0 3/2 24/0 30/417 1/1 3/0 1/1 21/0 26/218 1/0 1/0 1/0 2/1 1/0 1/1 1/0 1/1 8/0 17/319 1/0 1/1 2/1 4/3 1/1 2/0 24/0 35/620 1/1 1/1 8/0 10/221 1/1 2/1 1/0 1/0 5/0 10/222 2/0 1/0 1/1 1/1 11/0 16/2X 1/0 1/1 1/1 4/4 2/1 3/2 2/2 1/1 5/0 20/12Y 4/2 2/2 2/2 2/0 1/0 11/6
Total 9/4 9/6 13/7 18/6 40/22 7/1 13/3 4/3 30/23 5/0 7/0 26/18 15/8 286/7 482/108
Distribution of the 482 Low Frequency Reverse Transcriptase hits with remnants of at least one motif. Number of Low Frequency hits/Number of hits with a minimum of one recognizable motif. Of the 482 hits, 108 have at least one recognizable RT motif. The remaining 374 hits have remnants of at least one motif and were conserved enough to be scored by GPS.
Chromosome HIV MPMV
Motifs K D QG DD G-K LG K D QG DD G-K LG
1 1R (1)C C (1)C C 1R
2 1C
3 1C 1C
4 3C(1)C C 1R
5 1C
6
7
8
9 (1)R C
10 1C
11
12 1C
13 (1)C C1R
14
15 1R
16 1R(1)C R 1C
17
18 1C
19 1C 1R (2)C C C
20 1C
21
22 1C
X 1R (1)C R 1C
Y 1C(1)R C C C
Spuma TERT
K D QG DD G-K LG K D QG DD G-K LG
29R
1R 14R
13R 1R 1C 9R
1C 2C 1C12R 1C1R 1C 1C
8R 1C
10R
6R
15R
13R
1C 1R 10R
1R 1C 8R
1C 4R
8R
1C 12R 1R
24R
1C 21R
8R
2R 22R 2R
8R
5R(1)C C 10R 1R
1C 5R
Chromosomes
Chromosome 21 contig NT_029490TPTE Gene
Truncated LINEinserted into
Intron 6
Truncated L1MB1 inserted into
Intron 6
Truncated L1PA5inserted into
Intron 8
Truncated LINEinserted into
Intron 18
Figure 3: Looking at the environment of each Retroid Genome. In this example, four truncated LINEs are found within three different exons of a putative Tyrosine Phosphatase gene (TPTE). Insertions of Retroid genomes into introns may have little effect on a gene, or may allow for gene shuffling. In this case none of the coding region for the gene was disrupted, which demonstrates that Retroid sequence information may be utilized to make introns, or selection favors insertions that do not disrupt coding capacity or introns may provide the preferential target site for transposition. The black lines represent the exons of the TPTE gene.
Looking at the environment of each Retroid Agent
(November, 2002 Freeze)
Query:
22 distinct reverse transcriptase sequences representing 18 subgroups were used to query the NCBI’s Human Genome Database
Results:
1) Retroid Agents are not randomly distributed on Human Chromosomes. 2) Chromosomes X and Y have the highest percent Retroid Agent sequence 3) Of those remaining, Chromosome 4, has the most, while Chromosome 20 comprises the least percent Retroid Agents.
Only two chromosomes, 19 and 21 are without at least one intact and potentially active LINE. Using exact sequence lengths for each hit of each category indicated in the table of data, the November freeze of the human genome contains at least 1.01% unique RT sequences, 0.35% full-length LINEs and 0.032% active LINEs.
Distribution of Retroid Agents on Human Chromosomes
1) Low frequency RT-like sequences (not from LINEs or ERVs) are discernible in the Human Genome.
2) Human low frequency RT-like sequences are remnants of ancient invasions.
3) Human low frequency RT-like sequences are remnants of failed invasions.
4)The pattern of low frequency RT-like sequences is unique in each organismal genome.
5) Both unique and trans-organismal patterns of low frequency RT-like sequences are found in Eukaryotes.
1) Gene conversion, an event without a mechanism.2) Transcriptional inactivation due to methylation of CpG regions. 3) Translational recoding.4) Complementation.
New hypotheses from discovery-based research
What mechanisms could be maintaining these signals ?
Dr. Marcella McClure, P.I. (Marcie)
Eric Donaldson, B.S., Bioinformatician II
Dustin Lee, M.S., Bioinformatics Programmer
Aaron Juntunen, Undergraduate programmer
Crystal Hepp, Undergraduate
Kendal Harwood, Undergraduate