tools for comparative sequence analysis ivan ovcharenko lawrence livermore national laboratory
TRANSCRIPT
Tools for Comparative SequenceAnalysis
www.dcode.orgIvan Ovcharenko
Lawrence Livermore National Laboratory
A set of problems:http://www.dcode.org/bioquest.php
1. Browsing genomes using synteny links 2. Aligning sequences to vertebrate genomes 3. Aligning sequences to identify evolutionary conserved regions 4. Assigning function to regulatory elements 5. Decoding gene regulation using microarray data
zPicture:Dynamic Alignment of
Megabase-long Sequences and Genomes
http://zpicture.dcode.org
zPicture http://zpicture.dcode.org/
Automated sequence extraction and gene annotation
I. Ovcharenko, G. Loots, R.C. Hardison, W. Miller, and Lisa StubbsGenome Research, 14(3), 472-477 (2004)
>hg16_dna range=chr16:55400000-55800000 Tataatggctacctatttggagtgcctaccatgtattagtcattgtgcta actgatgtataggcatctcatttacagttcaactcatttgaacctaaatg aagaatagttgtttgtcccttattttatttaacaaaatttaaaactattt ctaagtcgctcattaaatgacaaagcttaaaccaaattttgtctgattgt aaaggccatacttttAATCATTTATATAAAACAACGCAGCCATATTTAAC TTCTGCCATATATTTTCTTACCGATGAATGATATATATCAAATGTTGACT TAGTTTTTAAATGGAAGACAGAAGCGGTTTAGAATGGCCTATTTTCAGTC AGCCAAAAATGTCAAAACCTTCTGTGAGTAGTCCAGGTACTGGAAATCAG ACAATTTGAACTTCAGGATACTACAATAATTTTTTCCTTTGTGGGTAGTG GTGGAGCATGAATTCTCTACTTCTTATTGGTCCTTCTGCTATGATGGCCC TTTCAGTCACACCTCTGTTCTCAAAATAAGAATATAATCAATAAAGTAGA GTTTGAGGGAACGGAGGACTAAGTCAAAAGTGGGATACCTAGGACTTCAT TCTAGttactgtggaattatctcctttgcttttcttcctgtttgtgcttt ttctatcctgttaattctcctgccttatggaaagcacagtgattgtttca cagcataaaccagacatcacttttccagtttaattttttttcaaaggccc ccattgcattttggaaaaaattcaaaatattcaacatggcctacaaagcc ctgtcacccttaaatagtgtgttgagtctggctcctacccacagtctaaa tctcaactgtctccaatcttctccctcactaaactcctaccagcaaatct tttcttcaaactggctaatgccctattctagcctcagagttttgtgctgc tgttctcttaggtacagtgtttttccccaagatttttatctggctttctc ttcttcatttagacttttaaacaaacagcttcatgaattacttgagatgt aattaatatacatacaatttacccatttaaggtatacattttaatgtttt tattatattcacagagttgtacaaccatcacactctaatttcagaacgtt ttcatcttgattcagattttaaatcaaatgtcacatcatccagtaggaac tccagtcactaattagaaatacccattatgtttttacacacattctcaat cccactacctgtttgttattgcacttgaacttacatgaaactatttactt gtttatacatttattgtctGTTATTCCTAGCACATAGAAGGTATGTCTGG CACATAGCAAACACTCGATCTTTGATGAATGAATGAATAATGATAACATT AACTTTTTTGCTTATTCTGCCTTGTATTGTGTAAGATTAGAGACaatcct tacaacaaacttgaaaacccagacttaacgatctctaaaactcacatgta agttaaggctcagagaagtttcatcacttgctcagagttacgtaactggt gaataccgaggctagatttcaaacccaaggctgcccggctctaaaTGAGG GGATATTTGATTAGGCCAAAGTAACCTGAACCCTTAAAATAACcaggctt taacttccagaaacatgggaactagataacctaagaacctgctggccacg aaacccctagaatactgaacacaatatcacaaacatattttgaaatgcat agatgagcatgtaaaatactgagggaactcctcaatggccaaaagtggaa agcagatgaaaaccagaactgtgtaaaagcctgaaagttacagtcgtcct gcagacatttgtcaatctcagtaacaaagggacttagtattttttggcta tggaagacaaaaacaagctttttgtataaggtgggaatgttgaactgaga cctcatgggagaaaaagcagatgaagggttagaggctcagtaaaagaatg aactggaaaaatccatcttctgacaaagaaagacaatgaggaaacttttc tgtcttgggctgggtgCTTGGTTGGAGCAGGGGGAAAGAATCTCTGATTT
> 69149 115179 SLC6A2 69149 69197 UTR 69198 69471 exon 82066 82197 exon 84439 84676 exon 97643 97781 exon 104518 104652 exon 106610 106713 exon 107878 108002 exon 108825 108937 exon 110497 110625 exon 111069 111168 exon 112154 112254 exon 112739 112906 exon 114463 114534 exon 114923 114946 exon 114947 115179 UTR > 173279 186382 CESR 173279 173321 UTR 173322 173373 exon 177416 177623 exon 180095 180239 exon 182703 182836 exon 184865 185018 exon 185907 186077 exon 186078 186382 UTR > 173303 203537 CES1 173303 173321 UTR 173322 173373 exon 177419 177623 exon 180095 180239 exon 182703 182836 exon 184865 185018 exon 185907 186014 exon 186747 186851 exon 189424 189462 exon 193343 193483 exon 195380 195460 exon 195723 195870 exon 199927 200058 exon 202790 202862 exon 203159 203342 exon 203343 203537 UTR < 212212 242464 CES1 212212 212406 UTR 212407 212590 exon 212887 212959 exon 215691 215822 exon 219879 220026 exon 220289 220369 exon 222266 222406 exon 226287 226325 exon 228898 229002 exon 229735 229842 exon 230731 230884 exon 232913 233046 exon 235514 235658 exon 238133 238337 exon 242394 242445 exon 242446 242464 UTR < 229367 242488 CESR 229367 229671 UTR 229672 229842 exon 230731 230884 exon 232913 233046 exon 235514 235658 exon 238133 238340 exon 242394 242445 exon 242446 242488 UTR < 255598 284772 FLJ31547 255598 255832 UTR 255833 256064 exon 256150 256222 exon 262265 262412 exon 265761 265829 exon 268931 269071 exon 270794 270898 exon 272730 272834 exon 275344 275497 exon 279013 279146 exon 281027 281165 exon 283235 283439 exon
Automated sequence and gene annotation extraction http://zpicture.dcode.org/
chr16:55,400,000-…
zPicture: dynamic & interactive alignments visualization tool. http://zpicture.dcode.org/
Dynamic rotation from Pip- to Smooth- plots
Interactive parameter changes
zPicture: dynamic annotation
zPicture: dynamic selection of conservation parameters
100bps/70%
500bps/85%
Mycobacterium leprae vs. Mycobacterium tuberculosis.
Conservation of genes:
NONhypothetical genes – 97% are conserved
Hypothetical genes -- ∼20% are conserved
zPicture: Aligning complete microbial genomes
rVista 2.0:Identification of Evolutionarily Conserved Transcription Factor Binding Sites
http://rvista.dcode.org
rVista 2.0 http://rvista.dcode.org/ Identification of Evolutionarily Conserved Transcription Factor Binding Sites
http://zpicture.dcode.org
http://ecrbrowser.dcode.org
http://globin.cse.psu.edu/gala
Human ACTTTCCTACATCTATCTATA |||||::|||||||:||||||Mouse ACTTTGATACATCTCTCTATA
Human ACTTTGATACATCTATCTATA ||||||||||||||:||||||Mouse ACTTTGATACATCTCTCTATA
Human -----GATACATCTATCTATA ||||| Mouse ACTTTGATAC-----------
Human ACTTTGATACATCTATCTATA |||||Mouse ACTTT----------------
Seq ASeq B
(2) zPicture
(1) blastz (3) ECR Browser
New/ Pre-computed Alignments
Select Transcription Factors/ Matrix Similarity•Biobase matrices •User defined consensus sequences
Figure 1A
B
C
zPicture-rVista 2.0 interconnection
zPicture
rVista 2.0
ECR Browser:Tool for Browsing Genome
Conservation Profileshttp://ecrbrowser.dcode.org
http://ecrbrowser.dcode.org
http://ecrbrowser.dcode.org
http://ecrbrowser.dcode.org
http://ecrbrowser.dcode.org
http://ecrbrowser.dcode.org
http://ecrbrowser.dcode.org
http://ecrbrowser.dcode.org
http://ecrbrowser.dcode.org
Grab ECR :: direct access to a conserved element
Genome Alignment:Align your sequence to a
vertebrate genome
Genome Alignment
AC146831
Genome alignment: Output page
ECR Browser contains rVista portal
Figure 2A
B
C
CardiacEnhancer
Human GGAATGTCATTAATGCGCTGGGGAGACGTCCATTGGAGACAGGCGGCGTTATCCG|||||||||||||||||| ||||||||||||||||||||||||| ||||||||||
Mouse GGAATGTCATTAATGCGCCGGGGAGACGTCCATTGGAGACAGGCAGCGTTATCCG…… ……
Smad Smad
…AGACAGGCA… …AGCCCGGGA…Wild-type Smad-mutation
eShadow:Phylogenetic Shadowing
of Closely Related Speicieshttp://eshadow.dcode.org
eShadow: Phylogenetic Shadowing
http://eshadow.dcode.org
Phylogenetic shadowing on multiple (10-14) primate sequences
Apo-B
Plasminogen
LXR-alpha
CETP
Boffelli et al., Science, 2003
CREME:Using Microarray Data
to Decode Genome Regulation
http://crem.dcode.org
TFBS in Promoter ECRs of RefSeq genes
~13k RefSeq
loci
~8k Conserved promoters
414 TRANSFAC PWMs
~3M predicted TFBS
TFBS in Promoter ECRs of RefSeq genes
Testing Motif Abundances
•Identify enriched motifs in a gene set relative to a background set.•Take into account length of promoters
Filtering Similar PWMs
•TRANSFAC contains many redundancies:–Different PWMs for the same TF.–Similar PWMs for TFs from the same family.
•Filtering strategy:–For two PWMs that tend to co-occur in a very small window (4bp), remove the less enriched one.
Human Cell Cycle
16 enrichedPWMs
1089modules
336 genes,Whitfield et al. 02.
7 significantmodules
5 coherentlyexpressed
E2F, NFY, CREB…
Human Cell CycleDELTAEF1, EVI1, GR: 11 genes, p=0.01
Validation on a known module
• NFAT-AP1:– 10 known genes containing multiple regulatory
elements. In all NFAT is upstream of AP1.– CREME reported the correct module only (p=0.01).– CREME correctly identified the correct orientation
of the TFBS.– The module was identified even after adding 10
random promoters to the gene set.
Colleagues and collaborators
Lawrence Livermore National Laboratory
UC, Berkeley
Stanford
Lawrence Berkeley National Laboratory
Pennsylvania State University
www.dcode.org
Gaby Loots
Lisa Stubbs
Roded Sharan
Asa Ben-Hur
Ross Hardison Webb Miller
Marcelo Nobrega
Dario Boffelli
Sha Hammond