computational analysis of genome sequences steven salzberg the institute for genomic research (tigr)...

46
Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Upload: dora-norton

Post on 31-Dec-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Computational Analysis of Genome Sequences

Steven SalzbergThe Institute for Genomic Research (TIGR)and

The Johns Hopkins University

Page 2: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

•1995: 1st genome (H. influenzae, TIGR)•1996: 1st eukaryote (S. cerevisiae)•2000: 29 complete microbial genomes

•22 in progress at TIGR•50+ in progress worldwide

•3 complete eukaryotes•yeast, nematode, fruit fly

•2 major projects in 2000:•Human (3.3 billion bp)•Arabidopsis thaliana (125 million bp)

The Genomics Revolution

Page 3: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Organism (genome size) Reference Haemophilus influenzae (1.83 Mb) Fleischmann et al., Science 269, 496-512 (1995). Mycoplasma genitalium (0.58 Mb) Fraser et al., Science 270, 397-403 (1995). Methanococcus jannaschii(1.7 Mb) Bult et al., Science 273, 1058-73 (1996). Helicobacter pylori(1.6 Mb) Tomb et al., Nature 388, 539-47 (1997). Archeoglobus fulgidus (2.1 Mb) Klenk et al., Nature 390, 364-70 (1997). Borrelia burgdorferi(1.5 Mb) Fraser et al., Nature 390, 580-6 (1997). Treponema pallidum(1.1 Mb) Fraser et al., Science 281, 375-88 (1998). Plasmodium falciparum chr2 (1 Mb) Gardner et al., Science 282, 1126-32 (1998). Thermotoga maritima (1.8 Mb) Nelson et al., Nature 399, 323-9 (1999). Deinococcus radiodurans(3.3 Mb) White et al., Science 286, 1571-7 (1999). Arabidopsis thaliana chr2 (19 Mb) Lin et al., Nature 402, 761-8 (1999). Neisseria meningitidis (2.3 Mb) Tettelin et al., Science 287, 1809-15 (2000). Chlamydia pneumoniae (1.2 Mb) Read et al., Nucleic Acids Res 28, 1397-406 (2000). Chlamydia trachomatis (1.0 Mb) Read et al., Nucleic Acids Res 28, 1397-406 (2000). Vibrio cholerae (4.0 Mb) Heidelberg et al., Nature, in press. Mycobacterium tuberculosis(4.4 Mb) Fleischmann et al., manuscript in preparationStreptococcus pneumoniae(2.2 Mb) Tettelin et al., manuscript in preparationCaulobacter crescentus (4.0 Mb) Nierman et al., manuscript in preparationChlorobium tepidum (2.1 Mb) Eisen et al., manuscript in preparationPorphyromonas gingivalis (2.2 Mb) Fleishmann et al., manuscript in preparation

Genomes Completed at TIGR

Page 4: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

4

Organism (genome size) Funding source Plasmodium falciparum chr 14 (3.4 Mb) BWF/DoD Plasmodium falciparum chr 10,11 (4 Mb) NIAID/DoD Trypanosoma brucei chr 2 (1 Mb) NIAID Enterococcus faecalis (3.0 Mb) NIAID Mycobacterium avium (4.4 Mb) NIAID Pseudomonas putida (6.2 Mb) DOE Schewanella putrefaciens (4.5 Mb) DOE Staphylococcus aureus (2.8 Mb) NIAID, MGRI Dehalococcoides ethenogenes (1.5Mb) DOE Desulfovibrio vulgaris (3.2Mb) DOE Thiobacillus ferrooxidans (2.9 Mb) DOEChlamydia psittaci GPIC (1.2Mb) NIAIDBacillus anthracis (5.0Mb) ONR/DOE/NIAIDTreponema denticola (3.0 Mb) NIDRC. hydrogenoformans (2.0 Mb) DOEMethylococcus capsulatus (4.6 Mb) DOEGeobacter sulfurreducens (4.0 Mb) DOEWolbachia sp (Drosophila) (1.4 Mb) NIHColwellia sp (1.0 Mb) DOEMycobacterium smegmatis (4.0Mb) NIAID Staphylococcus epidermidis (2.5 Mb) NIAIDTheileria parva (10Mb) ILRI/TIGR

Genomes in progress at TIGR

Page 5: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

A Microbial Genome Sequencing Project

Random sequencingGenome Assembly Annotation Data Release

Library construction

Colony picking

Template preparation

Sequencing reactions

Base calling

Sequence files

TIGR AssemblerGenome scaffold

Ordered contig set

Gap closuresequence editing

Re-assembly

ONE ASSEMBLY!

Combinatorial PCRPOMP

Gene finding

Homology searches

Initial role assignments

Metabolic pathwaysGene families

Comparative genomics

Transcriptional/translational

regularory elementsRepetitive sequences

Publicationwww.tigr.org

Sample tracking

Page 6: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Gene Finding

Gene finding plays an ever-larger role in high-speed DNA sequencing projects There’s no time for much else! 1000’s of genes generated each month at a

high-throughput sequencing facility Separate gene finders are needed for every

organism Training on organism X, finding genes on Y,

generates inferior results Bootstrapping problem: training data is hard to

find

Page 7: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Open Reading Frames: 6 possibilities

TCG TAC GTA GCT AGC TAG CTAAGC ATG CAT CGA TCG ATC GAT

T CGT ACG TAG CTA GCT AGC TAA GCA TGC ATC GAT CGA TCG AT

TC GTA CGT AGC TAG CTA GCT AAG CAT GCA TCG ATC GAT CGA T

iden

tical

seq

uenc

e

Page 8: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

8

GLIMMER: A Microbial Gene Finder GLIMMER 2.0: released late 1999 > 200 site licenses worldwide Works on bacteria, archaea, viruses too Malaria (eukaryotic) version: GLIMMERM Refs: Salzberg et al., NAR, 1998,

Genomics 1999; Delcher et al., NAR, 1999

Web site and code:

http://www.tigr.org/

Page 9: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Uniform Markov Models

Use conditional probability of a sequence position given previous k positions in the sequence.

Fixed, kth-order model: bigger k ‘s yield better models (as long as data is sufficient).

Probability (score) of sequence s1 s2 s3 … sn is:

) ... |( 11

iki

n

ii sssP

Page 10: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Advantages: Easy to train. Count frequencies of (k+1)mers

in training data. Easy to assign a score to a sequence.

Disadvantages: (k+1)mers can be undersampled; i.e., occur

too infrequently in training data. Models sequence as fixed-length chunks,

which may not be the best model of biology.

Uniform Markov Models

Page 11: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Interpolated Markov Models

Use a linear combination of 8 different Markov chains; for example: c8 P (g|atcagtta) + c7 P (g|tcagtta) + …

+ c1 P (g|a) + c0 P (g)

where c0 + c1 + c2 + c3 + c4 = 1

Equivalent to interpolating the results of multiple Markov chains

Score of a sequence is the product of interpolated probabilities of bases in the sequence

Page 12: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

IMM’s vs. Fixed-Order Models

Performance: IMM should always do at least as well as

fixed-order. E.g., even if kth-order model is correct, it can be

simulated by (k+1)st-order Our results support this.

IMM result can be used as fixed-order model. IMM slightly harder to train and uses more

memory.

Page 13: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

IMM Training

Problem: How to determine the weights of all the thousands of k-mers?

Traditionally done with E-M algorithm using cross-validation (deleted estimation). Slow. Overtraining can be a problem.

Page 14: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

GLIMMER IMM Training

Our approach assumes: Longer context is always better Only reason not to use it is undersampling

in training data. If sequence occurs frequently enough in

training data, use it, i.e., = 1 Otherwise, use frequency and 2 significance

to set .

Page 15: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

How GLIMMER Works

Three separate programs: long-orfs: automatically extract

long open reading frames that do not overlap other long orfs.

IMM model builder. Takes any kind of sequence data.

Gene predictor. Takes genome sequence and finds all the genes.

Page 16: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Gene Predictor

Finds & scores entire ORF’s. Uses 7 competing models: 6 reading

frames plus “random” model. Score for an ORF is the probability that the

“right” model generated it. 3-periodic Markov model High-scoring ORF’s are then checked for

overlaps.

Page 17: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Glimmer 2.0 IMM design

ATGCATGATCGAG

12bp

Pos -1

ac

t

Pos -3 Pos -3Pos -3Pos -2

g

Pos -3Pos -3 Pos -3Pos -4

8 levels deep

Context

Page 18: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Better Overlap Resolution

Page 19: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Better Overlap Resolution

Page 20: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

GLIMMER 2.0’s Performance

Organism Genes Genes Additional Annotated Found Genes

H. influenzae 1738 1720 (99.0%) 250 (14%)M. genitalium 483 480 (99.4%) 81 (17%)M. jannaschii 1727 1721 (99.7%) 221 (13%)H. pylori 1590 1550 (97.5%) 293 (18%)E. coli 4269 4158 (97.4%) 824 (19%)B. subtilis 4100 4030 (98.3%) 586 (14%)A. fulgidis 2437 2404 (98.6%) 274 (11%)B. burgdorferi 853 843 (99.3%) 62 (7%)T. pallidum 1039 1014 (97.6%) 180 (17%)T. maritima 1877 1854 (98.8%) 190 (10%)

Page 21: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

GLIMMER 2.0 on known genes

Organism Genes Known Correct Annotated Genes Predictions

H. influenzae 1738 1501 1496 (99.7%)M. genitalium 483 478 476 (99.6%)M. jannaschii 1727 1259 1256 (99.8%)H. pylori 1590 1092 1084 (99.3%)E. coli 4269 2656 2632 (99.1%)B. subtilis 4100 1249 1231 (98.6%)A. fulgidis 2437 1799 1786 (99.3%)B. burgdorferi 853 601 600 (99.8%)T. pallidum 1039 755 747 (98.9%)

T. maritima 1877 1504 1493 (99.3%)

Average (99.3%)

Page 22: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Speed Training for 2 Megabase genome: < 1

minute (on a Pentium-450) Find all genes in 2Mb genome: < 1 minute

Impact: GLIMMER was used for: B. burgdorferi (Lyme disease) , T. pallidum (syphilis)

(TIGR) C. trachomatis (blindness,std) (Berkeley/Stanford) C. pneumoniae (pneumonia)

(Berkeley/Stanford/UCSF) T. maritima, D. radiodurans, M. tuberculosis, V.

cholerae, S. pneumoniae, C. trachomatis, C. pneumoniae, N. meningitidis (TIGR)

X. fastidiosa (Brazilian consortium) Plasmodium falciparum (malaria) [GlimmerM] Arabidopsis thaliana (model plant) [GlimmerM] Others: viruses, simple eukaryotes, more bacteria

Page 23: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

23

Self-Similarity Scans

• Idea: analyze a whole genome by counting 3-mers in all 6 frames

• Analyze small windows (2000 bp, 10000bp) using the same statistic

• Algorithm:• Build model of entire sequenceBuild model of entire sequence

• ApplyApply the 2 statistic to compare to compare windows to the genome itselfwindows to the genome itself

Page 24: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Haemophilus influenzae (meningitis)

GC%

2

Page 25: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Thermotoga maritima (hyperthermophile)

Page 26: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Vibrio cholerae (cholera)

Page 27: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

27

On the other side of CTX prophage is a region encoding an RTX toxin (rtxA) and its activator (rtxC) and transporters (rtxBD). A third transporter gene has been identified that is a paralog of rtxB, and is transcribed in the same direction as rtxBD. Downstream of this gene are two genes encoding a sensor histidine kinase and response regulator. Trinucleotide composition analysis suggests that the RTX region was horizontally acquired along with the sensor histidine kinase/response regulator, suggesting these regulators effect expression of the closely linked RTX transcriptional units.

--Heidelberg et al., Nature, in press.

Page 28: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

28

Aligns 2 complete genomes Maximal Unique Matches

Suffix trees Very fast alignment of very long DNA

sequences Ref: Delcher et al., Nucl. Acids Res.,

1999Software at:

http://www.tigr.org/softlab

MUMmer

Page 29: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

• Efficiently compute alignments between long sequences to identify biologically interesting features.

• E.g., two strains of M. tuberculosis,each ~4.4MB• E.g., two versions of a genome at different stages of closure

• Compute alignment in less than 2 minutes

The Problem

Page 30: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Sequences in genomes A and B that:Occur exactly once in A and in BAre not contained in any larger such sequence

Maximal Unique Sequences

A:

B:

Occurs only here Mismatch at both ends

Page 31: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Select the longest consistent set of MUMs

Occur in the same order in A and B

A:

B:A:

B:

Page 32: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

A tree with edges labelled by strings Labels of child edges of a node begin with distinct letters Each leaf L represents a sequence—the labels on the path to L from the root

Holds all suffixes of a set of sequences A suffix is a subsequence that extends to the end of its sequence

The suffix tree for sequences A and B :Contains less than 2(|A | + |B |) nodes.Can be constructed in O (|A | + |B |) time!

Still need lots of RAMAll the analyses here were run on a desktop PC

Suffix Trees

Page 33: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Analyze the gaps between adjacent MUMs

Small gaps can be aligned with Smith-Waterman algorithm Large gaps can be aligned recursively Large inserts can be searched for separately. Many will be inconsistent MUMs Overlapping MUMs indicate variation in copy number of small repeats

Page 34: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

M. tuberculosis CSU93 vs. H37Rv

A C G TA 66 164 9C 48 81 169G 164 89 44T 11 159 61

a MUM

Page 35: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

M genitalium vs. M. pneumoniae

Page 36: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

H. pylori 26695 vs. J99

Page 37: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

V. cholera (forward) vs. E. coli

Origin

Page 38: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

V. cholera (reverse) vs. E. coli

Page 39: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

V. cholera (both strands) vs. E. coli: a puzzle?

Page 40: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

V. cholera vs. itself

Page 41: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

S. pyogenes vs. S. pneumoniae

Page 42: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

S. pyogenes vs. itself

Page 43: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

M. leprae vs M. tuberculosis

M. leprae

M.

tub

erc

ulo

sis

Page 44: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

X-alignments: how?

12 3 54

6

Ori

65 4 23

1 62 3 54

1

12 4 53

6 15 3 24

6

Page 45: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

Chr 2 vs. Chr 4 of Arabidopsis thaliana: discovery of a 4 Mb duplication

1100 genes430 (39%) duplicated

Page 46: Computational Analysis of Genome Sequences Steven Salzberg The Institute for Genomic Research (TIGR) and The Johns Hopkins University

46

Acknowledgements• GLIMMER, GLIMMERM

• Arthur Delcher, Simon Kasif, Owen White, Mihaela Pertea

• MUMmer• Arthur Delcher, Simon Kasif, Jeremy Peterson, Rob

Fleischmann, Owen White

• Analyses• Numerous TIGR faculty and staff, including: Jonathan

Eisen, Owen White, Rob Fleischmann, Hervé Tettelin, Tim Read, Maria Ermolaeva, John Heidelberg, Ian Paulsen, Malcolm Gardner, Claire Fraser, Clyde Hutchison, ...

• Supported by:• National Institutes of Health (NHGRI, NLM)• National Science Foundation (CISE, BIO)