Challenges for metagenomic data analysis and lessons from viral metagenomes
[What would you do if sequencing were free?]
Rob Edwards
http://phage.sdsu.edu/~rob
San Diego State UniversityFellowship for Interpretation of Genomes
SGM Meeting, Warwick, April 2006
Outline
• The envy is not mine
• A tour around the world, thanks to phage
• People suck
• What is the most successful gene in
evolution?
• Is there a Future?
This is all 454 sequence data
• 21 libraries– 10 microbial, 11 phage
• 597,340,328 bp total– 20% of the human genome– 50% of all complete and partial microbial
genomes
• 5,769,035 sequences– Average 274,716 per library
• Average read length 103.5 bp– Av. read length has not increased in 7 months
• Cost 0.04¢ per bp
Sequencing is cheap and easy.
Bioinformatics is neither.
The Soudan Mine, Minnesota
Red Stuff OxidizedBlack Stuff Reduced
Red and Black Samples Are Different
Cloned and 454 sequenced16S are indistinguishable
Black stuff
Red
ClonedRed
There are different amounts of metabolism in each environment
There are different amounts ofsubstrates in each environment
BlackStuff
RedStuff
But are the differences significant?
• Sample 10,000 proteins from site 1• Count frequency of each “subsystem”• Repeat 20,000 times
• Repeat for sample 2
• Combine both samples• Sample 10,000 proteins 20,000 times• Build 95% CI
• Compare medians from sites 1 and 2 with 95% CI
Rodriguez-Brito (2006). BMC Bioinformatics
Subsystem differences & metabolism
Iron acquisitionBlack Stuff
Siderophore enterobactin biosynthesisferric enterobactin transportABC transporter ferrichromeABC transporter heme
Black stuff: ferrous iron (Fe2+, ferroan [(Mg,Fe)6(Si,Al)4O10(OH)8])
Red stuff: ferric iron (goethite [FeO(OH)])
Nitrification differentiates the samples
Edwards (2006)BMC Genomics
The challenge is explaining the differences between samples
Red Sample
Arg, Trp, His UbiquinoneFA oxidationChemotaxis, FlagellaMethylglyoxal
metabolism
Black Sample
Ile, Leu, ValSiderophoresGlycerolipidsNiFe hydrogenasePhenylpropionate
degradation
We can cheaply compare the importantbiochemistry happening in different
environments
We don’t care which organisms are doing the metabolism but we know what organisms are
there
Outline
• The envy is not mine
• A tour around the world, thanks to phage
• People suck
• What is the most successful gene in
evolution?
• Is there a Future?
Why Phages?
• Phages are viruses that infect bacteria– 10:1 ratio of phages:bacteria
– 1031 phages on the planet
• Specific interactions (probably)– one virus : one host
• Small genome size– Higher coverage
• Horizontal gene transfer– 1025-1028 bp DNA per year in the oceans
• Can’t do fosmids
Phages In The Worlds Oceans
GOM41 samples
13 sites5 years
SAR1 sample
1 site1 year
BBC85 samples
38 sites8 years
ARC56 samples
16 sites1 year
LI4 sites1 year
Most Marine Phage Sequences are Novel
Thanks: Mya Breitbart
Phages are specific to environments
PhageProteomicTree v. 5(Edwards, Rohwer)
ssDNA
-like
T7-likeT4-like
Marine Single-Stranded DNA Viruses
• 6% of SAR sequences ssDNA phage (Chlamydia-like Microviridae)
• 40% viral particles in SAR are ssDNA phage
• Several full-genome sequences were recovered via de novo assembly of these fragments
• Confirmed by PCR and sequencing
12,297 sequence fragments hit using TBLASTXover a ~4.5 kb genome
3890 bp 4490 bp
0
1033
SAR Aligned Against the Chlamydia 4
Individual sequence reads
Chlamydia phi 4genome
Coverage
Concatenated hits
Outline
• The envy is not mine
• A tour around the world, thanks to phage
• People suck
• What is the most successful gene in
evolution?
• Is there a Future?
Phages, Reefs, and Human Disturbance
Phages, Reefs, and Human Disturbance
The Northern Line IslandsExpedition, 2005
Christmas
Kingman
Christmas
Kingman
Palmyra
Washington
Fanning
Christmas to Kingman Bias in No. Phage HostsNegative numbers mean relatively more phage hosts at Kingman
More pathogens at Christmas.More people at Christmas.
More photosynthesis at Kingman.No people at Kingman.
Outline
• The envy is not mine
• A tour around the world, thanks to phage
• People suck
• What is the most successful gene in
evolution?
• Is there a Future?
Phages enrich for important genesRios Mesquites Stromatolites• No photosynthesis genes in phages
Pozas Azules Stromatolites• 5 different photosynthesis genes in phages
RNR is the most successful reaction in evolution
Outline
• The envy is not mine
• A tour around the world, thanks to phage
• People suck
• What is the most successful gene in
evolution?
• Is there a Future?
Computational Challenges
• Sequence annotations and analysis
– What is there?
– What is it doing?
– How is it doing it?
• Gene predictions in unknowns
– Lutz Krause (Bielefeld)
• Sequence comparisons
– BLAST
– Other ways to rapidly compare short sequences
– What happens when everyone is using 454
sequencing?
Sequence data from 21 libraries
6 million sequences600 million bp
• Each BLASTX search takes 1,000 CPU hours• 21 libraries = 21,000 CPU hours or 2.4 CPU years• Users want
• repeat runs, • TBLASTX, • more analysis• more data• more, more, more, more
SDSU Forest Rohwer Beltran Rodriguez-Brito
USF Mya Breitbart
Rohwer Lab Linda Wegley Florent Angly Matt Haynes
Stromatolites Janet Seifert Rice University) Valeria Souza (UNAM, Mexico)
Math Guys@SDSU Peter Salamon Joe Mahaffy James Nulton Ben Felts David Bangor Steve Rayhawk Jennifer Mueller
MIT: Ed DeLong
FIG Veronika Vonstein Ross Overbeek Annotators
ANL Rick Stevens Bob Olsen CI Support
Also at SDSU Anca Segall Stanley Maloy
UBC Curtis Suttle Amy Chan