greengenes.lbl.gov 16s rrna gene database and workbench compatible with arb todd desantis, phil...
TRANSCRIPT
greengenes.lbl.govgreengenes.lbl.gov 16S rRNA gene database 16S rRNA gene database
and workbench compatible and workbench compatible with ARB with ARB
Todd DeSantis, Phil Hugenholtz, Niels Larson, Igor Dubosarskiy, Jordan Moberg, Yvette Piceno, Ingrid
Zubieta, Eoin Brodie, Gary Andersen
LBL - JGI
Andersen Group Program Aims
• Creating a microarray for the simultaneous differentiation and quantification of closely related prokaryotes in complex samples.
The Biomarker
16S rDNA
rRNA (functional molecule)
LSU
SSU
16S rDNA - identify and classify organisms by gene sequence variations.
The Challenges• 16S sequence deposit rate is increasing.• Many are mis-annotated and/or chimeric.• Sequence Taxonomy updates lags years
behind sequence availability (“Bacteria, Unclassified”).
• Difficult to create and manage MSAs of all 16S seq data (or even thousands) using Clustal/BioEdit/Arb.
• Probe quality is reliant on excellent MSAs and taxonomy.
• “Signatures” can erode as more sequences are discovered.
greengenes.lbl.gov
Stay current
020,00040,00060,00080,000
100,000120,000140,000160,000180,000200,000
Cum
ulativ
e N
CBI
16S
record
s
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
Year
Source: http://www.ncbi.nlm.nih.gov/
‘16S NOT 1.16S NOTmitochondr* NOT 18S’
Fate of NCBI Records:
short FASTA file (9%)
short BLAST match length (8%)
BLAST match to 18S/Mito SSU (1%)
odd nt insertions (1%)
passed (81%)
greengenes.lbl.gov
Verify ‘16S-ness’
NAST alignstep 1: find template
• Hand curated MSA provided by Phil.
• Alignment "template" is top BLAST HSP– q= -1, Favors long
match• Candidate trimmed
of extra-16S seq data– tRNA, intergenic
spacer regions, and 23S rDNA
– based on HSP boundries
• If HSP paired opposite strands, candidate is reverse complemented.
NAST alignstep 1: find template
• Hand curated MSA provided by Phil.
• Alignment "template" is top BLAST HSP– q= -1, Favors long
match• Candidate trimmed
of extra-16S seq data– tRNA, intergenic
spacer regions, and 23S rDNA
– based on HSP boundries
• If HSP paired opposite strands, candidate is reverse complemented.
NAST alignstep 1: find template
• Hand curated MSA provided by Phil.
• Alignment "template" is top BLAST HSP– q= -1, Favors long
match• Candidate trimmed
of extra-16S seq data– tRNA, intergenic
spacer regions, and 23S rDNA
– based on HSP boundries
• If HSP paired opposite strands, candidate is reverse complemented.
NAST alignstep 1: find template
• Hand curated MSA provided by Phil.
• Alignment "template" is top BLAST HSP– q= -1, Favors long
match• Candidate trimmed
of extra-16S seq data– tRNA, intergenic
spacer regions, and 23S rDNA
– based on HSP boundries
• If HSP paired opposite strands, candidate is reverse complemented.
NAST alignstep 2: gap removal
Preserves global MSA positions(columns) by allowing local misalignments.
DEFINESt = post-Align0 template sequence.Sc = post-Align0 candidate sequence.Ht = alignment space (hyphen) inserted into St by
Align0.Hc = alignment space (hyphen) inserted into Sc by
Align0. WHILE (St contains one or more Ht) DO
LHt = character index of distal 5' Ht within St
L5' = character index of Hc within Sc which is 5' proximal to Ht
L3' = character index of Hc within Sc which is 3' proximal to Ht
IF ((LHt – L5') > (L3' – LHt)) Delete Hc found at L3'
ELSE Delete Hc found at L5'Delete template gap character.
END WHILE
Result: Largest MSA of full-length (>1250 nt) 16S rDNA genes.
Isolate tag present?
“Genus species” style name in DEFINITION or
source>organism?
Is sequence from whole genome
record?
Genbank record
yes
Record is from an isolateyesno
Glob text from “DEFINITION”,
“source”, and “TITLE”
no
Text glob
“Gs yes”
Text glob contains “clone” OR “uncultur”?
Record is from a clone
Text glob contains “symbiont”?
Record is from a symbiont
yes no
noyes
Does a source>isolate
field exist?
yesno
Record is from undecided
“Gs no”
yesno
“Isolate tag no”
“Isolate tag yes”
Gs result? if Gs
yes
no
Strain tag is present
Record is from a isolate_str
greengenes.lbl.gov
Name generator• NCBI annotations are non-standardized
– Determine if sequence is from an isolate, environmental amplicon/metagenome
– Concatenate useful terms
• Effort to guide future GenBank submitters in clear record descriptions – http://www.jgi.doe.gov/16s/
greengenes.lbl.gov
Chimera tracking• Amplicons from complex gDNA can
contain partial sequence from more than one genome.
• Up to 4% of sequences are deemed chimeric by Bellerophon2– Flags are set to avoid using these
questionable sequences in phylogeny assessments
greengenes.lbl.gov
Maintain TaxonomyJGI taxonomy organized in ARB using maximum parsimony tree insertions.Example: http://greengenes.lbl.gov/cgi
-bin/User/show_one_record_v2.pl?prokMSA_id=82172
prokMSA_id: 82172 prokMSAname: termite gut clone Rs-050
GenBank ACCESSION: AB100461.1, GenBank GI: 28971862, RDP_id: S000122947, NCBI_tax_id: 203524, Study_id: 21358
G2_chip_tax_string=Bacteria; Firmicutes; Clostridia; Clostridiales; Peptostreptococcaceae; sf_5; otu_2988
JGI_tax_string=Bacteria; Firmicutes (incl. basal lineag; Firmicutes; Peptostreptococcaceae; Mogibacterium
JGI_tax_string_format_2=Bacteria; Firmicutes (incl. basal lineag; Firmicutes; Peptostreptococcaceae; Mogibacterium; otu_415
Pace_tax_string=Bacteria; Firmicutes; Clostridium et al.; Peptostreptococcaceae; Clostridium acidiurici et al.; Clostridium difficile et al.; Clostridium aminobutyricum et
RDP_tax_string= Bacteria; Firmicutes; Clostridia; Clostridiales; unclassified_Clostridiales.
ncbi_tax_string=Bacteria; Firmicutes; Clostridia; Clostridiales; Eubacteriaceae; environmental samples
greengenes.lbl.gov Tools
• BLAST• SimRank• Probe matcher• Text search• PCR primer design• Private NAST aligner
greengenes.lbl.gov Compatible with ARB
• Entire data base download-able in ARB format.
• Can import new records into personal ARB data base.
16S Sequence clustering
• Each sequence reduced to an array (list) of “probe-friendly” 25-mers which:– Have high complexity– Can be synthesized with 75 or fewer masks– Adequate H-bond potential
• G+C content over 48%• Or empirical bond stability found in test arrays
• Transitive clustering by fraction of 25mers in common – Cluster considered an Operational Taxonomic
Unit (OTU)
Extended Bergey’s TaxonomyBergey’s v0.9 with added nomenclature from
Hugenholtz tree of environmental DNA• Each OTU assigned to one of 455 families• Families split into subfamilies where >15%
sequence variation existed. • Results: (considering both domains)
• 63 phyla• 136 classes• 262 orders• 455 families• 842 subfamilies (~94% identity)• 8,989 OTUs (~99% identity)• 30,627 sequences (each belong to only one OTU)
Bacteria; Proteobacteria; Deltaproteobacteria; Desulfovibrionales; Desulfovibrionaceae; sf_1; otu_10051
Desulfovibrio sp. str. DMB.Desulfovibrio sp. 'Bendigo A'Desulfovibrio vulgaris DSM 644
Regions unique to OTU
Regions not unique to OTU Sequence discrepancies
Example of the Location of Probes Used for the Desulfovibrio vulgaris Probe Set
Probe Design
Locus Specific Prevalence ScoringExample: proteobacteria
OTU composed of 26 sequences
22/22 25/25 20/25
Probe selection objectives for each OTU
• Find 11 or more 25mers (targets) – >90% prevalent in an OTU’s sequences– dissimilar from sequences outside the OTU– >48% G+C or empirically responsive– >1 loci within 16S rDNA gene
• Presumed cross-hybridizing probes were those 25-mers that contained a central 17-mer matching sequences in more than one OTU (Urakawa, Stahl et al. 2002)– avoiding probes that were unique solely due to a mismatch in
one of the outer four bases. • As each PM probe (Perfect Match to target) was chosen, it was
paired with a control 25-mer (mismatching probe, MM), identical in all positions except the thirteenth base.
• The MM probe did not contain an internal 17-mer complimentary to sequences in any OTU.
18 µ
18 µ
AC
GG
TC
GA
AC
GG
TC
GA
AC
GG
TC
GA
AC
GG
TC
GA
AC
GG
TC
GA
Hybridize
PCR Amplify DNA
Fractionate DNA
End-label with biotin
Extract Genomic DNA
Overview of Sample Preparation
Image Capture and
Data Reduction
•Over 500,000 data points
SUBG
ROUP
desc
riptio
n
81_L
os_A
lamos
.CEL
81_U
ltra_S
oil.CE
L
84_L
os_A
lamos
.CEL
84_M
_Mille
r.CEL
90-Lo
s_Ala
mos.C
EL
90-P
ower_
Soil.C
EL
Airpo
rt_1.C
EL
Airpo
rt_2.C
EL
Airpo
rt_6.C
EL
Airpo
rt_7.C
EL
Airpo
rt_A_
1.CEL
Airpo
rt_A_
2.CEL
Airpo
rt_B_
1.CEL
Airpo
rt_B_
2.CEL
COUN
TIF pf
==1
'022102 CHLOROPLASTS_AND_CYANELLES 1 1 1 1 1 1 1 1 1 1 1 1 1 1 14'02280110 SPHINGOMONAS_GROUP 1 1 1 1 1 1 1 1 1 1 1 1 1 1 14'0228030406MCH.PURPURATUM_SUBGROUP 1 1 1 1 1 1 1 1 1 1 1 1 1 1 14'021506 CY.AURANTIACA_GROUP 1 1 1 1 1 1 1 1 1 0.94 1 1 1 1 13'0228010616MSO.LOTI_SUBGROUP 1 1 1 1 1 1 1 1 1 0.94 1 1 1 1 13'02280313 PSEUDOMONAS_AND_RELATIVES 0.96 1 1 1 1 1 1 1 1 1 1 1 1 1 13'0230011302CORYNEBACTERIUM_GROUP 1 1 0.94 1 1 1 1 1 1 1 1 1 1 1 13'021304 ENVIRONMENTAL_CLONE_OPB5_GROUP1 1 0.91 1 1 1 1 1 1 1 1 1 1 1 13'02150402 FLX.SANCTI_SUBGROUP 1 1 1 1 1 1 1 1 1 0.94 1 1 1 1 13'0228010608BL.VIRIDIS_ASSEMBLAGE 1 1 1 1 1 1 1 1 1 0.91 1 1 1 1 13'02280211 OXALOBACTER_GROUP 1 1 0.89 1 1 1 1 1 1 1 1 1 1 1 13'02280308 XANTHOMONAS_GROUP 1 1 1 1 1 1 1 1 1 0.92 1 1 1 1 13'0230010901ARTHROBACTER_AND_RELATIVES 1 1 0.92 1 1 1 1 1 1 1 1 1 1 1 13'02300110 PROPIONIBACTERIUM_GROUP 0.9 1 1 1 1 1 1 1 1 1 1 1 1 1 13'0218 ENVIRONMENTAL_CLONE_WCHB1-31_GROUP0.94 1 0.89 1 1 1 1 1 1 1 1 1 1 1 12'02250306 ACBT.CAPSULATUM_GROUP 1 1 1 1 1 1 1 1 0.96 0.92 1 1 1 1 12'0228010611METHYLOBACTERIA_SUBGROUP 0.95 1 1 1 1 1 1 1 1 0.91 1 1 1 1 12'022801061210BDR.ELKANII_SUBGROUP 1 1 1 1 1 1 1 1 0.98 0.67 1 1 1 1 12'022801061214BLB.DENITRIFICANS_SUBGROUP 1 1 1 1 1 1 1 1 1 0.79 1 0.95 1 1 12'022802090401COM.TERRIGENA_SUBGROUP 0.95 1 0.7 1 1 1 1 1 1 1 1 1 1 1 12'02300710 B.MEGATERIUM_GROUP 0.94 1 0.93 1 1 1 1 1 1 1 1 1 1 1 12'02300901 C.LEPTUM_GROUP 0.89 1 0.94 1 1 1 1 1 1 1 1 1 1 1 12'022801080102PARACOCCUS_SUBGROUP 0.91 1 1 1 1 0.91 1 1 1 1 1 1 1 1 12'0228040603POL.CELLULOSUM_SUBGROUP 0.82 1 1 1 1 1 1 1 1 0.82 1 1 1 1 12'02280108010101ROS.DENITRIFICANS_SUBGROUP 0.95 1 1 1 1 0.95 1 1 1 0.95 1 1 1 1 11'0228050301AOB.CRYAEROPHILUS_SUBGROUP0.95 1 0.95 1 1 1 1 1 1 1 1 0.95 1 1 11'023001130101MYB.TUBERCULOSIS_SUBGROUP 0.94 1 1 1 1 0.94 1 1 1 0.95 1 1 1 1 11'0228010404AZS.LIPOFERUM_SUBGROUP 0.96 1 0.96 1 1 1 1 1 1 0.93 1 1 1 1 11'022801061201AFIPIA.FELIS_SUBGROUP 1 1 1 1 1 1 1 1 0.98 0.89 1 0.95 1 1 11'022801061204NTB.WINOGRADSKYI_SUBGROUP 1 1 1 1 1 1 1 1 0.94 0.78 1 0.94 1 1 11'022801061205RPS.PALUSTRIS_SUBGROUP 1 1 1 1 1 1 1 1 0.97 0.87 1 0.96 1 1 11'022801061208BDR.LUPINI_SUBGROUP 1 1 1 1 1 1 1 1 0.96 0.71 1 0.96 1 1 11'022801061212BDR.LIAONINGENSIS_SUBGROUP 1 1 1 1 1 1 1 1 0.95 0.79 1 0.95 1 1 11'0228020403NSS.MULTIFORMIS_SUBGROUP 0.96 1 0.96 1 1 1 1 1 1 0.87 1 1 1 1 11'021306 ENVIRONMENTAL_CLONE_III1-8_GROUP0.93 1 1 1 1 1 1 1 0.97 0.85 1 1 1 1 11'02200101 PIRELLULA_SCHLESNER_ISOLATES0.92 1 0.92 1 1 1 1 1 1 0.94 1 1 1 1 11'02280410 DESULFOBULBUS_ASSEMBLAGE 0.92 1 1 1 1 1 1 1 1 0.89 1 0.95 1 1 11'02300711 B.SUBTILIS_GROUP 0.85 1 0.67 1 1 1 1 1 0.97 1 1 1 1 1 11'021505 PERSICOBACTER_GROUP 1 1 1 1 1 1 1 1 0.86 0.82 1 1 1 0.92 11'022804010401DSV.HALOPHILUS_SUBGROUP 0.5 1 0.42 1 1 0.92 1 1 1 1 1 1 1 1 11'0230011201PSC.HALOPHOBICA_SUBGROUP 0.92 1 1 1 1 0.92 1 1 1 0.93 1 1 1 1 11'0230040105BTV.FIBRISOLVENS_SUBGROUP 0.68 1 0.78 1 1 0.84 1 1 1 1 1 1 1 1 11'02280327 ENTERICS_AND_RELATIVES 0.91 1 0.86 1 1 1 1 1 0.99 0.98 1 1 1 1 10'0230010602A.FERROOXIDANS_SUBGROUP 0.96 1 0.96 1 1 0.94 1 1 1 0.91 1 1 1 1 10'02250301 MOUNT_COOT-THA_ENVIRONMENTAL_CLONES_III0.89 1 0.89 1 1 0.9 1 1 1 1 1 0.96 1 1 10'0228010609MSI.TRICHOSPORIUM_SUBGROUP 0.77 1 0.87 1 1 0.96 1 1 1 0.78 1 1 1 1 10'0228020804BRD.BRONCHISEPTICA_SUBGROUP0.82 1 0.91 1 1 1 1 1 0.97 0.77 1 1 1 1 10'02300111 MICROMONOSPORA_GROUP 0.83 1 0.83 1 1 0.83 1 1 1 0.94 1 1 1 1 10'0230070903B.ALCALOPHILUS_SUBGROUP 0.81 1 0.46 1 1 0.88 1 1 1 0.98 1 1 1 1 10'0205 ENVIRONMENTAL_CLONE_OPB45_GROUP0.73 1 0.91 1 1 1 1 1 0.91 0.6 1 1 1 1 10'021312 ENVIRONMENTAL_CLONE_RB40_GROUP0.82 1 0.81 1 1 0.8 1 1 1 0.74 1 1 1 1 10'0215010204CY.FERMENTANS_SUBGROUP 0.94 1 1 1 1 0.94 1 1 0.92 0.91 1 1 1 1 10'02280108010105OCT.ANTARCTICUS_SUBGROUP 0.71 1 0.75 1 1 0.73 1 1 1 0.64 1 1 1 1 10'023001080110THERMOPHILIC_STREPTOMYCES 0.75 1 0.8 1 1 0.88 1 1 1 0.88 1 1 1 1 10'0230040103EUB.SABURREUM_SUBGROUP 0.4 1 0.53 1 1 0.75 1 1 1 0.56 1 1 1 1 10'02300713 B.SPHAERICUS_GROUP 0.86 1 0.71 1 1 0.92 1 1 1 0.93 1 1 1 1 10'0230072109STC.PNEUMONIAE_SUBGROUP 0.92 1 0.92 1 1 1 1 1 1 0.85 1 1 1 0.93 10'02250102 ENVIRONMENTAL_CLONE_OCS307_GROUP0.89 1 1 1 1 0.94 1 1 1 1 0.94 0.94 1 0.94 9'0230040104RUC.GNAVUS_SUBGROUP 0.9 1 0.95 1 1 0.96 1 1 0.96 0.95 1 1 1 1 9'021305 ENVIRONMENTAL_CLONE_RB25_GROUP0.93 1 0.95 1 1 1 0.95 1 0.96 0.83 1 1 1 1 9•Scores for each of 9000 OTUS
Distribution of 16S rDNA Sequences detected Distribution of 16S rDNA Sequences detected via Cloning or Microarray Analysisvia Cloning or Microarray Analysis
Clone Hits Only (8)
Clone and ArrayHits (73) Array Hits
Only (97)
Confirmed by specific PCR and sequencing:Actinobacteria; Actinosynnemataceae; sf_1Nitrospira; Nitrospiraceae; sf_1Clostridia; Syntrophomonadaceae; sf_5Planctomycetes; Plantomycetaceae; sf_3Gammaproteobacteria; Pseudoaltermonadaceae; sf_1Acidobacteria; Ellin6075/11-25; sf_1Spirochaetes; Spirochaetaceae; sf_1Spirochaetes; Spirochaetaceae; sf_3Spirochaetes; Leptospiracea; sf_3
6
7
8
9
10
11
12
13
14
1 2 3 4 5 6 7 8
log2 Concentration (pM)
log 2
Hyb
Sco
re (
a.u.
) r = 0.917
Spike–in% G+C
sequence % G+C probes
Mycoplasma neurolyticum
50.0 45.4
Oenococcus oeni 50.9 50.8
Saprospira grandis 51.8 50.9
Fervidobacterium nodosum
58.2 53.8
Caulobacter vibrioides
56.4 58.5
Array is quantitative
Example query against meteorological data:
Does detection of Actinobacterium PENDANT-38 correlate with temperature?
r = 0.64, p=0.026527(adjusted for multiple testing)
4
4.5
5
5.5
6
6.5
75 80 85 90
Temp. degC
log(H
ybSc
ore
)
Real-time quantitative PCR confirmation of array monitoring. Real-time quantitative PCR confirmation of array monitoring.
Representative organism Phylocode GroupCorrected Array Intensity
Area 2 Reduction Oxidation
Geothrix fermentans 2.13.8.386 Acidobacteriaceae 45 2344 2290
Geobacter metallireducens 2.28.4.7.4.10207 Geobacteraceae 251 2238 2188
Geobacter arculus 2.28.4.7.4.10209 Geobacteraceae 38 1412 1698
Species specific - Geothrix fermentans Group specific - Geobacteraceae
(a) Array quantitation
(b) qPCR quantitation
Uranium BioremediationUranium Bioremediation – is uranium re-oxidation under reducing conditions due to loss of metal reducers? – is uranium re-oxidation under reducing conditions due to loss of metal reducers?
Real-time quantitative PCR confirmation – Urban AerosolReal-time quantitative PCR confirmation – Urban Aerosol
Array hybridization signal correlates significantlyArray hybridization signal correlates significantlywith 16S copies in environmental aerosol DNA extractwith 16S copies in environmental aerosol DNA extract
Pseudomonas oleovoransPseudomonas oleovorans example example
Order Class Peak Duration (sec)
Phaeophyceae (phylum) Stramenopiles (no rank) 5
Basidiomycota (phylum) Fungi (kingdom) 45
Deferribacterales Cyanobacteria 450
Ascomycota (phylum) Fungi (kingdom) 450
Vibrionales Gammaproteobacteria 450
Flavobacteriales Flavobacteria 450
Clostridiales Clostridia 45
Rhizobiales Alphaproteobacteria 45
Rhodospirillales Alphaproteobacteria 45 n.s.
Lactobacillales Bacilli 45
Bacillales Bacilli 450
Mycoplasmatales Mollicutes 5
Xanthomonadales Gammaproteobacteria 5 n.s.
Burkholderiales Betaproteobacteria 0
Sphingomonadales Alphaproteobacteria 0
Sphingobacteriales Sphingobacteria 0
Acholeplasmatales Mollicutes 45
FEMS Letters - pseudoshift
Acknowledgements
• Phil Hugenholtz – Taxonomy, Arb Interface, Chimera
• Niels Larson – SimRank• Igor Dubosarskiy – JSP• Jordan Moberg – Microarrays, Cloning • Yvette Piceno – Microarrays, Primer Design• Ingrid Zubieta – PCR, Cloning• Eoin Brodie – Microarrays, QPCR• Gary Andersen – 16S Microarray Group Leader
...CGTAAAGCTCTGTCTTTGGGGAAGATAATGACGGTACCCAAGGAGGAAGCCACGGCTAACT... C. perf. str.CPN50
................................................................... C. perf. resistant
................................................................... Clostridium sp. AB&J
................................................................... clone p-4636-2Wa2
................................................................... C. perf. A
................................................................... C. perf rrnA
................................................................... C. perf rrnE
.................................T................................. C. perf rrnD
................................................................... C. perf rrnC
................................................................... C. perf rrnB
................................................................... C. perf rrnF
................................................................... C. perf rrnG
................................................................... C. perf str.13a
................................................................... C. perf str.13b
................................................................... C. perf rrnH
................................................................... C. perf rrnI
................................................................... C. perf rrnJ
................................................................... clone OI1612
................................................................... C. perf. B
................................................................... Swine manure 37-3
................................................................... Swine manure 37-4
TAAAGCTCTGTCTTTGGGGAAGATA tacccaaggaggaagccacggctaa AAAGCTCTGTCTTTGGGGAAGATAA AAGCTCTGTCTTTGGGGAAGATAAT AGCTCTGTCTTTGGGGAAGATAATG
Bacteria
CFB
Cyan
Proteo
Gram +
High G+C
Bacil-Strep
Clostridium
C.BOTULINUM_SUBGROUP
C.THERMOBUTYRICUM_SUBGROUP
C.BARATI_SUBGROUP
C.CADAVERIS
C.ALGIDICARNIS
C.PERFRINGENS
C.AURANTIBUTYRICUM
C. BUTYRICUM
16S rDNA
Probe Properties:25mer exits in 90% of the taxon’s seqsInternal 21mer exists only in one taxon.Probes 5 - 8
5 6 7 8
27 1492
420 469
Ave Diff =1891
C. perfringens probe set identified in EPA sample 22 (N.Y. Spring)