microbial bioinformatics
TRANSCRIPT
Microbial Bioinformatics
Keith A. Crandall, PhD, FAAS, FLSDirector, Computational Biology Institute
Director, GW Genomics CoreCo-Director, Informatics, Clinical and Translational Science Institute CN
Co-Director, Institute for Biomedical Sciences Genomics and Bioinformatics ProgramProfessor, Department of Biostatistics and Bioinformatics, GWSPH
Professor, Department of Biological Sciences, CCASResearch Associate, Department of Invertebrate Zoology, US National Museum of Natural History,
Smithsonian Institution
16S rRNA Sequencing Timeline
Mic
robi
al N
GS
Am
plico
n (ta
rget
ed)
sequ
encin
g
• Gold standard bacteria and archaea (16S rRNA): variable (loops) and conserved (stems) regions
• Fungi (ITS)• Protozoa (18S rRNA)
Microbial NGS Amplicon (targeted) sequencing
16S rRNA
Microbial NGSFrom microbial taxonomic profiles to biological questions
Phyla Genera
S1 S2 S3 S1 S2 S3
Phyla Sample1 Sample2 Sample3Actinobacteria 18.8 7.9 8.9Firmicutes 44.8 21.4 38.3Fusobacteria 3.4 2.2 4.8Proteobacteria 28.2 67.1 44.1
Microbiome Analyses - Metagenomics
16S - Metataxonomy
16S – Advantages vs Disadvantages?
● Advantages
○ Cost
○ Samples
○ Ease of analysis
○ Reference databases
○ PCR based -> lower starting DNA template
● Disadvantages
○ Only a single locus
○ No functional information
○ Often not discriminatory at the species level – or even genus level
○ No strain differentiation
○ No pathogenicity inferences
○ No drug resistance inferences
16S - Cost
Approach
What does an Illumina library need to look like?
p5 Index2 Rd1 seq primer Rd2 seq primerIndex1 p7
16S amplicon insert5’3’
3’5’
Making amplicon libraries
16S gene
Rd2 primer overhang overhang
Rd1 primer overhang
5’
3’
5’
3’
*DNA is synthesized in the 5’ to 3’ direction
2-step PCR edition
Making amplicon libraries
PCR Amplicon Rd2 primer overhang overhang
Rd1 primer overhang
5’
3’5’
3’
*DNA is synthesized in the 5’ to 3’ direction
2-step PCR edition
Product
Making amplicon libraries
PCR Amplicon Rd2 primer overhang overhang
Rd1 primer overhang
5’
3’
5’
3’
*DNA is synthesized in the 5’ to 3’ direction
2-step PCR edition
5’
3’
3’
5’
p5 Index2
Index1 p7
p5 Index2 Rd1 seq primer Rd2 seq primerIndex1 p7
16S amplicon insert5’3’
3’5’
5’
3’
Index1 p7
Making amplicon libraries
16s gene5’
3’
5’
3’
*DNA is synthesized in the 5’ to 3’ direction
1-step PCR edition
3’
5’
p5Index2 misc.
misc.
p5 Index2 Misc. seqs Misc. seqs Index1 p7
16S amplicon insert5’3’
3’5’
Misc. seq + gene-specific primer region used as custom sequencing primer
One step PCR Primer StructureSB501 - Forward primer option
AATGATACGGCGACCACCGAGATCTACACCTACTATATATGGTAATTGTGTGCCAGCMGCCGCGGTAA
Adapter - Allows binding to the flow cellSB501 - Barcoded Primer - Different for every primerPad - Boost the primer melting temperatureLink - Anticomplementary to known sequencesV4f - 16S V4 region forward primer
How many PCR steps?One-step PCR
● PROS○ Fewer steps○ Less optimization○ Less possibility for
contamination● CONS
○ Less options for optimization
○ Less sensitive ○ Expensive/less stable
primers
Two-step PCR
● PROS○ Well-established○ Highly sensitive○ Cheaper primers
● CONS○ Possibility of amplicon
contamination○ Higher possibility for user
error/contamination○ More steps○ More optimization
Don’t Trust Your Data
Tools & Databases● Mothur (mothur.org) – full 16S analysis suite● QIIME (qiime.org) – full 16S analysis suite● MG-RAST server (metagnomics.anl.gov) – 16S and WGS● PathoScope (GitHub) – 16S and WGS● CloVR (clovr.org) – 16S and WGS● Animalcules (R Shiny) – downstream hypothesis testing● DADA2 – 16S analysis suite, etc.
● Ribosomal Database Project (RDP)● GreenGenes● SILVA (arb-silva.de)
Basic Analysis Steps● Remove all those adapters you put on for sequencing!● Remove unwanted reads and sequencing and PCR error
○ Read length, error score (remember fastq!)● Assemble paired ends to make a contig● Map contigs against a reference library● Call taxa
● Characterize Diversity (alpha
QIIME2 Workflow
From 16S rRNA fastq files to table of microbial abundance and taxonomy#ASV IDsample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 taxonomyASV1 23408 7345 38 1947 1066 82761 2679 1681 1135 1650 Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia/ShigellaASV2 149 174 21237 2619 2344 58 61 26 2232 60 Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; KlebsiellaASV3 68 141 0 0 7 0 0 0 28 18 Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; ProteusASV4 11829 14760 1586 27 26 2084 41 1314 993 103 Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; StreptococcusASV5 1395 0 551 2895 1010 1259 191 39 176 2003 Firmicutes; Bacilli; Lactobacillales; Aerococcaceae; AerococcusASV6 0 218 0 0 0 0 0 0 104 0 Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; KlebsiellaASV7 353 39 12 58 12 22 37 0 30 17 Firmicutes; Bacilli; Lactobacillales; Enterococcaceae; EnterococcusASV8 0 0 2625 13431 55640 67 13 19 2414 502 Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; StreptococcusASV9 0 0 0 5537 2332 25 18 20 19 1133 Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; PluralibacterASV10 3984 0 128 1538 341 297 94 12 54 1170 Actinobacteria; Actinobacteria; Actinomycetales; Actinomycetaceae; ActinotignumASV11 74 7268 0 0 0 0 0 0 129 0 Firmicutes; Bacilli; Lactobacillales; Lactobacillaceae; LactobacillusASV12 56 63 29 23 91 12 38 0 512 648 Firmicutes; Bacilli; Bacillales; Staphylococcaceae; StaphylococcusASV13 0 0 0 0 0 0 0 0 0 0 Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; CitrobacterASV14 0 0 0 17 0 8 7 0 46 0 Firmicutes; Bacilli; Lactobacillales; Lactobacillaceae; LactobacillusASV15 403 0 133 721 288 278 0 0 0 323 Firmicutes; Bacilli; Lactobacillales; Aerococcaceae; AerococcusASV16 409 0 20 101 0 50 0 0 52 445 Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; StreptococcusASV17 374 17 0 0 0 28 17 0 114 16 Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; LactococcusASV18 0 0 0 0 0 48 0 507 0 0 Bacteroidetes; Bacteroidia; Bacteroidales; Prevotellaceae; Prevotella_7ASV19 0 0 0 0 0 0 0 0 0 0 Actinobacteria; Actinobacteria; Bifidobacteriales; Bifidobacteriaceae; GardnerellaASV20 50 0 0 22 0 0 69 0 183 26 Firmicutes; Negativicutes; Selenomonadales; Veillonellaceae; Veillonella
Power Considerations for Experimental Design
Experimental Considerations – Sample Storage
Experimental Considerations – Extraction Method
Bias From Analysis Approaches● OTUs vs ASVs (operational taxonomic units, amplicon sequence
variants)● Bioinformatics pipeline● Reference database
● Lots to worry about!
Operational Taxonomic Units● Why no species?● Same 16S, different genomes● Same species, different 16S● OTUS are clusters of sequences that
are within a small x% genetic distance from one another (typically 3%)
mothur● QC● Cluster sequences with
97% identify● Form OTUs● Classify OTUs● Taxonomy table output
How do you classify reads?● Align to a reference database● Silva is the most popular and has
collected data for over 20 years● >600 million sequences
DADA2 Pipeline - ASVs
● More taxonomic Resolution
● ASVs are consistent
Callahan et al. Nature Methods 2016
DADA2 will model sequencing error!
Resolution and Accuracy
Abundance predictions in DADA2 (ASV) are more accurate than with mothur (OTUs)
Summary● 16S data are informative for a diversity of questions in microbiome
research● They have an extreme cost advantage for analyzing large numbers
of samples● One needs to take care in sample collection, storage, DNA
extraction, PCR, data analyses, and reference databases to obtain accurate and replicable results
● There are a wide variety of tools available for QC and taxonomic assignment of 16S data. Then one needs to move to R for further statistical analyses.
Tutorials!!
● QIIME2
● Muthor
● DADA2