metagenomics 2015 module6 - bioinformatics · 2018-11-21 · 150624 4 module!6!! bioinformatics.ca...
Post on 12-Jul-2020
2 Views
Preview:
TRANSCRIPT
15-‐06-‐24
1
Canadian Bioinforma3cs Workshops
www.bioinforma3cs.ca
2 Module #: Title of Module
15-‐06-‐24
2
Module 6 Microbiome biomarker discovery John Parkinson on behalf of Thea Van Rossum, Anamaria
Crisan, Mike Peabody & Fiona Brinkman
What are biomarkers and how do we find them?
bi·∙o·∙mark·∙er /ˈbīōˌmärkər/
Func6onal biomarkers
Taxonomic biomarkers
Biological func3ons that may be specific to a single organism or shared among mul3ple organisms
Can be a specific species, but most oWen is a category of organisms – also called Opera3onal Taxonomic Unit (OTU).
Measureable biological property that can be indica3ve of some phenomena, such as an infec3on, disease, or environmental disturbance
Anamaria Crisan 4
15-‐06-‐24
3
Module 6 bioinformatics.ca
DISCOVERY
VALIDATION
o Bioinforma6cs So?ware -‐ Takes the raw digi3zed genomic data and performs QC and biomarker quan3fica3on
o Use math – applied sta3s3cal methods can help us find useful biomarkers
o Design Primers – biological “hooks” that pick out our biomarker of interest from a sample
o qPCR – measures how many 3mes primers (hooks) manage to snag our biomarker of interest
What are biomarkers and how do we find them?
Anamaria Crisan
Module 6 bioinformatics.ca
1
2
3
4
5
6
Extract & Sequence DNA
Iden3fy Poten3al Biomarkers ( OTUs or Func3onal Genes )
Validate Biomarker Abundance
Select Useful Biomarkers
Plan Analysis: “bio”? “mark”?
Obtain Biological Samples
What are biomarkers and how do we find them?
Anamaria Crisan, Thea Van Rossum
15-‐06-‐24
4
Module 6 bioinformatics.ca
A) Bacteria
B) Viruses
o Shotgun or 16S amplicon o Best studied, most methods developed
o Shotgun or amplicon (RdRp, g23) o Can be challenging to get enough DNA o Host-‐specificity and popula3on “bursts” hold promise
What will put the “bio” in biomarker?
C) Eukaryotes
o Amplicon (18S, ITS) (large genomes make shotgun difficult) o Best studied, most methods developed
Combina3ons!
Thea Van Rossum
Module 6 bioinformatics.ca
A) TAXONOMIC
B) GENE-‐BASED
o Can use amplicon or shotgun data o Strain-‐level diversity can lead to false posi3ves/nega3ves o More variable across environments (for beser or worse)
o Requires shotgun data (DNA or RNA) o Need good sequencing depth to reach specialised genes o Domain-‐based gene architecture can be tricky
What kind of biomarker do we want?
C) OTHER? o Diversity metrics, using microbiome analysis to suggest other
metabolic markers, etc
Combina3ons!
Thea Van Rossum, Fiona Brinkman
15-‐06-‐24
5
What is biomarker selec6on? The process for removing non-‐informa3ve or redundant OTUs from an analysis.
Anamaria Crisan, Fiona Brinkman
Module 6 bioinformatics.ca Anamaria Crisan
Sample freq
uency
Sample freq
uency
Abundance Abundance
OTU1 OTU2
15-‐06-‐24
6
Module 6 bioinformatics.ca
What makes a good biomarker?
In this example we want to find biomarkers that separate between the red and blue class labels.
Anamaria Crisan
Module 6 bioinformatics.ca
How can biomarkers add value to current testing procedures? How can we make biomarkers easily accessible to those that routinely test water quality?
1
2
What makes a good biomarker?
Anamaria Crisan
15-‐06-‐24
7
Module 6 bioinformatics.ca
Biomarker selec3on relies on sta3s3cal techniques • These techniques can range from the very simple (like a t-‐test) to more
complex • You can write your own sta6s6cal analysis (using programs like R, Matlab,
Python, STATA, SAS etc.) • OR you can use more complex methods developed by others:
• Popular metagenomics / microbiome methods include: • LEfSE • metagenomeSeq
• Whatever you choose to use it’s important to understand the sta6s6cal methods especially:
• Your assump3ons about the data • The sta3s3cal method’s assump3ons about the data • The sta3s3cal method’s limita3ons • How to interpret your results from the output of the sta3s3cal method
Anamaria Crisan
Module 6 bioinformatics.ca
Biological Data is difficult
• Most biological data is correlated and, especially with microbial community data, quite sparse
• Sta3s3cal methods vary with respect to how correla3on and data sparseness are taken into account.
• So how do others deal with the problem? – They ignore correla3on and use only parametric methods (most common
approach); use a hard filter to look at only abundant OTUs – Use more complex approaches that take into account correla3on and
sparseness • BUT it can be more computa3onally intensive and 3me consuming to use these
methods • AND it can be difficult to understand and interpret what the method is doing
Anamaria Crisan
15-‐06-‐24
8
Module 6 bioinformatics.ca
So you might be wondering … what are these mysterious “statistical methods”?
Module 6 bioinformatics.ca
Generally, statistical techniques either try to predict labels or continuous values
Anamaria Crisan
15-‐06-‐24
9
Module 6 bioinformatics.ca
And…they also generally belong to one of two categories
(I lied a lisle : Semi-‐supervised methods also exist) Anamaria Crisan
Module 6 bioinformatics.ca Anamaria Crisan
15-‐06-‐24
10
Module 6 bioinformatics.ca 19 Anamaria Crisan
Module 6 bioinformatics.ca
Note, the same techniques are called different things in different fields
Anamaria Crisan
15-‐06-‐24
11
Module 6 bioinformatics.ca 21
The best approach to use depends on your research question
Module 6 bioinformatics.ca
Now that we have biomarkers, we should validate them.
15-‐06-‐24
12
Module 6 bioinformatics.ca
How do we validate our biomarkers?
• Once you ID the gene or taxonomic group you want to use as a biomarker, you need to design a test for it (q)PCR is a good op3on: 1. Iden3fy biomarker specific sequence – If used marker-‐based tool to iden3fy biomarker, you’re all set (e.g.
MetaPhlAn2) – Otherwise, can cluster reads or align them to find conserved sequences
– need to verify that “representa3ve” sequence is s3ll a good biomarker
2. Design primers (& probe) – PrimerProspector: designs primers from a sequence alignment – PrimerBLAST: designs primers specific to a clade Thea Van Rossum
Module 6 bioinformatics.ca
Case Study – Categorical Biomarker from Bacterial Shotgun Data
Example of fast track to marker iden3fica3on and PCR test development – IDing markers of water quality (see www.watersheddiscovery.ca for project)
• Bacterial shotgun data using Illumina HiSeq plazorm • Comparing riverwater microbiomes at an agricultural unimpacted site
(AUP) versus two impacted downstream sites (“at pollu3on” (APL) & “downstream” (ADS))
• Using MetaPhlAn
– Low sensi3vity, high precision – Based on clade-‐specific gene sequences – 3000 reference genomes – Fast: 3 million reads (100-‐150 bp) in 10 minutes
Thea Van Rossum, Fiona Brinkman
15-‐06-‐24
13
Quality trimmed and normalised data across samples MOCK COMMUNITY VALIDATION
Validated with mock community: DNA-free water spiked with DNA from lab-cultured bacteria 7% of reads were assigned to a species (low sensitivity) Of those, 84% were correctly assigned
Thea Van Rossum
Module 6 bioinformatics.ca
2
Taxonomic Classifica6ons – Rural Site • Priori3sed high abundance taxa • Use White’s non-‐parametric t-‐test with false discovery
rate mul3ple test correc3on to find differen3ally abundant taxa (alterna3ve: RandomForests)
26
Upstream At Site & Downstream
Taxon 1 Taxon 2
Top 50 m
ost “abu
ndant”
taxa
Thea Van Rossum
15-‐06-‐24
14
Module 6 bioinformatics.ca
3. Iden6fied sequences characteris6c of taxa
• 57,016 reads assigned to Taxon 1 • 2,176 reads assigned to Taxon 2 • Priori3sed taxon 1 due to our research indica3ng high
abundance taxa are most accurately predicted
• Extracted taxon-‐specific sequences from Metaphlan database – 607 sequences for Taxon 1
• Aligned reads against these sequences • Chose regions of Metaphlan sequences with most hits
Thea Van Rossum
Metaphlan marker sequence
Consensus sequence
Highest Coverage = candidate marker sequence
Thea Van Rossum, Fiona Brinkman
15-‐06-‐24
15
4. Designed qPCR primers and probes from marker sequence
• Used Primer3 for primer and probe design
• Checked rela3ve occurrence rates of candidate primers and probes
– Considered matches that are exact or have 1-‐2 mismatches
– Chose sequences that minimize non-‐specific matches
• Confirmed we can amplify a product of the right size
Upstream At site & downstream
Forward primer Probe Reverse primer
Next: itera3vely test & improve qPCR Thea Van Rossum
Module 6 bioinformatics.ca
“Fast track” Case Study Summary • Ini3al analysis iden3fied a PCR marker based on
differen3al abundance of a bacterial species that is being used to pilot our itera3ve valida3on process
• Benefits of this approach – Fast: sequence data to PCR primers in a couple days – Simple: does not require large amounts of processing power
• Limita3ons of this approach – Depends on differen3al abundance of known bacteria (if the differen3al bacteria aren’t highly similar to those in the Metaphlan database, this approach will not work)
– Depends on taxa, which have been shown to be more variable across environments than (gene) func3ons
• This is a “low hanging fruit” approach; a good first step Thea Van Rossum, Fiona Brinkman
15-‐06-‐24
16
Module 6 bioinformatics.ca
Bacterial Shotgun – alterna6ve methods
Get predicted proteins Cluster Find differen3al
clusters Design PCR
• More complete taxonomic analysis – Kraken, Discribinate
• Gene func3on analysis – MEGAN4 (SEED, KEGG) – Real3meMetagenomics (FIGFams)
• Cluster based analysis:
Thea Van Rossum
Module 6 bioinformatics.ca
From Biomarker to Lab Test
Iden3fy discrimina3ve taxa or func3ons
Design primers (Primer-‐BLAST or IDT Real3me PCR Tool)
Validate primers in silico (Primer Prospector,
Primer-‐BLAST)
Validate primers in vitro (qPCR)
Iden3fy informa3ve region for primer design (CD-‐HIT)
Primer 1 Primer 2
Mike Peabody
15-‐06-‐24
17
Module 6 bioinformatics.ca
Remember other markers… Community diversity could be a useful indicator of ecosystem health Microbiome analysis may suggest other types of screening tests… (metabolites, etc) Markers are only as good as the data they are based on, so design experiments carefully, include +ve and -ve controls
1
Anamaria Crisan, Fiona Brinkman
Module 6 bioinformatics.ca
We are on a Coffee Break & Networking Session
top related