metagenomics 2015 module6 - bioinformatics · 2018-11-21 · 150624 4 module!6!! bioinformatics.ca...

15-‐06-‐24

Canadian Bioinforma3cs Workshops

www.bioinforma3cs.ca

2 Module #: Title of Module

15-‐06-‐24

Module 6 Microbiome biomarker discovery John Parkinson on behalf of Thea Van Rossum, Anamaria

Crisan, Mike Peabody & Fiona Brinkman

What are biomarkers and how do we find them?

bi·∙o·∙mark·∙er /ˈbīōˌmärkər/

Func6onal biomarkers

Taxonomic biomarkers

Biological func3ons that may be specific to a single organism or shared among mul3ple organisms

Can be a specific species, but most oWen is a category of organisms – also called Opera3onal Taxonomic Unit (OTU).

Measureable biological property that can be indica3ve of some phenomena, such as an infec3on, disease, or environmental disturbance

Anamaria Crisan 4

15-‐06-‐24

Module 6 bioinformatics.ca

DISCOVERY

VALIDATION

o  Bioinforma6cs So?ware -‐ Takes the raw digi3zed genomic data and performs QC and biomarker quan3fica3on

o  Use math – applied sta3s3cal methods can help us find useful biomarkers

o  Design Primers – biological “hooks” that pick out our biomarker of interest from a sample

o  qPCR – measures how many 3mes primers (hooks) manage to snag our biomarker of interest

Anamaria Crisan

Extract & Sequence DNA

Iden3fy Poten3al Biomarkers ( OTUs or Func3onal Genes )

Validate Biomarker Abundance

Select Useful Biomarkers

Plan Analysis: “bio”? “mark”?

Obtain Biological Samples

Anamaria Crisan, Thea Van Rossum

15-‐06-‐24

A) Bacteria

B) Viruses

o  Shotgun or 16S amplicon o  Best studied, most methods developed

o  Shotgun or amplicon (RdRp, g23) o  Can be challenging to get enough DNA o  Host-‐specificity and popula3on “bursts” hold promise

What will put the “bio” in biomarker?

C) Eukaryotes

o  Amplicon (18S, ITS) (large genomes make shotgun difficult) o  Best studied, most methods developed

Combina3ons!

Thea Van Rossum

A) TAXONOMIC

B) GENE-‐BASED

o  Can use amplicon or shotgun data o  Strain-‐level diversity can lead to false posi3ves/nega3ves o  More variable across environments (for beser or worse)

o  Requires shotgun data (DNA or RNA) o  Need good sequencing depth to reach specialised genes o  Domain-‐based gene architecture can be tricky

What kind of biomarker do we want?

C) OTHER? o  Diversity metrics, using microbiome analysis to suggest other

metabolic markers, etc

Combina3ons!

Thea Van Rossum, Fiona Brinkman

15-‐06-‐24

What is biomarker selec6on? The process for removing non-‐informa3ve or redundant OTUs from an analysis.

Anamaria Crisan, Fiona Brinkman

Module 6 bioinformatics.ca Anamaria Crisan

Sample freq

Abundance Abundance

OTU1 OTU2

15-‐06-‐24

What makes a good biomarker?

In this example we want to find biomarkers that separate between the red and blue class labels.

Anamaria Crisan

How can biomarkers add value to current testing procedures? How can we make biomarkers easily accessible to those that routinely test water quality?

What makes a good biomarker?

Anamaria Crisan

15-‐06-‐24

Biomarker selec3on relies on sta3s3cal techniques •  These techniques can range from the very simple (like a t-‐test) to more

complex •  You can write your own sta6s6cal analysis (using programs like R, Matlab,

Python, STATA, SAS etc.) •  OR you can use more complex methods developed by others:

•  Popular metagenomics / microbiome methods include: •  LEfSE •  metagenomeSeq

•  Whatever you choose to use it’s important to understand the sta6s6cal methods especially:

•  Your assump3ons about the data •  The sta3s3cal method’s assump3ons about the data •  The sta3s3cal method’s limita3ons •  How to interpret your results from the output of the sta3s3cal method

Anamaria Crisan

Biological Data is difficult

•  Most biological data is correlated and, especially with microbial community data, quite sparse

•  Sta3s3cal methods vary with respect to how correla3on and data sparseness are taken into account.

•  So how do others deal with the problem? –  They ignore correla3on and use only parametric methods (most common

approach); use a hard filter to look at only abundant OTUs –  Use more complex approaches that take into account correla3on and

sparseness •  BUT it can be more computa3onally intensive and 3me consuming to use these

methods •  AND it can be difficult to understand and interpret what the method is doing

Anamaria Crisan

15-‐06-‐24

So you might be wondering … what are these mysterious “statistical methods”?

Generally, statistical techniques either try to predict labels or continuous values

Anamaria Crisan

15-‐06-‐24

And…they also generally belong to one of two categories

(I lied a lisle : Semi-‐supervised methods also exist) Anamaria Crisan

Module 6 bioinformatics.ca Anamaria Crisan

15-‐06-‐24

Module 6 bioinformatics.ca 19 Anamaria Crisan

Note, the same techniques are called different things in different fields

Anamaria Crisan

15-‐06-‐24

Module 6 bioinformatics.ca 21

The best approach to use depends on your research question

Now that we have biomarkers, we should validate them.

15-‐06-‐24

How do we validate our biomarkers?

•  Once you ID the gene or taxonomic group you want to use as a biomarker, you need to design a test for it (q)PCR is a good op3on: 1.  Iden3fy biomarker specific sequence –  If used marker-‐based tool to iden3fy biomarker, you’re all set (e.g.

MetaPhlAn2) –  Otherwise, can cluster reads or align them to find conserved sequences

– need to verify that “representa3ve” sequence is s3ll a good biomarker

2. Design primers (& probe) –  PrimerProspector: designs primers from a sequence alignment –  PrimerBLAST: designs primers specific to a clade Thea Van Rossum

Case Study – Categorical Biomarker from Bacterial Shotgun Data

Example of fast track to marker iden3fica3on and PCR test development – IDing markers of water quality (see www.watersheddiscovery.ca for project)

•  Bacterial shotgun data using Illumina HiSeq plazorm •  Comparing riverwater microbiomes at an agricultural unimpacted site

(AUP) versus two impacted downstream sites (“at pollu3on” (APL) & “downstream” (ADS))

•  Using MetaPhlAn

–  Low sensi3vity, high precision –  Based on clade-‐specific gene sequences –  3000 reference genomes –  Fast: 3 million reads (100-‐150 bp) in 10 minutes

15-‐06-‐24

Quality trimmed and normalised data across samples MOCK COMMUNITY VALIDATION

Validated with mock community: DNA-free water spiked with DNA from lab-cultured bacteria 7% of reads were assigned to a species (low sensitivity) Of those, 84% were correctly assigned

Thea Van Rossum

Taxonomic Classifica6ons – Rural Site •  Priori3sed high abundance taxa •  Use White’s non-‐parametric t-‐test with false discovery

rate mul3ple test correc3on to find differen3ally abundant taxa (alterna3ve: RandomForests)

Upstream At Site & Downstream

Taxon 1 Taxon 2

Top 50 m

ost “abu

ndant”

Thea Van Rossum

15-‐06-‐24

3. Iden6fied sequences characteris6c of taxa

•  57,016 reads assigned to Taxon 1 •  2,176 reads assigned to Taxon 2 •  Priori3sed taxon 1 due to our research indica3ng high

abundance taxa are most accurately predicted

•  Extracted taxon-‐specific sequences from Metaphlan database –  607 sequences for Taxon 1

•  Aligned reads against these sequences •  Chose regions of Metaphlan sequences with most hits

Thea Van Rossum

Metaphlan marker sequence

Consensus sequence

Highest Coverage = candidate marker sequence

15-‐06-‐24

4. Designed qPCR primers and probes from marker sequence

•  Used Primer3 for primer and probe design

•  Checked rela3ve occurrence rates of candidate primers and probes

–  Considered matches that are exact or have 1-‐2 mismatches

–  Chose sequences that minimize non-‐specific matches

•  Confirmed we can amplify a product of the right size

Upstream At site & downstream

Forward primer Probe Reverse primer

Next: itera3vely test & improve qPCR Thea Van Rossum

“Fast track” Case Study Summary •  Ini3al analysis iden3fied a PCR marker based on

differen3al abundance of a bacterial species that is being used to pilot our itera3ve valida3on process

•  Benefits of this approach –  Fast: sequence data to PCR primers in a couple days –  Simple: does not require large amounts of processing power

•  Limita3ons of this approach –  Depends on differen3al abundance of known bacteria (if the differen3al bacteria aren’t highly similar to those in the Metaphlan database, this approach will not work)

–  Depends on taxa, which have been shown to be more variable across environments than (gene) func3ons

•  This is a “low hanging fruit” approach; a good first step Thea Van Rossum, Fiona Brinkman

15-‐06-‐24

Bacterial Shotgun – alterna6ve methods

Get predicted proteins Cluster Find differen3al

clusters Design PCR

•  More complete taxonomic analysis –  Kraken, Discribinate

•  Gene func3on analysis – MEGAN4 (SEED, KEGG) –  Real3meMetagenomics (FIGFams)

•  Cluster based analysis:

Thea Van Rossum

From Biomarker to Lab Test

Iden3fy discrimina3ve taxa or func3ons

Design primers (Primer-‐BLAST or IDT Real3me PCR Tool)

Validate primers in silico (Primer Prospector,

Primer-‐BLAST)

Validate primers in vitro (qPCR)

Iden3fy informa3ve region for primer design (CD-‐HIT)

Primer 1 Primer 2

Mike Peabody

15-‐06-‐24

Remember other markers… Community diversity could be a useful indicator of ecosystem health Microbiome analysis may suggest other types of screening tests… (metabolites, etc) Markers are only as good as the data they are based on, so design experiments carefully, include +ve and -ve controls

Anamaria Crisan, Fiona Brinkman

We are on a Coffee Break & Networking Session

metagenomics 2015 module6 - bioinformatics · 2018-11-21 · 150624 4 module!6!! bioinformatics.ca...

Documents

module6 lecture 2(1)

module 6: screening and assessment - urban...

nguyen thi hien hoi thao mat tran to quoc-150624

module6 office2007 slides

module6 remuneration

module6 food allergy 0408 brasil

module6 uitwerkingenvandeopdrachten -...

module6 project jenna brown

modules module6

module6 handling devices

class 6-writing 2-module6.pptx

module6 compress & decompress

mdia man built capital -150624

academy module6 web

150624 verleiden met beeld

module6 2015 handout - university of san diego

module6 and 4

module6 the olympic adventure

module6, descriptive, normative, and impact evaluation...

module6 0_pm_workorder_overview_24 04 2013_v1 0.pptx