extracting the signal from the noise - github pages

Extracting the

signal from the noise in high-throughput amplicon data

Stanford Microbiome Symposium - May 2016

The amplicon inference problem

Infer the sample types and abundances {(s, a)} from error-ful amplicon reads {r}.

1. Low resolution - 97%, genus at best

2. High false positive rate - #(OTUs) >> richness

3. Big data scaling - time scales super-linearly

4. Cross-study comparison - must reprocess all data together

Amplicon inference problems

sample

sequences

amplicon reads OTUs

ErrorsMake OTUs

DADA2

DADA2: High resolution

sample

sequences

amplicon reads OTUs

ErrorsMake OTUs

DADA2

Callahan, et al. Nature Methods, 2016.

DADA2: High resolution

s: ATTAACGAGATTATAACCAGAGTACGAATA...

r: ATCAACGAGATTATAACAAGAGTACGAATA...

p(r|s) =LY

i=1

p(r(i)|s(i), qr(i), Z)

DADA2: Error model

s: ATTAACGAGATTATAACCAGAGTACGAATA...

r: ATCAACGAGATTATAACAAGAGTACGAATA...

Error rates depend on....- Substitution (eg. A->C) - Quality score (eg. Q=30) - Batch effect (eg. run)

p(r|s) =LY

i=1

p(r(i)|s(i), qr(i), Z)

Using more data!

DADA2: Error model

The true shape of an error cloud

read

s

0 1 2 3 4 5 6 70

5

10

15

20

25

Effective Hamming Distance

NOT AN ERROR

BIO

LOG

Y

ERR

OR

OTU

Signal from Noise: OTUs

The true shape of an error cloud

read

s

0 1 2 3 4 5 6 70

5

10

15

20

25

Effective Hamming Distance

NOT AN ERROR

ERROR

BIOLOGY

DADA2

Signal from Noise: DADA2

Accuracyand

Resolution

Data: Kopylova, et al. mSystems, 2016.

0 100 200 300 400

010

020

030

040

0

True abundance

Infe

rred

abun

danc

e

0 50 100 200 300

050

150

250

True abundanceIn

ferre

d ab

unda

nce

Accuracy: Simulated datamothur (an) DADA2

TP: 978 FP: 272 FN: 77 cor: 0.935

True abundance True abundance

Data: Kopylova, et al. mSystems, 2016.

0 100 200 300 400

010

020

030

040

0

True abundance

Infe

rred

abun

danc

e

0 50 100 200 300

050

150

250

True abundanceIn

ferre

d ab

unda

nce

Accuracy: Simulated datamothur (an) DADA2

TP: 978 FP: 272 FN: 77 cor: 0.935

True abundance True abundance

TP: 1042 FP: 0FN: 13 cor: 0.999

Credit: Kopylova, et al. mSystems, 2016.

Accuracy: Mock community

0

1000

2000

3000

4000

DADA2 mothur SUMAClust Swarm uclust UPARSEMethod

# O

TUs

QIIME: Open DADA2

PC1 + PC2

0.0

0.2

0.4

0.6

QIIME: Open DADA2

Arsenic

0.0

0.2

0.4

0.6

0.8

1.0

Accuracy: Arsenic treatment

Credit: Dylan Dahan & Gabriel G. Perron

Variance explained by

Unpublishedordinationremoved.

Resolution: Mock Community

1e−05

1e−03

1e−01

1 10 100Hamming (Log−scale)

Freq

uenc

y (L

og−s

cale

)

AccuracyReference

Exact

One Off

Other

UPARSE

1e−05

1e−03

1e−01


Freq

uenc

y (L

og−s

cale

)

Vs. UPARSE

Added

Lost

Same

AccuracyReference

Exact

One Off

Other

DADA2

1e−05

1e−03

1e−01


Freq

uenc

y (L

og−s

cale

)

Vs. UPARSE

Added

Lost

Same

AccuracyReference

Exact

One Off

Other

MED

1e−05

1e−03

1e−01


Freq

uenc

y (L

og−s

cale

)

Vs. UPARSE

Added

Lost

Same

AccuracyReference

Exact

One Off

Other

Mothur

1e−05

1e−03

1e−01


Freq

uenc

y (L

og−s

cale

)

Vs. UPARSE

Added

Lost

Same

AccuracyReference

Exact

One Off

Other

QIIME

Callahan et al., Nature Methods, 2016.

DADA2

Hamming (log)

Resolution: Petrel aDNA

PetrelErrors

QIIME: De novo

Petrel.1

Petrel.2

Petrel.3

Petrel.4

DADA2

Credit: Kealoha Kinney, Michael Bunce, Andreanna Welch

0.00

0.25

0.50

0.75

1.00

Sample

Frequency

OTU

O1

42 pregnant women

Data: MacIntyre et al. Scientific Reports, 2015.

Resolution: L. crispatusUPARSE

0.00

0.25

0.50

0.75

1.00

Sample

Frequency

Variant

L1

L2

L3

L4

L5

L6

Data: MacIntyre et al. Scientific Reports, 2015.

Resolution: L. crispatus

42 pregnant women

DADA2

Scalingand

Comparison

The sequence is the label

ATTAACGAGATTATAACCAGAGTACGAATA...

is a consistent labelOTU85 is not

The sequence is the label

ATTAACGAGATTATAACCAGAGTACGAATA...

Consistent labels allow samples to be

independently processed

is a consistent labelOTU85 is not

Separable processing. Flat memory requirements.

Task parallelization.

1e+05

1e+08

1e+11

1e+03 1e+05reads

Alig

nmen

ts MethodHierarchicalDADA3Greedy

One sample

1e+10

1e+13

1e+16

1e+19

100 10000samples

Alig

nmen

ts MethodHierarchicalDADA3Greedy

One study

Consistent Labels: Scaling

Consistent Labels: Comparison

Sequence Tables

Study 1

Study 2

Study 3

Cross-study comparisonmerge

Eliminates need for joint reprocessing of raw data.

- A replacement for OTU picking- Accurate and high-resolution- Implemented in an R package- Open source

DADA2 is...

- A replacement for OTU picking- Accurate and high-resolution- Implemented in an R package- Open source

DADA2 is...

@bejcal -- https://github.com/benjjneb/dada2

Susan Holmes Michael RosenJoey McMurdie

Acknowledgements

http://benjjneb.github.io/dada2/

extracting the signal from the noise - github pages

Documents