extracting the signal from the noise - github pages
TRANSCRIPT
Extracting the
signal from the noise in high-throughput amplicon data
Stanford Microbiome Symposium - May 2016
The amplicon inference problem
Infer the sample types and abundances {(s, a)} from error-ful amplicon reads {r}.
1. Low resolution - 97%, genus at best
2. High false positive rate - #(OTUs) >> richness
3. Big data scaling - time scales super-linearly
4. Cross-study comparison - must reprocess all data together
Amplicon inference problems
sample
sequences
amplicon reads OTUs
ErrorsMake OTUs
DADA2
Callahan, et al. Nature Methods, 2016.
DADA2: High resolution
s: ATTAACGAGATTATAACCAGAGTACGAATA...
r: ATCAACGAGATTATAACAAGAGTACGAATA...
p(r|s) =LY
i=1
p(r(i)|s(i), qr(i), Z)
DADA2: Error model
s: ATTAACGAGATTATAACCAGAGTACGAATA...
r: ATCAACGAGATTATAACAAGAGTACGAATA...
Error rates depend on....- Substitution (eg. A->C) - Quality score (eg. Q=30) - Batch effect (eg. run)
p(r|s) =LY
i=1
p(r(i)|s(i), qr(i), Z)
Using more data!
DADA2: Error model
The true shape of an error cloud
read
s
0 1 2 3 4 5 6 70
5
10
15
20
25
Effective Hamming Distance
NOT AN ERROR
BIO
LOG
Y
ERR
OR
OTU
Signal from Noise: OTUs
The true shape of an error cloud
read
s
0 1 2 3 4 5 6 70
5
10
15
20
25
Effective Hamming Distance
NOT AN ERROR
ERROR
BIOLOGY
DADA2
Signal from Noise: DADA2
Data: Kopylova, et al. mSystems, 2016.
0 100 200 300 400
010
020
030
040
0
True abundance
Infe
rred
abun
danc
e
0 50 100 200 300
050
150
250
True abundanceIn
ferre
d ab
unda
nce
Accuracy: Simulated datamothur (an) DADA2
TP: 978 FP: 272 FN: 77 cor: 0.935
True abundance True abundance
Data: Kopylova, et al. mSystems, 2016.
0 100 200 300 400
010
020
030
040
0
True abundance
Infe
rred
abun
danc
e
0 50 100 200 300
050
150
250
True abundanceIn
ferre
d ab
unda
nce
Accuracy: Simulated datamothur (an) DADA2
TP: 978 FP: 272 FN: 77 cor: 0.935
True abundance True abundance
TP: 1042 FP: 0FN: 13 cor: 0.999
Credit: Kopylova, et al. mSystems, 2016.
Accuracy: Mock community
0
1000
2000
3000
4000
DADA2 mothur SUMAClust Swarm uclust UPARSEMethod
# O
TUs
QIIME: Open DADA2
PC1 + PC2
0.0
0.2
0.4
0.6
QIIME: Open DADA2
Arsenic
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy: Arsenic treatment
Credit: Dylan Dahan & Gabriel G. Perron
Variance explained by
Unpublishedordinationremoved.
Resolution: Mock Community
1e−05
1e−03
1e−01
1 10 100Hamming (Log−scale)
Freq
uenc
y (L
og−s
cale
)
AccuracyReference
Exact
One Off
Other
UPARSE
1e−05
1e−03
1e−01
1 10 100Hamming (Log−scale)
Freq
uenc
y (L
og−s
cale
)
Vs. UPARSE
Added
Lost
Same
AccuracyReference
Exact
One Off
Other
DADA2
1e−05
1e−03
1e−01
1 10 100Hamming (Log−scale)
Freq
uenc
y (L
og−s
cale
)
Vs. UPARSE
Added
Lost
Same
AccuracyReference
Exact
One Off
Other
MED
1e−05
1e−03
1e−01
1 10 100Hamming (Log−scale)
Freq
uenc
y (L
og−s
cale
)
Vs. UPARSE
Added
Lost
Same
AccuracyReference
Exact
One Off
Other
Mothur
1e−05
1e−03
1e−01
1 10 100Hamming (Log−scale)
Freq
uenc
y (L
og−s
cale
)
Vs. UPARSE
Added
Lost
Same
AccuracyReference
Exact
One Off
Other
QIIME
Callahan et al., Nature Methods, 2016.
DADA2
Hamming (log)
Resolution: Petrel aDNA
PetrelErrors
QIIME: De novo
Petrel.1
Petrel.2
Petrel.3
Petrel.4
DADA2
Credit: Kealoha Kinney, Michael Bunce, Andreanna Welch
0.00
0.25
0.50
0.75
1.00
Sample
Frequency
OTU
O1
42 pregnant women
Data: MacIntyre et al. Scientific Reports, 2015.
Resolution: L. crispatusUPARSE
0.00
0.25
0.50
0.75
1.00
Sample
Frequency
Variant
L1
L2
L3
L4
L5
L6
Data: MacIntyre et al. Scientific Reports, 2015.
Resolution: L. crispatus
42 pregnant women
DADA2
The sequence is the label
ATTAACGAGATTATAACCAGAGTACGAATA...
Consistent labels allow samples to be
independently processed
is a consistent labelOTU85 is not
Separable processing. Flat memory requirements.
Task parallelization.
1e+05
1e+08
1e+11
1e+03 1e+05reads
Alig
nmen
ts MethodHierarchicalDADA3Greedy
One sample
1e+10
1e+13
1e+16
1e+19
100 10000samples
Alig
nmen
ts MethodHierarchicalDADA3Greedy
One study
Consistent Labels: Scaling
Consistent Labels: Comparison
Sequence Tables
Study 1
Study 2
Study 3
Cross-study comparisonmerge
Eliminates need for joint reprocessing of raw data.
- A replacement for OTU picking- Accurate and high-resolution- Implemented in an R package- Open source
DADA2 is...
- A replacement for OTU picking- Accurate and high-resolution- Implemented in an R package- Open source
DADA2 is...
@bejcal -- https://github.com/benjjneb/dada2