model-based species identification using dna barcodes
DESCRIPTION
Model-based species identification using DNA barcodes. Bogdan Paşaniuc. CSE Department, University of Connecticut. Joint work with Ion Măndoiu and Sotirios Kentros. Outline. Existing approaches to species identification Proposed statistical model based methods Experimental Results - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/1.jpg)
Model-based species identification using DNA
barcodes
Bogdan Paşaniuc
CSE Department, University of Connecticut
Joint work with Ion Măndoiu and Sotirios Kentros
![Page 2: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/2.jpg)
Outline
Existing approaches to species identification
Proposed statistical model based methods
Experimental Results Ongoing Work and Conclusions
![Page 3: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/3.jpg)
Background on DNA barcoding
Recently proposed tool for species identification Use short DNA region as “fingerprint” for the
species Region of choice: cytochrome c oxidase subunit
1 mitochondrial gene ("COI", 648 base pairs long).
Key assumption: inter-species variability higher than intra-species variability
![Page 4: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/4.jpg)
Species identification problem Given:
Database DB containing barcodes from known species New barcode x
Find: a high confidence assignment to a species in the DB
UNKNOWN, if confidence not high enough
Use additional evidence/methods to resolve
UNKNOWN assignments and possible discovery of new species
![Page 5: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/5.jpg)
Existing approaches and limitations
Neighbor Joining tree for new + known barcodes [Meyers&Paulay05] One barcode per species Runtime does not scale well with #species (quadratic
or worse)
Likelihood ratio test for species membership using MCMC [Matz&Nielsen06] Impractical runtime even for moderate #species
Distance-based [BOLD-IDS, TaxI(Steinke et al.05)] Unclear statistical significance
![Page 6: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/6.jpg)
BOLD
BOLD: The Barcode of Life Data Systems [Ratnasingham&Hebert07] http://www.barcodinglife.org Currently: 28,129 species, 251,429
barcodes
Identification System: BOLD-IDS Distance-based (NJ tree for visualization) Employs a threshold (less than 1%
divergence) to get a tight match to a barcode in the DB
![Page 7: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/7.jpg)
BOLD-IDS
[Ekrem et al.07]: “…identifications by the BOLD facility must be cautiously evaluated as the system at present may return high probabilities of placements that obviously are erroneous”
![Page 8: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/8.jpg)
Outline
Existing approaches to species identification
Proposed statistical model based methods
Experimental Results Ongoing Work and Conclusions
![Page 9: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/9.jpg)
Bayesian approach to species identification
Assign barcode x=x1x2x3…xn to species SPi that maximizes P(SPi|x) over all species SPi
P(SPi|x) computed using Bayes’ theorem: P(SP|x) = P(x|SP)*P(SP)/P(x) Uniform prior P(SP) P(x) constant for fixed x Need model for P(x|SP)
We explored three scalable models: position weight matrices, Markov chains, hidden Markov models Similar to models used successfully in other sequence
analysis problems such as DNA motif finding and protein families
![Page 10: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/10.jpg)
Positional weight matrix (PWM)
Assumption: independence of loci
P(x|SP) = P(x1|SP)*P(x2|SP)*…*P(xn|SP)
For each locus, P(xi|SP) is estimated as the probability of seeing each nucleotide at that locus in DB sequences from species SP
![Page 11: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/11.jpg)
Inhomogeneous Markov Chain (IMC)
Takes into account dependencies between consecutive loci
start
A
C
T
G
A
C
T
G
A
C
T
G
A
C
T
G
…
locus 1 locus 2 locus 3 locus 4
1t 2t 3t
),()()|( 1
1
11
ii
n
i
i xxtxMxP
![Page 12: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/12.jpg)
Hidden Markov Model (HMM) Same structure as the IMC
Each state emits the associated DNA base with high probability; but can also emit the other bases with probability equal to mutation rate
Barcode x generated along path p with probability equal to product of emission & transitions along p
P(x|HMM) = sum of probabilities over all paths Efficiently computed by forward algorithm
![Page 13: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/13.jpg)
Accuracy on BOLD dataset
37 species with at least 100 barcodes from BOLD 10-50% barcodes removed and used for
test IMC yields better accuracy in all cases
10% 20% 30% 40% 50%
PWM 90.08% 90.01% 90.02% 89.68% 89.69%IMC 99.97% 99.93% 99.90% 99.91% 99.89%
HMM 99.57% 99.57% 99.66% 99.70% 99.76%
![Page 14: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/14.jpg)
Score normalization DB barcodes have non uniform lengths and
cover different regions of the COI gene Membership probabilities not always
comparable Normalization scheme:
Species models constructed only over positions covered in DB
Scores normalized using background IMC constructed from all sequences in DB
1
1 1
1
1
1
),(
),(ln)()(ln
)|()|(ln)(
n
i ii
iii
i
Mxxt
xxtxx
MxPMxPxScore
![Page 15: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/15.jpg)
Computing the confidence of assignment
x assigned to species SP with score s p-value: probability that a barcode generated
under background model Ḿ has a score s’ s Methods for p-value estimation:
Random sampling Generate random sequences and count how
many exceed the score Exact computation (for PWMs):
Dynamic programming [Rahmann03] Branch and bound [Zhang et. Al 07] Shiffted FFTs [Nagarajan et al. 05]
![Page 16: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/16.jpg)
Exact computation for PWMs [Rahmann03]
Computes the entire distribution Scores rounded by a granularity factor Score is a sum of n independent variables
(score contribution of each position) Probability of a rand. seq. of length i having a
score of computed from the contribution of first i-1 positions and current position
![Page 17: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/17.jpg)
Exact computation for IMCs
Define as the prob. of a random seq of length i having score and last letter
Basic recurrence:
)(iyf
),()),(
),(ln()(},,,{
1 yztyzt
yztffi
GTCAzi
iiz
iy
)|)),...,(( 1 MyxxxScoreP iiM
y
![Page 18: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/18.jpg)
IMC exact p-value computation Initially
The probability of a random barcode having score
Runtime , where R is the difference between max and min score for any i.
o/w 0,(y)πln- lnπnπ(α if(y),π
)(0
yf
},,,{
)()(GTCAy
ny
n ff
)/( 2 RnO
![Page 19: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/19.jpg)
Outline
Existing approaches to species identification
Proposed statistical model based methods
Experimental Results Ongoing Work and Conclusions
![Page 20: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/20.jpg)
Experimental setup (1) Compared methods
IMC Species with highest score If score < species specific threshold UNKNOWN
Distance-based (BOLD-IDS like) Species containing barcode showing less divergence If divergence > threshold (default 1%) UNKNOWN
Basic questions What is the effect of training set size (#barcodes
per species) on accuracy? What is the effect of the #species on accuracy?
![Page 21: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/21.jpg)
Experimental setup (2)
Two scenarios: Complete DB: all new barcodes belong to
species in DB Incomplete DB: some new barcodes belong
to species not in DB
![Page 22: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/22.jpg)
Accuracy measures
True positive rate = TP/(TP+FP) Barcodes belonging to species present in
the DB TP = #barcodes assigned to correct species FP = #barcodes assigned to incorrect species
Barcodes belonging to species not present in DB TP = #barcodes assigned to unknowns FP = #barcodes assigned to species in the DB
![Page 23: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/23.jpg)
Effect of #barcodes/species Datasets containing all BOLD species with at
least 5/25 barcodes BOLD5: 1508 sp, 28600 barcodes BOLD25: 270 sp, 17197 barcodes
DB composed of randomly picked 5-20 barcodes from all species in BOLD25
Test barcodes Complete database scenario
All remaining barcodes from BOLD25 Incomplete database scenario
All barcodes from BOLD5 not in DB
![Page 24: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/24.jpg)
Effect of #barcodes/species, complete DB
IMC and Distance (thr 1%)
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0 0.1 0.2 0.3 0.4 0.5
Unknown rate
TP ra
te
5101520
![Page 25: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/25.jpg)
Effect of #barcodes/species, incomplete DB
barcodes belonging to species in DB
0.90
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
0.00 0.10 0.20 0.30 0.40 0.50
Unknown rate
TP
5101520
barcodes not belonging to species in DB
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0.00 0.10 0.20 0.30 0.40 0.50
Unknown rateTP
5101520
![Page 26: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/26.jpg)
Effect of #species Datasets containing all BOLD species with at least
5/10 barcodes BOLD5: 1508 sp, 28600 barcodes BOLD10: 690 sp, 23558 barcodes
DB composed of randomly picked 100 to 690 species from BOLD10 10 barcodes per species
Test barcodes Complete database scenario
All remaining barcodes from picked species Incomplete database scenario
All barcodes from BOLD5 not in DB
![Page 27: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/27.jpg)
Effect of #species, complete DB
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.1 0.2 0.3 0.4 0.5
Unknown rate
TP ra
te
100200300400500600690
![Page 28: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/28.jpg)
Effect of #species, incomplete DB
Barcodes belonging to species in the DB
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Unknown rate
TP ra
te
100200300400500600690
Barcodes NOT belonging to species in DB
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Unknown rate
TP ra
te
100200300400500600690
![Page 29: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/29.jpg)
Outline
Existing approaches to species identification
Proposed statistical model based methods
Experimental Results Ongoing Work and Conclusions
![Page 30: Model-based species identification using DNA barcodes](https://reader036.vdocuments.mx/reader036/viewer/2022062315/5681609a550346895dcfc206/html5/thumbnails/30.jpg)
Conclusions & Ongoing work IMC provides a scalable method for species identification
High accuracy, with useful tradeoff between TP rate and unknown rate
Efficiently computable p-values
Comprehensive comparison of identification algorithms to be submitted to 2nd International Barcode Conference Broad coverage of methods
tree-based, distance-based, character-based, model-based Assessment of further effects besides #species and
#barcodes/species Barcode length Barcode quality Number of regions Runtime scalability (up to millions of species)
Diverse datasets (BOLD, cowries, flu viruses, simulated data, etc.)