virvarseq vs vivambc bie verbist | ncs brugge | 10-10-2014 statistical methods for improved variant...
TRANSCRIPT
VirVarSeq vs ViVaMBC
Bie Verbist | NCS Brugge | 10-10-2014
Statistical methods for improved variant calling of massively parallel sequencing data.
Pictured above: The structure of HIV.
OUTLINE
Viral dynamicsMassive parallel sequencingVariant calling
VirVarSeqViVaMBC
ResultsHCV plasmidsHCV clinical sample
3
Viral dynamics
A virus is a small infectious agent that replicates only inside the living cells of other organisms.
High replication rate (1011 replications a day for HIV)
High mutation rate
Viral population consist of closely related subgroups, viral quasispecies, which we want to identify and quantify.
4
Viral dynamics
Drug-resistant variant
Drug-sensitive variants
Num
ber
of v
irusu
s in
pop
ulat
ion
On treatmentBefore treatment
Undetectable
Heterogeneous viral population
Time
5
Sequencing
Sanger sequencing
Massively parallel sequencing ACGGTTTCCGTCTGGG
ACGGTTTCTGTCTGGGACGGTTTCCGTCTGGGACGGTTTCTGTCTGGGACGGTTTCTGTCTGGGACGGTTTCTGTCTGGGACGGTTTCTGTCTGGGACGATTTCTGTCTGGG
• detection limit: 20-30%• no accurate estimate of frequency
• detection limit << 20%• more accurate estimate of frequency
6
Massively parallel sequencing
Viral population
Amplification
DNA Fragments
Fragmentation
Sequencing by synthesis
GAAACCGT
A C G G T T T C T
Example, one fragment:
7
Massively parallel sequencing
Viral population
@HWUSI-EAS1524:17:FC:1:120:19254:21417 1:N:0:GATCAGGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA+G@GG@GG@GGHHHBH>GEGDGGBGEGG?GGHHHH>GEGBG@?BEF?DBB<GDGGGGFGG3GGEBA>EC:;
@HWUSI-EAS1524:17:FC:1:120:9430:21420 1:N:0:GATCAGATCGGAAGAGCACACGNCTGAACTCCAGTCACGATCAGATCCCGTATGCCGTCTTCTGCTTGAAAAAAAA+DDDDDDDDDD2DDDDD#DDDDDDDDDDDDDDDDDDDDDDD2:8:7;<@>;DDDDDDDDDDD:DDDDD###
@HWUSI-EAS1524:17:FC:1:120:12760:21420 1:N:0:GATCAGATCATACTGTCTTACTNTGATAAAACCTCCAATTCCCCCTANCATTNTTGGTTNCCATCTTCCTTGCAAA+HHHHHHHHHHHHHHHG#GGGFFFF@HHHHHHGHHHHHHHHF#FFEB#BBBA>B#BFFFFFHHHHHHHHHG
8
Variant calling
Distinguish low-frequency variants from sequencing error.
VirVarSeq ViVaMBC
Adaptive filtering approach based on quality scores.
Verbist et al. 2014, Bioinformatics. doi: 10.1093/ bioinformatics/btu587.
Model based clustering approach which models theerror probabilities based on quality scores.
Verbist et al. 2014, BMC bioinformatics. under revision.
9
VirVarSeq
• Extract reads that cover codon of interest• Filter based on the quality scores. • Build a codon table
...ReferenceReads
...
* codon = nucleotide triplets which specifies a single amino acid
Pos x
Codon Freq
CGA 0.62
CCA 0.25
GGA 0.13
CGACGACCACGACGACGACCACCACGTCGACGAGGA
Filtering
... ...CGACGACCACGACGACGACCACCACGTCGACGAGGA
Codon Table
............
... ...............
... ......
......
...
10
Definition of the Q-threshold (QIT) :
Fit mixture distribution on Q-scores with 3 components:
– Point prob around Q 2– Error distribution– Reliable call distribution
Intersection point is threshold.
Image or graphic goes here
VirVarSeq
QIT
11
ViVaMBC
• Extract reads that cover codon of interest• Perform Model Based Clustering
• Model the error probability • Clusters unknown, EM algorithm
...ReferenceReads
...Pos x
Codon Freq
CGA 0.62
CCA 0.25
GGA 0.13
CGACGACCACGACGACGACCACCACGTCGACGAGGA
Clustering
............
... ...............
... ......
......
...
CGACGA
CGACGA
CGACGA
CGA
Codon TableCCA
CCACGT
CCA
GGA
Cluster medoids = variant Size of Cluster = Frequency
N° Clusters = N° variants
12
Results – HCV plasmids
Two plasmids Amino acids 1 to 181 of NS3 region differ at two codon positions (36 and 155) mixed 4 different proportions
13
Other variants (11481 max)
are false positives.
VirVarSeq reports: – more false
positives – with frequencies
going up to 0,5%
Results – HCV plasmids
14
VirVarSeq reports more variants.
Above 1% methods in agreement, even above 0.5%.
Few false pos in GC region for ViVaMBC ?
Image or graphic goes here
Results - HCV clinical sample
ViVaMBC
V
irVa
rSe
q
15
VirVarSeq vs ViVaMBC
VirVarSeq• Adaptive approach• Easy development• Runs fast
ViVaMBC More elegant Longer development time Longer run time
When applying reporting limits of 1% or 0.5%, methods are in agreement. Below this limit, trade-off between sensitivity and specificity, with VirVarSeq less specific.
16
Acknowledgements
2
Promoters: Prof.Dr.O.Thas1, Prof.Dr.L.Clement1 and Prof.Dr.L.Bijnens2
Yves Wetzels, Tobias Verbeke, Joris Meys1 for IT support
Scientists within discovery sciences2
Non-clinical statistics team2
1
2
Back-up
Notation: ri: best base calls of read i (i=1 ... n) si: second best base calls of read i (i=1 ... n) zij: zij=1 when read i belongs to haplotype j
(j=1...k) τj: probability to belong to haplotype j
Complete Data Likelihood:
19
ViVaMBC
Complete Data Likelihood:
Likelihood depends on cluster membership zij
EM algorithm 20
ViVaMBC
21
Library preparation
Sequencing by synthesis