virvarseq vs vivambc bie verbist | ncs brugge | 10-10-2014 statistical methods for improved variant...

VirVarSeq vs ViVaMBC

Bie Verbist | NCS Brugge | 10-10-2014

Statistical methods for improved variant calling of massively parallel sequencing data.

Pictured above: The structure of HIV.

OUTLINE

Viral dynamicsMassive parallel sequencingVariant calling

VirVarSeqViVaMBC

ResultsHCV plasmidsHCV clinical sample

3

Viral dynamics

A virus is a small infectious agent that replicates only inside the living cells of other organisms.

High replication rate (1011 replications a day for HIV)

High mutation rate

Viral population consist of closely related subgroups, viral quasispecies, which we want to identify and quantify.

4

Viral dynamics

Drug-resistant variant

Drug-sensitive variants

Num

ber

of v

irusu

s in

pop

ulat

ion

On treatmentBefore treatment

Undetectable

Heterogeneous viral population

Time

5

Sequencing

Sanger sequencing

Massively parallel sequencing ACGGTTTCCGTCTGGG

ACGGTTTCTGTCTGGGACGGTTTCCGTCTGGGACGGTTTCTGTCTGGGACGGTTTCTGTCTGGGACGGTTTCTGTCTGGGACGGTTTCTGTCTGGGACGATTTCTGTCTGGG

• detection limit: 20-30%• no accurate estimate of frequency

• detection limit << 20%• more accurate estimate of frequency

6

Massively parallel sequencing

Viral population

Amplification

DNA Fragments

Fragmentation

Sequencing by synthesis

GAAACCGT

A C G G T T T C T

Example, one fragment:

7

Massively parallel sequencing

Viral population

@HWUSI-EAS1524:17:FC:1:120:19254:21417 1:N:0:GATCAGGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA+G@GG@GG@GGHHHBH>GEGDGGBGEGG?GGHHHH>GEGBG@?BEF?DBB<GDGGGGFGG3GGEBA>EC:;

@HWUSI-EAS1524:17:FC:1:120:9430:21420 1:N:0:GATCAGATCGGAAGAGCACACGNCTGAACTCCAGTCACGATCAGATCCCGTATGCCGTCTTCTGCTTGAAAAAAAA+DDDDDDDDDD2DDDDD#DDDDDDDDDDDDDDDDDDDDDDD2:8:7;<@>;DDDDDDDDDDD:DDDDD###

@HWUSI-EAS1524:17:FC:1:120:12760:21420 1:N:0:GATCAGATCATACTGTCTTACTNTGATAAAACCTCCAATTCCCCCTANCATTNTTGGTTNCCATCTTCCTTGCAAA+HHHHHHHHHHHHHHHG#GGGFFFF@HHHHHHGHHHHHHHHF#FFEB#BBBA>B#BFFFFFHHHHHHHHHG

8

Variant calling

Distinguish low-frequency variants from sequencing error.

VirVarSeq ViVaMBC

Adaptive filtering approach based on quality scores.

Verbist et al. 2014, Bioinformatics. doi: 10.1093/ bioinformatics/btu587.

Model based clustering approach which models theerror probabilities based on quality scores.

Verbist et al. 2014, BMC bioinformatics. under revision.

9

VirVarSeq

• Extract reads that cover codon of interest• Filter based on the quality scores. • Build a codon table

...ReferenceReads

...

* codon = nucleotide triplets which specifies a single amino acid

Pos x

Codon Freq

CGA 0.62

CCA 0.25

GGA 0.13

CGACGACCACGACGACGACCACCACGTCGACGAGGA

Filtering

... ...CGACGACCACGACGACGACCACCACGTCGACGAGGA

Codon Table

............

... ...............

... ......

......

...

10

Definition of the Q-threshold (QIT) :

Fit mixture distribution on Q-scores with 3 components:

– Point prob around Q 2– Error distribution– Reliable call distribution

Intersection point is threshold.

Image or graphic goes here

VirVarSeq

QIT

11

ViVaMBC

• Extract reads that cover codon of interest• Perform Model Based Clustering

• Model the error probability • Clusters unknown, EM algorithm

...ReferenceReads

...Pos x

Codon Freq

CGA 0.62

CCA 0.25

GGA 0.13

CGACGACCACGACGACGACCACCACGTCGACGAGGA

Clustering

............

... ...............

... ......

......

...

CGACGA

CGACGA

CGACGA

CGA

Codon TableCCA

CCACGT

CCA

GGA

Cluster medoids = variant Size of Cluster = Frequency

N° Clusters = N° variants

12

Results – HCV plasmids

Two plasmids Amino acids 1 to 181 of NS3 region differ at two codon positions (36 and 155) mixed 4 different proportions

13

Other variants (11481 max)

are false positives.

VirVarSeq reports: – more false

positives – with frequencies

going up to 0,5%

Results – HCV plasmids

14

VirVarSeq reports more variants.

Above 1% methods in agreement, even above 0.5%.

Few false pos in GC region for ViVaMBC ?

Image or graphic goes here

Results - HCV clinical sample

ViVaMBC

V

irVa

rSe

q

15

VirVarSeq vs ViVaMBC

VirVarSeq• Adaptive approach• Easy development• Runs fast

ViVaMBC More elegant Longer development time Longer run time

When applying reporting limits of 1% or 0.5%, methods are in agreement. Below this limit, trade-off between sensitivity and specificity, with VirVarSeq less specific.

16

Acknowledgements

2

Promoters: Prof.Dr.O.Thas1, Prof.Dr.L.Clement1 and Prof.Dr.L.Bijnens2

Yves Wetzels, Tobias Verbeke, Joris Meys1 for IT support

Scientists within discovery sciences2

Non-clinical statistics team2

1

2

[email protected]

10-10-2014

Thank you

Back-up

Notation: ri: best base calls of read i (i=1 ... n) si: second best base calls of read i (i=1 ... n) zij: zij=1 when read i belongs to haplotype j

(j=1...k) τj: probability to belong to haplotype j

Complete Data Likelihood:

19

ViVaMBC

Complete Data Likelihood:

Likelihood depends on cluster membership zij

EM algorithm 20

ViVaMBC

21

Library preparation

Sequencing by synthesis

virvarseq vs vivambc bie verbist | ncs brugge | 10-10-2014 statistical methods for improved variant...

Documents