virvarseq vs vivambc bie verbist | ncs brugge | 10-10-2014 statistical methods for improved variant...

21
VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured above: The structure of HIV.

Upload: marjory-bridges

Post on 12-Jan-2016

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

VirVarSeq vs ViVaMBC

Bie Verbist | NCS Brugge | 10-10-2014

Statistical methods for improved variant calling of massively parallel sequencing data.

Pictured above: The structure of HIV.

Page 2: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

OUTLINE

Viral dynamicsMassive parallel sequencingVariant calling

VirVarSeqViVaMBC

ResultsHCV plasmidsHCV clinical sample

Page 3: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

3

Viral dynamics

A virus is a small infectious agent that replicates only inside the living cells of other organisms.

High replication rate (1011 replications a day for HIV)

High mutation rate

Viral population consist of closely related subgroups, viral quasispecies, which we want to identify and quantify.

Page 4: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

4

Viral dynamics

Drug-resistant variant

Drug-sensitive variants

Num

ber

of v

irusu

s in

pop

ulat

ion

On treatmentBefore treatment

Undetectable

Heterogeneous viral population

Time

Page 5: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

5

Sequencing

Sanger sequencing

Massively parallel sequencing ACGGTTTCCGTCTGGG

ACGGTTTCTGTCTGGGACGGTTTCCGTCTGGGACGGTTTCTGTCTGGGACGGTTTCTGTCTGGGACGGTTTCTGTCTGGGACGGTTTCTGTCTGGGACGATTTCTGTCTGGG

• detection limit: 20-30%• no accurate estimate of frequency

• detection limit << 20%• more accurate estimate of frequency

Page 6: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

6

Massively parallel sequencing

Viral population

Amplification

DNA Fragments

Fragmentation

Sequencing by synthesis

GAAACCGT

A C G G T T T C T

Example, one fragment:

Page 7: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

7

Massively parallel sequencing

Viral population

@HWUSI-EAS1524:17:FC:1:120:19254:21417 1:N:0:GATCAGGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA+G@GG@GG@GGHHHBH>GEGDGGBGEGG?GGHHHH>GEGBG@?BEF?DBB<GDGGGGFGG3GGEBA>EC:;

@HWUSI-EAS1524:17:FC:1:120:9430:21420 1:N:0:GATCAGATCGGAAGAGCACACGNCTGAACTCCAGTCACGATCAGATCCCGTATGCCGTCTTCTGCTTGAAAAAAAA+DDDDDDDDDD2DDDDD#DDDDDDDDDDDDDDDDDDDDDDD2:8:7;<@>;DDDDDDDDDDD:DDDDD###

@HWUSI-EAS1524:17:FC:1:120:12760:21420 1:N:0:GATCAGATCATACTGTCTTACTNTGATAAAACCTCCAATTCCCCCTANCATTNTTGGTTNCCATCTTCCTTGCAAA+HHHHHHHHHHHHHHHG#GGGFFFF@HHHHHHGHHHHHHHHF#FFEB#BBBA>B#BFFFFFHHHHHHHHHG

Page 8: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

8

Variant calling

Distinguish low-frequency variants from sequencing error.

VirVarSeq ViVaMBC

Adaptive filtering approach based on quality scores.

Verbist et al. 2014, Bioinformatics. doi: 10.1093/ bioinformatics/btu587.

Model based clustering approach which models theerror probabilities based on quality scores.

Verbist et al. 2014, BMC bioinformatics. under revision.

Page 9: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

9

VirVarSeq

• Extract reads that cover codon of interest• Filter based on the quality scores. • Build a codon table

...ReferenceReads

...

* codon = nucleotide triplets which specifies a single amino acid

Pos x

Codon Freq

CGA 0.62

CCA 0.25

GGA 0.13

CGACGACCACGACGACGACCACCACGTCGACGAGGA

Filtering

... ...CGACGACCACGACGACGACCACCACGTCGACGAGGA

Codon Table

............

... ...............

... ......

......

...

Page 10: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

10

Definition of the Q-threshold (QIT) :

Fit mixture distribution on Q-scores with 3 components:

– Point prob around Q 2– Error distribution– Reliable call distribution

Intersection point is threshold.

Image or graphic goes here

VirVarSeq

QIT

Page 11: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

11

ViVaMBC

• Extract reads that cover codon of interest• Perform Model Based Clustering

• Model the error probability • Clusters unknown, EM algorithm

...ReferenceReads

...Pos x

Codon Freq

CGA 0.62

CCA 0.25

GGA 0.13

CGACGACCACGACGACGACCACCACGTCGACGAGGA

Clustering

............

... ...............

... ......

......

...

CGACGA

CGACGA

CGACGA

CGA

Codon TableCCA

CCACGT

CCA

GGA

Cluster medoids = variant Size of Cluster = Frequency

N° Clusters = N° variants

Page 12: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

12

Results – HCV plasmids

Two plasmids Amino acids 1 to 181 of NS3 region differ at two codon positions (36 and 155) mixed 4 different proportions

Page 13: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

13

Other variants (11481 max)

are false positives.

VirVarSeq reports: – more false

positives – with frequencies

going up to 0,5%

Results – HCV plasmids

Page 14: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

14

VirVarSeq reports more variants.

Above 1% methods in agreement, even above 0.5%.

Few false pos in GC region for ViVaMBC ?

Image or graphic goes here

Results - HCV clinical sample

ViVaMBC

V

irVa

rSe

q

Page 15: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

15

VirVarSeq vs ViVaMBC

VirVarSeq• Adaptive approach• Easy development• Runs fast

ViVaMBC More elegant Longer development time Longer run time

When applying reporting limits of 1% or 0.5%, methods are in agreement. Below this limit, trade-off between sensitivity and specificity, with VirVarSeq less specific.

Page 16: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

16

Acknowledgements

2

Promoters: Prof.Dr.O.Thas1, Prof.Dr.L.Clement1 and Prof.Dr.L.Bijnens2

Yves Wetzels, Tobias Verbeke, Joris Meys1 for IT support

Scientists within discovery sciences2

Non-clinical statistics team2

1

2

Page 17: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

[email protected]

10-10-2014

Thank you

Page 18: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

Back-up

Page 19: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

Notation: ri: best base calls of read i (i=1 ... n) si: second best base calls of read i (i=1 ... n) zij: zij=1 when read i belongs to haplotype j

(j=1...k) τj: probability to belong to haplotype j

Complete Data Likelihood:

19

ViVaMBC

Page 20: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

Complete Data Likelihood:

Likelihood depends on cluster membership zij

EM algorithm 20

ViVaMBC

Page 21: VirVarSeq vs ViVaMBC Bie Verbist | NCS Brugge | 10-10-2014 Statistical methods for improved variant calling of massively parallel sequencing data. Pictured

21

Library preparation

Sequencing by synthesis